Taming Rogue AI Agents with Observability-Driven Evaluation

Learn to implement a systematic, metric-driven framework for detecting and correcting problematic behaviors in production LLM agent systems through this technical conference talk. Discover how to instrument agent loops with comprehensive observability signals including tool-selection quality, error rates, action progression, latency, and domain-specific metrics, then integrate these into evaluation layers like Galileo for continuous system improvement. Explore the challenges that arise when prompts, retrieval systems, external data, and policies interact unpredictably, causing agents to drift into failure states. Follow a practical demonstration using a stock-trading system example that illustrates how brittle retrieval and faulty business logic lead to undesirable agent behavior, then see how to systematically refactor prompts and adjust retrieval pipelines while verifying improvements through enhanced metrics. Master techniques for adding observability with minimal code changes, pinpointing root causes through detailed tracing, and establishing a virtuous cycle of continuous, metric-validated system enhancement for agentic AI systems operating at production scale.