Beyond the Gold Standard - Evaluating and Trusting Agents in the Wild

Learn how to evaluate and deploy AI agents in production environments beyond traditional accuracy benchmarks through this 25-minute conference talk from the Coding Agents Conference. Discover the critical challenges of moving from controlled testing environments to real-world deployment where agents encounter ambiguous data, edge cases, and complex workflows that don't exist in standard benchmarks. Explore technical strategies for building "living ground truth" systems that evolve with your deployed agents, incorporating structured feedback from subject matter experts to maintain reliability over time. Examine practical frameworks for auditing, measuring, and improving agent trustworthiness using healthcare examples where high accuracy is essential, such as validating clinical dates and bed levels where 80% accuracy falls short of requirements. Understand how these reliability principles apply across various industries including e-commerce, fraud detection, and logistics, addressing the fundamental question of determining when an agent is production-ready and maintaining its trustworthiness post-deployment.