From Lab To Life - Practical AI System Evaluation

Explore a comprehensive conference talk that addresses the critical challenge of evaluating agentic AI systems as they transition from laboratory environments to real-world applications. Learn about the significant operational, reputational, and financial risks that enterprises face when deploying dynamic AI systems, and understand why traditional static benchmarks like MMLU fail to capture the complexities of real-world AI behavior. Discover a practical evaluation framework inspired by the University of Michigan's "Evaluation Framework for AI Systems in the Wild" that integrates performance, fairness, and ethics considerations. Examine how this risk-adjusted evaluation approach combines continuous, outcome-oriented methods with both human and automated assessments to increase stakeholder trust and transparency. Gain actionable insights into implementing these evaluation methodologies using open-source technologies throughout the entire AI system development lifecycle, from initial conception through ongoing real-world monitoring and assessment.