Building Metrics That Actually Work for AI Applications

Learn to build reliable evaluation metrics for AI applications in this comprehensive workshop led by a former Google Search Product Director. Discover how to design custom metrics that accurately measure performance in your specific use case, drawing from decades of AI development experience at Google Search and adapted for modern LLM applications. Master the process of brainstorming and designing tailored metrics for your application needs, then identify which types of signals—whether natural language, code, or other models—work best through rapid experimentation. Explore techniques for combining and calibrating metrics against ground truth data using real-world examples, while utilizing accessible tools like Google Sheets for visualization and analysis. Gain practical skills in integrating scoring models into both online workflows for agent control and offline processes for model comparison and training evaluation. The session provides actionable strategies for creating metrics that are highly accurate, fast, and tunable to ground truth rater and user behavior, essential for building trustworthy AI evaluations in production environments.