AI Evaluations Clearly Explained in 50 Minutes - Real Example

Learn to build AI evaluations through a comprehensive 53-minute tutorial featuring Hamel Husain, who has trained over 2,000 product managers and engineers from leading AI companies including OpenAI, Anthropic, and Google. Master the fundamentals of AI evaluation systems by following a live walkthrough that demonstrates how to analyze 100 real production traces and create evaluation criteria using simple spreadsheet tools. Discover why binary pass/fail ratings consistently outperform traditional 1-5 scoring systems and understand the critical metrics that matter most in AI evaluation. Explore the agreement metric trap that commonly misleads product managers and gain clarity on true positive and negative rates with practical explanations. Learn to establish continuous evaluation systems in production environments while understanding what makes evaluations truly valuable for AI systems. The tutorial covers real-world examples and provides actionable insights for implementing robust AI evaluation frameworks in your own projects.

Syllabus

00:00 What the most valuable part of evals is
01:25 Live walkthrough: Analyzing 100 real production traces
09:50 Creating the eval criteria using a simple spreadsheet
24:44 Why binary pass/fail ratings beat 1-5 scores every time
28:52 The agreement metric trap that fools most PMs
30:08 True positive and negative rates explained
36:00 How to set up continuous evals in production