Evals Are Not Unit Tests - Evaluating Non-Deterministic AI Systems

Learn how to effectively evaluate non-deterministic AI systems through a practical framework that goes beyond traditional unit testing approaches. Discover why conventional testing methods fall short when dealing with AI applications and explore a comprehensive evaluation methodology developed at Vercel for their v0 AI system. Understand the fundamental challenges of AI reliability through real-world examples, including the "Fruit Letter Counter" app failure case study. Master the basketball court analogy for defining evaluation boundaries and learn how to properly scope the domain of user queries your AI system needs to handle. Explore systematic approaches to data collection for evaluations, including strategies for gathering representative test cases that reflect actual user behavior. Examine the critical distinction between keeping data constant while varying tasks in your evaluation structure, and discover practical scoring methodologies for assessing AI system performance. Gain insights into integrating evaluation processes into continuous integration and deployment pipelines, enabling automated quality assurance for AI applications. Understand how proper evaluation frameworks can significantly improve AI system reliability and user experience while providing measurable metrics for system improvements.

Syllabus

00:00 Introduction to Vercel's V0 and its growth
01:00 The problem with AI unreliability
02:44 The "Fruit Letter Counter" app example of AI failure
03:33 Introducing "evals" and the basketball court analogy
05:09 Defining the "court": understanding the domain of user queries
07:53 Data collection for evals
09:13 Structuring evals: constants in data, variables in task
10:45 Scoring evals
12:35 Integrating evals into CI/CD
13:40 The benefits of using evals