Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn AI evaluations through a hands-on tutorial where two product managers build evaluation systems from scratch for an AI customer support agent. Discover the four essential types of AI evaluations that every practitioner should understand, then follow along as the instructors demonstrate the complete process of creating effective evaluation frameworks. Master the fundamentals by watching the creation of evaluation criteria, learning to use Anthropic's console for prompt generation, and understanding how to add human labels to golden datasets. Explore advanced techniques for scaling evaluations using LLM-judge prompts and discover methods for aligning LLM judges with human judgment to ensure reliable assessment outcomes. Gain practical experience in building robust evaluation systems that can effectively measure AI performance in real-world applications.
Syllabus
00:00 What are AI evals and how to get good at them
02:52 The 4 types of AI evaluations everyone should know
06:08 Live demo: Building evals for a customer support agent
10:29 Using Anthropic's console to generate great prompts
15:13 Creating the evaluation criteria
17:40 Adding human labels to the golden dataset
31:05 Scaling evals with LLM-judge prompts
38:21 How to align LLM judges with human judgment
Taught by
Peter Yang