How to Improve AI Apps with Automated Evals

Learn how to scale up the evaluation of AI applications through automated evaluation techniques in this comprehensive tutorial. Explore the challenges of evaluating open-ended LLM tasks that typically require human assessment and discover practical solutions using automated evals. Master the typical LLM workflow and understand common problems that arise when building AI applications. Dive deep into two distinct types of automated evaluations and their applications in real-world scenarios. Follow along with a detailed case study featuring an eval-driven LinkedIn Ghostwriter project that demonstrates the complete process from identifying failure modes to creating LLM judges. Gain hands-on experience with curating user inputs, generating content, applying evaluations, and refining results based on feedback. Access example code and references to implement these techniques in your own AI projects, and see a live demonstration of the automated evaluation system in action.

Syllabus

Introduction - 0:00
The Typical LLM Workflow - 0:21
The Problem - 1:11
Automed Evals - 1:50
2 Types of Automated Evals - 4:25
Example: Eval-driven LinkedIn Ghostwriter - 7:03
Step 1: Identify Failure Modes - 9:36
Step 2: Create LLM Judge - 10:49
Step 3: Curate User Inputs - 19:49
Step 4: Generate LI Posts - 20:30
Step 5: Apply Evals - 21:12
Step 6: Review Results and Refine - 22:06
The Results - 25:19
Demo - 26:59