Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

How to Evaluate AI Agents - Part 2

Data Science Dojo via YouTube

Start learning Write review

This 50-minute webinar from Data Science Dojo explores the complex challenge of evaluating AI agents, explaining why traditional LLM metrics are insufficient for agent evaluation. Dive into four core evaluation methods: code-based evaluations, LLM-driven assessments, human feedback systems, and ground truth comparisons. Learn to craft high-quality LLM evaluations aligned with real-world tasks, build benchmarks using ground truth data, implement best practices for telemetry capture, and leverage OpenInference standards for system consistency. The session includes a hands-on demonstration evaluating a travel agent using Arize Phoenix, covering agent components (routers, skills, paths), evaluation techniques, template building, test dataset implementation, and guardrails for prompt injection detection. Perfect for data scientists and AI practitioners looking to develop robust evaluation frameworks for complex agent systems.

Syllabus

0:00 - Introduction and Series Overview
1:26 - Focus of Today: Evaluating AI Agents
2:10 - Agent Components Overview Router, Skills, Path
4:39 - How to Evaluate a Router
6:10 - How to Evaluate Skills API, RAG, Code
7:37 - Evaluating Agent Paths Trajectory Eval
9:52 - Evaluation Techniques Overview
10:15 - Technique 1: LLM as a Judge
19:44 - Technique 2: Code-Based Evaluation
22:08 - Technique 3: Human Annotations
24:24 - Live Demo: Evaluating a Travel Agent
27:03 - Example of LLM-as-a-Judge in Action
30:11 - How to Build and Apply Evaluation Templates
34:50 - Using Test Datasets for Evaluation
42:04 - Guardrails and Prompt Injection Detection
46:04 - Summary: Combining Techniques in Dev & Prod
48:30 - Multimodal Evaluation Note Voice, Image, Video
49:16 - Final Wrap-Up and Next Steps