Google, IBM & Meta Certificates — 40% Off for a Limited Time
Build AI Apps with Azure, Copilot, and Generative AI — Microsoft Certified
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn practical evaluation strategies for large language models that go beyond traditional benchmarks to assess real-world performance, reliability, and user satisfaction in this 32-minute conference talk. Discover how to move past impressive accuracy scores and leaderboard metrics to create evaluation frameworks that reflect how LLMs actually perform in production environments, complex workflows, and agentic systems. Explore concrete examples using open-source frameworks including GuideLLM, lm-eval-harness, and OpenAI Evals to build custom evaluation suites tailored to specific use cases. Master techniques for measuring reasoning quality, agent consistency, and MCP (Model Context Protocol) integration while implementing human-in-the-loop feedback systems and agent reliability checks that mirror production conditions. Gain actionable insights for evaluating reasoning skills, consistency, and reliability in agentic AI applications, validating MCP and agent interactions, and integrating human assessments for better user-aligned outcomes. Walk away with best practices for confidently deploying chatbots, copilots, and autonomous AI agents by ensuring your LLMs meet real-world expectations rather than just achieving high leaderboard positions.
Syllabus
Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith
Taught by
AI Engineer