Power BI Fundamentals - Create visualizations and dashboards from scratch
AI Adoption - Drive Business Value and Organizational Impact
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn practical evaluation strategies for large language models that go beyond traditional benchmarks to assess real-world performance, reliability, and user satisfaction in this 32-minute conference talk. Discover how to move past impressive accuracy scores and leaderboard metrics to create evaluation frameworks that reflect how LLMs actually perform in production environments, complex workflows, and agentic systems. Explore concrete examples using open-source frameworks including GuideLLM, lm-eval-harness, and OpenAI Evals to build custom evaluation suites tailored to specific use cases. Master techniques for measuring reasoning quality, agent consistency, and MCP (Model Context Protocol) integration while implementing human-in-the-loop feedback systems and agent reliability checks that mirror production conditions. Gain actionable insights for evaluating reasoning skills, consistency, and reliability in agentic AI applications, validating MCP and agent interactions, and integrating human assessments for better user-aligned outcomes. Walk away with best practices for confidently deploying chatbots, copilots, and autonomous AI agents by ensuring your LLMs meet real-world expectations rather than just achieving high leaderboard positions.
Syllabus
Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith
Taught by
AI Engineer