Strategies for LLM Evals - GuideLLM, lm-eval-harness, OpenAI Evals Workshop

Learn practical evaluation strategies for large language models that go beyond traditional benchmarks to assess real-world performance, reliability, and user satisfaction in this 32-minute conference talk. Discover how to move past impressive accuracy scores and leaderboard metrics to create evaluation frameworks that reflect how LLMs actually perform in production environments, complex workflows, and agentic systems. Explore concrete examples using open-source frameworks including GuideLLM, lm-eval-harness, and OpenAI Evals to build custom evaluation suites tailored to specific use cases. Master techniques for measuring reasoning quality, agent consistency, and MCP (Model Context Protocol) integration while implementing human-in-the-loop feedback systems and agent reliability checks that mirror production conditions. Gain actionable insights for evaluating reasoning skills, consistency, and reliability in agentic AI applications, validating MCP and agent interactions, and integrating human assessments for better user-aligned outcomes. Walk away with best practices for confidently deploying chatbots, copilots, and autonomous AI agents by ensuring your LLMs meet real-world expectations rather than just achieving high leaderboard positions.