Engineering Better Evals - Scalable LLM Evaluation Pipelines That Work

Explore advanced LLM evaluation strategies that move beyond toy benchmarks toward real-world production impact in this 25-minute conference talk. Learn how to architect and implement evaluation pipelines that work across both online and offline environments, reducing development complexity and accelerating iteration. Discover LLM-as-a-judge frameworks, human-in-the-loop evaluation techniques, and hybrid approaches that unlock more robust and nuanced performance assessments. Examine technical architectures, real implementation patterns, and trade-offs between evaluation techniques to make informed engineering decisions. Gain practical strategies for crafting efficient, scalable, and accurate evaluation pipelines tailored to custom LLM products, whether building from scratch or refining existing workflows. Presented by Dat Ngo, Director of AI Solutions at Arize, and Aman Khan, Director of Product for LLM at Arize AI, who bring extensive experience from working with industry leaders including Reddit, Booking.com, Siemens, and Roblox on production AI systems.