Why Language Models Need a Lesson in Education - Evaluating LLM Performance Using Academic Rubrics

Learn how to effectively evaluate Large Language Model (LLM) performance in production environments through educational assessment principles in this 15-minute conference talk. Discover why traditional machine learning metrics like accuracy and error rates fall short when dealing with subjective text generation quality, and explore the additional complexities that arise when combining LLMs with other tools in agentic contexts. Understand how to adapt academic evaluation methodologies by creating clear, objective rubrics that define success criteria for LLM outputs. Master the technique of deploying additional tested LLMs to conduct systematic evaluations based on these rubrics, providing an efficient solution to the evaluation challenge. Examine the remaining gaps and limitations in current LLM evaluation approaches, gaining insights from both machine learning engineering and educational assessment perspectives. The presentation draws from real-world experience in deploying AI systems to production and applies lessons learned from academic teaching to solve practical machine learning evaluation problems.