Beyond BLEU and ROUGE - Evaluating LLMs and AI Systems

Explore advanced evaluation methodologies for large language models and AI systems in this 46-minute conference talk from Conf42 ML 2025. Discover why traditional metrics like BLEU and ROUGE fall short in assessing modern AI capabilities and learn about comprehensive evaluation frameworks that address factual accuracy, toxicity, bias, and human-centric considerations. Examine the limitations of n-gram overlap methods through real-world examples and understand how these flaws impact AI system assessment. Delve into modern evaluation approaches including fact scoring, automated evaluation technologies, and the OpenAI Evolve framework. Master generative evaluation techniques and learn to assess Retrieval-Augmented Generation (RAG) systems effectively. Compare human judgment against automated metrics and understand when each approach is most appropriate. Build knowledge of robust evaluation ecosystems and apply these concepts through practical scenarios like customer service AI evaluation. Gain insights into the future of AI evaluation and how to implement comprehensive assessment strategies that go beyond surface-level metrics to truly measure AI system performance and reliability.

Syllabus

00:00 Introduction and Session Overview
00:49 Speaker Introductions
01:47 Evaluating AI Systems: Beyond Traditional Metrics
02:28 Key Dimensions of AI Evaluation
04:50 Limitations of Traditional Metrics: BLEU and ROUGE
06:16 Real-World Examples Highlighting Metric Flaws
10:17 Understanding N-gram Overlap
13:12 Modern Evaluation Frameworks
14:44 Factual Accuracy and Fact Score
20:32 Addressing Toxicity and Bias in AI
23:31 Human-Centric Evaluation Methods
24:35 Conclusion and Transition
25:38 Introduction to Evaluation Shifts
25:50 Strengths of Large Language Models in Evaluation
26:43 Automated Evaluation: Importance and Benefits
27:24 Key Technologies in Automated Evaluation
28:35 Deep Dive: OpenAI Evolve Framework
32:40 Generative Evaluation GE Explained
35:44 Evaluation of Retrieval-Augmented Generation REG Systems
38:19 Human Judgment vs. Automated Metrics
41:23 Building Robust Evaluation Ecosystems
43:05 Real-World Scenario: Evaluating Customer Service AI
45:22 Conclusion and Future of Evaluation