Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

2025 is the Year of Evals! Just like 2024, and 2023, and … - Enterprise AI/ML Evaluation and Monitoring

AI Engineer via YouTube

Start learning Write review

Explore the evolution and critical importance of AI/ML evaluation systems in enterprise environments through this 19-minute conference talk by Mozilla AI CEO John Dickerson. Examine the persistent gap between the recognized need for AI evaluation and the actual implementation of proper guardrails and governance in production systems. Learn about the decade-long journey of enterprise AI evaluation, from pre-ChatGPT deployments through the current era of generative AI and autonomous agents. Discover how the landscape has shifted from traditional ML monitoring to the complex challenges of evaluating multi-agent systems and connecting AI performance to downstream business KPIs. Understand the disconnect between what practitioners need for effective AI evaluation and what C-suite executives are willing to invest in, while gaining insights into venture capital predictions and macroeconomic factors influencing the AI evaluation market. Delve into the definition and monitoring challenges of AI agents, the transition from single-model to multi-agent system evaluation, and the role of domain expertise in creating effective evaluation frameworks. The presentation includes a Q&A session covering topics such as the use of LLMs as judges in evaluation processes and the practical considerations for implementing robust AI evaluation systems in enterprise settings.

Syllabus

00:00 Introduction to Arthur AI and Mozilla AI
00:46 2025: The Year of Evals
01:15 AI/ML monitoring and evaluation
02:48 The Year of the Agent
03:26 The need for 'evals' wasn't obvious to the C-suite
04:15 Pre-ChatGPT launch
06:06 Venture capitalists' predictions
07:03 Macroeconomic side of things
08:06 OpenAI launching ChatGPT
09:15 2023: The Year of GenAI
09:39 2024: GenAI applications in production
10:22 2025: Scaling and autonomy
11:35 Definition of an agent
12:06 Connecting to downstream business KPIs
14:40 Shift to multi-agent systems monitoring
15:42 Q&A
16:16 Discussion on domain expertise in evaluations
18:13 Discussion on LLMs as judges