Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

The Inadequacy of Offline LLM Evaluations - A Need to Account for Personalization in Model Behavior

Simons Institute via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore how standard offline evaluations for large language models fail to capture real-world model behavior through this research presentation by Angelina Wang from Cornell Tech. Examine the fundamental disconnect between traditional benchmark testing methods and actual language model performance in personalized user interactions. Discover empirical evidence from a comprehensive study involving 800 real users of ChatGPT and Gemini, demonstrating how identical questions can produce markedly different responses depending on whether they're posed to a stateless system, within one user's chat session, or in another user's session. Learn why personalization fundamentally alters model behavior and understand the critical need to evaluate AI systems based on their behaviors in human interactions rather than through decontextualized, prediction-only outputs. Gain insights into the limitations of current evaluation methodologies and the importance of developing assessment frameworks that account for the dynamic, personalized nature of language model deployment in real-world applications.

Syllabus

The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

Taught by

Simons Institute

Reviews

Start your review of The Inadequacy of Offline LLM Evaluations - A Need to Account for Personalization in Model Behavior

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.