Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

The Data Addition Dilemma: Navigating Distribution Shifts in Machine Learning

Simons Institute via YouTube

Start learning Write review

Free courses from frontend to fullstack and AI

Learn More →

Build with Azure OpenAI, Copilot Studio & Agentic Frameworks — Microsoft Certified

Learn More →

Overview

Google, IBM & Meta Certificates – 40% Off

One plan covers every Professional Certificate on Coursera.

Unlock All Certificates

Watch a 37-minute lecture from UC Berkeley researcher Irene Y Chen at the Simons Institute exploring why combining data from different sources for machine learning training isn't always beneficial. Learn about the "Data Addition Dilemma" where mixing dissimilar data sources can reduce accuracy, create fairness issues, and harm performance for underrepresented groups. Examine the fundamental trade-off between benefits of increased data scale and drawbacks of distribution shifts when combining datasets. Discover practical strategies and heuristics for deciding which data sources to combine to achieve optimal model performance improvements. Gain insights into key considerations for data collection and composition as AI models continue growing in size and complexity.