Master Agentic AI, GANs, Fine-Tuning & LLM Apps
Most AI Pilots Fail to Scale. MIT Sloan Teaches You Why — and How to Fix It
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore a comprehensive research presentation examining straggler issues in large language model training through what-if analysis methodology. Learn how researchers from New York University, ByteDance, and Zhejiang University conducted a five-month study of ByteDance's LLM training cluster to understand performance bottlenecks caused by slow workers in distributed GPU computations. Discover findings on the frequency and impact of stragglers on training job performance, analyze temporal and spatial patterns in straggler occurrence, and examine the complex root causes beyond simple hardware failures. Gain insights into how thousands of GPUs with frequent synchronization requirements create susceptibility to performance degradation, and understand the simulation techniques used to contrast actual performance with straggler-free scenarios in large-scale machine learning infrastructure.
Syllabus
OSDI '25 - Understanding Stragglers in Large Model Training Using What-if Analysis
Taught by
USENIX