Understanding Stragglers in Large Model Training Using What-if Analysis

Explore a comprehensive research presentation examining straggler issues in large language model training through what-if analysis methodology. Learn how researchers from New York University, ByteDance, and Zhejiang University conducted a five-month study of ByteDance's LLM training cluster to understand performance bottlenecks caused by slow workers in distributed GPU computations. Discover findings on the frequency and impact of stragglers on training job performance, analyze temporal and spatial patterns in straggler occurrence, and examine the complex root causes beyond simple hardware failures. Gain insights into how thousands of GPUs with frequent synchronization requirements create susceptibility to performance degradation, and understand the simulation techniques used to contrast actual performance with straggler-free scenarios in large-scale machine learning infrastructure.