Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Understanding Stragglers in Large Model Training Using What-if Analysis

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a comprehensive research presentation examining straggler issues in large language model training through what-if analysis methodology. Learn how researchers from New York University, ByteDance, and Zhejiang University conducted a five-month study of ByteDance's LLM training cluster to understand performance bottlenecks caused by slow workers in distributed GPU computations. Discover findings on the frequency and impact of stragglers on training job performance, analyze temporal and spatial patterns in straggler occurrence, and examine the complex root causes beyond simple hardware failures. Gain insights into how thousands of GPUs with frequent synchronization requirements create susceptibility to performance degradation, and understand the simulation techniques used to contrast actual performance with straggler-free scenarios in large-scale machine learning infrastructure.

Syllabus

OSDI '25 - Understanding Stragglers in Large Model Training Using What-if Analysis

Taught by

USENIX

Reviews

Start your review of Understanding Stragglers in Large Model Training Using What-if Analysis

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.