Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about GREYHOUND, an automated system for detecting and mitigating fail-slows (stragglers) in large-scale hybrid-parallel GPU training environments through this 18-minute conference presentation from USENIX ATC '25. Discover findings from a comprehensive characterization study conducted on a production cluster with over 10,000 GPUs, revealing how fail-slows manifest as transient stragglers lasting from sub-minutes to nearly ten hours and delay training jobs by 1.34× on average. Explore the root causes of these performance issues, including slow computations, communications due to contention, device degradation, and network congestion. Understand the limitations of current manual detection and checkpoint-and-restart approaches, then examine GREYHOUND's innovative multi-level mitigation mechanism that automatically identifies slow GPUs and communication links without human intervention. Review the system's impressive performance metrics, including over 99% accuracy in detecting fail-slows in production environments and 1.58× improvement in end-to-end throughput when handling stragglers in testbed experiments with 256 H800 GPUs.
Syllabus
USENIX ATC '25 - GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale
Taught by
USENIX