NY State-Licensed Certificates in Design, Coding & AI — Online
Learn Python with Generative AI - Self Paced Online
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about GREYHOUND, an automated system for detecting and mitigating fail-slows (stragglers) in large-scale hybrid-parallel GPU training environments through this 18-minute conference presentation from USENIX ATC '25. Discover findings from a comprehensive characterization study conducted on a production cluster with over 10,000 GPUs, revealing how fail-slows manifest as transient stragglers lasting from sub-minutes to nearly ten hours and delay training jobs by 1.34× on average. Explore the root causes of these performance issues, including slow computations, communications due to contention, device degradation, and network congestion. Understand the limitations of current manual detection and checkpoint-and-restart approaches, then examine GREYHOUND's innovative multi-level mitigation mechanism that automatically identifies slow GPUs and communication links without human intervention. Review the system's impressive performance metrics, including over 99% accuracy in detecting fail-slows in production environments and 1.58× improvement in end-to-end throughput when handling stragglers in testbed experiments with 256 H800 GPUs.
Syllabus
USENIX ATC '25 - GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale
Taught by
USENIX