Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about a novel congestion control scheme designed for high-speed AI clusters through this 17-minute conference presentation from USENIX ATC '25. Discover how Barre addresses the significant challenges faced by RoCEv2 networks operating at 400 Gbps, particularly in managing congestion under high-throughput workloads in modern AI and HPC environments. Explore the limitations of existing advanced congestion control algorithms, including their complex parameter tuning requirements and dependency on sophisticated hardware features that hinder large-scale data center deployment. Understand how Barre leverages commodity hardware and standard network functionalities to achieve near-optimal performance in fairness, congestion responsiveness, and scalability with minimal overhead. Examine the real-world deployment results from a 400 Gbps RoCE cluster supporting up to 10,000 GPUs over more than a year, demonstrating an average 9.6% improvement in AI training task throughput. Gain insights into how Barre's core principles can be applied to enhance DCQCN, a widely deployed congestion control algorithm, highlighting the scheme's practicality and versatility for modern data center networks.
Syllabus
USENIX ATC '25 - Barre: Empowering Simplified and Versatile Programmable Congestion Control in...
Taught by
USENIX