Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
PowerBI Data Analyst - Create visualizations and dashboards from scratch
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about a novel congestion control scheme designed for high-speed AI clusters through this 17-minute conference presentation from USENIX ATC '25. Discover how Barre addresses the significant challenges faced by RoCEv2 networks operating at 400 Gbps, particularly in managing congestion under high-throughput workloads in modern AI and HPC environments. Explore the limitations of existing advanced congestion control algorithms, including their complex parameter tuning requirements and dependency on sophisticated hardware features that hinder large-scale data center deployment. Understand how Barre leverages commodity hardware and standard network functionalities to achieve near-optimal performance in fairness, congestion responsiveness, and scalability with minimal overhead. Examine the real-world deployment results from a 400 Gbps RoCE cluster supporting up to 10,000 GPUs over more than a year, demonstrating an average 9.6% improvement in AI training task throughput. Gain insights into how Barre's core principles can be applied to enhance DCQCN, a widely deployed congestion control algorithm, highlighting the scheme's practicality and versatility for modern data center networks.
Syllabus
USENIX ATC '25 - Barre: Empowering Simplified and Versatile Programmable Congestion Control in...
Taught by
USENIX