Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Barre - Empowering Simplified and Versatile Programmable Congestion Control in High-Speed AI Clusters

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about a novel congestion control scheme designed for high-speed AI clusters through this 17-minute conference presentation from USENIX ATC '25. Discover how Barre addresses the significant challenges faced by RoCEv2 networks operating at 400 Gbps, particularly in managing congestion under high-throughput workloads in modern AI and HPC environments. Explore the limitations of existing advanced congestion control algorithms, including their complex parameter tuning requirements and dependency on sophisticated hardware features that hinder large-scale data center deployment. Understand how Barre leverages commodity hardware and standard network functionalities to achieve near-optimal performance in fairness, congestion responsiveness, and scalability with minimal overhead. Examine the real-world deployment results from a 400 Gbps RoCE cluster supporting up to 10,000 GPUs over more than a year, demonstrating an average 9.6% improvement in AI training task throughput. Gain insights into how Barre's core principles can be applied to enhance DCQCN, a widely deployed congestion control algorithm, highlighting the scheme's practicality and versatility for modern data center networks.

Syllabus

USENIX ATC '25 - Barre: Empowering Simplified and Versatile Programmable Congestion Control in...

Taught by

USENIX

Reviews

Start your review of Barre - Empowering Simplified and Versatile Programmable Congestion Control in High-Speed AI Clusters

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.