Methodology and Observation of Congestion Control Impact on MoE Training Job Completion Time
Open Compute Project via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn methodology and emulation techniques to quantify AI network fabric performance for GPU clusters running Mixture-of-Experts training workloads in this 15-minute conference presentation. Explore how congestion control and load balancing schemas impact training job completion times in AI data centers through systematic observation and comparison. Discover practical approaches to measuring network effectiveness that interconnects GPU clusters, with focus on real-world implementations that bridge theoretical insights with actionable strategies for advancing AI network infrastructure research and applications.
Syllabus
Methodology and Observation of Congestion Control Impact on MoE Training Job Completion Time
Taught by
Open Compute Project