Source Routing for AI Fabrics - Optimizing Network Traffic in Multi-tenant AI Clusters
Open Compute Project via YouTube
Launch a New Career with Certificates from Google, IBM & Microsoft
Advanced Techniques in Data Visualization - Self Paced Online
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about an innovative approach to scheduling AI workloads in Ethernet fabrics through this technical presentation from Marvell experts Kishore Atreya and Prathyaya Bhandarkar. Explore how source routing frameworks can address challenges in large-scale, multi-tenant AI clusters where high tail latency and jitter impact training performance. Discover a simplified solution that leverages SAI to predetermine flow paths and program them across access nodes, taking advantage of AI training flow predictability. Examine how software controllers can engineer traffic flows between training elements to optimize bandwidth utilization, load, and latency, ultimately reducing network costs and power requirements compared to traditional fabric scheduling approaches. Gain insights into addressing congestion avoidance in AI infrastructure while avoiding the complexity and unpredictable behavior of alternative solutions like enhanced congestion control, load balancing, packet spraying and fabric scheduling.
Syllabus
Source Routing for AI Fabrics
Taught by
Open Compute Project