Free courses from frontend to fullstack and AI
Power BI Fundamentals - Create visualizations and dashboards from scratch
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a technical deep-dive conference talk examining Microsoft's deployment of SRv6 technology to address critical networking challenges in large-scale AI training clusters. Learn how the synchronized nature of AI workloads creates massive, bursty elephant flows that break traditional data center designs, causing ECMP hashing collisions, congestion, and significant job completion delays. Discover the specific traffic characterization of large-scale training jobs and compare NIC-based versus switch-based load balancing techniques. Understand Microsoft's strategic shift to deterministic multipathing using Source Routing (SRv6) to ensure conflict-free traffic placement in AI backend networks. Gain practical insights into the real-world implementation of SRv6 uSID within the SONiC network operating system, including operational data on deployment, monitoring, and troubleshooting this architecture in production environments. Benefit from the expertise of Pablo Camarillo, Principal Engineer at Cisco and lead architect of SRv6 technology, who shares lessons learned from implementing this solution in one of the world's largest AI infrastructures, moving beyond theoretical concepts to practical application in hyperscale network fabrics.
Syllabus
AI Backend: Deploying SRv6 uSID and SONiC for Deterministic Load Balancing
Taught by
NANOG