Learn Generative AI, Prompt Engineering, and LLMs for Free
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore a technical deep-dive conference talk examining Microsoft's deployment of SRv6 technology to address critical networking challenges in large-scale AI training clusters. Learn how the synchronized nature of AI workloads creates massive, bursty elephant flows that break traditional data center designs, causing ECMP hashing collisions, congestion, and significant job completion delays. Discover the specific traffic characterization of large-scale training jobs and compare NIC-based versus switch-based load balancing techniques. Understand Microsoft's strategic shift to deterministic multipathing using Source Routing (SRv6) to ensure conflict-free traffic placement in AI backend networks. Gain practical insights into the real-world implementation of SRv6 uSID within the SONiC network operating system, including operational data on deployment, monitoring, and troubleshooting this architecture in production environments. Benefit from the expertise of Pablo Camarillo, Principal Engineer at Cisco and lead architect of SRv6 technology, who shares lessons learned from implementing this solution in one of the world's largest AI infrastructures, moving beyond theoretical concepts to practical application in hyperscale network fabrics.
Syllabus
AI Backend: Deploying SRv6 uSID and SONiC for Deterministic Load Balancing
Taught by
NANOG