AI, Data Science & Cloud Certificates from Google, IBM & Meta
Learn EDR Internals: Research & Development From The Masters
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore a technical deep-dive conference talk examining Microsoft's deployment of SRv6 technology to address critical networking challenges in large-scale AI training clusters. Learn how the synchronized nature of AI workloads creates massive, bursty elephant flows that break traditional data center designs, causing ECMP hashing collisions, congestion, and significant job completion delays. Discover the specific traffic characterization of large-scale training jobs and compare NIC-based versus switch-based load balancing techniques. Understand Microsoft's strategic shift to deterministic multipathing using Source Routing (SRv6) to ensure conflict-free traffic placement in AI backend networks. Gain practical insights into the real-world implementation of SRv6 uSID within the SONiC network operating system, including operational data on deployment, monitoring, and troubleshooting this architecture in production environments. Benefit from the expertise of Pablo Camarillo, Principal Engineer at Cisco and lead architect of SRv6 technology, who shares lessons learned from implementing this solution in one of the world's largest AI infrastructures, moving beyond theoretical concepts to practical application in hyperscale network fabrics.
Syllabus
AI Backend: Deploying SRv6 uSID and SONiC for Deterministic Load Balancing
Taught by
NANOG