Enabling Technologies for Next Generation Large Scale AI Backend Networking
Open Compute Project via YouTube
-
16
-
- Write review
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about Microsoft's next-generation backend AI network architecture designed for massive GPU clusters in this conference talk from the Open Compute Project. Discover how AI acceleration has transformed networking infrastructure requirements, driving the evolution from tens of thousands to hundreds of thousands of GPUs in backend clusters. Explore the distinct networking challenges these hyper-scale environments present compared to traditional data centers, including demands for ultra-low latency, high throughput, and proactive fault detection capabilities. Examine three key technologies deployed in Microsoft's data centers: Segment Routing over IPv6 (SRv6) for advanced traffic engineering, High-Frequency Streaming Telemetry (HFST) for real-time network monitoring, and trimming techniques for optimized performance. Understand the implementation details and driving factors behind each technology, while gaining insights into how SAI (Switch Abstraction Interface) and SONiC (Software for Open Networking in the Cloud) support the deployment of these hyper-scale AI backend networks. Gain valuable perspectives on the future of networking infrastructure as AI workloads continue to scale exponentially.
Syllabus
Enabling Technologies for Next Generation Large Scale AI Backend Networking
Taught by
Open Compute Project