Enabling Technologies for Next Generation Large Scale AI Backend Networking

Learn AI, Data Science & Business — Earn Certificates That Get You Hired

Learn More →

Build with Azure OpenAI, Copilot Studio & Agentic Frameworks — Microsoft Certified

Learn More →

Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off

One annual plan covers every course and certificate on Coursera. 40% off for a limited time.

Get Full Access

Learn about Microsoft's next-generation backend AI network architecture designed for massive GPU clusters in this conference talk from the Open Compute Project. Discover how AI acceleration has transformed networking infrastructure requirements, driving the evolution from tens of thousands to hundreds of thousands of GPUs in backend clusters. Explore the distinct networking challenges these hyper-scale environments present compared to traditional data centers, including demands for ultra-low latency, high throughput, and proactive fault detection capabilities. Examine three key technologies deployed in Microsoft's data centers: Segment Routing over IPv6 (SRv6) for advanced traffic engineering, High-Frequency Streaming Telemetry (HFST) for real-time network monitoring, and trimming techniques for optimized performance. Understand the implementation details and driving factors behind each technology, while gaining insights into how SAI (Switch Abstraction Interface) and SONiC (Software for Open Networking in the Cloud) support the deployment of these hyper-scale AI backend networks. Gain valuable perspectives on the future of networking infrastructure as AI workloads continue to scale exponentially.