Evolution of Ethernet Based Switch Platforms and Fabrics to Meet Meta's AI Training Clusters
Open Compute Project via YouTube
Learn Generative AI, Prompt Engineering, and LLMs for Free
Python, Prompt Engineering, Data Science — Build the Skills Employers Want Now
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how Meta evolved Ethernet-based switch platforms and fabrics to support large-scale AI training clusters in this 23-minute conference talk from OCP 2024. Discover the journey from deep-buffered to shallow-buffer disaggregated Ethernet platforms for building non-blocking fabrics that interconnect GPU clusters of up to 4,000 units. Explore the switching and routing platforms, features, and unique scaling challenges encountered when constructing significantly larger non-blocking fabrics for generative AI training workloads. Understand the specific SAI and FBOSS enhancements that enabled adaptation to new platform features, including the evolution from cell-based fabrics to more scalable architectures. Gain insights into the technical decisions and engineering solutions that Meta's software and hardware engineers implemented to meet the demanding requirements of modern AI training infrastructure.
Syllabus
Evolution of Ethernet based switch platforms and fabrics to meet Metas AI training clusters s
Taught by
Open Compute Project