Evolution of Ethernet Based Switch Platforms and Fabrics to Meet Meta's AI Training Clusters
Open Compute Project via YouTube
PowerBI Data Analyst - Create visualizations and dashboards from scratch
Free courses from frontend to fullstack and AI
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how Meta evolved Ethernet-based switch platforms and fabrics to support large-scale AI training clusters in this 23-minute conference talk from OCP 2024. Discover the journey from deep-buffered to shallow-buffer disaggregated Ethernet platforms for building non-blocking fabrics that interconnect GPU clusters of up to 4,000 units. Explore the switching and routing platforms, features, and unique scaling challenges encountered when constructing significantly larger non-blocking fabrics for generative AI training workloads. Understand the specific SAI and FBOSS enhancements that enabled adaptation to new platform features, including the evolution from cell-based fabrics to more scalable architectures. Gain insights into the technical decisions and engineering solutions that Meta's software and hardware engineers implemented to meet the demanding requirements of modern AI training infrastructure.
Syllabus
Evolution of Ethernet based switch platforms and fabrics to meet Metas AI training clusters s
Taught by
Open Compute Project