Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Evolution of Ethernet Based Switch Platforms and Fabrics to Meet Meta's AI Training Clusters

Open Compute Project via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how Meta evolved Ethernet-based switch platforms and fabrics to support large-scale AI training clusters in this 23-minute conference talk from OCP 2024. Discover the journey from deep-buffered to shallow-buffer disaggregated Ethernet platforms for building non-blocking fabrics that interconnect GPU clusters of up to 4,000 units. Explore the switching and routing platforms, features, and unique scaling challenges encountered when constructing significantly larger non-blocking fabrics for generative AI training workloads. Understand the specific SAI and FBOSS enhancements that enabled adaptation to new platform features, including the evolution from cell-based fabrics to more scalable architectures. Gain insights into the technical decisions and engineering solutions that Meta's software and hardware engineers implemented to meet the demanding requirements of modern AI training infrastructure.

Syllabus

Evolution of Ethernet based switch platforms and fabrics to meet Metas AI training clusters s

Taught by

Open Compute Project

Reviews

Start your review of Evolution of Ethernet Based Switch Platforms and Fabrics to Meet Meta's AI Training Clusters

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.