Scale Up and Scale Out AI Fabrics - A Polymorphic Ethernet Architecture for Systems of Systems
Open Compute Project via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the technical challenges and solutions for designing AI fabrics that can handle both scale-up and scale-out requirements in this 22-minute conference presentation by Jai Kumar, Distinguished Engineer at Broadcom. Learn how to address the competing demands of inference workloads that require low-latency, efficient bandwidth for unified GPU memory domains versus training workloads for large language models that need multi-tiered architectures managing distributed GPU domains. Discover how Ethernet technology can be leveraged to create a polymorphic architecture that converges these conflicting requirements into a robust system of systems. Examine key technical considerations including memory versus network semantics, protocol overhead optimization, latency management, fabric topology design, and congestion control algorithms, while understanding how to address challenges like incast-outcast patterns, multipathing, and remote memory access in distributed AI computing environments.
Syllabus
Scale Up and Scale Out AI Fabrics A Polymorphic Ethernet Architecture for Systems of System
Taught by
Open Compute Project