Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to extend InfiniBand GPUDirect Async (IBGDA) support beyond Nvidia GPUs and InfiniBand networks to create GPU-agnostic solutions for open AI systems. Discover the implementation of IBGDA technology using Ethernet NICs running RoCE (RDMA over Converged Ethernet), which traditionally has been limited to specific hardware configurations. Explore the technical challenges encountered when adapting this high-performance networking technology for broader compatibility and understand optimization strategies for Open Compute Project (OCP) AI systems and various GPU platforms. Examine how RDMA technology enables AI scale-out interconnects and how IBGDA improvements in latency and message rates are achieved by allowing GPUs to directly initiate RDMA transactions between endpoints. Gain insights into the architectural considerations and practical implementation approaches for making advanced GPU networking capabilities more accessible across different hardware ecosystems in AI infrastructure deployments.
Syllabus
Enable IBGDA support in a GPU agnostic manner for open AI systems
Taught by
Open Compute Project