Building Resilience for Large-Scale AI Training: GPU Management, Failure Detection, and Beyond
CNCF [Cloud Native Computing Foundation] via YouTube
Build the Finance Skills That Lead to Promotions — Not Just Certificates
The Most Addictive Python and SQL Courses
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore critical strategies for managing GPU failures and building resilient AI training systems in this conference talk from KubeCon + CloudNativeCon. Learn how to tackle the challenges of hardware failures when scaling AI training across thousands of GPUs and hundreds of machines. Discover effective approaches to GPU fault detection, network performance monitoring, and proactive problem identification using tools like NVIDIA DCGM. Gain insights into fault-tolerant distributed training principles that help minimize the impact of GPU failures. Drawing from real-world experience in cloud computing and large language model training, master best practices for identifying, remediating, and preventing GPU failures that can otherwise lead to increased costs and development delays. Understand why even minor performance degradation can significantly impact large-scale training jobs and how proper observability can help maintain optimal training efficiency.
Syllabus
Building Resilience for Large-Scale AI Training: GPU Man... Ganeshkumar Ashokavardhanan & Ace Eldeib
Taught by
CNCF [Cloud Native Computing Foundation]