Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Towards Building a Flexible, Efficient and Resilient Training with Adaptive Checkpoint

SNIAVideo via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a conference presentation that introduces Universal Checkpointing (UCP), a groundbreaking system designed to enable flexible and efficient generative AI training with reconfigurable parallelism on large-scale AMD GPU clusters. Learn how this innovative approach addresses critical challenges in AI infrastructure scaling, where increasing model sizes and datasets lead to decreased Mean Time Between Failures (MTBF) and more frequent job failures. Discover how UCP overcomes the limitations of existing systems that offer minimal support for reconfiguring parallelism mid-training, which traditionally slows progress during hardware failures or GPU re-allocation due to tightly coupled distributed checkpoints. Understand the storage and memory performance challenges that UCP addresses through carefully designed hardware and software architecture considerations, including an in-depth analysis of the PyTorch GPU-Storage data path to achieve optimal performance between AMD GPU clusters and high-performance remote storage systems. Examine how UCP's optimizations enable reconfiguration across a broad set of popular parallelism strategies for various generative AI models with minimal reconfiguration costs, significantly enhancing flexibility and resilience in training workflows. Gain insights into the wide applicability of these findings across entire AI data pipelines, including reducing cold start overhead during inference and improving checkpoint loading for downstream post-training tasks such as Supervised Fine-Tuning and Reinforcement Learning, presented as a collaborative effort between UIUC and AMD.

Syllabus

SNIA SDC 2025 - Towards Building a Flexible, Efficient & Resilient Training w/ Adaptive Checkpoint

Taught by

SNIAVideo

Reviews

Start your review of Towards Building a Flexible, Efficient and Resilient Training with Adaptive Checkpoint

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.