Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Universal Checkpointing - A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a 15-minute conference presentation from USENIX ATC '25 that introduces Universal Checkpointing (UCP), a groundbreaking checkpointing system designed to enable flexible and efficient distributed deep neural network training with reconfigurable parallelism. Learn how researchers from the University of Illinois Urbana-Champaign, Microsoft, and Snowflake have developed a solution to overcome the critical limitation of existing DNN training systems that tightly couple distributed checkpoints to specific model parallelism and hardware configurations. Discover how UCP decouples checkpoint structure from parallel training strategies and hardware configurations, allowing large-scale training jobs to efficiently adapt to hardware failures and resource elasticity. Understand the pattern-based reconfiguration pipeline that enables automatic, flexible, and efficient mapping of checkpoint state to various parallelism strategies. Examine evaluation results across a range of DNN models, including state-of-the-art dense and sparse large language models (LLMs), demonstrating UCP's ability to enable reconfiguration for a broader set of widely used parallelism strategies than existing solutions while adding negligible reconfiguration cost. Gain insights into how UCP has been successfully deployed in real LLM training workloads, significantly enhancing their flexibility and resilience in dynamic hardware environments.

Syllabus

USENIX ATC '25 - Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing...

Taught by

USENIX

Reviews

Start your review of Universal Checkpointing - A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.