Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a 15-minute conference presentation from USENIX ATC '25 that introduces Universal Checkpointing (UCP), a groundbreaking checkpointing system designed to enable flexible and efficient distributed deep neural network training with reconfigurable parallelism. Learn how researchers from the University of Illinois Urbana-Champaign, Microsoft, and Snowflake have developed a solution to overcome the critical limitation of existing DNN training systems that tightly couple distributed checkpoints to specific model parallelism and hardware configurations. Discover how UCP decouples checkpoint structure from parallel training strategies and hardware configurations, allowing large-scale training jobs to efficiently adapt to hardware failures and resource elasticity. Understand the pattern-based reconfiguration pipeline that enables automatic, flexible, and efficient mapping of checkpoint state to various parallelism strategies. Examine evaluation results across a range of DNN models, including state-of-the-art dense and sparse large language models (LLMs), demonstrating UCP's ability to enable reconfiguration for a broader set of widely used parallelism strategies than existing solutions while adding negligible reconfiguration cost. Gain insights into how UCP has been successfully deployed in real LLM training workloads, significantly enhancing their flexibility and resilience in dynamic hardware environments.
Syllabus
USENIX ATC '25 - Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing...
Taught by
USENIX