Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about ByteCheckpoint, an industrial-grade checkpointing system designed for large-scale foundation model training, in this 13-minute conference presentation from NSDI '25. Discover how this unified system addresses critical challenges in preserving training states during large foundation model development, enabling seamless training resumption after failures and efficient transitions between different GPU resources and parallelism configurations. Explore the system's key features including parallelism-agnostic checkpoint representation for efficient load-time resharding, generic workflows supporting multiple training frameworks and storage backends, and comprehensive optimizations ensuring high I/O efficiency and scalability. Examine the performance improvements achieved by ByteCheckpoint, including a 54.20× average reduction in runtime checkpoint stalls and up to 9.96× faster saving times compared to existing open-source solutions. Understand how the system's monitoring tools facilitate large-scale performance analysis and bottleneck detection in production environments where different foundation models require various frameworks and storage backends based on model sizes and training scales.
Syllabus
NSDI '25 - ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Taught by
USENIX