PowerBI Data Analyst - Create visualizations and dashboards from scratch
Build GenAI Apps from Scratch — UCSB PaCE Certificate Program
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about ByteCheckpoint, an industrial-grade checkpointing system designed for large-scale foundation model training, in this 13-minute conference presentation from NSDI '25. Discover how this unified system addresses critical challenges in preserving training states during large foundation model development, enabling seamless training resumption after failures and efficient transitions between different GPU resources and parallelism configurations. Explore the system's key features including parallelism-agnostic checkpoint representation for efficient load-time resharding, generic workflows supporting multiple training frameworks and storage backends, and comprehensive optimizations ensuring high I/O efficiency and scalability. Examine the performance improvements achieved by ByteCheckpoint, including a 54.20× average reduction in runtime checkpoint stalls and up to 9.96× faster saving times compared to existing open-source solutions. Understand how the system's monitoring tools facilitate large-scale performance analysis and bottleneck detection in production environments where different foundation models require various frameworks and storage backends based on model sizes and training scales.
Syllabus
NSDI '25 - ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Taught by
USENIX