Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads
CNCF [Cloud Native Computing Foundation] via YouTube
Power BI Fundamentals - Create visualizations and dashboards from scratch
Earn a Michigan Engineering AI Certificate — Stay Ahead of the AI Revolution
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
This conference talk explores how infrastructure-level checkpointing can enhance resilience for AI/ML workloads beyond traditional application framework checkpointing. Learn how Checkpoint/Restore in Userspace (CRIU) can efficiently address scheduling and resilience issues in an application-agnostic way as production workloads scale. The presenters demonstrate a Kubernetes operator that leverages CRIU, CRI-O, and cuda-checkpoint to checkpoint and hot-restart distributed ML workloads. Discover synchronization mechanisms for JobSets running stateful workloads during node maintenance scenarios. The presentation covers use cases and limitations of platform-layer checkpoint/restore for stateful ML applications, provides a technical overview of the implementation, and discusses the roadmap for productionizing this emerging technology.
Syllabus
Transparent, Infra-Level Checkpoint and Restore for Resil... Ganeshkumar Ashokavardhanan & Bernie Wu
Taught by
CNCF [Cloud Native Computing Foundation]