Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
This conference talk explores how infrastructure-level checkpointing can enhance resilience for AI/ML workloads beyond traditional application framework checkpointing. Learn how Checkpoint/Restore in Userspace (CRIU) can efficiently address scheduling and resilience issues in an application-agnostic way as production workloads scale. The presenters demonstrate a Kubernetes operator that leverages CRIU, CRI-O, and cuda-checkpoint to checkpoint and hot-restart distributed ML workloads. Discover synchronization mechanisms for JobSets running stateful workloads during node maintenance scenarios. The presentation covers use cases and limitations of platform-layer checkpoint/restore for stateful ML applications, provides a technical overview of the implementation, and discusses the roadmap for productionizing this emerging technology.
Syllabus
Transparent, Infra-Level Checkpoint and Restore for Resil... Ganeshkumar Ashokavardhanan & Bernie Wu
Taught by
CNCF [Cloud Native Computing Foundation]