Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads
CNCF [Cloud Native Computing Foundation] via YouTube
Power BI Fundamentals - Create visualizations and dashboards from scratch
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This conference talk explores how infrastructure-level checkpointing can enhance resilience for AI/ML workloads beyond traditional application framework checkpointing. Learn how Checkpoint/Restore in Userspace (CRIU) can efficiently address scheduling and resilience issues in an application-agnostic way as production workloads scale. The presenters demonstrate a Kubernetes operator that leverages CRIU, CRI-O, and cuda-checkpoint to checkpoint and hot-restart distributed ML workloads. Discover synchronization mechanisms for JobSets running stateful workloads during node maintenance scenarios. The presentation covers use cases and limitations of platform-layer checkpoint/restore for stateful ML applications, provides a technical overview of the implementation, and discusses the roadmap for productionizing this emerging technology.
Syllabus
Transparent, Infra-Level Checkpoint and Restore for Resil... Ganeshkumar Ashokavardhanan & Bernie Wu
Taught by
CNCF [Cloud Native Computing Foundation]