Training Foundation Model Workloads on Kubernetes at Scale with MCAD
CNCF [Cloud Native Computing Foundation] via YouTube
Advanced Techniques in Data Visualization - Self Paced Online
Google AI Professional Certificate - Learn AI Skills That Get You Hired
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore how IBM Research built Vela, a cloud-native AI supercomputer, to train foundational models on Kubernetes at scale. Learn about the challenges faced in supporting multiple frameworks like PyTorch, Ray, and Spark for diverse research teams. Discover the role of Multi-Cluster App Dispatcher (MCAD) in queuing custom resources for large-scale AI training, and its interaction with the underlying Kubernetes scheduler. Gain insights into the implementation of gang priority, gang preemption, and fault tolerance for training processes that span hundreds of GPUs and run for extended periods. This conference talk provides valuable information on scaling AI workloads in a Kubernetes environment for researchers and developers working with foundation models.
Syllabus
Training Foundation Model Workloads on Kubernetes at Scale W... Abhishek Malvankar & Olivier Tardieu
Taught by
CNCF [Cloud Native Computing Foundation]