Cluster Management for Large Scale AI and GPUs: Challenges and Opportunities
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
This conference talk explores the challenges and opportunities in managing large GPU clusters for cloud native AI workloads. Learn how IBM Research configures and manages large-scale GPU clusters, addressing issues like effective GPU utilization, dynamic sharing across teams, and handling performance degradations and faults that impact multi-GPU jobs. Discover how cloud native tools such as Kubeflow and Kueue serve as building blocks for GPU clusters used across IBM Research for training, tuning, and inference jobs. The speakers demonstrate lessons learned in cluster configuration and showcase the development of Kubernetes native automation for GPU health checks and reporting. The presentation also covers diagnostic techniques that enable dynamic quota adjustments to account for faulty GPUs and automatic workload steering away from problematic nodes.
Syllabus
Cluster Management for Large Scale AI and GPUs: Challenges and Oppor... Claudia Misale & David Grove
Taught by
CNCF [Cloud Native Computing Foundation]