Cluster Management for Large Scale AI and GPUs: Challenges and Opportunities
CNCF [Cloud Native Computing Foundation] via YouTube
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
NY State-Licensed Certificates in Design, Coding & AI — Online
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This conference talk explores the challenges and opportunities in managing large GPU clusters for cloud native AI workloads. Learn how IBM Research configures and manages large-scale GPU clusters, addressing issues like effective GPU utilization, dynamic sharing across teams, and handling performance degradations and faults that impact multi-GPU jobs. Discover how cloud native tools such as Kubeflow and Kueue serve as building blocks for GPU clusters used across IBM Research for training, tuning, and inference jobs. The speakers demonstrate lessons learned in cluster configuration and showcase the development of Kubernetes native automation for GPU health checks and reporting. The presentation also covers diagnostic techniques that enable dynamic quota adjustments to account for faulty GPUs and automatic workload steering away from problematic nodes.
Syllabus
Cluster Management for Large Scale AI and GPUs: Challenges and Oppor... Claudia Misale & David Grove
Taught by
CNCF [Cloud Native Computing Foundation]