Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Cluster Management for Large Scale AI and GPUs: Challenges and Opportunities

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
This conference talk explores the challenges and opportunities in managing large GPU clusters for cloud native AI workloads. Learn how IBM Research configures and manages large-scale GPU clusters, addressing issues like effective GPU utilization, dynamic sharing across teams, and handling performance degradations and faults that impact multi-GPU jobs. Discover how cloud native tools such as Kubeflow and Kueue serve as building blocks for GPU clusters used across IBM Research for training, tuning, and inference jobs. The speakers demonstrate lessons learned in cluster configuration and showcase the development of Kubernetes native automation for GPU health checks and reporting. The presentation also covers diagnostic techniques that enable dynamic quota adjustments to account for faulty GPUs and automatic workload steering away from problematic nodes.

Syllabus

Cluster Management for Large Scale AI and GPUs: Challenges and Oppor... Claudia Misale & David Grove

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Cluster Management for Large Scale AI and GPUs: Challenges and Opportunities

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.