This conference talk explores the challenges and opportunities in managing large GPU clusters for cloud native AI workloads. Learn how IBM Research configures and manages large-scale GPU clusters, addressing issues like effective GPU utilization, dynamic sharing across teams, and handling performance degradations and faults that impact multi-GPU jobs. Discover how cloud native tools such as Kubeflow and Kueue serve as building blocks for GPU clusters used across IBM Research for training, tuning, and inference jobs. The speakers demonstrate lessons learned in cluster configuration and showcase the development of Kubernetes native automation for GPU health checks and reporting. The presentation also covers diagnostic techniques that enable dynamic quota adjustments to account for faulty GPUs and automatic workload steering away from problematic nodes.