Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
Earn Your CS Degree, Tuition-Free, 100% Online!
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore how multi-cluster batch schedulers address the scalability challenges of AI/ML workloads in this conference talk from KubeCon + CloudNativeCon. Learn about the limitations of single-cluster schedulers when handling millions of diverse, resource-intensive batch jobs that require GPU bursts for training and CPU/memory-intensive preprocessing tasks. Discover how multi-cluster schedulers federate Kubernetes clusters to dynamically extend capacity across on-premises and cloud environments while providing tenant isolation and zone outage resilience. Examine the implementation of critical batch scheduling features including globally coordinated preemption for optimal capacity reclamation, fair-share quota enforcement to ensure equitable compute distribution among teams, and gang scheduling that reserves resources across clusters for synchronized multi-node job launches. Gain insights into the architectural approaches used by multi-cluster schedulers to overcome ETCD scalability limits and single-region failure domains while maintaining efficient resource utilization across federated Kubernetes environments.
Syllabus
Multi-Cluster Wars: The Scheduler Awakens - Dejan Pejchev & Priyanka Ravi, G-Research
Taught by
CNCF [Cloud Native Computing Foundation]