Multi-Cluster Wars - The Scheduler Awakens

Explore how multi-cluster batch schedulers address the scalability challenges of AI/ML workloads in this conference talk from KubeCon + CloudNativeCon. Learn about the limitations of single-cluster schedulers when handling millions of diverse, resource-intensive batch jobs that require GPU bursts for training and CPU/memory-intensive preprocessing tasks. Discover how multi-cluster schedulers federate Kubernetes clusters to dynamically extend capacity across on-premises and cloud environments while providing tenant isolation and zone outage resilience. Examine the implementation of critical batch scheduling features including globally coordinated preemption for optimal capacity reclamation, fair-share quota enforcement to ensure equitable compute distribution among teams, and gang scheduling that reserves resources across clusters for synchronized multi-node job launches. Gain insights into the architectural approaches used by multi-cluster schedulers to overcome ETCD scalability limits and single-region failure domains while maintaining efficient resource utilization across federated Kubernetes environments.