Unlocking Peak Workload Performance and Efficiency with Ray on Kubernetes

Learn to build an efficient, cost-effective, and accelerator-aware ML platform using Ray on Kubernetes in this 27-minute conference talk from Ray Summit 2025. Discover how to establish smarter resource management foundations through GKE Custom Compute Classes and label-based scheduling to seamlessly shift workloads across spot instances, diverse GPU and TPU types, and reserved capacity while dramatically reducing cloud costs. Explore how GKE image streaming accelerates image and model loading to minimize startup latency, and understand how fair-share scheduling is achieved with Kueue, Kubernetes' native job-queueing system that integrates directly with Ray to ensure fair access to compute resources. Master large-scale training techniques on TPUs using Ray Train with JAX and PyTorch frameworks, and leverage integrated TPU metrics in the Ray Dashboard for simplified performance debugging. Examine high-throughput serving and batch inference strategies that dynamically shift workloads between GPUs and TPUs to optimize for both cost and latency. Preview upcoming Ray features that enhance large-scale training, high-throughput inference, and complex RL and agentic workloads running on secure, decoupled GKE architecture. Gain a comprehensive blueprint for the recommended Ray on Kubernetes architecture to unlock new levels of performance, cost-efficiency, and operational simplicity for ML workloads at any scale.