Scaling Production LLM Inference Using EKS Auto Mode and Ray Serve

Learn how to deploy large-scale LLM inference systems without becoming a Kubernetes expert in this conference talk from Ray Summit 2025. Discover how AWS's EKS Auto Mode combined with Ray Serve creates a fully automated, production-grade serving platform that eliminates operational overhead and infrastructure management complexities. Explore the transformation from labor-intensive, manually managed clusters to self-healing, cost-efficient systems through a real-world deployment example. Master intelligent node provisioning tailored for AI workloads, automatic workload-driven scaling for CPUs and GPUs, built-in observability features, seamless GPU lifecycle management, burst capacity handling for maintaining low latency under unpredictable loads, and cost optimization strategies for expensive inference accelerators. Understand how Ray Serve orchestrates high-throughput, multi-model LLM inference to create resilient systems that scale from prototype to production with minimal complexity. Gain a clear blueprint for deploying scalable, reliable, and cost-efficient LLM inference on AWS, perfect for ML engineers wanting to focus on AI applications rather than infrastructure management and platform teams seeking turnkey AI infrastructure solutions.