LLMs on Kubernetes - Squeeze 5x GPU Efficiency With Cache, Route, Repeat!

Learn how to dramatically improve GPU efficiency for Large Language Model deployments on Kubernetes in this conference talk from KubeCon + CloudNativeCon. Discover battle-tested computer science principles that can increase your cluster's efficiency by 5x without relying on magic solutions. Explore the open-source "Production Stack" project, a first-party vLLM initiative that supercharges vLLM on Kubernetes through intelligent caching strategies that offload KV Cache to CPU, disk, or remote storage to eliminate redundant computations. Master smarter routing techniques that match requests to GPUs with pre-computed caches for lower Time To First Token (TTFT), implement enhanced fault tolerance systems that can migrate live requests mid-generation during failures, and revolutionize RAG workflows by blending non-prefix caches from retrieved chunks using CacheBlend for 3x faster TTFT. Examine real-world benchmarks demonstrating 5x throughput improvements compared to vanilla vLLM implementations. Gain actionable deployment patterns for faster, cheaper, and more reliable LLM infrastructure whether you're working as an Infrastructure Engineer, ML Developer, or Site Reliability Engineer dealing with GPU shortages and high inference costs in production environments.