Routing Stateful AI Workloads in Kubernetes

Explore advanced routing strategies for stateful AI workloads in Kubernetes through this 30-minute conference talk from CNCF. Learn why traditional Kubernetes routing approaches fall short for modern generative AI applications that require context-aware routing to maximize performance and reduce costs. Discover layered routing strategies ranging from basic round-robin to sophisticated KV-Cache-aware load balancing, understanding when to apply each approach and their performance implications. Gain insights from the speakers' experience developing llm-d, a framework utilizing the Kubernetes Gateway API Inference Extension through collaboration between Google, IBM Research, and Red Hat. Master routing patterns for long-context and sessionful traffic, implement global cache indices and local offloading for intelligent routing decisions, and examine benchmarks demonstrating improvements in latency, cache hit rates, and GPU utilization. Understand practical methods for adopting cache-aware routing without requiring major infrastructure changes, making this essential viewing for anyone scaling multi-turn, agentic, or LLM-powered workloads in Kubernetes environments.