Introduction to Programming with Python
Google AI Professional Certificate - Learn AI Skills That Get You Hired
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore advanced routing strategies for stateful AI workloads in Kubernetes through this 30-minute conference talk from CNCF. Learn why traditional Kubernetes routing approaches fall short for modern generative AI applications that require context-aware routing to maximize performance and reduce costs. Discover layered routing strategies ranging from basic round-robin to sophisticated KV-Cache-aware load balancing, understanding when to apply each approach and their performance implications. Gain insights from the speakers' experience developing llm-d, a framework utilizing the Kubernetes Gateway API Inference Extension through collaboration between Google, IBM Research, and Red Hat. Master routing patterns for long-context and sessionful traffic, implement global cache indices and local offloading for intelligent routing decisions, and examine benchmarks demonstrating improvements in latency, cache hit rates, and GPU utilization. Understand practical methods for adopting cache-aware routing without requiring major infrastructure changes, making this essential viewing for anyone scaling multi-turn, agentic, or LLM-powered workloads in Kubernetes environments.
Syllabus
Routing Stateful AI Workloads in Kubernetes - Maroon Ayoub, IBM & Michey Mehta, Red Hat
Taught by
CNCF [Cloud Native Computing Foundation]