Pass the PMP® Exam on Your First Try — Expert-Led Training
Learn Backend Development Part-Time, Online
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore advanced routing strategies for stateful AI workloads in Kubernetes through this 30-minute conference talk from CNCF. Learn why traditional Kubernetes routing approaches fall short for modern generative AI applications that require context-aware routing to maximize performance and reduce costs. Discover layered routing strategies ranging from basic round-robin to sophisticated KV-Cache-aware load balancing, understanding when to apply each approach and their performance implications. Gain insights from the speakers' experience developing llm-d, a framework utilizing the Kubernetes Gateway API Inference Extension through collaboration between Google, IBM Research, and Red Hat. Master routing patterns for long-context and sessionful traffic, implement global cache indices and local offloading for intelligent routing decisions, and examine benchmarks demonstrating improvements in latency, cache hit rates, and GPU utilization. Understand practical methods for adopting cache-aware routing without requiring major infrastructure changes, making this essential viewing for anyone scaling multi-turn, agentic, or LLM-powered workloads in Kubernetes environments.
Syllabus
Routing Stateful AI Workloads in Kubernetes - Maroon Ayoub, IBM & Michey Mehta, Red Hat
Taught by
CNCF [Cloud Native Computing Foundation]