LLMs on Kubernetes - Squeeze 5x GPU Efficiency With Cache, Route, Repeat!
CNCF [Cloud Native Computing Foundation] via YouTube
Stuck in Tutorial Hell? Learn Backend Dev the Right Way
Pass the PMP® Exam on Your First Try — Expert-Led Training
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn how to dramatically improve GPU efficiency for Large Language Model deployments on Kubernetes in this conference talk from KubeCon + CloudNativeCon. Discover battle-tested computer science principles that can increase your cluster's efficiency by 5x without relying on magic solutions. Explore the open-source "Production Stack" project, a first-party vLLM initiative that supercharges vLLM on Kubernetes through intelligent caching strategies that offload KV Cache to CPU, disk, or remote storage to eliminate redundant computations. Master smarter routing techniques that match requests to GPUs with pre-computed caches for lower Time To First Token (TTFT), implement enhanced fault tolerance systems that can migrate live requests mid-generation during failures, and revolutionize RAG workflows by blending non-prefix caches from retrieved chunks using CacheBlend for 3x faster TTFT. Examine real-world benchmarks demonstrating 5x throughput improvements compared to vanilla vLLM implementations. Gain actionable deployment patterns for faster, cheaper, and more reliable LLM infrastructure whether you're working as an Infrastructure Engineer, ML Developer, or Site Reliability Engineer dealing with GPU shortages and high inference costs in production environments.
Syllabus
LLMs on Kubernetes: Squeeze 5x GPU Efficiency With Cache, Route, Repea... Yuhan Liu & Suraj Deshmukh
Taught by
CNCF [Cloud Native Computing Foundation]