Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

LLMs on Kubernetes - Squeeze 5x GPU Efficiency With Cache, Route, Repeat!

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to dramatically improve GPU efficiency for Large Language Model deployments on Kubernetes in this conference talk from KubeCon + CloudNativeCon. Discover battle-tested computer science principles that can increase your cluster's efficiency by 5x without relying on magic solutions. Explore the open-source "Production Stack" project, a first-party vLLM initiative that supercharges vLLM on Kubernetes through intelligent caching strategies that offload KV Cache to CPU, disk, or remote storage to eliminate redundant computations. Master smarter routing techniques that match requests to GPUs with pre-computed caches for lower Time To First Token (TTFT), implement enhanced fault tolerance systems that can migrate live requests mid-generation during failures, and revolutionize RAG workflows by blending non-prefix caches from retrieved chunks using CacheBlend for 3x faster TTFT. Examine real-world benchmarks demonstrating 5x throughput improvements compared to vanilla vLLM implementations. Gain actionable deployment patterns for faster, cheaper, and more reliable LLM infrastructure whether you're working as an Infrastructure Engineer, ML Developer, or Site Reliability Engineer dealing with GPU shortages and high inference costs in production environments.

Syllabus

LLMs on Kubernetes: Squeeze 5x GPU Efficiency With Cache, Route, Repea... Yuhan Liu & Suraj Deshmukh

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of LLMs on Kubernetes - Squeeze 5x GPU Efficiency With Cache, Route, Repeat!

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.