Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

KV-Cache Storage Offloading for Efficient Inference in LLMs

SNIAVideo via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore KV-cache storage offloading techniques for optimizing large language model inference in this 51-minute conference talk from SNIA SDC 2025. Learn how growing memory demands of Key-Value caches exceed GPU capacity as LLMs serve more users and generate longer outputs, creating bottlenecks for large-scale inference systems. Discover how relocating attention cache data to high-speed, low-latency storage tiers alleviates GPU memory constraints and unlocks new scalability levels for serving large models. Dive deep into inference workload architecture, understand the structure and role of KV-cache, and examine practical implementation of storage offloading. Gain insights into why external storage is essential for modern inference workloads, what makes KV-cache a bottleneck in large-scale deployments, and how inference engines work with KV-cache offloading enhancements. Master the timing and methods for implementing KV-cache storage offloading to improve inference performance, presented by Ugur Kaynar from Dell Technologies.

Syllabus

SNIA SDC 2025 - KV-Cache Storage Offloading for Efficient Inference in LLMs

Taught by

SNIAVideo

Reviews

Start your review of KV-Cache Storage Offloading for Efficient Inference in LLMs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.