KV-Cache Storage Offloading for Efficient Inference in LLMs

Explore KV-cache storage offloading techniques for optimizing large language model inference in this 51-minute conference talk from SNIA SDC 2025. Learn how growing memory demands of Key-Value caches exceed GPU capacity as LLMs serve more users and generate longer outputs, creating bottlenecks for large-scale inference systems. Discover how relocating attention cache data to high-speed, low-latency storage tiers alleviates GPU memory constraints and unlocks new scalability levels for serving large models. Dive deep into inference workload architecture, understand the structure and role of KV-cache, and examine practical implementation of storage offloading. Gain insights into why external storage is essential for modern inference workloads, what makes KV-cache a bottleneck in large-scale deployments, and how inference engines work with KV-cache offloading enhancements. Master the timing and methods for implementing KV-cache storage offloading to improve inference performance, presented by Ugur Kaynar from Dell Technologies.