Learn Excel & Financial Modeling the Way Finance Teams Actually Use Them
Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore KV-cache storage offloading techniques for optimizing large language model inference in this 51-minute conference talk from SNIA SDC 2025. Learn how growing memory demands of Key-Value caches exceed GPU capacity as LLMs serve more users and generate longer outputs, creating bottlenecks for large-scale inference systems. Discover how relocating attention cache data to high-speed, low-latency storage tiers alleviates GPU memory constraints and unlocks new scalability levels for serving large models. Dive deep into inference workload architecture, understand the structure and role of KV-cache, and examine practical implementation of storage offloading. Gain insights into why external storage is essential for modern inference workloads, what makes KV-cache a bottleneck in large-scale deployments, and how inference engines work with KV-cache offloading enhancements. Master the timing and methods for implementing KV-cache storage offloading to improve inference performance, presented by Ugur Kaynar from Dell Technologies.
Syllabus
SNIA SDC 2025 - KV-Cache Storage Offloading for Efficient Inference in LLMs
Taught by
SNIAVideo