Learn AI, Data Science & Business — Earn Certificates That Get You Hired
The Investment Banker Certification
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore KV-cache storage offloading techniques for optimizing large language model inference in this 51-minute conference talk from SNIA SDC 2025. Learn how growing memory demands of Key-Value caches exceed GPU capacity as LLMs serve more users and generate longer outputs, creating bottlenecks for large-scale inference systems. Discover how relocating attention cache data to high-speed, low-latency storage tiers alleviates GPU memory constraints and unlocks new scalability levels for serving large models. Dive deep into inference workload architecture, understand the structure and role of KV-cache, and examine practical implementation of storage offloading. Gain insights into why external storage is essential for modern inference workloads, what makes KV-cache a bottleneck in large-scale deployments, and how inference engines work with KV-cache offloading enhancements. Master the timing and methods for implementing KV-cache storage offloading to improve inference performance, presented by Ugur Kaynar from Dell Technologies.
Syllabus
SNIA SDC 2025 - KV-Cache Storage Offloading for Efficient Inference in LLMs
Taught by
SNIAVideo