Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Disaggregated KV Storage - A New Tier for Efficient Scalable LLM Inference

SNIAVideo via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a technical conference presentation that introduces a disaggregated key-value storage architecture designed to address the growing infrastructure costs of large language model inference. Learn how this innovative approach offloads KV-cache tensors to reduce GPU compute pressure while maintaining low-latency, high-throughput performance in generative AI systems. Discover the first end-to-end system based on shared storage for KV-cache offloading that integrates with production-scale orchestration frameworks like Dynamo and Production Stack, enabling scalable deployment across distributed GPU clusters. Examine theoretical analysis and empirical evaluation comparing this approach to state-of-the-art inference engines such as vLLM, with benchmarks demonstrating 5–8× higher request throughput and 5–7× faster prefill latency compared to baseline systems. Review experiments covering various GPU types and LLMs including DeepSeek-V3, simulating diverse use cases such as multi-turn conversations, long context generation, and agentic workloads. Understand how this stateless external KV store enables direct GPU-initiated I/O and overlapping of compute and data access, improving efficiency at the infrastructure level compared to traditional block or file storage systems. Gain insights into system design principles, performance characteristics, and practical deployment lessons for engineers, system architects, and infrastructure practitioners seeking scalable, storage-centric approaches to improve LLM inference efficiency and elasticity at scale.

Syllabus

SNIA SDC 2025 - Disaggregated KV Storage: A New Tier for Efficient Scalable LLM Inference

Taught by

SNIAVideo

Reviews

Start your review of Disaggregated KV Storage - A New Tier for Efficient Scalable LLM Inference

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.