Disaggregated KV Storage - A New Tier for Efficient Scalable LLM Inference

Explore a technical conference presentation that introduces a disaggregated key-value storage architecture designed to address the growing infrastructure costs of large language model inference. Learn how this innovative approach offloads KV-cache tensors to reduce GPU compute pressure while maintaining low-latency, high-throughput performance in generative AI systems. Discover the first end-to-end system based on shared storage for KV-cache offloading that integrates with production-scale orchestration frameworks like Dynamo and Production Stack, enabling scalable deployment across distributed GPU clusters. Examine theoretical analysis and empirical evaluation comparing this approach to state-of-the-art inference engines such as vLLM, with benchmarks demonstrating 5–8× higher request throughput and 5–7× faster prefill latency compared to baseline systems. Review experiments covering various GPU types and LLMs including DeepSeek-V3, simulating diverse use cases such as multi-turn conversations, long context generation, and agentic workloads. Understand how this stateless external KV store enables direct GPU-initiated I/O and overlapping of compute and data access, improving efficiency at the infrastructure level compared to traditional block or file storage systems. Gain insights into system design principles, performance characteristics, and practical deployment lessons for engineers, system architects, and infrastructure practitioners seeking scalable, storage-centric approaches to improve LLM inference efficiency and elasticity at scale.