Oneiros - KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving
Centre for Networked Intelligence, IISc via YouTube
-
10
-
- Write review
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Attend this technical seminar to explore Oneiros, an innovative approach to optimizing KV cache memory management for multi-tenant Large Language Model (LLM) serving systems. Learn how this novel solution addresses the memory bottleneck in LLM inference by introducing parameter remapping techniques that repurpose model parameter memory for KV cache storage, eliminating the need for costly CPU-GPU memory swapping. Discover the key insight that model parameters remain constant during runtime while KV caches update dynamically, and understand how Oneiros leverages this observation to achieve significant performance improvements. Examine the technical implementation details of parameter remapping in multi-tenant environments where inactive model memory can be aggressively reclaimed for active KV cache needs. Analyze comprehensive performance benchmarks demonstrating 44.8%-82.5% reduction in tail time-between-token latency, 20.7%-99.3% improvement in tail time-to-first-token latency, and 6.6%-86.7% higher throughput compared to existing vLLM solutions. Explore how modern hardware architectures like the NVIDIA Grace Hopper Superchip enable high CPU-GPU bandwidth utilization for optimal parameter remapping efficiency. Gain insights into the broader implications for datacenter power management, efficient memory allocation strategies, and workload characterization in AI systems from Dr. Ruihao Li, Research Scientist at Meta's AI and Systems Co-Design group.
Syllabus
Time: 7:00 PM - PM IST
Taught by
Centre for Networked Intelligence, IISc