Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the first systematic characterization of KV cache workload patterns from a leading large language model service provider in this 16-minute conference talk from USENIX ATC '25. Discover how researchers from Shanghai Jiao Tong University and Alibaba Group analyzed real-world LLM serving workloads to understand KV cache performance, revealing that KV cache reuses are skewed across requests with single-turn and multi-turn requests showing equal importance for reuse patterns. Learn about the diverse reuse time and probability patterns across different request categories, and understand how the overall cache size requirements for optimal hit ratios remain moderate in practice. Examine the workload-aware cache eviction policy proposed by the research team that demonstrates improved serving performance under real-world traces, particularly when operating with limited cache capacity constraints. Gain insights into system design decisions for LLM serving infrastructure and understand how workload-dependent cache eviction policies can optimize throughput and latency for cloud-scale language model deployments.