Master Windows Internals - Kernel Programming, Debugging & Architecture
Coursera Plus Annual Nearly 45% Off
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This conference talk from FAST '25 presents IMPRESS, an importance-informed multi-tier prefix KV storage system designed to optimize large language model inference. Learn how researchers from Zhejiang University and Huawei Cloud address the challenge of efficiently storing and reusing prefix key-value pairs (KVs) from repeated contexts in LLM applications. Discover their innovative approach that identifies important token indices across attention heads and implements I/O-efficient algorithms to reduce time to first token (TTFT). The presentation demonstrates how IMPRESS can reduce TTFT by up to 2.8× compared to state-of-the-art systems while maintaining comparable inference accuracy, making it particularly valuable for LLM applications with limited CPU memory where disk I/O latency becomes a bottleneck.
Syllabus
FAST '25 - IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language...
Taught by
USENIX