Learn the Skills Netflix, Meta, and Capital One Actually Hire For
Google Data Analytics, IBM AI & Meta Marketing — All in One Subscription
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
This conference talk from FAST '25 presents IMPRESS, an importance-informed multi-tier prefix KV storage system designed to optimize large language model inference. Learn how researchers from Zhejiang University and Huawei Cloud address the challenge of efficiently storing and reusing prefix key-value pairs (KVs) from repeated contexts in LLM applications. Discover their innovative approach that identifies important token indices across attention heads and implements I/O-efficient algorithms to reduce time to first token (TTFT). The presentation demonstrates how IMPRESS can reduce TTFT by up to 2.8× compared to state-of-the-art systems while maintaining comparable inference accuracy, making it particularly valuable for LLM applications with limited CPU memory where disk I/O latency becomes a bottleneck.
Syllabus
FAST '25 - IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language...
Taught by
USENIX