Stuck in Tutorial Hell? Learn Backend Dev the Right Way
AI Engineer - Learn how to integrate AI into software applications
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This conference talk from FAST '25 presents IMPRESS, an importance-informed multi-tier prefix KV storage system designed to optimize large language model inference. Learn how researchers from Zhejiang University and Huawei Cloud address the challenge of efficiently storing and reusing prefix key-value pairs (KVs) from repeated contexts in LLM applications. Discover their innovative approach that identifies important token indices across attention heads and implements I/O-efficient algorithms to reduce time to first token (TTFT). The presentation demonstrates how IMPRESS can reduce TTFT by up to 2.8× compared to state-of-the-art systems while maintaining comparable inference accuracy, making it particularly valuable for LLM applications with limited CPU memory where disk I/O latency becomes a bottleneck.
Syllabus
FAST '25 - IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language...
Taught by
USENIX