DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving

Explore a 15-minute conference talk from USENIX OSDI '24 that introduces DistServe, a novel approach to improve large language model (LLM) serving performance. Learn how DistServe disaggregates prefill and decoding computation, assigning them to different GPUs to eliminate interference and optimize resource allocation. Discover how this method significantly enhances LLM serving performance by meeting stringent latency requirements for both time to first token (TTFT) and time per output token (TPOT). Understand the benefits of DistServe's co-optimization strategy and its ability to serve up to 7.4 times more requests or achieve 12.6 times tighter SLO compared to state-of-the-art systems, while maintaining latency constraints for over 90% of requests across various popular LLMs and applications.