Accelerating LLM Serving with Prompt Cache Offloading via CXL

Learn how to accelerate Large Language Model (LLM) serving performance through innovative prompt cache offloading techniques using CXL (Compute Express Link) technology in this 14-minute conference talk. Discover how LLM inference utilizes key-value (KV) caches to store intermediate activations for token-wise reuse, effectively reducing redundant computation and latency. Explore the challenges of prompt caching across multi-turn or long-session workloads as user counts and sequence lengths increase, where GPU HBM memory alone cannot sustain the full cache footprint, necessitating offloads to higher-capacity storage like host DRAM. Understand the limitations of host DRAM scalability and examine how integrating CXL-attached memory via PCIe switches can unlock multi-terabyte capacity beyond traditional host limits. Analyze benchmark results from ShareGPT testing that demonstrate how augmenting a GPU with an additional 80GB of CXL memory alongside host DRAM achieved a 2.5x increase in prompt cache hit rates, 32.9% reduction in time-to-first-token, and 38.5% boost in overall throughput. Gain insights into this cost-effective tiered design approach that delivers consistent low-latency responses under heavy load while significantly reducing infrastructure costs, making LLM service deployment both faster and more economical.