Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Accelerating LLM Serving with Prompt Cache Offloading via CXL

Open Compute Project via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to accelerate Large Language Model (LLM) serving performance through innovative prompt cache offloading techniques using CXL (Compute Express Link) technology in this 14-minute conference talk. Discover how LLM inference utilizes key-value (KV) caches to store intermediate activations for token-wise reuse, effectively reducing redundant computation and latency. Explore the challenges of prompt caching across multi-turn or long-session workloads as user counts and sequence lengths increase, where GPU HBM memory alone cannot sustain the full cache footprint, necessitating offloads to higher-capacity storage like host DRAM. Understand the limitations of host DRAM scalability and examine how integrating CXL-attached memory via PCIe switches can unlock multi-terabyte capacity beyond traditional host limits. Analyze benchmark results from ShareGPT testing that demonstrate how augmenting a GPU with an additional 80GB of CXL memory alongside host DRAM achieved a 2.5x increase in prompt cache hit rates, 32.9% reduction in time-to-first-token, and 38.5% boost in overall throughput. Gain insights into this cost-effective tiered design approach that delivers consistent low-latency responses under heavy load while significantly reducing infrastructure costs, making LLM service deployment both faster and more economical.

Syllabus

Accelerating LLM Serving with Prompt Cache Offloading via CXL

Taught by

Open Compute Project

Reviews

Start your review of Accelerating LLM Serving with Prompt Cache Offloading via CXL

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.