Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

PagedAttention - Revolutionizing LLM Inference with Efficient Memory Management

DevConf via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a groundbreaking conference talk that introduces PagedAttention, a revolutionary technique for optimizing large language model (LLM) inference through efficient memory management. Learn how this innovative approach, inspired by virtual memory paging in operating systems, addresses critical memory bottlenecks that plague traditional LLM serving systems by partitioning key-value (KV) caches into smaller, non-contiguous blocks for dynamic allocation and flexible reuse. Discover the technical implementation within the vLLM framework, an open-source, high-performance LLM serving system developed at UC Berkeley, and understand how PagedAttention decouples physical cache layout from logical structure to minimize memory fragmentation and overhead. Examine the impressive performance improvements achieved through this method, including up to 30× higher throughput compared to traditional serving methods like Hugging Face Transformers, Orca, and NVIDIA's FasterTransformer, while reducing KV cache waste to approximately 4% for near-optimal memory usage. Understand how this optimization enables larger batch processing, supports advanced sampling techniques like beam search without latency compromise, and makes LLM deployment feasible even on resource-constrained hardware. Gain insights into the challenges and limitations of the approach, including lookup table management overhead and potential latency increases in certain scenarios, along with ongoing research solutions such as optimized data structures and prefetching strategies that address these issues.

Syllabus

PagedAttention: Revolutionizing LLM Inference with Efficient Memory Management - DevConf.CZ 2025

Taught by

DevConf

Reviews

Start your review of PagedAttention - Revolutionizing LLM Inference with Efficient Memory Management

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.