Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Toppings - CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about Toppings, an innovative system for efficiently serving multiple Low-Rank Adaptation (LoRA) adapters derived from a common base large language model in this 14-minute conference presentation from USENIX ATC '25. Discover how this system addresses the challenge of high GPU loading overhead that delays time-to-first-token responses and interrupts ongoing decoding processes in continuous batching scenarios. Explore the CPU-assisted LoRA serving approach that simultaneously uses CPUs to compute lightweight adaptations during prefilling while LoRA adapters load onto GPUs, then seamlessly switches to GPU computation once loading completes. Examine the highly optimized synchronization mechanism and pipeline loading scheme designed to efficiently coordinate LoRA computation across CPUs and GPUs. Understand the rank-aware scheduling algorithm that optimally schedules heterogeneous LoRA requests to maximize Service Level Objective (SLO) attainment. Review performance improvements showing up to 1.7× better average request serving latency and up to 99% SLO attainment compared to state-of-the-art LoRA serving systems, as presented by researchers from Hong Kong University of Science and Technology, The Chinese University of Hong Kong Shenzhen, TeleAI China Telecom, and Huawei Cloud.

Syllabus

USENIX ATC '25 - Toppings: CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference

Taught by

USENIX

Reviews

Start your review of Toppings - CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.