Google AI Professional Certificate - Learn AI Skills That Get You Hired
JavaScript Programming for Beginners
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about Toppings, an innovative system for efficiently serving multiple Low-Rank Adaptation (LoRA) adapters derived from a common base large language model in this 14-minute conference presentation from USENIX ATC '25. Discover how this system addresses the challenge of high GPU loading overhead that delays time-to-first-token responses and interrupts ongoing decoding processes in continuous batching scenarios. Explore the CPU-assisted LoRA serving approach that simultaneously uses CPUs to compute lightweight adaptations during prefilling while LoRA adapters load onto GPUs, then seamlessly switches to GPU computation once loading completes. Examine the highly optimized synchronization mechanism and pipeline loading scheme designed to efficiently coordinate LoRA computation across CPUs and GPUs. Understand the rank-aware scheduling algorithm that optimally schedules heterogeneous LoRA requests to maximize Service Level Objective (SLO) attainment. Review performance improvements showing up to 1.7× better average request serving latency and up to 99% SLO attainment compared to state-of-the-art LoRA serving systems, as presented by researchers from Hong Kong University of Science and Technology, The Chinese University of Hong Kong Shenzhen, TeleAI China Telecom, and Huawei Cloud.
Syllabus
USENIX ATC '25 - Toppings: CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference
Taught by
USENIX