Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

BlitzScale - Fast and Live Large Model Autoscaling with O(1) Host Caching

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about BlitzScale, an innovative autoscaling system for large language models that addresses the fundamental trade-off between scaling speed and storage requirements in serverless model-as-a-service architectures. Discover how researchers from Shanghai Jiao Tong University and Huawei Cloud developed a solution that achieves fast data plane performance with O(1) caching by leveraging GPU compute networks for parameter loading, utilizing network-optimized multicast for scaling multiple instances. Explore the breakthrough approach of implementing live autoscaling through fine-grained layer-level scaling abstraction instead of traditional coarse-grained instance-level methods, enabling computation offloading from overloaded serving instances to scaled ones without waiting for complete parameter loading. Examine the system's impressive performance results, including up to 94% lower tail latency reductions compared to ServerlessLLM and 49% reduction in GPU serving time compared to non-autoscaling systems like DistServe and vLLM while maintaining the same service-level agreements under real-world workloads.

Syllabus

OSDI '25 - BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching

Taught by

USENIX

Reviews

Start your review of BlitzScale - Fast and Live Large Model Autoscaling with O(1) Host Caching

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.