Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about BlitzScale, an innovative autoscaling system for large language models that addresses the fundamental trade-off between scaling speed and storage requirements in serverless model-as-a-service architectures. Discover how researchers from Shanghai Jiao Tong University and Huawei Cloud developed a solution that achieves fast data plane performance with O(1) caching by leveraging GPU compute networks for parameter loading, utilizing network-optimized multicast for scaling multiple instances. Explore the breakthrough approach of implementing live autoscaling through fine-grained layer-level scaling abstraction instead of traditional coarse-grained instance-level methods, enabling computation offloading from overloaded serving instances to scaled ones without waiting for complete parameter loading. Examine the system's impressive performance results, including up to 94% lower tail latency reductions compared to ServerlessLLM and 49% reduction in GPU serving time compared to non-autoscaling systems like DistServe and vLLM while maintaining the same service-level agreements under real-world workloads.