Torpor - GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference

Explore a 16-minute conference presentation from USENIX ATC '25 that introduces Torpor, an innovative serverless platform designed to deliver GPU-efficient, low-latency inference services. Learn how researchers from The Chinese University of Hong Kong, Hong Kong University of Science and Technology, Alibaba Group, and Nokia Bell Labs address the critical challenge of existing serverless platforms lacking efficient GPU support for high-performance inference. Discover Torpor's novel approach of maintaining models in main memory and dynamically swapping them onto GPUs upon request arrivals through late binding with model swapping. Understand the technical innovations including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management that minimize latency overhead. Examine the interference-aware request scheduling algorithm that leverages high-speed GPU interconnects to meet latency service-level objectives for individual inference functions. Review the impressive performance results showing how Torpor can concurrently serve hundreds of inference functions on a worker node with 4 GPUs while achieving latency performance comparable to native execution, and learn about the pilot deployment results demonstrating 70% and 65% GPU provisioning cost reductions for users and the platform respectively.