SuperServe - Fine-Grained Inference Serving for Unpredictable Workloads

Explore a 16-minute conference presentation from NSDI '25 that introduces SuperServe, a novel ML inference serving system designed to handle unpredictable and bursty workloads in production environments. Learn about the challenges of serving multiple machine learning models under varying request patterns while balancing latency, accuracy requirements, and resource efficiency. Discover SubNetAct, an innovative mechanism that uses specialized control-flow operators in pre-trained, weight-shared super-networks to dynamically route requests through networks and actuate specific models that meet individual latency and accuracy targets. Understand how this approach enables serving significantly more models while requiring up to 2.6× lower memory compared to existing systems. Examine the SlackFit scheduling policy and see how SuperServe achieves 4.67% higher accuracy for the same latency targets and 2.85× higher latency target attainment for the same accuracy when tested on real-world Microsoft workload traces. Gain insights into fine-grained, reactive scheduling policies and their impact on ML inference serving efficiency from researchers at Georgia Institute of Technology, UC Berkeley, and Adobe.