Katz - Efficient Workflow Serving for Diffusion Models with Many Adapters

Learn about an innovative system architecture for optimizing text-to-image generation workflows in this conference presentation from USENIX ATC '25. Discover how researchers from Hong Kong University of Science and Technology and Alibaba Group developed a solution to efficiently serve diffusion models augmented with multiple ControlNet and LoRA adapters in production AI cloud environments. Explore the key technical challenges in serving workflows where base diffusion models are enhanced with numerous adapters to control image details like shapes, outlines, poses, and styles. Understand the system's differentiation between compute-heavy ControlNets and compute-light LoRAs, and how it addresses their distinct bottlenecks - computational overhead for the former and loading delays for the latter. Examine the ControlNet-as-a-Service design that decouples ControlNets from base models, deploying them as separate, independently scalable services on dedicated GPUs to enable caching, parallelization, and sharing. Learn about the bounded asynchronous loading technique that overlaps LoRA loading with initial base model execution by up to K steps while maintaining image quality. Discover how latent parallelism accelerates base model execution across multiple GPUs. Review the impressive performance results showing up to 7.8× latency reduction and 1.7× throughput improvement when serving SDXL models on H800 GPUs without compromising image quality compared to state-of-the-art text-to-image serving systems.