PPipe - Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism

Learn about PPipe, a novel inference serving system that leverages pool-based pipeline parallelism to efficiently serve video analytics on heterogeneous GPU clusters in this 17-minute conference talk from USENIX ATC '25. Discover how researchers from Purdue University demonstrate the effective application of pipeline parallelism—traditionally used for throughput-oriented deep learning model training—to latency-bound model inference scenarios. Explore the synergy between diversity in model layers and GPU architectures, revealing how low-class and high-class GPUs can achieve comparable inference latency for many layers. Understand the system's architecture featuring an MILP-based control plane and a data plane that performs resource reservation-based adaptive batching. Examine evaluation results across 18 CNN models showing PPipe's ability to achieve 41.1%–65.5% higher utilization of low-class GPUs while maintaining high utilization of high-class GPUs, resulting in 32.2%–75.1% higher serving throughput compared to baseline approaches. Gain insights into how this approach addresses the growing prevalence of heterogeneous GPU clusters in both public clouds and on-premise data centers.