Get 20% off all career paths from fullstack to AI
2,000+ Free Courses with Certificates: Coding, AI, SQL, and More
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about PPipe, a novel inference serving system that leverages pool-based pipeline parallelism to efficiently serve video analytics on heterogeneous GPU clusters in this 17-minute conference talk from USENIX ATC '25. Discover how researchers from Purdue University demonstrate the effective application of pipeline parallelism—traditionally used for throughput-oriented deep learning model training—to latency-bound model inference scenarios. Explore the synergy between diversity in model layers and GPU architectures, revealing how low-class and high-class GPUs can achieve comparable inference latency for many layers. Understand the system's architecture featuring an MILP-based control plane and a data plane that performs resource reservation-based adaptive batching. Examine evaluation results across 18 CNN models showing PPipe's ability to achieve 41.1%–65.5% higher utilization of low-class GPUs while maintaining high utilization of high-class GPUs, resulting in 32.2%–75.1% higher serving throughput compared to baseline approaches. Gain insights into how this approach addresses the growing prevalence of heterogeneous GPU clusters in both public clouds and on-premise data centers.
Syllabus
USENIX ATC '25 - PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via...
Taught by
USENIX