Get 20% off all career paths from fullstack to AI
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about PPipe, a novel inference serving system that leverages pool-based pipeline parallelism to efficiently serve video analytics on heterogeneous GPU clusters in this 17-minute conference talk from USENIX ATC '25. Discover how researchers from Purdue University demonstrate the effective application of pipeline parallelism—traditionally used for throughput-oriented deep learning model training—to latency-bound model inference scenarios. Explore the synergy between diversity in model layers and GPU architectures, revealing how low-class and high-class GPUs can achieve comparable inference latency for many layers. Understand the system's architecture featuring an MILP-based control plane and a data plane that performs resource reservation-based adaptive batching. Examine evaluation results across 18 CNN models showing PPipe's ability to achieve 41.1%–65.5% higher utilization of low-class GPUs while maintaining high utilization of high-class GPUs, resulting in 32.2%–75.1% higher serving throughput compared to baseline approaches. Gain insights into how this approach addresses the growing prevalence of heterogeneous GPU clusters in both public clouds and on-premise data centers.
Syllabus
USENIX ATC '25 - PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via...
Taught by
USENIX