Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about Prism, a production deep learning recommendation model (DLRM) serving system that addresses GPU fragmentation challenges through resource disaggregation in this 18-minute conference presentation from NSDI '25. Discover how online recommender systems face efficiency challenges when provisioning DLRM services at scale, as these models require extensive CPU cores and memory but only small numbers of GPUs, leading to resource waste in multi-GPU servers. Explore Prism's innovative architecture that separates CPU nodes and heterogeneous GPU nodes into independently scalable resource pools connected through RDMA, automatically dividing DLRMs into CPU- and GPU-intensive subgraphs for optimized scheduling. Examine the system's latency minimization techniques including optimal graph partitioning, topology-aware resource management, and SLO-aware communication scheduling that achieve 53% reduction in CPU fragmentation and 27% reduction in GPU fragmentation in crowded clusters. Understand how Prism enables efficient capacity loaning from training clusters during seasonal events, saving over 90% of GPUs, and learn from real-world deployment insights from a system running on over 10,000 GPUs in production for more than two years.

Syllabus

NSDI '25 - GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale

Taught by

USENIX

Reviews

Start your review of GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.