Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Colocating ML Inference and Training with Fast GPU Memory Handover

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about SIRIUS, an innovative GPU resource sharing system that enables efficient colocation of machine learning inference and training workloads through this 15-minute conference presentation from USENIX ATC '25. Discover how this system addresses the challenge of meeting strict latency Service Level Objectives (SLOs) for inference tasks while maximizing GPU utilization by running training tasks on leftover resources. Explore three key innovations: leveraging gradient computation characteristics to rapidly adjust training memory consumption within milliseconds, implementing explicit memory reclamation management for safe handover processes, and employing SLO-aware memory reallocation strategies to minimize initialization overhead and prevent thrashing under fluctuating workloads. Examine the evaluation results demonstrating SIRIUS's superior performance compared to existing colocation approaches, achieving an average 57.0% improvement in inference SLO compliance (up to 97.0%) and 2.2× enhancement in training throughput (up to 13.7×). Gain insights into spatial GPU resource sharing techniques, memory management strategies for machine learning workloads, and practical solutions for optimizing GPU utilization in production environments where both inference and training tasks must coexist efficiently.

Syllabus

USENIX ATC '25 - Colocating ML Inference and Training with Fast GPU Memory Handover

Taught by

USENIX

Reviews

Start your review of Colocating ML Inference and Training with Fast GPU Memory Handover

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.