Build AI Apps with Azure, Copilot, and Generative AI — Microsoft Certified
NY State-Licensed Certificates in Design, Coding & AI — Online
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore a 22-minute conference talk from USENIX ATC '23 that addresses the critical issue of GPU underutilization in large tech companies. Dive into the challenges of GPU sharing techniques and the resulting fragmentation problems in large clusters. Learn about a novel approach called Fragmentation Gradient Descent (FGD), which quantifies GPU fragmentation and schedules workloads to minimize its growth. Discover how this innovative method, implemented as a new scheduler in Kubernetes, significantly reduces unallocated GPUs and improves overall utilization. Gain insights into the performance evaluation of FGD using production traces on an emulated cluster of over 6,200 GPUs, and understand its potential to revolutionize GPU resource management in large-scale machine learning environments.
Syllabus
USENIX ATC '23 - Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation...
Taught by
USENIX