Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI
CNCF [Cloud Native Computing Foundation] via YouTube
Stuck in Tutorial Hell? Learn Backend Dev the Right Way
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Discover how to enhance GPU utilization and speed up model training using Kubernetes scheduling framework and Node Resource Interface (NRI) in this 24-minute conference talk by He Cao from ByteDance. Learn about the limitations of vanilla Kubernetes in managing heterogeneous resources and explore solutions implemented at ByteDance. Gain insights into GPU-sharing scheduling techniques that allow for fine-grained resource allocation, improving GPU utilization in AI inference scenarios. Understand the implementation of topology-aware scheduling and customized GPU-RDMA affinity strategies at the root complex level to accelerate large model training using GPUDirect RDMA. This talk provides valuable knowledge for optimizing resource management and performance in AI workloads on Kubernetes clusters.
Syllabus
Improving GPU Utilization and Accelerating Model Training with Scheduling Framework and NRI - He Cao
Taught by
CNCF [Cloud Native Computing Foundation]