Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI
CNCF [Cloud Native Computing Foundation] via YouTube
Learn Generative AI, Prompt Engineering, and LLMs for Free
Power BI Fundamentals - Create visualizations and dashboards from scratch
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Discover how to enhance GPU utilization and speed up model training using Kubernetes scheduling framework and Node Resource Interface (NRI) in this 24-minute conference talk by He Cao from ByteDance. Learn about the limitations of vanilla Kubernetes in managing heterogeneous resources and explore solutions implemented at ByteDance. Gain insights into GPU-sharing scheduling techniques that allow for fine-grained resource allocation, improving GPU utilization in AI inference scenarios. Understand the implementation of topology-aware scheduling and customized GPU-RDMA affinity strategies at the root complex level to accelerate large model training using GPUDirect RDMA. This talk provides valuable knowledge for optimizing resource management and performance in AI workloads on Kubernetes clusters.
Syllabus
Improving GPU Utilization and Accelerating Model Training with Scheduling Framework and NRI - He Cao
Taught by
CNCF [Cloud Native Computing Foundation]