Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

Discover how to enhance GPU utilization and speed up model training using Kubernetes scheduling framework and Node Resource Interface (NRI) in this 24-minute conference talk by He Cao from ByteDance. Learn about the limitations of vanilla Kubernetes in managing heterogeneous resources and explore solutions implemented at ByteDance. Gain insights into GPU-sharing scheduling techniques that allow for fine-grained resource allocation, improving GPU utilization in AI inference scenarios. Understand the implementation of topology-aware scheduling and customized GPU-RDMA affinity strategies at the root complex level to accelerate large model training using GPUDirect RDMA. This talk provides valuable knowledge for optimizing resource management and performance in AI workloads on Kubernetes clusters.