Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous Resources

Learn how ByteDance optimizes LLM workload performance through enhanced topology-aware scheduling in this technical conference talk. Explore solutions for managing high-density processors, including die-level affinity implementation and anti-affinity configuration between memory bandwidth-intensive pods. Discover techniques for achieving inter-RDMA affinity at ToR level to prevent switch congestion, implementing GPU-RDMA affinity at PCIe switch level for accelerated communication via GPUDirect RDMA, and establishing job-level topology affinity within Kubernetes scheduler's pod-level operations. Gain insights into addressing K8s topology management limitations for new-generation processors and shifting performance bottlenecks from computation to networking, with practical approaches for handling heterogeneous resources like GPU and RDMA.