AutoCCL - Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training

Learn about AutoCCL, an automated tuning method for optimizing collective communication libraries in distributed deep neural network training through this 15-minute conference presentation from NSDI '25. Discover how researchers from the University of Science and Technology of China and Microsoft Research address the critical challenge of parameter selection in communication libraries that are often overlooked in network optimizations. Explore the innovative divide-and-conquer algorithm that tackles state explosion problems in configuration search spaces by decoupling implementation-related parameters from search-sensitive ones. Understand the online tuning approach that accounts for communication-computation interference while hiding tuning overhead within early training iterations. Examine the implementation built on top of NVIDIA's NCCL library and review comprehensive evaluation results showing 1.24-1.29× speedups on microbenchmarks compared to NCCL, up to 1.80× improvements with concurrent computation, and 1.07-1.32× enhancements in per-iteration training time for large language models and vision models across multi-node GPU clusters with various interconnect configurations.