OptiReduce - Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Learn about OptiReduce, a novel collective-communication system designed for distributed deep learning in cloud environments through this 16-minute conference presentation from NSDI '25. Discover how this system addresses the challenges of varying computation and communication variabilities by providing bounded, predictable completion times for deep-learning jobs. Explore the innovative approach that exploits the inherent resiliency and stochastic nature of distributed deep learning training and fine-tuning to work effectively with approximated or lost gradients. Understand the key mechanisms introduced, including unreliable bounded transport with adaptive timeout to improve tail execution time, and strategies like Transpose AllReduce and Hadamard Transform to mitigate the impact of gradient drops on model accuracy. Examine the evaluation results demonstrating OptiReduce's superior performance, achieving 70% faster time-to-accuracy compared to Gloo and 30% faster compared to NCCL when operating in shared cloud environments, while maintaining an efficient balance between tail performance and trained model accuracy.