Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

OptiReduce - Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about OptiReduce, a novel collective-communication system designed for distributed deep learning in cloud environments through this 16-minute conference presentation from NSDI '25. Discover how this system addresses the challenges of varying computation and communication variabilities by providing bounded, predictable completion times for deep-learning jobs. Explore the innovative approach that exploits the inherent resiliency and stochastic nature of distributed deep learning training and fine-tuning to work effectively with approximated or lost gradients. Understand the key mechanisms introduced, including unreliable bounded transport with adaptive timeout to improve tail execution time, and strategies like Transpose AllReduce and Hadamard Transform to mitigate the impact of gradient drops on model accuracy. Examine the evaluation results demonstrating OptiReduce's superior performance, achieving 70% faster time-to-accuracy compared to Gloo and 30% faster compared to NCCL when operating in shared cloud environments, while maintaining an efficient balance between tail performance and trained model accuracy.

Syllabus

NSDI '25 - OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the...

Taught by

USENIX

Reviews

Start your review of OptiReduce - Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.