MPI Meets Machine Learning - Unlocking PyTorch Distributed for Scaling AI Workloads

Explore the intersection of High-Performance Computing and modern deep learning through this conference talk that bridges traditional HPC paradigms with PyTorch's distributed computing ecosystem. Discover how familiar HPC concepts like collective operations, point-to-point communication, and process groups manifest in PyTorch's distributed APIs, and learn how PyTorch builds upon battle-tested communication backends including NCCL, Gloo, and MPI while introducing novel primitives optimized for gradient synchronization and model parallelism. Move beyond basic data parallelism to examine advanced memory-saving techniques like Fully Sharded Data Parallel (FSDP), PyTorch's native solution for memory scaling, and explore the emerging Tensor and Pipeline Parallelism APIs that demonstrate how these techniques compose to train massive models. Gain comprehensive understanding of PyTorch's distributed architecture and the inner workings of one of the most actively developed areas in modern ML infrastructure, while mapping distributed systems concepts to PyTorch's implementation to see how familiar patterns from parallel computing manifest in PyTorch's ecosystem and identify areas for innovation and improvement.