How to Get Started with Distributed Training at Scale

Learn distributed training strategies essential for efficiently scaling deep learning models in this 31-minute conference talk from Ray Summit 2025. Discover the core techniques of data parallelism, model parallelism, and pipeline parallelism, understanding when each approach is most effective as models and datasets grow. Explore advanced methods including sharded training and ZeRO, along with the tradeoffs that arise in real-world large-cluster environments. Address the toughest challenges in distributed training such as communication overhead, fault tolerance, reproducibility, and managing heterogeneous compute. See demonstrations of how PyTorch and Ray can be combined to implement these strategies with minimal code changes, making it easier to scale from prototype to production. Master techniques for overcoming scalability bottlenecks including communication overhead and system failures, and learn to use Ray with PyTorch to launch, orchestrate, and monitor large-scale distributed training jobs.