Build AI Apps with Azure, Copilot, and Generative AI — Microsoft Certified
Google, IBM & Meta Certificates — 40% Off for a Limited Time
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn distributed training strategies essential for efficiently scaling deep learning models in this 31-minute conference talk from Ray Summit 2025. Discover the core techniques of data parallelism, model parallelism, and pipeline parallelism, understanding when each approach is most effective as models and datasets grow. Explore advanced methods including sharded training and ZeRO, along with the tradeoffs that arise in real-world large-cluster environments. Address the toughest challenges in distributed training such as communication overhead, fault tolerance, reproducibility, and managing heterogeneous compute. See demonstrations of how PyTorch and Ray can be combined to implement these strategies with minimal code changes, making it easier to scale from prototype to production. Master techniques for overcoming scalability bottlenecks including communication overhead and system failures, and learn to use Ray with PyTorch to launch, orchestrate, and monitor large-scale distributed training jobs.
Syllabus
How to Get Started with Distributed Training at Scale | Ray Summit 2025
Taught by
Anyscale