Get 35% Off CFI Certifications - Code CFI35
AI Product Expert Certification - Master Generative AI Skills
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn distributed training strategies essential for efficiently scaling deep learning models in this 31-minute conference talk from Ray Summit 2025. Discover the core techniques of data parallelism, model parallelism, and pipeline parallelism, understanding when each approach is most effective as models and datasets grow. Explore advanced methods including sharded training and ZeRO, along with the tradeoffs that arise in real-world large-cluster environments. Address the toughest challenges in distributed training such as communication overhead, fault tolerance, reproducibility, and managing heterogeneous compute. See demonstrations of how PyTorch and Ray can be combined to implement these strategies with minimal code changes, making it easier to scale from prototype to production. Master techniques for overcoming scalability bottlenecks including communication overhead and system failures, and learn to use Ray with PyTorch to launch, orchestrate, and monitor large-scale distributed training jobs.
Syllabus
How to Get Started with Distributed Training at Scale | Ray Summit 2025
Taught by
Anyscale