Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How to Get Started with Distributed Training at Scale

Anyscale via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn distributed training strategies essential for efficiently scaling deep learning models in this 31-minute conference talk from Ray Summit 2025. Discover the core techniques of data parallelism, model parallelism, and pipeline parallelism, understanding when each approach is most effective as models and datasets grow. Explore advanced methods including sharded training and ZeRO, along with the tradeoffs that arise in real-world large-cluster environments. Address the toughest challenges in distributed training such as communication overhead, fault tolerance, reproducibility, and managing heterogeneous compute. See demonstrations of how PyTorch and Ray can be combined to implement these strategies with minimal code changes, making it easier to scale from prototype to production. Master techniques for overcoming scalability bottlenecks including communication overhead and system failures, and learn to use Ray with PyTorch to launch, orchestrate, and monitor large-scale distributed training jobs.

Syllabus

How to Get Started with Distributed Training at Scale | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of How to Get Started with Distributed Training at Scale

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.