Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Building Fault-Tolerant Massive Ray Clusters on Anyscale

Anyscale via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn how to build and operate fault-tolerant Ray clusters at massive scale in this conference talk from Ray Summit 2025. Discover the engineering challenges that emerge when running AI workloads on clusters exceeding 10,000 nodes, including network instability, spot instance preemptions, hardware failures, resource contention, and unpredictable infrastructure behavior. Explore how Ray's distributed runtime is architected to gracefully handle these failures while maintaining workload continuity and strong reliability despite constant system churn. Gain insights into key engineering strategies for implementing fault tolerance, managing distributed state, enabling recovery mechanisms, achieving elasticity, and implementing workload-aware scheduling at extreme scale. Examine upcoming improvements to Ray's scalability, reliability, and performance designed to support the next generation of AI applications including large-scale model training, distributed inference, reinforcement learning, and multimodal pipelines.

Syllabus

Building Fault-Tolerant Massive Ray Clusters on Anyscale | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of Building Fault-Tolerant Massive Ray Clusters on Anyscale

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.