Building Fault-Tolerant Massive Ray Clusters on Anyscale

Learn how to build and operate fault-tolerant Ray clusters at massive scale in this conference talk from Ray Summit 2025. Discover the engineering challenges that emerge when running AI workloads on clusters exceeding 10,000 nodes, including network instability, spot instance preemptions, hardware failures, resource contention, and unpredictable infrastructure behavior. Explore how Ray's distributed runtime is architected to gracefully handle these failures while maintaining workload continuity and strong reliability despite constant system churn. Gain insights into key engineering strategies for implementing fault tolerance, managing distributed state, enabling recovery mechanisms, achieving elasticity, and implementing workload-aware scheduling at extreme scale. Examine upcoming improvements to Ray's scalability, reliability, and performance designed to support the next generation of AI applications including large-scale model training, distributed inference, reinforcement learning, and multimodal pipelines.