Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How AWS Scales Reinforcement Learning Across Thousands of GPUs

Anyscale via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to build fault-tolerant, scalable reinforcement learning systems for large-scale LLM alignment by combining Ray's elasticity with Amazon SageMaker HyperPod's resiliency in this 32-minute conference talk from Ray Summit 2025. Discover the evolving landscape of reinforcement learning as it expands beyond gaming and robotics into LLM alignment, agentic workloads, and real-world control with trillion-parameter base models, where success depends on robustness, scalability, and elasticity rather than just raw computational speed. Understand why many RL pipelines fail at cluster scale due to GPU failures, preemptions, and tail latencies that degrade goodput—the true measure of useful learning per GPU-hour. Explore Ray's unified programming model that enables RL pipelines to scale seamlessly from single machines to massive distributed clusters, already powering popular frameworks like Verl for worker orchestration, rollout coordination, and efficient data movement across nodes. Get introduced to Amazon SageMaker HyperPod, a persistent, highly resilient GPU cluster designed for distributed AI at extreme scale, and learn how pairing Ray's efficiency with HyperPod's robustness enables scalable post-training pipelines across hundreds or thousands of GPUs. Dive into architectural details including running Ray Jobs on HyperPod at massive scale, leveraging vLLM for high-throughput inference workers, coordinating PPO, GRPO, and DAPO pipelines using Verl + Ray, recovering gracefully from GPU failures and preemptions to maximize goodput, and designing RL factory architectures suitable for trillion-parameter model alignment. Access reference implementations for post-training large open-weight models along with practical strategies for optimizing performance, cost, and reliability in next-generation fault-tolerant, massively scalable RL systems.

Syllabus

How AWS Scales Reinforcement Learning Across Thousands of GPUs | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of How AWS Scales Reinforcement Learning Across Thousands of GPUs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.