How AWS Scales Reinforcement Learning Across Thousands of GPUs

Learn how to build fault-tolerant, scalable reinforcement learning systems for large-scale LLM alignment by combining Ray's elasticity with Amazon SageMaker HyperPod's resiliency in this 32-minute conference talk from Ray Summit 2025. Discover the evolving landscape of reinforcement learning as it expands beyond gaming and robotics into LLM alignment, agentic workloads, and real-world control with trillion-parameter base models, where success depends on robustness, scalability, and elasticity rather than just raw computational speed. Understand why many RL pipelines fail at cluster scale due to GPU failures, preemptions, and tail latencies that degrade goodput—the true measure of useful learning per GPU-hour. Explore Ray's unified programming model that enables RL pipelines to scale seamlessly from single machines to massive distributed clusters, already powering popular frameworks like Verl for worker orchestration, rollout coordination, and efficient data movement across nodes. Get introduced to Amazon SageMaker HyperPod, a persistent, highly resilient GPU cluster designed for distributed AI at extreme scale, and learn how pairing Ray's efficiency with HyperPod's robustness enables scalable post-training pipelines across hundreds or thousands of GPUs. Dive into architectural details including running Ray Jobs on HyperPod at massive scale, leveraging vLLM for high-throughput inference workers, coordinating PPO, GRPO, and DAPO pipelines using Verl + Ray, recovering gracefully from GPU failures and preemptions to maximize goodput, and designing RL factory architectures suitable for trillion-parameter model alignment. Access reference implementations for post-training large open-weight models along with practical strategies for optimizing performance, cost, and reliability in next-generation fault-tolerant, massively scalable RL systems.