Distributed ML Training with KubeRay at Robinhood

Explore how Robinhood scaled its machine learning platform to support large-model and large-dataset training through distributed training with KubeRay in this 32-minute conference talk from Ray Summit 2025. Learn from Lanting Chiang and Robert Macy as they detail Robinhood's journey from single-node training limitations to implementing distributed training capabilities essential for future model development. Discover the evaluation process and architectural decisions that led to adopting KubeRay for large-scale distributed training, including how Ray was integrated into their existing ML training stack. Understand the platform-level abstractions Robinhood built to make distributed training seamless and accessible for internal teams, and examine how their unique Kubernetes environment influenced their choice between native KubeRay components and alternative solutions. Gain practical insights into integrating Ray into a production ML platform, including lessons learned, architectural best practices, and strategies for enabling distributed training at scale in real-world enterprise environments.