Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
Build with Azure OpenAI, Copilot Studio & Agentic Frameworks — Microsoft Certified
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn how NVIDIA and Roblox built a modern ML platform using Ray to train large-scale 3D foundation models in this 30-minute conference talk from Ray Summit 2025. Discover the platform's architecture, including the integration of KubeRay with Istio and Kubeflow to support authentication, multi-tenancy, and secure orchestration. Explore infrastructure innovations such as peer-to-peer Docker image distribution, lazy pulling, and the ability to scale Ray jobs across multiple clusters, along with the open-sourcing of the new KubeRay dashboard for improved iterative development. Examine the challenges faced when applying Ray to foundation-model training workloads, including orchestrating massive LLM batch labeling jobs, leveraging Ray Data at scale, and supporting large distributed pipelines across heterogeneous compute. Understand how Roblox transitioned from MPI-based distributed training to Ray Train as their default framework, improving reliability and simplifying operations while gaining critical capabilities like observability and fault tolerance. Gain practical insights into building production-grade Ray platforms, modernizing distributed training workflows, and supporting multimodal foundation model development at enterprise scale.
Syllabus
NVIDIA’s Framework for Scalable Data Curation | Ray Summit 2025
Taught by
Anyscale