NVIDIA NeMo Curator - Scaling Multi-Modal Data Curation Workflows

Learn how to architect and scale petabyte-level multi-modal data curation workflows for Generative AI applications in this 26-minute conference talk from Ray Summit 2025. Discover the unique challenges of processing diverse data types including text, video, and audio at massive scale, and explore how Ray's distributed computing framework serves as the backbone for large, heterogeneous pipelines. Understand why multi-modal data processing presents distinct difficulties in coordinating diverse workloads, managing stateful operations like deduplication, and efficiently utilizing GPU acceleration while maintaining reliability across distributed clusters. Examine architectural patterns and best practices demonstrated through real-world experience building NVIDIA NeMo Curator, including how Ray Actors enable stateful, long-running operations for deduplication and metadata tracking, how Ray Tasks provide highly parallel processing for batch transformations, and how heterogeneous CPU/GPU resource management maximizes throughput across multi-modal workloads. Gain practical insights into operating efficient pipelines across petabytes of data and thousands of distributed workers, and acquire the knowledge needed to build scalable, resilient, GPU-accelerated data pipelines for next-generation Generative AI applications.