Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

NVIDIA NeMo Curator - Scaling Multi-Modal Data Curation Workflows

Anyscale via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to architect and scale petabyte-level multi-modal data curation workflows for Generative AI applications in this 26-minute conference talk from Ray Summit 2025. Discover the unique challenges of processing diverse data types including text, video, and audio at massive scale, and explore how Ray's distributed computing framework serves as the backbone for large, heterogeneous pipelines. Understand why multi-modal data processing presents distinct difficulties in coordinating diverse workloads, managing stateful operations like deduplication, and efficiently utilizing GPU acceleration while maintaining reliability across distributed clusters. Examine architectural patterns and best practices demonstrated through real-world experience building NVIDIA NeMo Curator, including how Ray Actors enable stateful, long-running operations for deduplication and metadata tracking, how Ray Tasks provide highly parallel processing for batch transformations, and how heterogeneous CPU/GPU resource management maximizes throughput across multi-modal workloads. Gain practical insights into operating efficient pipelines across petabytes of data and thousands of distributed workers, and acquire the knowledge needed to build scalable, resilient, GPU-accelerated data pipelines for next-generation Generative AI applications.

Syllabus

NVIDIA NeMo Curator: Scaling Multi-Modal Data Curation Workflows | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of NVIDIA NeMo Curator - Scaling Multi-Modal Data Curation Workflows

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.