Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to transform complex multimodal dataset construction into scalable, automated pipelines through this 35-minute conference talk from Ray Summit 2025. Discover how Netflix and LanceDB engineers tackle the resource-intensive challenge of curating massive video and image datasets by combining Ray's distributed processing capabilities with LanceDB's high-performance storage solutions. Explore Netflix's approach to leveraging Ray for distributed ingestion, filtering, and large-scale inference across enormous multimodal corpora, while understanding how LanceDB serves as the unified storage and query layer throughout the data curation lifecycle. Gain insights into distributing processing across hundreds of GPUs to accelerate ingestion and filtering, running batch inference at scale with cutting-edge vision-language models for scoring and captioning content, and utilizing LanceDB's columnar design for intelligent curation and sampling that reduces dataset size while increasing diversity. Master practical techniques for building scalable, high-performance pipelines suitable for both research and production environments in multimodal dataset construction.
Syllabus
Scaling Multimodal Data Curation with Ray and LanceDB | Ray Summit 2025
Taught by
Anyscale