Ray Data and vLLM for Scalable Image Captioning

Learn how to build scalable post-training workflows for large-scale vision-language models using Ray Data, Ray Data LLM, and vLLM in this 31-minute conference talk from Ray Summit 2025. Discover how Anindya Saha from Zoox transforms complex, multi-stage pipelines into clean, scalable, production-ready systems by addressing the coordination challenges typically seen in post-training pipelines for image-text models. Explore an end-to-end image captioning workflow that demonstrates Ray's unified stack for both rapid prototyping and seamless scaling to production workloads. Master practical patterns including evolving from prototype to production using Ray-based abstractions, leveraging Ray Data for efficient distributed loading and transformation of massive image datasets, and using Ray Data LLM's Processor abstraction to integrate vLLM with preprocessing and postprocessing logic. Understand how to scale vLLM inference across multiple GPUs for high-throughput caption generation, implement fully customized preprocessing and postprocessing pipelines with class-based state management, and integrate Prometheus and Grafana for real-time performance visibility. Learn to optimize GPU utilization across distributed workloads using Ray's scheduling and resource controls, enabling high throughput, strong resource efficiency, and simplified maintenance for complex post-training workloads involving millions of images. Gain actionable design patterns and a clear path for building robust, scalable vision-language post-training pipelines.