The Most Addictive Python and SQL Courses
AI Engineer - Learn how to integrate AI into software applications
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to optimize vLLM for massive-scale batch inference through this conference talk from Ray Summit 2025. Discover Sutro's approach to building an accelerated batch inference service that handles workloads ranging from hundreds of tokens to tens of billions per job for synthetic data generation, evaluations, and large-scale unstructured data processing. Explore the critical importance of predictability in cost, performance, and execution transparency for large offline workloads, and examine Sutro's deeply optimized, vLLM-powered inference engine designed specifically for large batch processing. Dive into custom internal implementation layers built on top of vLLM, including a performance profiler that measures and predicts system behavior in real time, throughput estimation algorithms that inform batching, scheduling, and hardware allocation, and cost attribution instrumentation that provides precise, job-level visibility into resource usage. Gain practical techniques for designing transparent, high-performance vLLM infrastructure at scale, with insights particularly valuable for teams operating at large batch sizes, generating synthetic datasets, or building evaluation pipelines where cost predictability and throughput consistency are essential.
Syllabus
High-Throughput Inference for Synthetic Data & Evals at Sutro | Ray Summit 2025
Taught by
Anyscale