Free courses from frontend to fullstack and AI
Foundations of Data Visualization - Self Paced Online
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how to optimize vLLM for massive-scale batch inference through this conference talk from Ray Summit 2025. Discover Sutro's approach to building an accelerated batch inference service that handles workloads ranging from hundreds of tokens to tens of billions per job for synthetic data generation, evaluations, and large-scale unstructured data processing. Explore the critical importance of predictability in cost, performance, and execution transparency for large offline workloads, and examine Sutro's deeply optimized, vLLM-powered inference engine designed specifically for large batch processing. Dive into custom internal implementation layers built on top of vLLM, including a performance profiler that measures and predicts system behavior in real time, throughput estimation algorithms that inform batching, scheduling, and hardware allocation, and cost attribution instrumentation that provides precise, job-level visibility into resource usage. Gain practical techniques for designing transparent, high-performance vLLM infrastructure at scale, with insights particularly valuable for teams operating at large batch sizes, generating synthetic datasets, or building evaluation pipelines where cost predictability and throughput consistency are essential.
Syllabus
High-Throughput Inference for Synthetic Data & Evals at Sutro | Ray Summit 2025
Taught by
Anyscale