High-Throughput Inference for Synthetic Data and Evals at Sutro

Learn how to optimize vLLM for massive-scale batch inference through this conference talk from Ray Summit 2025. Discover Sutro's approach to building an accelerated batch inference service that handles workloads ranging from hundreds of tokens to tens of billions per job for synthetic data generation, evaluations, and large-scale unstructured data processing. Explore the critical importance of predictability in cost, performance, and execution transparency for large offline workloads, and examine Sutro's deeply optimized, vLLM-powered inference engine designed specifically for large batch processing. Dive into custom internal implementation layers built on top of vLLM, including a performance profiler that measures and predicts system behavior in real time, throughput estimation algorithms that inform batching, scheduling, and hardware allocation, and cost attribution instrumentation that provides precise, job-level visibility into resource usage. Gain practical techniques for designing transparent, high-performance vLLM infrastructure at scale, with insights particularly valuable for teams operating at large batch sizes, generating synthetic datasets, or building evaluation pipelines where cost predictability and throughput consistency are essential.