Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to maximize batch inference throughput for large-scale LLM processing through dynamic partitioning techniques in this 34-minute conference talk from Ray Summit 2025. Discover how Daft addresses the fundamental challenge of maximizing prefix caching without stalling GPUs by implementing dynamic prefix partitioning instead of traditional static pre-sorting methods. Explore how this breakthrough technique continuously adjusts partitions in-flight as data streams into vLLM, ensuring high prefix cache hit rates without manual preprocessing while maintaining full GPU saturation throughout entire queries. Examine the integration of vLLM's high-performance inference engine into Daft's distributed execution model powered by Ray, including detailed insights into the Daft query optimizer and execution engine architecture. Review performance benchmarks on real multimodal pipelines demonstrating end-to-end performance gains across multimodal batch workloads. Understand how dynamic partitioning transforms petabyte-scale multimodal query processing, making large-scale batch inference faster, more efficient, and significantly easier to operate in complex, end-to-end AI pipelines.
Syllabus
How Daft Boosts Batch Inference Throughput with Dynamic Partitioning | Ray Summit 2025
Taught by
Anyscale