Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How Daft Boosts Batch Inference Throughput with Dynamic Partitioning

Anyscale via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to maximize batch inference throughput for large-scale LLM processing through dynamic partitioning techniques in this 34-minute conference talk from Ray Summit 2025. Discover how Daft addresses the fundamental challenge of maximizing prefix caching without stalling GPUs by implementing dynamic prefix partitioning instead of traditional static pre-sorting methods. Explore how this breakthrough technique continuously adjusts partitions in-flight as data streams into vLLM, ensuring high prefix cache hit rates without manual preprocessing while maintaining full GPU saturation throughout entire queries. Examine the integration of vLLM's high-performance inference engine into Daft's distributed execution model powered by Ray, including detailed insights into the Daft query optimizer and execution engine architecture. Review performance benchmarks on real multimodal pipelines demonstrating end-to-end performance gains across multimodal batch workloads. Understand how dynamic partitioning transforms petabyte-scale multimodal query processing, making large-scale batch inference faster, more efficient, and significantly easier to operate in complex, end-to-end AI pipelines.

Syllabus

How Daft Boosts Batch Inference Throughput with Dynamic Partitioning | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of How Daft Boosts Batch Inference Throughput with Dynamic Partitioning

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.