Get Coursera Plus for 40% off
PowerBI Data Analyst - Create visualizations and dashboards from scratch
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to build a robust, scalable inference platform for next-generation generative models through this 17-minute conference talk from Ray Summit 2025. Discover DigitalOcean's approach to handling the rising complexity of inference as models grow in size, context length, and modality, using Ray and vLLM running on Kubernetes for both serverless and dedicated GPU workloads. Explore how Ray's scheduling primitives ensure reliable execution across distributed clusters, while placement groups guarantee GPU affinity and predictable performance, with Ray observability tools providing deep insights into system health and workload behavior. Understand how vLLM delivers fast token streaming, optimized batching, and advanced memory/KV-cache management to meet real-world performance requirements. Examine two key operational modes: serverless inference for automatic scaling and cost efficiency, and dedicated inference for fine-grained GPU partitioning and performance isolation. Dive into advanced optimization techniques for long-context models exceeding 8k tokens, including dynamic batching by token length, KV cache reuse strategies, and speculative decoding for improved latency and throughput. Get insights into the roadmap for a fully multimodal, multi-tenant inference platform featuring concurrent model orchestration, tenant isolation, security-aware billing, and a unified model registry for intelligent model placement and lifecycle management. Gain practical knowledge for building future-ready inference platforms capable of serving large, dynamic, multimodal generative models at scale, whether you're optimizing production stacks or architecting new systems.
Syllabus
How DigitalOcean Builds Next-Gen Inference with Ray, vLLM & More | Ray Summit 2025
Taught by
Anyscale