Elastic Expert Parallelism for vLLM

Learn how to advance large-scale Expert Parallelism (EP) for efficient, scalable inference of Mixture-of-Experts (MoE) models in this 34-minute conference talk from Ray Summit 2025. Discover the core constraints in MoE serving where EP requires massive, monolithic deployment units—such as DeepSeek R3/V1 needing 144 GPUs for a single serving instance—making traditional inter-instance autoscaling systems struggle with real-world workload fluctuations. Explore the innovative intra-instance Elastic EP technique that enables fine-grained, low-latency autoscaling within a single EP instance, allowing vLLM to precisely match GPU resources to workload demand without downtime, fragmentation, or inefficient overprovisioning. Understand how Ray orchestrates Elastic EP scaling across distributed clusters, providing coordination, lifecycle management, and flexibility for dynamically adjusting expert-parallel resources while maintaining fast and reliable inference. Gain practical strategies for serving large MoE models at scale, optimizing KV-cache and expert utilization, and using Ray to coordinate sophisticated intra-instance parallelism patterns. Presented by Yongji Wu from UC Berkeley and Rui Qiao from Anyscale.