Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to efficiently serve Mixture-of-Experts (MoE) models at scale through this 31-minute conference talk from Ray Summit 2025. Discover why MoE models offer cost-effective scaling for large language models through selective expert activation, while understanding the complex orchestration challenges these architectures introduce in production environments. Explore how Multi-head Latent Attention (MLA) reduces KV-cache footprint through low-rank compression, yet creates new challenges when combined with high degrees of expert parallelism. Understand the tradeoffs between tensor parallelism and data-parallel attention for MoE inference, particularly how KV-cache duplication affects performance decisions. Master the combination of data parallelism with expert parallelism to unlock unique optimizations for prefill versus decode phases of inference, making prefill/decode disaggregation a powerful strategy for maximizing utilization across heterogeneous resources. See how Ray Serve and vLLM work together to balance flexibility, high throughput, and operational simplicity for production-grade MoE serving at scale. Gain practical knowledge for designing and operating distributed MoE inference pipelines, optimizing KV-cache usage, and leveraging Ray Serve to coordinate complex parallelism strategies across large GPU clusters.
Syllabus
Ray + vLLM Efficient Multi Node Orchestration for Sparse MoE Model Serving | Ray Summit 2025
Taught by
Anyscale