Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Ray + vLLM - Efficient Multi-Node Orchestration for Sparse MoE Model Serving

Anyscale via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to efficiently serve Mixture-of-Experts (MoE) models at scale through this 31-minute conference talk from Ray Summit 2025. Discover why MoE models offer cost-effective scaling for large language models through selective expert activation, while understanding the complex orchestration challenges these architectures introduce in production environments. Explore how Multi-head Latent Attention (MLA) reduces KV-cache footprint through low-rank compression, yet creates new challenges when combined with high degrees of expert parallelism. Understand the tradeoffs between tensor parallelism and data-parallel attention for MoE inference, particularly how KV-cache duplication affects performance decisions. Master the combination of data parallelism with expert parallelism to unlock unique optimizations for prefill versus decode phases of inference, making prefill/decode disaggregation a powerful strategy for maximizing utilization across heterogeneous resources. See how Ray Serve and vLLM work together to balance flexibility, high throughput, and operational simplicity for production-grade MoE serving at scale. Gain practical knowledge for designing and operating distributed MoE inference pipelines, optimizing KV-cache usage, and leveraging Ray Serve to coordinate complex parallelism strategies across large GPU clusters.

Syllabus

Ray + vLLM Efficient Multi Node Orchestration for Sparse MoE Model Serving | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of Ray + vLLM - Efficient Multi-Node Orchestration for Sparse MoE Model Serving

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.