Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

How Red Hat Scales Large-Scale Serving with vLLM

Anyscale via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how vLLM evolves to support advanced parallelism techniques for serving frontier-scale models like DeepSeek-R1 across large GPU clusters in this 37-minute conference talk from Ray Summit 2025. Discover how prefill/decode (p/d) disaggregation enables more efficient resource utilization by separating the heavy, bursty prefill phase from the latency-sensitive decode phase, allowing clusters to scale each stage independently for maximum throughput and cost efficiency. Explore the implementation of wide expert parallelism (EP) in vLLM and how it allows MoE-based models to leverage large numbers of experts across multi-node environments, including detailed coverage of orchestration, scheduling, and memory-management challenges at cluster scale. Examine the design decisions that make these deployments practical and understand the tradeoffs and system-level considerations when serving massive models, including GPU topology, communication overhead, batching behavior, and cluster elasticity. Gain deep insights into how vLLM implements next-generation parallelism strategies and the requirements for running cluster-scale MoE and non-MoE LLMs efficiently in real production environments, presented by Robert Shaw from Red Hat.

Syllabus

How Red Hat Scales Large-Scale Serving with vLLM | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of How Red Hat Scales Large-Scale Serving with vLLM

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.