How Red Hat Scales Large-Scale Serving with vLLM

Learn how vLLM evolves to support advanced parallelism techniques for serving frontier-scale models like DeepSeek-R1 across large GPU clusters in this 37-minute conference talk from Ray Summit 2025. Discover how prefill/decode (p/d) disaggregation enables more efficient resource utilization by separating the heavy, bursty prefill phase from the latency-sensitive decode phase, allowing clusters to scale each stage independently for maximum throughput and cost efficiency. Explore the implementation of wide expert parallelism (EP) in vLLM and how it allows MoE-based models to leverage large numbers of experts across multi-node environments, including detailed coverage of orchestration, scheduling, and memory-management challenges at cluster scale. Examine the design decisions that make these deployments practical and understand the tradeoffs and system-level considerations when serving massive models, including GPU topology, communication overhead, batching behavior, and cluster elasticity. Gain deep insights into how vLLM implements next-generation parallelism strategies and the requirements for running cluster-scale MoE and non-MoE LLMs efficiently in real production environments, presented by Robert Shaw from Red Hat.