Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to accelerate vLLM inference on Huawei Ascend NPUs using Ray Compiled Graphs in this 11-minute conference talk from Ray Summit 2025. Discover the major advancement presented by Huawei Canada engineers that achieves over 50% performance gains compared to existing NPU-based solutions through a new extension to Ray Compiled Graph. Explore how this work serves as both a production-grade optimization and proof-of-concept for SPMD-mode support in the upcoming vLLM V1 integration with Ray. Understand the central design element of the new NPU Store, inspired by Ray's GPU Store, which streamlines tensor movement and improves cross-device efficiency in heterogeneous pipelines. Examine three key contributions: the Multi-Accelerator Support Layer that provides a generic abstraction layer compatible with GPU Store for NCCL-style peer-to-peer tensor transfers across accelerators, the SPMD-Mode NPU Backend that leverages advanced operator fusion and optimized memory scheduling for high-performance inference on Huawei Ascend NPUs, and the Optimized Cross-Device Tensor Transfer system featuring a prototype NPU Store that maximizes throughput and minimizes latency for CPU-NPU tensor movement. Gain insights into how this design simplifies future integration of other hardware backends including TPUs, NPUs, and emerging accelerators while unlocking substantial speedups for large-scale LLM inference in hybrid inference and post-training pipelines.
Syllabus
Boosting vLLM Inference on Huawei NPU with Ray Compiled Graphs — Huawei | Ray Summit 2025
Taught by
Anyscale