High-Performance LLM Serving on Intel - vLLM for XPU, HPU and CPU

Learn how to deploy high-performance vLLM inference across Intel's complete hardware portfolio including GPUs (XPU), Gaudi Accelerators (HPU), and CPUs in this 26-minute conference talk from Ray Summit 2025. Discover Intel's latest advancements in bringing top-tier vLLM performance to diverse hardware backends through insights from Intel engineers Ding Ke and Chendi Xue. Explore comprehensive updates on vLLM enablement across Intel platforms, covering feature parity and performance with the new vLLM v1 architecture including KV connector, data parallelism, and multi-token prediction capabilities. Examine Intel-optimized model support including DeepSeek and GPT-OSS, and understand how the model ecosystem continues expanding. Gain insights into Intel's strategies for migrating vLLM capabilities from CUDA to non-CUDA environments while minimizing developer friction through API alignment with torch.cuda behavior. Review open-sourced kernels (Cutlass, Triton) that have been upstreamed into vLLM, BitsAndBytes, and other libraries. Understand Intel's future roadmap including upcoming optimizations and capabilities designed to enhance performance and developer experience across all Intel hardware platforms. Acquire practical knowledge for deploying performant LLM inference using vLLM on Intel platforms, learn from real-world migration challenges, and explore Intel's vision for creating a unified, developer-friendly AI ecosystem.