AWS + vLLM - Building the Future of Open, Fast LLM Serving

Learn how Amazon Web Services advances large-scale LLM inference through deep support and contributions to vLLM, the leading open-source engine for high-throughput, low-latency model serving, in this 14-minute conference talk from Ray Summit 2025. Discover how vLLM serves as a foundational component of Amazon Rufus shopping assistant, handling millions of customer requests through robust support for heterogeneous hardware including AWS Trainium and NVIDIA GPUs. Explore Amazon's cost-optimized, multi-node inference architecture that intelligently routes requests to the most appropriate accelerator, delivering substantial cost savings while maintaining top-tier performance. Examine deployment best practices for running vLLM on AWS at scale, understand how Amazon builds multi-accelerator inference clusters using Trainium and GPUs, and review the open-source work streams and contributions Amazon has made to vLLM. Gain insights into Amazon's production-scale vLLM operations, learn to architect heterogeneous inference pipelines on AWS, and understand Amazon's initiatives to strengthen the vLLM ecosystem for AWS customers and the broader community.