Scalable and Efficient LLM Serving With the VLLM Production Stack

Learn how to deploy and scale Large Language Model (LLM) serving infrastructure using the vLLM Production Stack in this 40-minute conference talk from the Linux Foundation. Discover the evolution of the vLLM serving engine from single-node deployments to a comprehensive full-stack inference system designed for enterprise-scale operations. Explore key architectural components including KV cache sharing for accelerated inference, prefix-aware routing that optimizes query distribution to appropriate vLLM instances, and robust observability features for monitoring and autoscaling. Master deployment strategies for Kubernetes clusters through simplified single-command operations, and understand how these optimizations work together to achieve high reliability, throughput, and low latency in production environments. Gain insights into best practices for LLM inference performance optimization through real-time demonstrations and practical examples from industry experts at the University of Chicago and IBM Research.