Learn EDR Internals: Research & Development From The Masters
Google AI Professional Certificate - Learn AI Skills That Get You Hired
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to deploy and scale Large Language Model (LLM) serving infrastructure using the vLLM Production Stack in this 40-minute conference talk from the Linux Foundation. Discover the evolution of the vLLM serving engine from single-node deployments to a comprehensive full-stack inference system designed for enterprise-scale operations. Explore key architectural components including KV cache sharing for accelerated inference, prefix-aware routing that optimizes query distribution to appropriate vLLM instances, and robust observability features for monitoring and autoscaling. Master deployment strategies for Kubernetes clusters through simplified single-command operations, and understand how these optimizations work together to achieve high reliability, throughput, and low latency in production environments. Gain insights into best practices for LLM inference performance optimization through real-time demonstrations and practical examples from industry experts at the University of Chicago and IBM Research.
Syllabus
Scalable and Efficient LLM Serving With the VLLM Production Stack - Junchen Jiang & Yue Zhu
Taught by
Linux Foundation