Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Scalable and Efficient LLM Serving With the VLLM Production Stack

Linux Foundation via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to deploy and scale Large Language Model (LLM) serving infrastructure using the vLLM Production Stack in this 40-minute conference talk from the Linux Foundation. Discover the evolution of the vLLM serving engine from single-node deployments to a comprehensive full-stack inference system designed for enterprise-scale operations. Explore key architectural components including KV cache sharing for accelerated inference, prefix-aware routing that optimizes query distribution to appropriate vLLM instances, and robust observability features for monitoring and autoscaling. Master deployment strategies for Kubernetes clusters through simplified single-command operations, and understand how these optimizations work together to achieve high reliability, throughput, and low latency in production environments. Gain insights into best practices for LLM inference performance optimization through real-time demonstrations and practical examples from industry experts at the University of Chicago and IBM Research.

Syllabus

Scalable and Efficient LLM Serving With the VLLM Production Stack - Junchen Jiang & Yue Zhu

Taught by

Linux Foundation

Reviews

Start your review of Scalable and Efficient LLM Serving With the VLLM Production Stack

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.