Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Fast Inference, Furious Scaling - Leveraging VLLM With KServe

Linux Foundation via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how to achieve high-performance large language model deployment by integrating two powerful open-source projects in this 33-minute conference talk from the Linux Foundation. Learn about vLLM, a specialized library for LLM inference and serving that delivers exceptional throughput and efficiency through advanced techniques like PagedAttention, continuous batching, and optimized CUDA kernels. Discover KServe, a Kubernetes-based platform that provides scalable model deployment capabilities with robust features including autoscaling, monitoring, and model versioning for production AI environments. Watch a practical demonstration showing how these technologies integrate to create a comprehensive solution for deploying LLMs in production environments. Understand how combining vLLM's inference optimizations with KServe's scalability features enables organizations to achieve fast, low-latency inference while ensuring seamless scaling across cloud platforms, making it ideal for enterprise-grade LLM serving requirements.

Syllabus

Fast Inference, Furious Scaling: Leveraging VLLM With KServe - Rafael Vasquez, IBM

Taught by

Linux Foundation

Reviews

Start your review of Fast Inference, Furious Scaling - Leveraging VLLM With KServe

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.