Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Fast Inference, Furious Scaling - Leveraging VLLM With KServe

Linux Foundation via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to integrate vLLM and KServe for high-performance, scalable large language model deployment in production environments through this 23-minute conference talk from the Linux Foundation. Discover vLLM, a specialized library for LLM inference that delivers exceptional throughput and efficiency using advanced techniques like PagedAttention, continuous batching, and optimized CUDA kernels. Explore KServe, a Kubernetes-based platform that provides robust model deployment capabilities including autoscaling, monitoring, and model versioning for AI models in production. Watch a practical demonstration showing how these two open-source projects integrate to create a powerful solution that combines vLLM's inference optimizations with KServe's scalability features. Understand how organizations can leverage this integration to deploy large language models effectively in production, achieving fast, low-latency inference while maintaining seamless scaling capabilities across cloud platforms.

Syllabus

Fast Inference, Furious Scaling: Leveraging VLLM With KServe - Rafael Vasquez, IBM

Taught by

Linux Foundation

Reviews

Start your review of Fast Inference, Furious Scaling - Leveraging VLLM With KServe

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.