Learn Backend Development Part-Time, Online
Build GenAI Apps from Scratch — UCSB PaCE Certificate Program
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore how to achieve high-performance large language model deployment by integrating two powerful open-source projects in this 33-minute conference talk from the Linux Foundation. Learn about vLLM, a specialized library for LLM inference and serving that delivers exceptional throughput and efficiency through advanced techniques like PagedAttention, continuous batching, and optimized CUDA kernels. Discover KServe, a Kubernetes-based platform that provides scalable model deployment capabilities with robust features including autoscaling, monitoring, and model versioning for production AI environments. Watch a practical demonstration showing how these technologies integrate to create a comprehensive solution for deploying LLMs in production environments. Understand how combining vLLM's inference optimizations with KServe's scalability features enables organizations to achieve fast, low-latency inference while ensuring seamless scaling across cloud platforms, making it ideal for enterprise-grade LLM serving requirements.
Syllabus
Fast Inference, Furious Scaling: Leveraging VLLM With KServe - Rafael Vasquez, IBM
Taught by
Linux Foundation