Become an AI & ML Engineer with Cal Poly EPaCE — IBM-Certified Training
Learn Generative AI, Prompt Engineering, and LLMs for Free
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore how to achieve high-performance large language model deployment by integrating two powerful open-source projects in this 33-minute conference talk from the Linux Foundation. Learn about vLLM, a specialized library for LLM inference and serving that delivers exceptional throughput and efficiency through advanced techniques like PagedAttention, continuous batching, and optimized CUDA kernels. Discover KServe, a Kubernetes-based platform that provides scalable model deployment capabilities with robust features including autoscaling, monitoring, and model versioning for production AI environments. Watch a practical demonstration showing how these technologies integrate to create a comprehensive solution for deploying LLMs in production environments. Understand how combining vLLM's inference optimizations with KServe's scalability features enables organizations to achieve fast, low-latency inference while ensuring seamless scaling across cloud platforms, making it ideal for enterprise-grade LLM serving requirements.
Syllabus
Fast Inference, Furious Scaling: Leveraging VLLM With KServe - Rafael Vasquez, IBM
Taught by
Linux Foundation