Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how to achieve high-performance large language model deployment by integrating two powerful open-source projects in this 33-minute conference talk from the Linux Foundation. Learn about vLLM, a specialized library for LLM inference and serving that delivers exceptional throughput and efficiency through advanced techniques like PagedAttention, continuous batching, and optimized CUDA kernels. Discover KServe, a Kubernetes-based platform that provides scalable model deployment capabilities with robust features including autoscaling, monitoring, and model versioning for production AI environments. Watch a practical demonstration showing how these technologies integrate to create a comprehensive solution for deploying LLMs in production environments. Understand how combining vLLM's inference optimizations with KServe's scalability features enables organizations to achieve fast, low-latency inference while ensuring seamless scaling across cloud platforms, making it ideal for enterprise-grade LLM serving requirements.
Syllabus
Fast Inference, Furious Scaling: Leveraging VLLM With KServe - Rafael Vasquez, IBM
Taught by
Linux Foundation