Earn Your CS Degree, Tuition-Free, 100% Online!
Learn EDR Internals: Research & Development From The Masters
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to integrate vLLM and KServe for high-performance, scalable large language model deployment in production environments through this 23-minute conference talk from the Linux Foundation. Discover vLLM, a specialized library for LLM inference that delivers exceptional throughput and efficiency using advanced techniques like PagedAttention, continuous batching, and optimized CUDA kernels. Explore KServe, a Kubernetes-based platform that provides robust model deployment capabilities including autoscaling, monitoring, and model versioning for AI models in production. Watch a practical demonstration showing how these two open-source projects integrate to create a powerful solution that combines vLLM's inference optimizations with KServe's scalability features. Understand how organizations can leverage this integration to deploy large language models effectively in production, achieving fast, low-latency inference while maintaining seamless scaling capabilities across cloud platforms.
Syllabus
Fast Inference, Furious Scaling: Leveraging VLLM With KServe - Rafael Vasquez, IBM
Taught by
Linux Foundation