Fast Inference, Furious Scaling - Leveraging VLLM With KServe

Learn how to integrate vLLM and KServe for high-performance, scalable large language model deployment in production environments through this 23-minute conference talk from the Linux Foundation. Discover vLLM, a specialized library for LLM inference that delivers exceptional throughput and efficiency using advanced techniques like PagedAttention, continuous batching, and optimized CUDA kernels. Explore KServe, a Kubernetes-based platform that provides robust model deployment capabilities including autoscaling, monitoring, and model versioning for AI models in production. Watch a practical demonstration showing how these two open-source projects integrate to create a powerful solution that combines vLLM's inference optimizations with KServe's scalability features. Understand how organizations can leverage this integration to deploy large language models effectively in production, achieving fast, low-latency inference while maintaining seamless scaling capabilities across cloud platforms.