Improve AI Inference - Serving Models With KServe and VLLM

Learn how to enhance AI model inference performance using KServe and vLLM in this 34-minute conference talk from the Linux Foundation. Discover Red Hat's integration of these technologies within OpenShift AI, their MLOps platform, and understand how Red Hat engineers actively contribute to both upstream projects. Explore the architecture and components of Red Hat OpenShift AI, all derived from open source projects, and examine how KServe functions as a model serving platform within this ecosystem. Dive into the advantages of combining vLLM and KServe as the runtime for Large Language Models, including techniques for faster inference and optimized resource consumption through continuous batching, PagedAttention, and speculative decoding. Gain insights into further resource optimization strategies using LLM quantization with vLLM's LLM Compressor library, providing practical knowledge for improving AI model deployment and serving efficiency.