LMCache - Lower LLM Performance Costs in the Enterprise
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to reduce GPU costs and improve LLM performance in enterprise environments through this 26-minute conference talk from CNCF. Discover LMCache, an open source LLM serving engine extension that significantly reduces Time to First Token (TTFT) and increases throughput for large language model deployments. Explore the key enterprise concerns of cost optimization and return on investment (ROI) when implementing AI applications like copilots, search engines, document understanding, and chatbots that rely on GPU clusters for high-throughput inference. Examine LMCache's high-performance KV cache management layer and see demonstrations of its integration with production inference engines including vLLM and KServe deployed on Kubernetes clusters. Understand real-world applications through examples of document analysis and high-speed RAG (Retrieval-Augmented Generation) support. Gain insights into the growing open source community developing KV caching solutions that are already impacting ROI for major companies including RedHat, IBM, Google, Nvidia, and CoreWeave.
Syllabus
LMCache: Lower LLM Performance Costs in the Enterprise - Martin Hickey & Junchen Jiang
Taught by
CNCF [Cloud Native Computing Foundation]