Making Long-context LLM Inference 10x Faster and 10x Cheaper Through Knowledge Sharing
CNCF [Cloud Native Computing Foundation] via YouTube
AI Adoption - Drive Business Value and Organizational Impact
The Most Addictive Python and SQL Courses
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn about an innovative knowledge-sharing system for Large Language Models (LLMs) in this technical conference talk from CNCF. Explore how LLMs can efficiently share digested knowledge through KV caches, eliminating the need for multiple document processing. Discover implementation techniques on Kubernetes that enable storing KV caches on cost-effective devices while significantly reducing LLM serving delays. Examine practical demonstrations showing how this approach not only improves economic efficiency but also enhances performance, particularly in first-token response times. Gain insights into solving the challenge of storing and quickly serving KV caches without relying solely on GPU/CPU memory.
Syllabus
Making Long-context LLM Inference 10x faster & 10x cheap... - Junchen Jiang, Yihua Cheng, & Zhou Sun
Taught by
CNCF [Cloud Native Computing Foundation]