Making Long-context LLM Inference 10x Faster and 10x Cheaper Through Knowledge Sharing
CNCF [Cloud Native Computing Foundation] via YouTube
Master Agentic AI, GANs, Fine-Tuning & LLM Apps
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about an innovative knowledge-sharing system for Large Language Models (LLMs) in this technical conference talk from CNCF. Explore how LLMs can efficiently share digested knowledge through KV caches, eliminating the need for multiple document processing. Discover implementation techniques on Kubernetes that enable storing KV caches on cost-effective devices while significantly reducing LLM serving delays. Examine practical demonstrations showing how this approach not only improves economic efficiency but also enhances performance, particularly in first-token response times. Gain insights into solving the challenge of storing and quickly serving KV caches without relying solely on GPU/CPU memory.
Syllabus
Making Long-context LLM Inference 10x faster & 10x cheap... - Junchen Jiang, Yihua Cheng, & Zhou Sun
Taught by
CNCF [Cloud Native Computing Foundation]