The Most Addictive Python and SQL Courses
AI Adoption - Drive Business Value and Organizational Impact
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore how to optimize Large Language Model (LLM) serving on Kubernetes through prefix-aware routing in this 26-minute conference talk. Learn about the challenges of efficiently serving LLMs on Kubernetes, including poor GPU utilization, higher latency, and rising costs caused by diverse prompts and request patterns. Discover how prefix-aware routing intelligently analyzes initial tokens of incoming prompts to identify patterns and optimize LLM inference requests through smart routing, prioritization, and caching. Examine the architecture of a prefix-aware scorer plugin and its integration with the Kubernetes Gateway API Inference Extension. Understand how this approach enables reuse of cached data like KV caches, improves batching of similar requests, and efficiently utilizes model shards or LoRA adapters. Gain insights into real-world performance benefits including increased throughput, reduced latency, and maximized resource efficiency for GenAI workloads running on CNCF infrastructure through smarter routing strategies.
Syllabus
You Got a Match! LLM Prefix Aware Routing With Kubernetes - Ricardo Noriega & Cong Liu
Taught by
CNCF [Cloud Native Computing Foundation]