LLM Prefix Aware Routing With Kubernetes

Explore how to optimize Large Language Model (LLM) serving on Kubernetes through prefix-aware routing in this 26-minute conference talk. Learn about the challenges of efficiently serving LLMs on Kubernetes, including poor GPU utilization, higher latency, and rising costs caused by diverse prompts and request patterns. Discover how prefix-aware routing intelligently analyzes initial tokens of incoming prompts to identify patterns and optimize LLM inference requests through smart routing, prioritization, and caching. Examine the architecture of a prefix-aware scorer plugin and its integration with the Kubernetes Gateway API Inference Extension. Understand how this approach enables reuse of cached data like KV caches, improves batching of similar requests, and efficiently utilizes model shards or LoRA adapters. Gain insights into real-world performance benefits including increased throughput, reduced latency, and maximized resource efficiency for GenAI workloads running on CNCF infrastructure through smarter routing strategies.