Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

LLM Prefix Aware Routing With Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore how to optimize Large Language Model (LLM) serving on Kubernetes through prefix-aware routing in this 26-minute conference talk. Learn about the challenges of efficiently serving LLMs on Kubernetes, including poor GPU utilization, higher latency, and rising costs caused by diverse prompts and request patterns. Discover how prefix-aware routing intelligently analyzes initial tokens of incoming prompts to identify patterns and optimize LLM inference requests through smart routing, prioritization, and caching. Examine the architecture of a prefix-aware scorer plugin and its integration with the Kubernetes Gateway API Inference Extension. Understand how this approach enables reuse of cached data like KV caches, improves batching of similar requests, and efficiently utilizes model shards or LoRA adapters. Gain insights into real-world performance benefits including increased throughput, reduced latency, and maximized resource efficiency for GenAI workloads running on CNCF infrastructure through smarter routing strategies.

Syllabus

You Got a Match! LLM Prefix Aware Routing With Kubernetes - Ricardo Noriega & Cong Liu

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of LLM Prefix Aware Routing With Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.