Navigating the Rapid Evolution of Large Model Inference - Where Does Kubernetes Fit?
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore the complex intersection of large language model inference and Kubernetes infrastructure in this 30-minute conference talk from CNCF. Learn from Working Group Serving chairs and industry leaders from Bytedance, Red Hat, Google, and Microsoft as they address the critical decisions infrastructure teams face when deploying advanced LLM serving patterns. Discover how emerging techniques like model and expert parallelism, prefill/decode disaggregation, multi-LoRA implementations, and KV cache offloading challenge traditional serving architectures and push beyond conventional Kubernetes primitives. Gain practical frameworks for evaluating when to extend Kubernetes core functionality versus leveraging specialized runtimes and ecosystem projects. Understand the delicate balance between maintaining control and ensuring observability while adapting infrastructure to meet the rapidly evolving demands of large-scale LLM workloads. Acquire actionable insights for navigating the blurry boundaries between Kubernetes native capabilities, inference engines, and specialized tooling in the dynamic landscape of AI infrastructure.
Syllabus
Navigating the Rapid Evolution of Large Mod... Jiaxin Shan, Yuan Tang, Sergey Kanzhelev & Rita Zhang
Taught by
CNCF [Cloud Native Computing Foundation]