Navigating the Rapid Evolution of Large Model Inference - Where Does Kubernetes Fit?

Explore the complex intersection of large language model inference and Kubernetes infrastructure in this 30-minute conference talk from CNCF. Learn from Working Group Serving chairs and industry leaders from Bytedance, Red Hat, Google, and Microsoft as they address the critical decisions infrastructure teams face when deploying advanced LLM serving patterns. Discover how emerging techniques like model and expert parallelism, prefill/decode disaggregation, multi-LoRA implementations, and KV cache offloading challenge traditional serving architectures and push beyond conventional Kubernetes primitives. Gain practical frameworks for evaluating when to extend Kubernetes core functionality versus leveraging specialized runtimes and ecosystem projects. Understand the delicate balance between maintaining control and ensuring observability while adapting infrastructure to meet the rapidly evolving demands of large-scale LLM workloads. Acquire actionable insights for navigating the blurry boundaries between Kubernetes native capabilities, inference engines, and specialized tooling in the dynamic landscape of AI infrastructure.