AI Models Are Huge, but Your GPUs Aren't - Mastering Multi-Node Distributed Inference on Kubernetes

Learn to deploy massive AI models exceeding 600B parameters for inference using Kubernetes in this conference talk from CNCF. Explore production-ready strategies for handling infrastructure challenges when your AI models outgrow single-GPU capabilities, covering day 0/1 operations with focus on latency, cost, and accuracy tradeoffs. Discover how to select between full-precision and quantized models, size worker nodes for optimal GPU, memory, and networking performance, and manage model parallelism effectively. Master Kubernetes-native challenges including topology-aware scheduling, GPU-NIC binding, and orchestrating inference phases with custom controllers. Examine traffic routing strategies and adaptive approaches to balance cost and performance at scale. Understand Prefill/Decode disaggregation techniques in both static and pooled modes to support varied prompt lengths. Gain practical insights from real-world benchmarks and production experience, walking away with actionable diagrams, checklists, and manifests for confident deployment of distributed AI inference workloads on Kubernetes.