AI Models Are Huge, but Your GPUs Aren't - Mastering Multi-Node Distributed Inference on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
MIT Sloan AI Adoption: Build a Playbook That Drives Real Business ROI
Finance Certifications Goldman Sachs & Amazon Teams Trust
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn to deploy massive AI models exceeding 600B parameters for inference using Kubernetes in this conference talk from CNCF. Explore production-ready strategies for handling infrastructure challenges when your AI models outgrow single-GPU capabilities, covering day 0/1 operations with focus on latency, cost, and accuracy tradeoffs. Discover how to select between full-precision and quantized models, size worker nodes for optimal GPU, memory, and networking performance, and manage model parallelism effectively. Master Kubernetes-native challenges including topology-aware scheduling, GPU-NIC binding, and orchestrating inference phases with custom controllers. Examine traffic routing strategies and adaptive approaches to balance cost and performance at scale. Understand Prefill/Decode disaggregation techniques in both static and pooled modes to support varied prompt lengths. Gain practical insights from real-world benchmarks and production experience, walking away with actionable diagrams, checklists, and manifests for confident deployment of distributed AI inference workloads on Kubernetes.
Syllabus
AI Models Are Huge, but Your GPUs Aren’t: Mastering Multi-Node Distributed Infe... E. Wong & J. Shan
Taught by
CNCF [Cloud Native Computing Foundation]