More Than Model Sharding - LWS and Distributed Inference
CNCF [Cloud Native Computing Foundation] via YouTube
AI Adoption - Drive Business Value and Organizational Impact
AI Engineer - Learn how to integrate AI into software applications
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about LeaderWorkerSet (LWS), a Kubernetes solution designed to address the complex challenges of distributed inference for large language models beyond simple model sharding. Discover how native Kubernetes falls short when handling multi-node workloads like Llama3.1-405B or Deepseek-V3 (671B) that require distributed inference across multiple nodes using frameworks like vLLM with Ray backend. Explore the key challenges including standalone StatefulSets without coordination, gang-scheduling demands, uncontrolled startup order between master and workers causing boot lag, HPA limitations that scale individual StatefulSets rather than the entire group, stable index and rank requirements, topology-aware grouping needs, and failure recovery issues where single pod or GPU failures can disrupt overall inference. Understand how LWS addresses these problems through optimized resource coordination with leader-worker sets, improved performance through co-location strategies, integrated scaling with HPA for the entire LWS group, and all-or-nothing restart policies for fault tolerance as a cohesive unit.
Syllabus
More Than Model Sharding: LWS & Distributed Inference - Peter Pan, Nicole Li & Shane Wang
Taught by
CNCF [Cloud Native Computing Foundation]