More Than Model Sharding - LWS and Distributed Inference
CNCF [Cloud Native Computing Foundation] via YouTube
The Fastest Way to Become a Backend Developer Online
Build with Azure OpenAI, Copilot Studio & Agentic Frameworks — Microsoft Certified
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about LeaderWorkerSet (LWS), a Kubernetes solution designed to address the complex challenges of distributed inference for large language models beyond simple model sharding. Discover how native Kubernetes falls short when handling multi-node workloads like Llama3.1-405B or Deepseek-V3 (671B) that require distributed inference across multiple nodes using frameworks like vLLM with Ray backend. Explore the key challenges including standalone StatefulSets without coordination, gang-scheduling demands, uncontrolled startup order between master and workers causing boot lag, HPA limitations that scale individual StatefulSets rather than the entire group, stable index and rank requirements, topology-aware grouping needs, and failure recovery issues where single pod or GPU failures can disrupt overall inference. Understand how LWS addresses these problems through optimized resource coordination with leader-worker sets, improved performance through co-location strategies, integrated scaling with HPA for the entire LWS group, and all-or-nothing restart policies for fault tolerance as a cohesive unit.
Syllabus
More Than Model Sharding: LWS & Distributed Inference - Peter Pan, Nicole Li & Shane Wang
Taught by
CNCF [Cloud Native Computing Foundation]