More Than Model Sharding - LWS and Distributed Inference
CNCF [Cloud Native Computing Foundation] via YouTube
Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
Google AI Professional Certificate - Learn AI Skills That Get You Hired
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about LeaderWorkerSet (LWS), a Kubernetes solution designed to address the complex challenges of distributed inference for large language models beyond simple model sharding. Discover how native Kubernetes falls short when handling multi-node workloads like Llama3.1-405B or Deepseek-V3 (671B) that require distributed inference across multiple nodes using frameworks like vLLM with Ray backend. Explore the key challenges including standalone StatefulSets without coordination, gang-scheduling demands, uncontrolled startup order between master and workers causing boot lag, HPA limitations that scale individual StatefulSets rather than the entire group, stable index and rank requirements, topology-aware grouping needs, and failure recovery issues where single pod or GPU failures can disrupt overall inference. Understand how LWS addresses these problems through optimized resource coordination with leader-worker sets, improved performance through co-location strategies, integrated scaling with HPA for the entire LWS group, and all-or-nothing restart policies for fault tolerance as a cohesive unit.
Syllabus
More Than Model Sharding: LWS & Distributed Inference - Peter Pan, Nicole Li & Shane Wang
Taught by
CNCF [Cloud Native Computing Foundation]