New Pattern for Sailing Multi-host LLM Inference

Explore a conference talk that introduces LeaderWorkerSet (LWS), a specialized Kubernetes project designed to address the challenges of distributed large language model inference across multiple hosts. Learn how this solution, developed under Kubernetes SIG-Apps and Serving Working Group guidance, tackles the complexity of serving massive foundation models like Llama 3.1-405B and DeepSeek R1 that cannot fit on single nodes. Discover LWS's key features including dual-template architecture for different Pod types, fine-grained rolling update strategies, topology management, and all-or-nothing failure handling mechanisms. Examine real-world adoption practices from industry leaders including NVIDIA and Google, and see practical demonstrations of LWS integration with popular inference engines such as vLLM and SGLang. Gain insights into how this cloud-native approach simplifies the deployment and management of distributed inference workloads while maintaining reliability and scalability in production environments.