Sailing Multi-host Inference for LLM on Kubernetes

Learn how to deploy distributed inference for large language models on Kubernetes using LeaderWorkerSet (LWS) and vLLM in this conference talk. Explore the challenges of serving large foundation models like Llama3.1-405B or DeepSeek R1 that cannot fit into a single node, requiring distributed inference with model parallelism. Discover LeaderWorkerSet, a dedicated multi-host inference project developed under Kubernetes SIG-Apps and Serving Working Group guidance, which addresses these complexities through features including dual-template support for different Pod types, fine-grained rolling update strategies, topology management, and all-or-nothing failure handling. See practical demonstrations of deploying distributed inference workloads using the popular vLLM inference engine, known for its performance and ease of use, integrated with LWS on Kubernetes infrastructure. Gain insights into solving the increasingly prevalent and vital inference workload challenges in the cloud native ecosystem.