Sailing Multi-host Inference for LLM on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how to deploy distributed inference for large language models on Kubernetes using LeaderWorkerSet (LWS) and vLLM in this conference talk. Explore the challenges of serving large foundation models like Llama3.1-405B or DeepSeek R1 that cannot fit into a single node, requiring distributed inference with model parallelism. Discover LeaderWorkerSet, a dedicated multi-host inference project developed under Kubernetes SIG-Apps and Serving Working Group guidance, which addresses these complexities through features including dual-template support for different Pod types, fine-grained rolling update strategies, topology management, and all-or-nothing failure handling. See practical demonstrations of deploying distributed inference workloads using the popular vLLM inference engine, known for its performance and ease of use, integrated with LWS on Kubernetes infrastructure. Gain insights into solving the increasingly prevalent and vital inference workload challenges in the cloud native ecosystem.
Syllabus
Sailing Multi-host Inference for LLM on Kubernetes - Kay Yan, DaoCloud
Taught by
CNCF [Cloud Native Computing Foundation]