Sailing Multi-host Inference for LLM on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn how to deploy distributed inference for large language models on Kubernetes using LeaderWorkerSet (LWS) and vLLM in this conference talk. Explore the challenges of serving large foundation models like Llama3.1-405B or DeepSeek R1 that cannot fit into a single node, requiring distributed inference with model parallelism. Discover LeaderWorkerSet, a dedicated multi-host inference project developed under Kubernetes SIG-Apps and Serving Working Group guidance, which addresses these complexities through features including dual-template support for different Pod types, fine-grained rolling update strategies, topology management, and all-or-nothing failure handling. See practical demonstrations of deploying distributed inference workloads using the popular vLLM inference engine, known for its performance and ease of use, integrated with LWS on Kubernetes infrastructure. Gain insights into solving the increasingly prevalent and vital inference workload challenges in the cloud native ecosystem.
Syllabus
Sailing Multi-host Inference for LLM on Kubernetes - Kay Yan, DaoCloud
Taught by
CNCF [Cloud Native Computing Foundation]