Sailing Multi-host Inference for LLM on Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube
Gain a Splash of New Skills - Coursera+ Annual Just ₹7,999
The Most Addictive Python and SQL Courses
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to deploy distributed inference for large language models on Kubernetes using LeaderWorkerSet (LWS) and vLLM in this conference talk. Explore the challenges of serving large foundation models like Llama3.1-405B or DeepSeek R1 that cannot fit into a single node, requiring distributed inference with model parallelism. Discover LeaderWorkerSet, a dedicated multi-host inference project developed under Kubernetes SIG-Apps and Serving Working Group guidance, which addresses these complexities through features including dual-template support for different Pod types, fine-grained rolling update strategies, topology management, and all-or-nothing failure handling. See practical demonstrations of deploying distributed inference workloads using the popular vLLM inference engine, known for its performance and ease of use, integrated with LWS on Kubernetes infrastructure. Gain insights into solving the increasingly prevalent and vital inference workload challenges in the cloud native ecosystem.
Syllabus
Sailing Multi-host Inference for LLM on Kubernetes - Kay Yan, DaoCloud
Taught by
CNCF [Cloud Native Computing Foundation]