Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Llm-d - Multi-Accelerator LLM Inference on Kubernetes

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a conference talk that introduces llm-d, a Kubernetes-native distributed-inference stack designed to optimize large language model serving across diverse accelerator types. Learn how modern Kubernetes clusters can effectively utilize mixed hardware environments including GPUs, TPUs, and custom AI ASICs through a unified approach that goes beyond traditional single GPU per pod configurations. Discover the architecture of llm-d built around vLLM, featuring a workload-aware scheduler, disaggregated prefill and decode processes, tiered KV cache implementation, and comprehensive visibility into interconnect bandwidth from NIXL fabrics to GPU peer-to-peer links. Understand how llm-d integrates topology data into Kubernetes to ensure each request is routed to the optimal accelerator and network path that meets latency requirements while minimizing costs. Gain insights into llm-d's decision-making process regarding accelerator classes and interconnects, and receive practical guidance through a clear scorecard for selecting the most effective hardware combinations for different use cases including chat applications, long-context processing, and batch generation workloads. Walk away with a comprehensive blueprint for implementing llm-d to achieve high performance while maintaining budget efficiency in large language model deployments.

Syllabus

Llm-d: Multi-Accelerator LLM Inference on Kubernetes - Erwan Gallen, Red Hat

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Llm-d - Multi-Accelerator LLM Inference on Kubernetes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.