Llm-d - Multi-Accelerator LLM Inference on Kubernetes

Explore a conference talk that introduces llm-d, a Kubernetes-native distributed-inference stack designed to optimize large language model serving across diverse accelerator types. Learn how modern Kubernetes clusters can effectively utilize mixed hardware environments including GPUs, TPUs, and custom AI ASICs through a unified approach that goes beyond traditional single GPU per pod configurations. Discover the architecture of llm-d built around vLLM, featuring a workload-aware scheduler, disaggregated prefill and decode processes, tiered KV cache implementation, and comprehensive visibility into interconnect bandwidth from NIXL fabrics to GPU peer-to-peer links. Understand how llm-d integrates topology data into Kubernetes to ensure each request is routed to the optimal accelerator and network path that meets latency requirements while minimizing costs. Gain insights into llm-d's decision-making process regarding accelerator classes and interconnects, and receive practical guidance through a clear scorecard for selecting the most effective hardware combinations for different use cases including chat applications, long-context processing, and batch generation workloads. Walk away with a comprehensive blueprint for implementing llm-d to achieve high performance while maintaining budget efficiency in large language model deployments.