Large Scale Distributed LLM Inference with LLM-D and Kubernetes

Explore large-scale distributed inference for Large Language Models using LLM-D and Kubernetes in this comprehensive conference talk. Learn how to overcome the significant challenges of deploying LLMs in production environments, including high GPU/TPU costs, hardware scarcity, and the complex balance between performance, availability, scalability, and cost-efficiency. Discover LLM-D, a Cloud Native Kubernetes-based high-performance distributed LLM inference framework designed to provide the fastest time-to-value and competitive performance per dollar across diverse hardware accelerators. Begin with a gentle introduction to inference on Kubernetes before diving deep into LLM-D's architecture and the specific challenges it addresses. Understand how LLM-D builds upon existing projects like vLLM, Prometheus, and the Kubernetes Gateway API to create an opinionated set of components optimized for GenAI deployments. Examine the framework's KV-cache aware routing and disaggregated serving capabilities that operationalize generative AI at scale. Gain insights from this Apache 2 licensed project created by the makers of vLLM from Red Hat, Google, and Bytedance, and learn how to effectively serve LLMs in critical business applications while maintaining optimal resource utilization.