Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore large-scale distributed inference for Large Language Models using LLM-D and Kubernetes in this comprehensive conference talk. Learn how to overcome the significant challenges of deploying LLMs in production environments, including high GPU/TPU costs, hardware scarcity, and the complex balance between performance, availability, scalability, and cost-efficiency. Discover LLM-D, a Cloud Native Kubernetes-based high-performance distributed LLM inference framework designed to provide the fastest time-to-value and competitive performance per dollar across diverse hardware accelerators. Begin with a gentle introduction to inference on Kubernetes before diving deep into LLM-D's architecture and the specific challenges it addresses. Understand how LLM-D builds upon existing projects like vLLM, Prometheus, and the Kubernetes Gateway API to create an opinionated set of components optimized for GenAI deployments. Examine the framework's KV-cache aware routing and disaggregated serving capabilities that operationalize generative AI at scale. Gain insights from this Apache 2 licensed project created by the makers of vLLM from Red Hat, Google, and Bytedance, and learn how to effectively serve LLMs in critical business applications while maintaining optimal resource utilization.
Syllabus
Large Scale Distributed LLM Inference with LLM D and Kubernetes by Abdel Sghiouar
Taught by
Devoxx