Learn AI, Data Science & Business — Earn Certificates That Get You Hired
The Investment Banker Certification
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore large-scale distributed inference for Large Language Models using LLM-D and Kubernetes in this comprehensive conference talk. Learn how to overcome the significant challenges of deploying LLMs in production environments, including high GPU/TPU costs, hardware scarcity, and the complex balance between performance, availability, scalability, and cost-efficiency. Discover LLM-D, a Cloud Native Kubernetes-based high-performance distributed LLM inference framework designed to provide the fastest time-to-value and competitive performance per dollar across diverse hardware accelerators. Begin with a gentle introduction to inference on Kubernetes before diving deep into LLM-D's architecture and the specific challenges it addresses. Understand how LLM-D builds upon existing projects like vLLM, Prometheus, and the Kubernetes Gateway API to create an opinionated set of components optimized for GenAI deployments. Examine the framework's KV-cache aware routing and disaggregated serving capabilities that operationalize generative AI at scale. Gain insights from this Apache 2 licensed project created by the makers of vLLM from Red Hat, Google, and Bytedance, and learn how to effectively serve LLMs in critical business applications while maintaining optimal resource utilization.
Syllabus
Large Scale Distributed LLM Inference with LLM D and Kubernetes by Abdel Sghiouar
Taught by
Devoxx