Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to overcome the technical challenges of running ultra-low latency Large Language Model inference at massive scale in this conference talk by Haytham Abuelfutuh, Co-founder and CTO of Union.ai. Explore the unique obstacles inherent to LLM inference, including managing large model sizes and implementing efficient key-value caching (KV cache) strategies. Discover the complexities teams encounter when scaling LLM inference to handle high-volume request loads, from leveraging specialized hardware like GPUs and TPUs to implementing effective scaling strategies and designing new routing architectures. Gain insights into practical solutions developed at Union for optimizing inference workload performance, presented in a cloud- and platform-agnostic approach that enables implementation across diverse infrastructure environments. Drawing from Abuelfutuh's extensive 15-year experience designing distributed systems and cloud applications at Microsoft, Google, and Lyft, plus his work co-authoring the Flyte.org ML workflow orchestration system, this 30-minute presentation delivers actionable strategies for achieving ultra-low latency LLM inference at enterprise scale.
Syllabus
Scaling Ultra Low Latency LLM Inference
Taught by
MLOps World: Machine Learning in Production