Mastering LLM Inference Optimization - From Theory to Cost Effective Deployment

Learn to optimize Large Language Model (LLM) inference for production environments through this comprehensive conference talk that covers the unique challenges of deploying LLMs at scale. Discover why LLM inference differs fundamentally from standard deep learning model deployment and explore the critical factors affecting performance and cost-effectiveness. Examine current and future NVIDIA GPU architectures to understand which hardware configurations work best for specific models and deployment scenarios. Master the intricacies of building inference engines and gain deep insights into attention mechanisms, including various types used in production environments. Dive into KV Cache management strategies to maximize throughput per model deployment and explore parallelization techniques including tensor, data, sequence, pipeline, and expert parallelism to reduce latency. Understand quantization methods for weights, activations, and KV Cache to optimize GPU utilization and reduce engine sizes. Learn advanced throughput optimization techniques including inflight batching and analyze detailed performance metrics such as Time to First Token, inter-token latencies, and deployment characterizations to minimize costs. Get hands-on insights into TRT-LLM inference engine and NVIDIA Triton open-source inference server. Presented by Dr. Mark Moyou, Senior Data Scientist at NVIDIA, who brings extensive experience in scalable machine learning and hosts multiple AI-focused podcasts including The AI Portfolio Podcast.