Hacking the Inference Pareto Frontier

Learn advanced techniques for optimizing large language model inference systems through this 20-minute conference talk by Kyle Kranen from NVIDIA. Explore the concept of the Pareto frontier as it applies to AI model deployment, understanding how the trade-offs between cost, throughput, latency, and quality form boundaries that determine where LLM systems can be successfully applied. Discover how models and their inference systems create "token factories" with specific performance characteristics, and examine what happens when systems fall outside optimal trade-off curves. Master cutting-edge optimization techniques enabled by NVIDIA Dynamo, a datacenter-scale distributed inference framework, including disaggregation for separating LLM generation phases, speculation for predicting multiple tokens per cycle, KV routing and storage optimization, and pipelining improvements for agent workflows. Understand the three key drivers for modifying the Pareto frontier: scale, structure, and dynamism, with practical examples of worker specialization and dynamic load balancing. Gain insights into how these advanced inference techniques can reshape performance boundaries, enabling successful deployment of LLM applications that would otherwise be constrained by traditional cost-latency-quality limitations.

Syllabus

00:00 Introduction to Breaking the Inference Pareto Frontier
00:33 Introduction of Kyle Cranon and NVIDIA Dynamo
01:31 The Three Pillars of Deployment Quality, Latency, Cost
02:11 Understanding the Pareto Frontier
03:06 Application-Specific Prioritization of Quality, Latency, and Cost
04:32 Common Techniques to Manipulate the Pareto Frontier Quantization, RAG, Reasoning
05:19 Compounding Techniques
06:04 Three Drivers for Modifying the Pareto Frontier Scale, Structure, Dynamism
06:20 Scale: Disaggregation
11:02 Scale: Routing
13:00 Structure: Inference Time Scaling
16:14 Structure: KV Manipulation
17:43 Dynamism: Worker Specialization
18:42 Dynamism: Dynamic Load Balancing
19:55 Conclusion and NVIDIA Dynamo Resources