Overview
Syllabus
00:00 Introduction to Breaking the Inference Pareto Frontier
00:33 Introduction of Kyle Cranon and NVIDIA Dynamo
01:31 The Three Pillars of Deployment Quality, Latency, Cost
02:11 Understanding the Pareto Frontier
03:06 Application-Specific Prioritization of Quality, Latency, and Cost
04:32 Common Techniques to Manipulate the Pareto Frontier Quantization, RAG, Reasoning
05:19 Compounding Techniques
06:04 Three Drivers for Modifying the Pareto Frontier Scale, Structure, Dynamism
06:20 Scale: Disaggregation
11:02 Scale: Routing
13:00 Structure: Inference Time Scaling
16:14 Structure: KV Manipulation
17:43 Dynamism: Worker Specialization
18:42 Dynamism: Dynamic Load Balancing
19:55 Conclusion and NVIDIA Dynamo Resources
Taught by
AI Engineer