Completed
16:14 Structure: KV Manipulation
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Hacking the Inference Pareto Frontier
Automatically move to the next video in the Classroom when playback concludes
- 1 00:00 Introduction to Breaking the Inference Pareto Frontier
- 2 00:33 Introduction of Kyle Cranon and NVIDIA Dynamo
- 3 01:31 The Three Pillars of Deployment Quality, Latency, Cost
- 4 02:11 Understanding the Pareto Frontier
- 5 03:06 Application-Specific Prioritization of Quality, Latency, and Cost
- 6 04:32 Common Techniques to Manipulate the Pareto Frontier Quantization, RAG, Reasoning
- 7 05:19 Compounding Techniques
- 8 06:04 Three Drivers for Modifying the Pareto Frontier Scale, Structure, Dynamism
- 9 06:20 Scale: Disaggregation
- 10 11:02 Scale: Routing
- 11 13:00 Structure: Inference Time Scaling
- 12 16:14 Structure: KV Manipulation
- 13 17:43 Dynamism: Worker Specialization
- 14 18:42 Dynamism: Dynamic Load Balancing
- 15 19:55 Conclusion and NVIDIA Dynamo Resources