Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore distributed and disaggregated inference techniques for scalable execution of large language models in this 41-minute conference talk by Dr. Séverine Habert from NVIDIA. Discover how reasoning and agentic AI systems can be optimized through architectural improvements including KV caching, prefix reuse, KV-cache aware routing, and KV-cache offloading. Learn about performance enhancements that reduce latency and support efficient deployment of inference workloads at the cluster level. Gain insights into the latest developments in high-performance computing approaches for modern AI inference challenges, with practical applications for large-scale language model deployment.
Syllabus
HPC Café: Inference in the Age of Reasoning Models
Taught by
NHR@FAU