Introducing NVIDIA Dynamo: A Distributed Inference Serving Framework for Reasoning Models

Discover NVIDIA Dynamo, a new distributed inference serving framework specifically designed for deploying reasoning large language models (LLMs) across multi-node environments in this advanced technical session. Explore the framework's architecture and key components that enable seamless scaling within data centers while driving advanced inference optimization. Learn about cutting-edge inference serving techniques, including disaggregated serving that separates prefill and decode operations to optimize request handling and increase inference throughput. The session also covers how to quickly deploy this innovative serving framework using NVIDIA NIM. Presented by NVIDIA experts Harry Kim, Neelay Shah, Ryan Olson, and Tanmay Verma, this 89-minute technical presentation is a replay of NVIDIA GTC Session ID S73042 and features NVIDIA technologies including TensorRT, DALI, NVLink/NVSwitch, and Triton.