Inside NVIDIA Dynamo - Faster, Scalable AI Deployment

Explore how NVIDIA Dynamo revolutionizes large-scale LLM inference through system-level optimizations in this 34-minute conference talk from Ray Summit 2025. Learn from NVIDIA's Harry Kim as he demonstrates how Dynamo seamlessly integrates with high-performance engines like vLLM, SGLang, and TensorRT-LLM to address the core challenge of delivering massive efficiency gains across distributed serving stacks as LLMs grow in size, context length, and real-world usage. Discover Dynamo's key innovations including smart scheduling that routes requests based on KV-cache hit rates and system load while intelligently autoscaling and disaggregating prefill and decode phases, hierarchical memory management that transparently leverages HBM, CPU memory, local NVMe, and remote storage to minimize latency and maximize model capacity, and low-latency KV-cache transfer capabilities for quick movement across nodes and memory tiers. Examine production-grade LLM serving capabilities featuring tools for identifying optimal disaggregated serving configurations offline, automated tuning based on real-time traffic, topology-aware gang scheduling for dynamic scaling of prefill and decode workers, and LLM-specific fault-tolerance mechanisms for reliable serving at scale. Understand how organizations can achieve higher throughput, lower latency, and better cost efficiency across distributed LLM deployments while maintaining flexibility to use their preferred inference engine, making large-scale inference more efficient, robust, and operationally simple.