NVIDIA Dynamo Disaggregated Prefill-Decode LLM Serving and High Performance PyTorch/CUDA Optimizations
Generative AI on AWS via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore advanced LLM serving architectures and CUDA optimization techniques in this technical meetup featuring two specialized presentations. Learn how NVIDIA Dynamo revolutionizes large language model serving by disaggregating prefill and decode stages, enabling independent scaling for improved throughput while maintaining latency constraints. Discover the technical implementation details of this disaggregated serving approach and understand how it addresses performance bottlenecks in production LLM deployments. Gain insights into high-performance PyTorch and CUDA optimizations through Luminal's approach to discovering and generating optimized CUDA and Metal kernels from high-level operations, utilizing advanced performance-related search techniques to select the fastest candidates for specific workloads. Access comprehensive resources including GitHub repositories, O'Reilly publications, and supplementary learning materials to deepen your understanding of AI performance engineering and generative AI systems optimization.
Syllabus
NVIDIA Dynamo + Disaggregated Prefill-Decode LLM Serving + PyTorch/CUDA Performance with Luminal
Taught by
Generative AI on AWS