NVIDIA Dynamo Disaggregated Prefill-Decode LLM Serving and High Performance PyTorch/CUDA Optimizations
Generative AI on AWS via YouTube
AI Engineer - Learn how to integrate AI into software applications
The Most Addictive Python and SQL Courses
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore advanced LLM serving architectures and CUDA optimization techniques in this technical meetup featuring two specialized presentations. Learn how NVIDIA Dynamo revolutionizes large language model serving by disaggregating prefill and decode stages, enabling independent scaling for improved throughput while maintaining latency constraints. Discover the technical implementation details of this disaggregated serving approach and understand how it addresses performance bottlenecks in production LLM deployments. Gain insights into high-performance PyTorch and CUDA optimizations through Luminal's approach to discovering and generating optimized CUDA and Metal kernels from high-level operations, utilizing advanced performance-related search techniques to select the fastest candidates for specific workloads. Access comprehensive resources including GitHub repositories, O'Reilly publications, and supplementary learning materials to deepen your understanding of AI performance engineering and generative AI systems optimization.
Syllabus
NVIDIA Dynamo + Disaggregated Prefill-Decode LLM Serving + PyTorch/CUDA Performance with Luminal
Taught by
Generative AI on AWS