NVIDIA Dynamo Disaggregated Prefill-Decode LLM Serving and High Performance PyTorch/CUDA Optimizations
Generative AI on AWS via YouTube
AI, Data Science & Cloud Certificates from Google, IBM & Meta
PowerBI Data Analyst - Create visualizations and dashboards from scratch
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore advanced LLM serving architectures and CUDA optimization techniques in this technical meetup featuring two specialized presentations. Learn how NVIDIA Dynamo revolutionizes large language model serving by disaggregating prefill and decode stages, enabling independent scaling for improved throughput while maintaining latency constraints. Discover the technical implementation details of this disaggregated serving approach and understand how it addresses performance bottlenecks in production LLM deployments. Gain insights into high-performance PyTorch and CUDA optimizations through Luminal's approach to discovering and generating optimized CUDA and Metal kernels from high-level operations, utilizing advanced performance-related search techniques to select the fastest candidates for specific workloads. Access comprehensive resources including GitHub repositories, O'Reilly publications, and supplementary learning materials to deepen your understanding of AI performance engineering and generative AI systems optimization.
Syllabus
NVIDIA Dynamo + Disaggregated Prefill-Decode LLM Serving + PyTorch/CUDA Performance with Luminal
Taught by
Generative AI on AWS