Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

NVIDIA Dynamo Disaggregated Prefill-Decode LLM Serving and High Performance PyTorch/CUDA Optimizations

Generative AI on AWS via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore advanced LLM serving architectures and CUDA optimization techniques in this technical meetup featuring two specialized presentations. Learn how NVIDIA Dynamo revolutionizes large language model serving by disaggregating prefill and decode stages, enabling independent scaling for improved throughput while maintaining latency constraints. Discover the technical implementation details of this disaggregated serving approach and understand how it addresses performance bottlenecks in production LLM deployments. Gain insights into high-performance PyTorch and CUDA optimizations through Luminal's approach to discovering and generating optimized CUDA and Metal kernels from high-level operations, utilizing advanced performance-related search techniques to select the fastest candidates for specific workloads. Access comprehensive resources including GitHub repositories, O'Reilly publications, and supplementary learning materials to deepen your understanding of AI performance engineering and generative AI systems optimization.

Syllabus

NVIDIA Dynamo + Disaggregated Prefill-Decode LLM Serving + PyTorch/CUDA Performance with Luminal

Taught by

Generative AI on AWS

Reviews

Start your review of NVIDIA Dynamo Disaggregated Prefill-Decode LLM Serving and High Performance PyTorch/CUDA Optimizations

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.