PyTorch Data Loader Tuning and GPU Cross-Architecture Optimizations: CUDA and AMD
Generative AI on AWS via YouTube
AI Adoption - Drive Business Value and Organizational Impact
Introduction to Programming with Python
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This webinar features two technical talks focused on AI performance optimization. Begin with an introduction by Chris Fregly, followed by Chaim Rand's presentation on "Solving Bottlenecks with Data Input Pipeline with PyTorch Profiler and TensorBoard," which explores how to identify and resolve performance bottlenecks in PyTorch data pipelines using profiling tools. Then, dive into Quentin Anthony's talk on "How to Write Cross-Architecture Kernels: NVIDIA CUDA and AMD ROCm," where he explains the techniques for developing GPU kernels that work efficiently across both NVIDIA and AMD hardware platforms, particularly relevant for deploying modern AI models like DeepSeek-R1 and Llama-4. Learn about kernel sizing and cross-architecture optimization strategies for different SIMD hardware implementations. Access additional resources including a GitHub repository, related O'Reilly book, and free Generative AI course materials to further enhance your understanding of AI performance engineering.
Syllabus
PyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD
Taught by
Generative AI on AWS