PyTorch Data Loader Tuning and GPU Cross-Architecture Optimizations: CUDA and AMD
Generative AI on AWS via YouTube
Build GenAI Apps from Scratch — UCSB PaCE Certificate Program
Launch Your Cybersecurity Career in 6 Months
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This webinar features two technical talks focused on AI performance optimization. Begin with an introduction by Chris Fregly, followed by Chaim Rand's presentation on "Solving Bottlenecks with Data Input Pipeline with PyTorch Profiler and TensorBoard," which explores how to identify and resolve performance bottlenecks in PyTorch data pipelines using profiling tools. Then, dive into Quentin Anthony's talk on "How to Write Cross-Architecture Kernels: NVIDIA CUDA and AMD ROCm," where he explains the techniques for developing GPU kernels that work efficiently across both NVIDIA and AMD hardware platforms, particularly relevant for deploying modern AI models like DeepSeek-R1 and Llama-4. Learn about kernel sizing and cross-architecture optimization strategies for different SIMD hardware implementations. Access additional resources including a GitHub repository, related O'Reilly book, and free Generative AI course materials to further enhance your understanding of AI performance engineering.
Syllabus
PyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD
Taught by
Generative AI on AWS