PyTorch Data Loader Tuning and GPU Cross-Architecture Optimizations: CUDA and AMD

This webinar features two technical talks focused on AI performance optimization. Begin with an introduction by Chris Fregly, followed by Chaim Rand's presentation on "Solving Bottlenecks with Data Input Pipeline with PyTorch Profiler and TensorBoard," which explores how to identify and resolve performance bottlenecks in PyTorch data pipelines using profiling tools. Then, dive into Quentin Anthony's talk on "How to Write Cross-Architecture Kernels: NVIDIA CUDA and AMD ROCm," where he explains the techniques for developing GPU kernels that work efficiently across both NVIDIA and AMD hardware platforms, particularly relevant for deploying modern AI models like DeepSeek-R1 and Llama-4. Learn about kernel sizing and cross-architecture optimization strategies for different SIMD hardware implementations. Access additional resources including a GitHub repository, related O'Reilly book, and free Generative AI course materials to further enhance your understanding of AI performance engineering.