Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Udemy

Mastering GPU Parallel Programming with CUDA: ( HW & SW )

via Udemy

Overview

Performance Optimization and Analysis for High-Performance Computing

What you'll learn:
  • Comprehensive Understanding of GPU vs CPU Architecture
  • learn the history of graphical processing unit (GPU) until the most recent products
  • Understand the internal structure of GPU
  • Understand the different types of memories and how they affect the performance
  • Understand the most recent technologies in the GPU internal components
  • Understand the basics of the CUDA programming on GPU
  • Start programming GPU using both CUDA on Both windows and linux
  • understand the most efficient ways for parallelization
  • Profiling and Performance Tuning
  • Leveraging Shared Memory

This hands-on course teaches you how to unlock the huge parallel-processing power of modern GPUs with CUDA. You’ll start with the fundamentals of GPU hardware, trace the evolution of flagship architectures (Fermi → Pascal → Volta → Ampere → Hopper), and learn—through code-along labs—how to write, profile, and optimize high-performance kernels.

This is an independent training resource. It is not sponsored by, endorsed by, or otherwise affiliated with NVIDIA Corporation. “CUDA”, “Nsight”, and the architecture codenames are trademarks of NVIDIA and are used here only as factual references.

What you’ll master

  • GPU vs. CPU fundamentals – why GPUs dominate data-parallel workloads.

  • Generational design advances – the hardware features that matter most for performance.

  • CUDA toolkit installation – Windows, Linux, and WSL, plus first-run sanity checks.

  • Core CUDA concepts – threads, blocks, grids, and the memory hierarchy, built up with labs such as vector addition.

  • Profiling & tuning with Nsight Compute / nvprof – measure occupancy, hide latency, and break bottlenecks.

  • 2-D indexing for matrices – write efficient kernels for real-world linear-algebra tasks.

  • Optimization playbook – handle non-power-of-two data, leverage shared memory, maximize bandwidth, and minimize warp divergence.

  • Robust debugging & error handling – use runtime-API checks to ship production-ready code.

By the end, you’ll be able to design, analyze, and fine-tune CUDA kernels that run efficiently on today’s GPUs—equipping you to tackle demanding scientific, engineering, and AI workloads.

Syllabus

  • Introduction to the Nvidia GPUs hardware
  • Installing Cuda and other programs
  • Introduction to CUDA programming
  • Profiling
  • Performance analysis for the previous applications
  • 2D Indexing
  • Shared Memory + Warp Divergence
  • Debugging tools
  • Vector Reduction
  • Roofline model
  • Matrix Multiplication (Bonus)
  • Profiling - nsight systems

Taught by

Hamdy egy

Reviews

4.5 rating at Udemy based on 542 ratings

Start your review of Mastering GPU Parallel Programming with CUDA: ( HW & SW )

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.