Overview

AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off

One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.

In this course, you’ll master GPU programming using C++ and CUDA to significantly enhance your software's performance. By focusing on parallelism, you’ll learn to leverage the full power of GPUs for high-performance computing applications. You will acquire practical knowledge on managing GPU devices, optimizing GPU resource usage, and integrating GPU code with Python to build scalable and efficient applications. This course emphasizes real-world strategies for optimizing performance and building reusable libraries. This course combines fundamental theory with hands-on applications to help you solve complex performance challenges. You'll not only understand the core concepts but also implement them in real-world projects, such as creating libraries for Python integration. Ideal for C++ developers with experience in basic programming concepts, this course will take you through advanced topics, from parallel algorithms to multi-GPU usage. A background in operating systems is recommended for tackling more complex concepts. Based on the book, GPU Programming with C++ and CUDA, by Paulo Motta.

Syllabus

Introduction to Parallel Programming

In this section, we explore parallelism in software, its importance, and the differences between CPU and GPU architectures to build a foundation for GPU programming.

Setting Up Your Development Environment

In this section, we configure a GPU environment using Docker, locate official Linux documentation, and install the CUDA toolkit on Ubuntu 20.04 or 22.04 for AI and machine learning workflows.

Hello CUDA

In this section, we introduce GPU programming fundamentals, including kernel execution, device inspection, and setting up a working environment for CUDA development.

Hello Again, but in Parallel

In this section, we explore SIMD execution, data movement, and parallel vector addition for GPU programming.

A Closer Look into the World of GPUs

In this section, we explore GPU thread, block, and grid configurations, asynchronous data transfer, streams, events, and shared memory to optimize performance in parallel computing.

Parallel Algorithms with CUDA

In this section, we explore parallel algorithm design, focusing on matrix operations, reduction, and workload balancing for efficient GPU execution.

Performance Strategies

In this section, we explore GPU optimization and profile with NVIDIA Nsight Compute.

Overlaying Multiple Operations

In this section, we explore debugging CUDA code with VS Code, using CUDA streams to overlap memory and kernel operations, and configuring multiple GPUs for parallel processing.

Exposing Your Code to Python

In this section, we explore methods to integrate C++ GPU code with Python, focusing on Ctypes, custom wrappers, and performance analysis for efficient cross-language execution.

Exploring Existing GPU Models

In this section, we explore GPU development using cuBLAS and Thrust, optimize code for memory and thread efficiency, and test with GTest and Pytest to ensure reliability and performance.