Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Shared Memory Optimization

Go to class Write review

Details

Provider

CodeSignal
Pricing

Free Certificate
Languages

English
Certificate

Certificate Available
Effort

3 hours
Sessions

Self-Paced
Level

Advanced

Found in

Part of

Introduction to CUDA Kernel Programming

Overview

In this course, you will tackle global memory latency by harnessing the power of fast, on-chip shared memory. You will learn to synchronize threads using shared memory, implement a boundary-safe tiled matrix multiplication algorithm, and empirically compare it against a naive implementation using validated benchmarks.

Syllabus

Unit 1: Shared Memory Declaration

Shared Memory Mystery
Shared Memory Teamwork
Shared Tile Debugging
Building Block Cooperation
Indexing Under Pressure

Unit 2: Thread Synchronization

The Transpose Mystery
Completing the Tile Transpose
Barrier Trouble in Transpose
Shared Memory Row Flip
Mirror Tile Challenge

Unit 3: Implementing Tiled Computation

Shared Memory Race Rescue
Guarding Edge Tiles
Building Tile Loads
Covering the Whole Matrix
Finishing Tiled Matrix Multiply

Unit 4: Comparative Performance Analysis

Trusting Benchmark Results
Timing the Benchmark Right
Guarding Matrix Edges
Warming Up GPU Benchmarks
Benchmarking with Lambdas
Cleaning Up Benchmark Timing
Benchmarking Across Matrix Sizes

Reviews

Start your review of Shared Memory Optimization