In this course, you will tackle global memory latency by harnessing the power of fast, on-chip shared memory. You will learn to synchronize threads using shared memory, implement a boundary-safe tiled matrix multiplication algorithm, and empirically compare it against a naive implementation using validated benchmarks.
Overview
Syllabus
- Unit 1: Shared Memory Declaration
- Shared Memory Mystery
- Shared Memory Teamwork
- Shared Tile Debugging
- Building Block Cooperation
- Indexing Under Pressure
- Unit 2: Thread Synchronization
- The Transpose Mystery
- Completing the Tile Transpose
- Barrier Trouble in Transpose
- Shared Memory Row Flip
- Mirror Tile Challenge
- Unit 3: Implementing Tiled Computation
- Shared Memory Race Rescue
- Guarding Edge Tiles
- Building Tile Loads
- Covering the Whole Matrix
- Finishing Tiled Matrix Multiply
- Unit 4: Comparative Performance Analysis
- Trusting Benchmark Results
- Timing the Benchmark Right
- Guarding Matrix Edges
- Warming Up GPU Benchmarks
- Benchmarking with Lambdas
- Cleaning Up Benchmark Timing
- Benchmarking Across Matrix Sizes