Get 20% off all career paths from fullstack to AI
Start speaking a new language. It’s just 3 weeks away.
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about Voltrix-SpMM, a revolutionary GPU kernel design for sparse matrix-matrix multiplication on Tensor Cores presented at USENIX ATC '25. Discover how researchers from Wuhan University, Nvidia Corporation, and University of Macau address the fundamental challenge of efficiently leveraging Tensor Cores for sparse matrix computations, where the inherently sparse nature of matrices conflicts with dense computational patterns. Explore the innovative asynchronous data loading pipeline that employs bit-wise compressed format for sparse matrices and bulk memory copy instructions for dense matrices, featuring a warp-specialized producer-consumer model that overlaps data loading with computation. Examine the persistent and I/O co-balanced kernel mechanism with its two-stage partition strategy designed to achieve balance between input and output operations. Understand how this CUDA 12.6 implementation delivers substantial performance improvements, achieving average speedups of 36.5x over TC-GNN, 1.8x over DTC-SpMM, and 1.7x over RoDe, effectively unleashing the full computational potential of Tensor Cores for sparse matrix-matrix multiplication in scientific computing and machine learning applications.
Syllabus
USENIX ATC '25 - Voltrix: Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous...
Taught by
USENIX