High Performance Unstructured SpMM Computation Using Tensor Cores
Scalable Parallel Computing Lab, SPCL @ ETH Zurich via YouTube
Most AI Pilots Fail to Scale. MIT Sloan Teaches You Why — and How to Fix It
Build the Finance Skills That Lead to Promotions — Not Just Certificates
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
This conference talk presents "High Performance Unstructured SpMM Computation Using Tensor Cores," showcasing innovative research on sparse matrix-matrix multiplication optimization. Discover how the SMaT (Sparse Matrix Matrix Tensor Core-accelerated) library enables efficient utilization of Tensor Cores for unstructured sparse matrices, overcoming hardware limitations that typically constrain sparse computations. The presentation explores how the library leverages the low-level CUDA MMA API to maximize GPU performance, with algorithmic optimizations like sparse matrix permutation that minimize non-zero blocks. Evaluation results demonstrate SMaT outperforming state-of-the-art libraries by up to 125x (2.6x on average) on NVIDIA A100 GPUs. The talk covers the introduction to the problem, details of the SMaT approach, performance modeling methodology, comprehensive evaluation results, and concludes with implications for scientific computing, large-model training, and inference applications. The 31-minute presentation from ETH Zurich's Scalable Parallel Computing Lab was delivered at SC '24, the International Conference for High Performance Computing, Networking, Storage, and Analysis.
Syllabus
00:00 Introduction
03:45 SMaT: Sparse Matrix Matrix Tensor Core-accelerate
06:10 Performance Model
09:35 Evaluation
21:30 Conclusion
Taught by
Scalable Parallel Computing Lab, SPCL @ ETH Zurich