Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about GeneralSparse, a novel GPU-based solution for optimizing Sparse Matrix Multiplication (SpMM) in pruned Large Language Models through this 19-minute conference presentation from USENIX ATC '25. Discover how researchers from SKLP, Institute of Computing Technology, CAS, and University of Chinese Academy of Sciences address the computational challenges posed by the rapid growth of generative model parameters and the deployment difficulties related to weight storage and inference latency. Explore how weight pruning transforms matrix multiplications into SpMM computations and understand the limitations of existing solutions in handling diverse sparsity patterns from different pruning methods. Examine GeneralSparse's innovative approach that leverages memory access and reduction space abstractions, featuring dynamic box division processes that adapt to various pruning patterns and hierarchical reduction algorithms optimized for GPU architectures. Review performance evaluation results showing up to 20.82× speedup over cuSPARSE libraries on pruned LLM weight matrices and SuiteSparse collections, plus up to 2.33× speedup in end-to-end LLM inference compared to existing counterparts.
Syllabus
USENIX ATC '25 - GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference..
Taught by
USENIX