Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

GeneralSparse - Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about GeneralSparse, a novel GPU-based solution for optimizing Sparse Matrix Multiplication (SpMM) in pruned Large Language Models through this 19-minute conference presentation from USENIX ATC '25. Discover how researchers from SKLP, Institute of Computing Technology, CAS, and University of Chinese Academy of Sciences address the computational challenges posed by the rapid growth of generative model parameters and the deployment difficulties related to weight storage and inference latency. Explore how weight pruning transforms matrix multiplications into SpMM computations and understand the limitations of existing solutions in handling diverse sparsity patterns from different pruning methods. Examine GeneralSparse's innovative approach that leverages memory access and reduction space abstractions, featuring dynamic box division processes that adapt to various pruning patterns and hierarchical reduction algorithms optimized for GPU architectures. Review performance evaluation results showing up to 20.82× speedup over cuSPARSE libraries on pruned LLM weight matrices and SuiteSparse collections, plus up to 2.33× speedup in end-to-end LLM inference compared to existing counterparts.

Syllabus

USENIX ATC '25 - GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference..

Taught by

USENIX

Reviews

Start your review of GeneralSparse - Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.