GeneralSparse - Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs

Learn about GeneralSparse, a novel GPU-based solution for optimizing Sparse Matrix Multiplication (SpMM) in pruned Large Language Models through this 19-minute conference presentation from USENIX ATC '25. Discover how researchers from SKLP, Institute of Computing Technology, CAS, and University of Chinese Academy of Sciences address the computational challenges posed by the rapid growth of generative model parameters and the deployment difficulties related to weight storage and inference latency. Explore how weight pruning transforms matrix multiplications into SpMM computations and understand the limitations of existing solutions in handling diverse sparsity patterns from different pruning methods. Examine GeneralSparse's innovative approach that leverages memory access and reduction space abstractions, featuring dynamic box division processes that adapt to various pruning patterns and hierarchical reduction algorithms optimized for GPU architectures. Review performance evaluation results showing up to 20.82× speedup over cuSPARSE libraries on pruned LLM weight matrices and SuiteSparse collections, plus up to 2.33× speedup in end-to-end LLM inference compared to existing counterparts.

Syllabus

USENIX ATC '25 - GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference..

Taught by

USENIX

Reviews

Start your review of GeneralSparse - Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs

AI, Data Science & Cloud Certificates from Google, IBM & Meta

Stuck in Tutorial Hell? Learn Backend Dev the Right Way

Taught by

The Private Equity Associate Certification

Voltrix - Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Optimizing Large Language Model Inference for Arm CPUs

Learn the Skills Netflix, Meta, and Capital One Actually Hire For Ad

Write Prompts That Actually Work: ZTM’s Prompt Engineering Bootcamp Review

25 Resources to Learn Generative Engine Optimization in 2026

A Free Tool to Learn Languages Through Netflix and YouTube: Language Reactor Review

5 Best YouTube Marketing Courses for Business in 2026

From Zero to GenAI: 9 Unique Ways to Understand Large Language Models

Never Stop Learning.