Unstructured Sparsity Meets Tensor Cores - Lessons from Sparse Attention and MoE

Learn how unstructured sparsity intersects with tensor core architectures through practical insights from sparse attention mechanisms and Mixture of Experts (MoE) models in this 37-minute conference talk. Explore the challenges and opportunities that arise when implementing sparse computational patterns on modern GPU tensor cores, examining real-world case studies from attention mechanisms and MoE architectures. Discover optimization strategies for managing parallelism in sparse workloads, understand the performance implications of different sparsity patterns on tensor hardware, and gain insights into the trade-offs between computational efficiency and memory access patterns. Examine how unstructured sparsity can be effectively leveraged in deep learning applications while working within the constraints and capabilities of specialized tensor processing units.