Big Bird- Transformers for Longer Sequences

Explore a comprehensive video explanation of the BigBird paper, which introduces a novel sparse attention mechanism for transformers to handle longer sequences. Learn about the challenges of quadratic memory requirements in full attention models and how BigBird addresses this issue through a combination of random, window, and global attention. Discover the theoretical foundations, including universal approximation and Turing completeness, as well as the practical implications for NLP tasks such as question answering and summarization. Gain insights into the experimental parameters, structured block computations, and results that demonstrate BigBird's improved performance on various NLP tasks and its potential applications in genomics.

Syllabus

- Intro & Overview
- Quadratic Memory in Full Attention
- Architecture Overview
- Random Attention
- Window Attention
- Global Attention
- Architecture Summary
- Theoretical Result
- Experimental Parameters
- Structured Block Computations
- Recap
- Experimental Results
- Conclusion