MoE Models Don't Work Like You Think - Inside GPT-OSS

Explore the inner workings of mixture of expert (MoE) models through an in-depth analysis of GPT-OSS-20B, OpenAI's first open weight model since GPT-2. Challenge common misconceptions about how MoE models operate by examining whether these systems actually contain specialized domain experts for mathematics, coding, or language tasks. Discover through empirical investigation that the reality of expert specialization differs significantly from popular assumptions. Analyze the unique architecture of this transformer-based MoE model and learn how it processes information differently than expected. Investigate token routing mechanisms and uncover patterns using trigram analysis techniques. Examine attention mechanisms and their role in model behavior, while distinguishing between position specialists and context specialists within the expert framework. Access accompanying research materials and code implementations to deepen your understanding of these advanced language model architectures and their surprising operational characteristics.

Syllabus

- intro
- Dense vs MoE models
- Not Domain Experts
- Disproving Token Routing
- Identifying patterns with TriGrams
- Attention is all you need
- Position Specialists vs Context Specialists
- Conclusion