MoE Token Routing Explained - How Mixture of Experts Works with Code

Explore the intricate mechanics of token routing in Mixture of Experts (MoE) models through this comprehensive 34-minute technical video. Delve into the fundamental concepts of MoE layers and understand how tokens are intelligently distributed across different expert networks. Learn to compute router logits, implement sparsity through top-k expert selection, and normalize logits into router probabilities. Master the slot selection process, handle oversubscribed token scenarios, and construct final weight matrices for optimal model performance. Follow along with practical code implementations using the provided Colab notebook while referencing detailed slides that illustrate each algorithmic step. Gain deep insights into how modern large language models efficiently scale by routing computational tasks to specialized expert networks, making this essential viewing for machine learning practitioners working with transformer architectures and distributed computing systems.

Syllabus

Introduction:
Laying the Foundation for Mixture of Experts MoE:
Focus on Token Routing:
What is a Mixture of Experts Layer?:
Problem Statement and Configurations:
Compute Router Logits:
Sparsity and Selecting Top K Experts:
Normalizing Logits to Router Probabilities:
Slot Selection:
Dropping Oversubscribed Tokens:
Updated Normalized Token Weights:
Updated Slot Selection and Token Slots:
Final Weight Matrix Construction:
Conclusion: