Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Muon Optimizer for Dense Linear Layer Explained - Newton-Schulz + Momentum

Yacine Mahdid via YouTube

Start learning Write review

Details

Start learning

Provider

YouTube
Pricing

Free Video
Languages

English
Effort

33 minutes
Sessions

Self-Paced
Level

Advanced

Found in

Explore the Muon optimizer in this 33-minute tutorial that breaks down the cutting-edge optimization technique currently powering large language models like Kimi K2. Begin with an introduction to why Muon is revolutionizing machine learning optimization, then review foundational concepts including Adam and AdamW optimizers to establish context. Discover what makes Muon unique through an overview of its authors' approach and examine impressive performance results, including Kimi K2's achievements with Muon-CLIP. Dive deep into the core mechanism behind Muon's effectiveness with a detailed exploration of the Newton-Schulz method, understanding how it enables orthogonalization for dense linear layers. Conclude with hands-on implementation by coding the Muon optimizer in NumPy, providing practical experience with the mathematical concepts covered. Access additional resources including interactive problems, comprehensive overviews, and supplementary materials to deepen your understanding of this innovative optimization technique that combines Newton-Schulz orthogonalization with momentum-based updates.

Syllabus

- introduction: 0:00
- why muon is useful?: 2:04
- adam overview: 3:30
- adamw overview: 4:32
- what muon is doing?: 7:31
- muon authors overview: 8:26
- muon results: 10:39
- kimi k2 performance with muon-clip: 12:29
- what does muon do?: 13:54
- deep dive in newton schulz: 16:52
- coding muon in numpy: 27:59