Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Muon Optimizer for Dense Linear Layer Explained - Newton-Schulz + Momentum

Yacine Mahdid via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the Muon optimizer in this 33-minute tutorial that breaks down the cutting-edge optimization technique currently powering large language models like Kimi K2. Begin with an introduction to why Muon is revolutionizing machine learning optimization, then review foundational concepts including Adam and AdamW optimizers to establish context. Discover what makes Muon unique through an overview of its authors' approach and examine impressive performance results, including Kimi K2's achievements with Muon-CLIP. Dive deep into the core mechanism behind Muon's effectiveness with a detailed exploration of the Newton-Schulz method, understanding how it enables orthogonalization for dense linear layers. Conclude with hands-on implementation by coding the Muon optimizer in NumPy, providing practical experience with the mathematical concepts covered. Access additional resources including interactive problems, comprehensive overviews, and supplementary materials to deepen your understanding of this innovative optimization technique that combines Newton-Schulz orthogonalization with momentum-based updates.

Syllabus

- introduction: 0:00
- why muon is useful?: 2:04
- adam overview: 3:30
- adamw overview: 4:32
- what muon is doing?: 7:31
- muon authors overview: 8:26
- muon results: 10:39
- kimi k2 performance with muon-clip: 12:29
- what does muon do?: 13:54
- deep dive in newton schulz: 16:52
- coding muon in numpy: 27:59

Taught by

Yacine Mahdid

Reviews

Start your review of Muon Optimizer for Dense Linear Layer Explained - Newton-Schulz + Momentum

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.