AdamW Optimizer from Scratch in Python

Learn to implement the AdamW optimizer from scratch in Python through this 20-minute tutorial that breaks down the state-of-the-art optimization algorithm used in most modern deep learning training. Explore how AdamW differs from the traditional Adam optimizer by applying weight decay regularization in a specific way that provides much greater stability. Understand the fundamental differences between L2 regularization and weight decay, discover why Adam with L2 regularization was problematic, and examine the mathematical formulation behind AdamW's superior performance. Follow along with a complete code implementation while gaining insights into where this regularization technique fits within the broader landscape of optimization algorithms and why it has become the preferred choice for training deep neural networks.

Syllabus

- Introduction: 0:00
- Where does AdamW fit?: 2:30
- What type of regularization does AdamW apply?: 3:25
- Why Adam with L2 sucked?: 4:49
- Isn't L2 and Weight Decay the same?: 5:48
- AdamW formula breakdown: 6:19
- AdamW code implementation lol: 14:32
- AdamW Recap: 19:34