Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Why Does Adam Work So Well for LLMs? And Can We Find Optimal Per-Variable Step Sizes

New York University (NYU) via YouTube

Start learning Write review

Details

Start learning

Provider

YouTube
Pricing

Free Video
Languages

English
Effort

1 hour 10 minutes
Sessions

Self-Paced
Level

Advanced

Found in

Attend this ECE AI seminar exploring the fundamental reasons behind Adam optimizer's superior performance on large language models compared to stochastic gradient descent, and discover new methods for finding optimal per-variable step sizes. Examine evidence challenging the prevailing theory that heavy-tailed noise in stochastic gradients explains Adam's effectiveness, and learn how class imbalance in language tasks emerges as a key factor driving the performance gap between optimizers. Investigate how the large number of low-frequency classes in language modeling causes SGD to converge slowly while having minimal impact on Adam and sign descent algorithms. Explore experimental demonstrations showing how adding low-frequency classes to computer vision models and linear models can artificially induce performance gaps between SGD and Adam. Understand theoretical proofs in simplified settings that explain why gradient descent converges slowly compared to sign descent methods. Discover the first provably optimal method for updating per-variable step sizes that performs within a known factor of optimal fixed per-variable step sizes for smooth strongly-convex functions. Learn about the novel multi-dimensional backtracking procedure that adaptively uses hyper-gradients to generate cutting planes for reducing the search space of optimal step sizes. Examine practical linear-time variants developed to overcome the computational limitations of traditional black-box cutting-plane approaches like the ellipsoid method, making the theoretical advances applicable to real-world optimization problems.