Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Attend this ECE AI seminar exploring the fundamental reasons behind Adam optimizer's superior performance on large language models compared to stochastic gradient descent, and discover new methods for finding optimal per-variable step sizes. Examine evidence challenging the prevailing theory that heavy-tailed noise in stochastic gradients explains Adam's effectiveness, and learn how class imbalance in language tasks emerges as a key factor driving the performance gap between optimizers. Investigate how the large number of low-frequency classes in language modeling causes SGD to converge slowly while having minimal impact on Adam and sign descent algorithms. Explore experimental demonstrations showing how adding low-frequency classes to computer vision models and linear models can artificially induce performance gaps between SGD and Adam. Understand theoretical proofs in simplified settings that explain why gradient descent converges slowly compared to sign descent methods. Discover the first provably optimal method for updating per-variable step sizes that performs within a known factor of optimal fixed per-variable step sizes for smooth strongly-convex functions. Learn about the novel multi-dimensional backtracking procedure that adaptively uses hyper-gradients to generate cutting planes for reducing the search space of optimal step sizes. Examine practical linear-time variants developed to overcome the computational limitations of traditional black-box cutting-plane approaches like the ellipsoid method, making the theoretical advances applicable to real-world optimization problems.