Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

New York University (NYU)

Why Does Adam Work So Well for LLMs? And Can We Find Optimal Per-Variable Step Sizes

New York University (NYU) via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Attend this ECE AI seminar exploring the fundamental reasons behind Adam optimizer's superior performance on large language models compared to stochastic gradient descent, and discover new methods for finding optimal per-variable step sizes. Examine evidence challenging the prevailing theory that heavy-tailed noise in stochastic gradients explains Adam's effectiveness, and learn how class imbalance in language tasks emerges as a key factor driving the performance gap between optimizers. Investigate how the large number of low-frequency classes in language modeling causes SGD to converge slowly while having minimal impact on Adam and sign descent algorithms. Explore experimental demonstrations showing how adding low-frequency classes to computer vision models and linear models can artificially induce performance gaps between SGD and Adam. Understand theoretical proofs in simplified settings that explain why gradient descent converges slowly compared to sign descent methods. Discover the first provably optimal method for updating per-variable step sizes that performs within a known factor of optimal fixed per-variable step sizes for smooth strongly-convex functions. Learn about the novel multi-dimensional backtracking procedure that adaptively uses hyper-gradients to generate cutting planes for reducing the search space of optimal step sizes. Examine practical linear-time variants developed to overcome the computational limitations of traditional black-box cutting-plane approaches like the ellipsoid method, making the theoretical advances applicable to real-world optimization problems.

Syllabus

ECE AI SEMINAR: Why does Adam work so well for LLMs? And can we find optimal per-variable step sizes

Taught by

NYU Tandon School of Engineering

Reviews

Start your review of Why Does Adam Work So Well for LLMs? And Can We Find Optimal Per-Variable Step Sizes

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.