Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

Explore optimization challenges in transformer-based language models through this mathematical research lecture that examines scaling laws for gradient descent and sign descent algorithms applied to linear bigram models under Zipf's law. Delve into the theoretical foundations of why gradient descent struggles with the first and last layers of language models, particularly when dealing with heavy-tailed word distributions where frequency follows the 1/k pattern characteristic of natural language text. Learn how the power law distribution of tokens, parameterized by exponent α, affects training performance and discover why the case α=1 found in real text data represents a "worst-case" scenario for gradient descent optimization. Understand the mathematical derivation of scaling laws that show gradient descent requires iterations scaling almost linearly with dimension for Zipf-distributed data, while sign descent (as a proxy for Adam optimizer) achieves significantly better performance with iterations scaling only with the square-root of dimension. Gain insights into the theoretical underpinnings of why optimizers like Adam outperform gradient descent in natural language processing tasks, moving beyond the typical assumption that eigenvalues decay with exponent α > 1 to examine the more challenging heavy-tailed distributions encountered in practice.