A Framework for Designing Non-Diagonal Adaptive Training Methods

Explore a 49-minute conference talk presented by Wu Lin from the Vector Institute at IPAM's Theory and Practice of Deep Learning Workshop. Delve into a framework for designing non-diagonal adaptive training methods in deep learning optimization. Discover how probabilistic reformulation of optimization problems can exploit the Fisher-Rao geometric structure of probability families. Learn about new quasi-Newton methods for large-scale neural network training that leverage geometric structures. Examine the second-order perspective on adaptive methods like RMSProp and full-matrix AdaGrad. Understand the concept of preconditioner invariance and its application in making non-diagonal adaptive methods inverse-free while maintaining preconditioner structures for modern mini-batch training with low precision. Investigate Kronecker-factored adaptive methods as a bridge between non-diagonal and diagonal adaptive methods. Gain insights into the advantages of these methods for training large neural networks in half-precision, eliminating numerically unstable and computationally intensive matrix decompositions and inversions.