The Hidden Width of Deep ResNets

Explore the mathematical foundations of deep ResNet training dynamics in this 47-minute conference talk that presents a rigorous framework for analyzing practical architectures including Transformers. Learn how stochastic approximation of ODEs combined with propagation-of-chaos arguments reveals three key insights: discover why infinite-depth ResNets of any hidden width behave as if infinitely wide throughout training, understand how Transformer phase diagrams mirror two-layer perceptrons with appropriate substitutions, and examine optimal shape scaling showing that Transformers with optimal shape converge to limiting dynamics at rate P^{-1/6} for parameter budget P. Gain deep mathematical understanding of why depth creates effective width in neural networks and how this principle applies across different architectures from ResNets to modern Transformers.