A Proof of Learning Rate Transfer under $\mu$P
Soufiane Hayou
Abstract
We provide the first proof of learning rate transfer with a multi-layer perceptron (MLP) parametrized with $\mu P$, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that with $\mu P$, the optimal learning rate converges to a non-zero constant as width goes to infinity. In contrast, we show that this doesn't hold with other parametrizations such as Standard Parameterization (SP) and Neural Tangent Parametrization (NTP). We provide extensive empirical results validating our theoretical findings.
Successful Page Load