First Formal Proof Confirms Learning Rate Transfer Under μP, Streamlining AI Model Scaling

Groundbreaking research has provided the first rigorous proof of learning rate transfer under the Maximal Update Parametrization (μP), a significant development for efficiently scaling large neural networks. The finding, announced by researcher Quanquan Gu via social media on November 5, 2025, highlights a preprint paper authored by Soufiane Hayou titled "A Proof of Learning Rate Transfer under μP." This theoretical validation addresses a critical challenge in training vast AI models by enabling the transfer of optimal learning rates across different model sizes.

The Maximal Update Parametrization (μP) is a neural network scaling strategy designed to maintain consistent training dynamics as model width increases, thereby facilitating hyperparameter transfer. Prior to this proof, the phenomenon of learning rate transfer under μP was largely observed empirically. The newly published work demonstrates that under μP, the optimal learning rate converges to a non-zero constant as network width approaches infinity, offering a robust theoretical explanation for this crucial property.

The paper contrasts μP with alternative parametrizations, such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP), where the optimal learning rate does not exhibit the same stable convergence. This distinction underscores μP's unique advantage in simplifying the hyperparameter tuning process. "🔥 Learning rate transfer under μP is now proven!" stated Quanquan Gu in his tweet, linking to the preprint.

The research, primarily focusing on linear multi-layer perceptrons (MLPs), showcases that the optimal learning rate can be reliably determined on smaller models and applied to much larger versions without extensive re-tuning. This capability is expected to significantly reduce the computational cost and time associated with developing and deploying increasingly complex AI systems, particularly large language models. While the current proof is established for linear networks trained with gradient descent, the authors express optimism that the principles can be extended to non-linear architectures and other optimization algorithms in future work.