Beyond the Ideal: Analyzing the Inexact Muon Update
Egor Shulgin ⋅ AlRashed ⋅ Peter Richtarik ⋅ Francesco Orabona
Abstract
The Muon optimizer has rapidly emerged as a powerful, geometry-aware alternative to AdamW, demonstrating strong performance in large-scale training of neural networks. However, a critical theory-practice disconnect exists: Muon's efficiency relies on fast, approximate orthogonalization, while most theoretical analyses study idealized exact-SVD updates. This work moves beyond the ideal by providing a general analysis of the *inexact* orthogonalized update at Muon's core. We develop our analysis within the general framework of Linear Minimization Oracle (LMO)-based optimization, introducing a realistic additive error model to capture the inexactness of practical approximation schemes. Our analysis yields explicit bounds that quantify performance degradation as a function of the LMO inexactness/error, $\delta$. We reveal a fundamental coupling between this inexactness and the optimal step size and momentum: lower oracle precision requires a smaller step size but larger momentum parameter. These findings elevate the approximation procedure, such as the number of Newton-Schulz steps, from an implementation detail to a critical parameter that must be *co-tuned* with the learning schedule. NanoGPT experiments directly confirm the predicted coupling, with optimal learning rates clearly shifting as approximation precision changes.
Successful Page Load