Skip to yearly menu bar Skip to main content


Oral

Oral Session 3: Optimization, Training Methods & Learning Dynamics

Main Ballroom

Moderator: Quentin Berthet

Sat 2 May 6 a.m. PDT — 7 a.m. PDT
Abstract:
Chat is not available.

We provide the first proof of learning rate transfer with a multi-layer perceptron (MLP) parametrized with $\mu P$, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that with $\mu P$, the optimal learning rate converges to a non-zero constant as width goes to infinity. In contrast, we show that this doesn't hold with other parametrizations such as Standard Parameterization (SP) and Neural Tangent Parametrization (NTP). We provide extensive empirical results validating our theoretical findings.

Diagonal linear networks (DLNs) are a tractable model that captures several nontrivial behaviors in neural network training, such as initialization-dependent solutions and incremental learning. These phenomena are typically studied in isolation, leaving the overall dynamics insufficiently understood. In this work, we present a unified analysis of various phenomena in the gradient flow dynamics of DLNs. Using Dynamical Mean-Field Theory (DMFT), we derive a low-dimensional effective process that captures the asymptotic gradient flow dynamics in high dimensions. Analyzing this effective process yields new insights into DLN dynamics, including loss convergence rates and their trade-off with generalization, and systematically reproduces many of the previously observed phenomena. These findings deepen our understanding of DLNs and demonstrate the effectiveness of the DMFT approach in analyzing high-dimensional learning dynamics of neural networks.


Spotlight
A projection-based framework for gradient-free and parallel learning

Andreas Bergmeister ⋅ Manish Krishan Lal ⋅ Stefanie Jegelka ⋅ Suvrit Sra

We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.


Spotlight
Generalized and Optimal Straight-Through Estimators

James Hooper ⋅ Alexander Shekhovtsov

Modern ML models often utilize discrete components within their computational graphs, making training challenging. In such cases, approximate-chain-rule gradient estimators can be applied. They work reasonably well but are obtained by combining diverse rationales with ad-hoc choices. In this work, we propose a principled axiomatic approach to define a general family of gradient estimators and show that it subsumes many existing methods. Within this family, we derive optimal estimators with respect to a minimum variance criterion subject to interpretable bias-limiting constraints, addressing integer and one-hot categorical discrete variables. We empirically demonstrate that our estimator can achieve a better bias-variance trade-off than existing ones on synthetic problems and outperforms them on training variational auto-encoders with discrete latent variables.


Spotlight
GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Neeraj Gangwar ⋅ Rishabh Deshmukh ⋅ Michael Shavlovsky ⋅ Hancao Li ⋅ Vivek Mittal ⋅ Lexing Ying ⋅ Nickvash Kani

As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ($8\times$).

Tyler’s M-estimator is a popular method for robust covariance estimation. Although Tyler proposed a widely used fixed-point iteration to compute the estimator nearly four decades ago, the theoretical properties of the fixed-point iteration have remained elusive. In particular, no deterministic global convergence rate has been established, despite decades of research. In this paper, we resolve this longstanding question by interpreting the fixed-point iteration as an instance of the convex-concave procedure, and analyzing it through a novel combination of relative smoothness and Riemannian optimization. Our analysis requires us to navigate two distinct geometric frameworks: one induced by the Kullback--Leibler divergence, and the other one arising from the Riemannian geometry on the manifold of positive definite matrices. This interplay between the two geometries also allows us to prove that a regularized variant of the fixed-point iteration converges linearly. Beyond contributing to the theoretical understanding of Tyler’s M-estimator, our analysis demonstrates how the integration of Euclidean and Riemannian perspectives can lead to new insights into the design and analysis of optimization algorithms.


Spotlight
Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Marina Sheshukova ⋅ Sergey Samsonov ⋅ Denis Belomestny ⋅ Eric Moulines ⋅ Qi-Man Shao ⋅ Zhuo-Song Zhang ⋅ Alexey Naumov

In this paper, we establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing the confidence sets using the Stochastic Gradient Descent (SGD) algorithm. Under appropriate regularity conditions, our approach avoids the need to approximate the limiting covariance of Polyak-Ruppert SGD iterates, which allows us to derive approximation rates in convex distance of order up to $1/\sqrt{n}$. Notably, this rate can be faster than the one that can be proven in the Polyak-Juditsky central limit theorem. To our knowledge, this provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. Our analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.


Spotlight
Composable Coresets for Constrained Determinant Maximization and Beyond

Sepideh Mahabadi ⋅ Thuy-Duong Vuong

We study algorithms for construction of *composable coresets* for the task of *Determinant Maximization* under *partition constraint*. Given a point set $V \subset R^d$ that is partitioned into $s$ groups $V_1,\cdots, V_s$, and integers $k_1,...,k_s$, where $k=\sum_i k_i$, the goal is to pick $k_i$ points from group $V_i$ such that the overall determinant of the picked $k$ points is maximized. Determinant Maximization and its constrained variants have gained a lot of interest for modeling diversity, and have found applications in the context of data summarization. When the cardinality $k$ of the selected set is greater than the dimension $d$, we show a peeling algorithm that gives us a composable coreset of size $kd$ with a provably optimal approximation factor of $d^{O(d)}.$ When $k\leq d$, we show a simple coreset construction with optimal size and approximation factor. As a further application of our technique, we get a composable coreset for determinant maximization under the more general laminar matroid constraints, and a composable coreset for unconstrained determinant maximization in a previously unresolved regime. Our results generalize to all strongly Rayleigh distributions and to several other experimental design problems. As an application, we improve the runtime of the practical local-search based algorithm of [Anari-Vuong--COLT'22] for determinantal maximization under partition constraint from $O(n^{2^s}k^{2^s})$ to $O(n k^{2^s})$, making it only linear on the number of points $n$.