Track: Oral Session 3: Optimization

Cubic regularized subspace Newton for non-convex optimization

Jim Zhao · Nikita Doikov · Aurelien Lucchi

This paper addresses the optimization problem of minimizing non-convex continuous functions,a problem highly relevant in high-dimensional machine learning scenarios, particularly those involving over-parameterization. We analyze a randomized coordinate second-order method named SSCN, which can be interpreted as applying the cubic regularization of Newton's method in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, making it applicable in higher-dimensional scenarios.Theoretically, we establish strong global convergence guarantees for non-convex functions to a stationary point, with interpolating rates for arbitrary subspace sizes andallowing inexact curvature estimation, starting from an arbitrary initialization.When increasing the subspace size, our complexity matches the $\mathcal{O}(\epsilon^{-3/2})$ rate of the full Newton's method with cubic regularization.Additionally, we propose an adaptive sampling scheme ensuring the exact convergence rate of $\mathcal{O}(\epsilon^{-3/2}, \epsilon^{-3})$ to a second-order stationary point, without requiring to sample all coordinates.Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods and other second-order subspace methods.

Implicit Diffusion: Efficient optimization through stochastic sampling

Pierre Marion · Anna Korba · Peter Bartlett · Mathieu Blondel · Valentin De Bortoli · Arnaud Doucet · Felipe Llinares-López · Courtney Paquette · Quentin Berthet

Sampling and automatic differentiation are both ubiquitous in modern machine learning. At its intersection, differentiating through a sampling operation, with respect to the parameters of the sampling process, is a problem that is both challenging and broadly applicable. We introduce a general framework and a new algorithm for first-order optimization of parameterized stochastic diffusions, performing jointly, in a single loop, optimization and sampling steps. This approach is inspired by recent advances in bilevel optimization and automatic implicit differentiation, leveraging the point of view of sampling as optimization over the space of probability distributions. We provide theoretical and experimental results showcasing the performance of our method.

ScoreFusion: Fusing Score-based Generative Models via Kullback–Leibler Barycenters

Hao Liu · Junze Ye · Jose Blanchet · NIAN SI

We introduce ScoreFusion, a theoretically grounded method for fusing multiple pre-trained diffusion models that are assumed to generate from auxiliary populations. ScoreFusion is particularly useful for enhancing the generative modeling of a target population with limited observed data. Our starting point considers the family of KL barycenters of the auxiliary populations, which is proven to be an optimal parametric class in the KL sense, but difficult to learn. Nevertheless, by recasting the learning problem as score matching in denoising diffusion, we obtain a tractable way of computing the optimal KL barycenter weights. We prove a dimension-free sample complexity bound in total variation distance, provided that the auxiliary models are well-fitted for their own task and the auxiliary tasks combined capture the target well. The sample efficiency of ScoreFusion is demonstrated by learning handwritten digits. We also provide a simple adaptation of a Stable Diffusion denoising pipeline that enables sampling from the KL barycenter of two auxiliary checkpoints; on a portrait generation task, our method produces faces that enhance population heterogeneity relative to the auxiliary distributions.

The Pivoting Framework: Frank-Wolfe Algorithms with Active Set Size Control

Mathieu Besançon · Sebastian Pokutta · Elias Wirth

We propose the pivoting meta algorithm (PM) to enhance optimization algorithms that generate iterates as convex combinations of vertices of a feasible region $C\subseteq \mathbb{R}^n$, including Frank-Wolfe (FW) variants. PM guarantees that the active set (the set of vertices in the convex combination) of the modified algorithm remains as small as $dim(C)+1$ as stipulated by Carathéodory's theorem. PM achieves this by reformulating the active set expansion task into an equivalent linear program, which can be efficiently solved using a single pivot step akin to the primal simplex algorithm; the convergence rate of the original algorithms are maintained. Furthermore, we establish the connection between PM and active set identification, in particular showing under mild assumptions that PM applied to the away-step Frank-Wolfe algorithm or the blended pairwise Frank-Wolfe algorithm bounds the active set size by the dimension of the optimal face plus $1$. We provide numerical experiments to illustrate practicality and efficacy on active set size reduction.

Unbiased and Sign Compression in Distributed Learning: Comparing Noise Resilience via SDEs

Enea Monzio Compagnoni · Rustem Islamov · Frank Proske · Aurelien Lucchi

Distributed methods are essential for handling machine learning pipelines comprising large-scale models and datasets. However, their benefits often come at the cost of increased communication overhead between the central server and agents, which can become the main bottleneck, making training costly or even unfeasible in such systems. Compression methods such as quantization and sparsification can alleviate this issue. Still, their robustness to large and heavy-tailed gradient noise, a phenomenon sometimes observed in language modeling, remains poorly understood. This work addresses this gap by analyzing Distributed Compressed SGD (DCSGD) and Distributed SignSGD (DSignSGD) using stochastic differential equations (SDEs). Our results show that DCSGD with unbiased compression is more vulnerable to noise in stochastic gradients, while DSignSGD remains robust, even under large and heavy-tailed noise. Additionally, we propose new scaling rules for hyperparameter tuning to mitigate performance degradation due to compression. These findings are empirically validated across multiple deep learning architectures and datasets, providing practical recommendations for distributed optimization.