### Oral

## Oral: Optimization

##### Auditorium 1

**The sample complexity of ERMs in stochastic convex optimization**

Daniel Carmon · Amir Yehudayoff · Roi Livni

Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning.Nevertheless, a central fundamental question in this setup remained unresolved:how many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?This question was proposed by Feldman who proved that $\Omega(\frac{d}{\epsilon} + \frac{1}{\epsilon^2} )$ data points are necessary (where $d$ is the dimension and $\epsilon > 0$ the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon} + \frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon} + \frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence.This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of linear functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.

**Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements**

Emmanouil Vasileios Vlatakis-Gkaragkounis · Angeliki Giannou · Yudong Chen · Qiaomin Xie

For min-max optimization and variational inequalities problems (VIPs), Stochastic Extragradient (SEG) and Stochastic Gradient Descent Ascent (SGDA) have emerged as preeminent algorithms. Constant step-size versions of SEG/SGDA have gained popularity due to several appealing benefits, but their convergence behaviors are complicated even in rudimentary bilinear models. Our work elucidates the probabilistic behavior of these algorithms and their projected variants, for a wide range of monotone and non-monotone VIPs with potentially biased stochastic oracles. By recasting them as time-homogeneous Markov Chains, we establish geometric convergence to a unique invariant distribution and aymptotic normality. Specializing to min-max optimization, we characterize the relationship between the step-size and the induced bias with respect to the global solution, which in turns allows for bias refinement via the Richardson-Romberg scheme. Our theoretical analysis is corroborated by numerical experiments.

**Absence of spurious solutions far from ground truth: A low-rank analysis with high-order losses**

Ziye Ma · Ying Chen · Javad Lavaei · Somayeh Sojoudi

Matrix sensing problems exhibit pervasive non-convexity, plaguing optimization with a proliferation of suboptimal spurious solutions. Avoiding convergence to these critical points poses a major challenge. This work provides new theoretical insights that help demystify the intricacies of the non-convex landscape. In this work, we prove that under certain conditions, critical points sufficiently distant from the ground truth matrix exhibit favorable geometry by being strict saddle points rather than troublesome local minima. Moreover, we introduce the notion of higher-order losses for the matrix sensing problem and show that the incorporation of such losses into the objective function amplifies the negative curvature around those distant critical points. This implies that increasing the complexity of the objective function via high-order losses accelerates the escape from such critical points and acts as a desirable alternative to increasing the complexity of the optimization problem via over-parametrization. By elucidating key characteristics of the non-convex optimization landscape, this work makes progress towards a comprehensive framework for tackling broader machine learning objectives plagued by non-convexity.

We consider the problem of graph searching with prediction recently introduced by Banerjee et al. (2023). In this problem, an agent starting at some vertex r has to traverse a (potentially unknown) graph G to find a hidden goal node g while minimizing the total distance traveled. We study a setting in which at any node v, the agent receives a noisy estimate of the distance from v to g. We design algorithms for this search task on unknown graphs. We establish the first formal guarantees on unknown weighted graphs and provide lower bounds showing that the algorithms we propose have optimal or nearly-optimal dependence on the prediction error. Further, we perform numerical experiments demonstrating that in addition to being robust to adversarial error, our algorithms perform well in typical instances in which the error is stochastic. Finally, we provide simpler performance bounds on the algorithms of Banerjee et al. (2023) for the case of searching on a known graph and establish new lower bounds for this setting.

**Graph Partitioning with a Move Budget**

Mina Dalirrooyfard · Elaheh Fata · Majid Behbahani · Yuriy Nevmyvaka

In many real world networks, there already exists a (not necessarily optimal) $k$-partitioning of the network. Oftentimes, for such networks, one aims to find a $k$-partitioning with a smaller cut value by moving only a few nodes across partitions. The number of nodes that can be moved across partitions is often a constraint forced by budgetary limitations. Motivated by such real-world applications, we introduce and study the $r$-move $k$-partitioning~problem, a natural variant of the Multiway cut problem. Given a graph, a set of $k$ terminals and an initial partitioning of the graph, the $r$-move $k$-partitioning~problem aims to find a $k$-partitioning with the minimum-weighted cut among all the $k$-partitionings that can be obtained by moving at most $r$ non-terminal nodes to partitions different from their initial ones. Our main result is a polynomial time $3(r+1)$ approximation algorithm for this problem. We further show that this problem is $W[1]$-hard, and give an FPTAS for when $r$ is a small constant.