### Session

## Optimization / Learning Theory / Generalization

Moderator: Cassio de Campos

**Projection-Free Optimization on Uniformly Convex Sets**

Thomas Kerdreux · Alexandre d'Aspremont · Sebastian Pokutta

The Frank-Wolfe method solves smooth constrained convex optimization problems at a generic sublinear rate of $\mathcal{O}(1/T)$, and it (or its variants) enjoys accelerated convergence rates for two fundamental classes of constraints: polytopes and strongly-convex sets. Uniformly convex sets non-trivially subsume strongly convex sets and form a large variety of \textit{curved} convex sets commonly encountered in machine learning and signal processing. For instance, the $\ell_p$-balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show that these sets systematically induce accelerated convergence rates for the original Frank-Wolfe algorithm, which continuously interpolate between known rates. Our accelerated convergence rates emphasize that it is the curvature of the constraint sets -- not just their strong convexity -- that leads to accelerated convergence rates. These results also importantly highlight that the Frank-Wolfe algorithm is adaptive to much more generic constraint set structures, thus explaining faster empirical convergence. Finally, we also show accelerated convergence rates when the set is only locally uniformly convex around the optima and provide similar results in online linear optimization.

**Measure Transport with Kernel Stein Discrepancy**

Matthew Fisher · Tui Nolan · Matthew Graham · Dennis Prangle · Chris Oates

Measure transport underpins several recent algorithms for posterior approximation in the Bayesian context, wherein a transport map is sought to minimise the Kullback--Leibler divergence (KLD) from the posterior to the approximation. The KLD is a strong mode of convergence, requiring absolute continuity of measures and placing restrictions on which transport maps can be permitted. Here we propose to minimise a kernel Stein discrepancy (KSD) instead, requiring only that the set of transport maps is dense in an $L^2$ sense and demonstrating how this condition can be validated. The consistency of the associated posterior approximation is established and empirical results suggest that KSD is competitive and more flexible alternative to KLD for measure transport.

**Learn to Expect the Unexpected: Probably Approximately Correct Domain Generalization**

Vikas Garg · Adam Tauman Kalai · Katrina Ligett · Steven Wu

Domain generalization is the problem of machine learning when the training data and the test data come from different ``domains'' (data distributions). We propose an elementary theoretical model of the domain generalization problem, introducing the concept of a meta-distribution over domains. In our model, the training data available to a learning algorithm consist of multiple datasets, each from a single domain, drawn in turn from the meta-distribution. We show that our model can capture a rich range of learning phenomena specific to domain generalization for three different settings: learning with Massart noise, learning decision trees, and feature selection. We demonstrate approaches that leverage domain generalization to reduce computational or data requirements in each of these settings. Experiments demonstrate that our feature selection algorithm indeed ignores spurious correlations and improves generalization.

**Improving Adversarial Robustness via Unlabeled Out-of-Domain Data**

Zhun Deng · Linjun Zhang · Amirata Ghorbani · James Zou

Data augmentation by incorporating cheap unlabeled data from multiple domains is a powerful way to improve prediction especially when there is limited labeled data. In this work, we investigate how adversarial robustness can be enhanced by leveraging out-of-domain unlabeled data. We demonstrate that for broad classes of distributions and classifiers, there exists a sample complexity gap between standard and robust classification. We quantify the extent to which this gap can be bridged by leveraging unlabeled samples from a shifted domain by providing both upper and lower bounds. Moreover, we show settings where we achieve better adversarial robustness when the unlabeled data come from a shifted domain rather than the same domain as the labeled data. We also investigate how to leverage out-of-domain data when some structural information, such as sparsity, is shared between labeled and unlabeled domains. Experimentally, we augment object recognition datasets (CIFAR-10, CINIC-10, and SVHN) with easy-to-obtain and unlabeled out-of-domain data and demonstrate substantial improvement in the model's robustness against $\ell_\infty$ adversarial attacks on the original domain.