### Session

## Learning Theory

Moderator: Andrew Gordon Wilson

**A constrained risk inequality for general losses**

John Duchi · Feng Ruan

We provide a general constrained risk inequality that applies to arbitrary non-decreasing losses, extending a result of Brown and Low [\emph{Ann.~Stat.~1996}]. Given two distributions $P_0$ and $P_1$, we find a lower bound for the risk of estimating a parameter $\theta(P_1)$ under $P_1$ given an upper bound on the risk of estimating the parameter $\theta(P_0)$ under $P_0$. The inequality is a useful pedagogical tool, as its proof relies only on the Cauchy-Schwartz inequality, it applies to general losses, and it transparently gives risk lower bounds on super-efficient and adaptive estimators.

**Misspecification in Prediction Problems and Robustness via Improper Learning**

Annie Marsden · John Duchi · Gregory Valiant

We study probabilistic prediction games when the underlying model is misspecified, investigating the consequences of predicting using an incorrect parametric model. We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a ``proper'' predictor---one from the putative model class---relative to the best predictor in the same model class has lower bound scaling at least as $\sqrt{\gamma n}$, where $\gamma$ is a measure of the model misspecification to the true distribution in terms of total variation distance. In contrast, using an aggregation-based (improper) learner, one can obtain regret $d \log n$ for any underlying generating distribution, where $d$ is the dimension of the parameter; we exhibit instances in which this is unimprovable even over the family of all learners that may play distributions in the convex hull of the parametric family. These results suggest that simple strategies for aggregating multiple learners together should be more robust, and several experiments conform to this hypothesis.

**Minimax Optimal Regression over Sobolev Spaces via Laplacian Regularization on Neighborhood Graphs**

Alden Green · Sivaraman Balakrishnan · Ryan Tibshirani

In this paper we study the statistical properties of Laplacian smoothing, a graph-based approach to nonparametric regression. Under standard regularity conditions, we establish upper bounds on the error of the Laplacian smoothing estimator \smash{$\widehat{f}$}, and a goodness-of-fit test also based on \smash{$\widehat{f}$}. These upper bounds match the minimax optimal estimation and testing rates of convergence over the first-order Sobolev class $H^1(\mathcal{X})$, for $\mathcal{X} \subseteq \mathbb{R}^d$ and $1 \leq d < 4$; in the estimation problem, for $d = 4$, they are optimal modulo a $\log n$ factor. Additionally, we prove that Laplacian smoothing is manifold-adaptive: if $\mathcal{X} \subseteq \mathbb{R}^d$ is an $m$-dimensional manifold with $m < d$, then the error rate of Laplacian smoothing (in either estimation or testing) depends only on $m$, in the same way it would if $\mathcal{X}$ were a full-dimensional set in $\mathbb{R}^m$.

**Faster Kernel Interpolation for Gaussian Processes**

Mohit Yadav · Daniel Sheldon · Cameron Musco

A key challenge in scaling Gaussian Process (GP) regression to massive datasets is that exact inference requires computation with a dense n × n kernel matrix, where n is the number of data points. Significant work focuses on approximating the kernel matrix via interpolation using a smaller set of m “inducing points”. Structured kernel interpolation (SKI) is among the most scalable methods: by placing inducing points on a dense grid and using structured matrix algebra, SKI achieves per-iteration time of O(n + m log m) for approximate inference. This linear scaling in n enables approximate inference for very large data sets; however, the cost is per-iteration, which remains a limitation for extremely large n. We show that the SKI per-iteration time can be reduced to O(m log m) after a single O(n) time precomputation step by reframing SKI as solving a natural Bayesian linear regression problem with a fixed set of m compact basis functions. For a fixed grid, our new method scales to truly massive data sets: after the initial linear time pass, all subsequent computations are independent of n. We demonstrate speedups in practice for a wide range of m and n and for all the main GP inference tasks. With per-iteration complexity independent of the dataset size n for a fixed grid, our method scales to truly massive data sets. We demonstrate speedups in practice for a wide range of m and n and apply the method to GP inference on a three-dimensional weather radar dataset with over 100 million points.