**Unlabeled Data Help: Minimax Analysis and Adversarial Robustness**

Yue Xing · Qifan Song · Guang Cheng

The recent proposed self-supervised learning (SSL) approaches successfully demonstrate the great potential of supplementing learning algorithms with additional unlabeled data. However, it is still unclear whether the existing SSL algorithms can fully utilize the information of both labelled and unlabeled data. This paper gives an affirmative answer for the reconstruction-based SSL algorithm (Lee et al., 2020) under several statistical models. While existing literature only focuses on establishing the upper bound of the convergence rate, we provide a rigorous minimax analysis, and successfully justify the rate-optimality of the reconstruction-based SSL algorithm under different data generation models. Furthermore, we incorporate the reconstruction-based SSL into the exist- ing adversarial training algorithms and show that learning from unlabeled data helps improve the robustness.

**A Bayesian Approach for Stochastic Continuum-armed Bandit with Long-term Constraints**

Zai Shi · Atilla Eryilmaz

Despite many valuable advances in the domainof online convex optimization over thelast decade, many machine learning and networkingproblems of interest do not fit intothat framework due to their nonconvex objectivesand the presence of constraints. Thismotivates us in this paper to go beyond convexityand study the problem of stochasticcontinuum-armed bandit with long-termconstraints. For noiseless observations ofconstraint functions, we propose a genericmethod using a Bayesian approach based ona class of penalty functions, and prove thatit can achieve a sublinear regret with respectto the global optimum and a sublinear constraintviolation (CV), which can match thebest results of previous methods. Additionally,we propose another method to deal withthe case where constraint functions are observedwith noise, which can achieve a sublinearregret and a sublinear CV with more assumptions.Finally, we use two experimentsto compare our methods with two benchmarkmethods in online optimization and Bayesianoptimization, which demonstrates the advantagesof our algorithms.

**Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning**

Yongchan Kwon · James Zou

Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, which is not critical for machine learning settings. Beta Shapley unifies several popular data valuation methods and includes data Shapley as a special case. Moreover, we prove that Beta Shapley has several desirable statistical properties and propose efficient algorithms to estimate it. We demonstrate that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model.

**Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data**

Max Baak · Simon Brugman · Ilan Fridman Rojas · Lorraine d'Almeida · Ralph Urlus · Jean-Baptiste Oger

The creation of realistic, synthetic datasets has several purposes with growing demand in recent times, e.g. privacy protection and other cases where real data cannot be easily shared. A multitude of primarily neural networks (NNs), e.g. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), or Bayesian Network (BN) approaches have been created to tackle this problem, however these require extensive compute resources, lack interpretability, and in some instances lack replication fidelity as well. We propose a hybrid, probabilistic approach for synthesizing pairwise independent tabular data, called Synthsonic. A sequence of well-understood, invertible statistical transformations removes first-order correlations, then a Bayesian Network jointly models continuous and categorical variables, and a calibrated discriminative learner captures the remaining dependencies. Replication studies on MIT's SDGym benchmark show marginally or significantly better performance than all prior BN-based approaches, while being competitive with NN-based approaches (first place in 10 out of 13 benchmark datasets). The computational time required to learn the data distribution is at least one order of magnitude lower than the NN methods. Furthermore, inspecting intermediate results during the synthetic data generation allows easy diagnostics and tailored corrections. We believe the combination of out-of-the-box performance, speed and interpretability make this method a significant addition to the synthetic data generation

We study meta-learning in Markov Decision Processes (MDP) with linear transition models in the undiscounted episodic setting.Under a task sharedness metric based on model proximity we study task families characterized by a distribution over models specified by a bias term and a variance component. We then propose BUC-MatrixRL, a version of the UC-Matrix RL algorithm and show it can meaningfully leverage a set of sampled training tasks to quickly solve a test task sampled from the same task distribution by learning an estimator of the bias parameter of the task distribution. The analysis leverages and extends results in the learning to learn linear regression and linear bandit setting to the more general case of MDP's with linear transition models. We prove that compared to learning the tasks in isolation, BUC-Matrix RL provides significant improvements in the transfer regret for high bias low variance task distributions.

**Differentially Private Histograms under Continual Observation: Streaming Selection into the Unknown**

Adrian Rivera Cardoso · Ryan Rogers

We generalize the continuous observation privacy setting from Dwork et al. and Chan et al. by allowing each event in a stream to be a subset of some (possibly unknown) universe of items. We design differentially private (DP) algorithms for histograms in several settings, including top-k selection, with privacy loss that scales with polylog(T), where T is the maximum length of the input stream. We present a meta-algorithm that can use existing one-shot top-k private algorithms as a subroutine to continuously release DP histograms from a stream. Further, we present more practical DP algorithms for two settings: 1) continuously releasingthe top-k counts from a histogram over a known domain when an event can consist of an arbitrary number of items, and 2) continuously releasing histograms over an unknown domain when an event has a limited number of items.

**Many processors, little time: MCMC for partitions via optimal transport couplings**

Tin Nguyen · Brian Trippe · Tamara Broderick

Markov chain Monte Carlo (MCMC) methods are often used in clustering since they guarantee asymptotically exact expectations in the infinite-time limit. In finite time, though, slow mixing often leads to poor performance. Modern computing environments offer massive parallelism, but naive implementations of parallel MCMC can exhibit substantial bias. In MCMC samplers of continuous random variables, Markov chain couplings can overcome bias. But these approaches depend crucially on paired chains meetings after a small number of transitions. We show that straightforward applications of existing coupling ideas to discrete clustering variables fail to meet quickly. This failure arises from the "label-switching problem": semantically equivalent cluster relabelings impede fast meeting of coupled chains. We instead consider chains as exploring the space of partitions rather than partitions' (arbitrary) labelings. Using a metric on the partition space, we formulate a practical algorithm using optimal transport couplings. Our theory confirms our method is accurate and efficient. In experiments ranging from clustering of genes or seeds to graph colorings, we show the benefits of our coupling in the highly parallel, time-limited regime.

**Optimizing Early Warning Classifiers to Control False Alarms via a Minimum Precision Constraint**

Preetish Rath · Michael Hughes

Early warning prediction systems can suffer from high false alarm rates that limit utility, especially in settings with high class imbalance such as healthcare. Despite the widespread need to control false alarms, the dominant classifier training paradigm remains minimizing cross entropy, a loss function which does not treat false alarms differently than other types of mistakes. While existing efforts often try to reduce false alarms by post-hoc threshold selection after training, we suggest a comprehensive solution by changing the loss function used to train the classifier. Our proposed objective maximizes recall while enforcing a constraint requiring precision to exceed a specified value. We make our objective tractable for gradient-based optimization by developing tight sigmoidal bounds on the counts needed to compute precision and recall. Our objective is applicable to any classifier trainable via gradient descent, including linear models and neural networks. When predicting mortality risk across two large hospital datasets, we show how our method satisfies a desired constraint on false alarms while achieving better recall than alternatives.

Online testing procedures aim to control the extent of false discoveries over a sequence of hypothesis tests, allowing for the possibility that early-stage test results influence the choice of hypotheses to be tested in later stages. Typically, online methods assume that a permanent decision regarding the current test (reject or not reject) must be made before advancing to the next test. We instead assume that each hypothesis requires an immediate preliminary decision, but also allows us to update that decision until a preset deadline. Roughly speaking, this lets us apply a Benjamini-Hochberg-type procedure over a moving window of hypotheses, where the threshold parameters for upcoming tests can be determined based on preliminary results. We show that our approach can control the false discovery rate (FDR) at every stage of testing, even under arbitrary p-value dependencies. That said, our approach offers much greater flexibility if the p-values exhibit a known independence structure. For example, if the p-value sequence is finite and all p-values are independent, then we can additionally control FDR at adaptively chosen stopping times.

**Efficient Kernelized UCB for Contextual Bandits**

Houssam Zenati · Alberto Bietti · Eustache Diemert · Julien Mairal · Matthieu Martin · Pierre Gaillard

In this paper, we tackle the computational efficiency of kernelized UCB algorithms in contextual bandits. While standard methods require a $\mathcal{O}(CT^3)$ complexity where~$T$ is the horizon and the constant $C$ is related to optimizing the UCB rule, we propose an efficient contextual algorithm for large-scale problems. Specifically, our method relies on incremental Nystr\"om approximations of the joint kernel embedding of contexts and actions. This allows us to achieve a complexity of $\mathcal{O}(CTm^2)$ where $m$ is the number of Nystr\"om points. To recover the same regret as the standard kernelized UCB algorithm, $m$ needs to be of order of the effective dimension of the problem, which is at most $\mathcal{O}(\sqrt{T})$ and nearly constant in some cases.

**Offline Policy Selection under Uncertainty**

Sherry Yang · Bo Dai · Ofir Nachum · George Tucker · Dale Schuurmans

The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their expected values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose a Bayesian approach for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints. Empirically, despite being Bayesian, the credible intervals obtained are competitive with state-of-the-art frequentist approaches in confidence interval estimation. More importantly, we show how the belief distribution may be used to rank policies with respect to arbitrary downstream policy selection metrics, and empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

**QLSD: Quantised Langevin Stochastic Dynamics for Bayesian Federated Learning**

Maxime Vono · Vincent Plassier · Alain Durmus · Aymeric Dieuleveut · Eric Moulines

The objective of Federated Learning (FL) is to perform statistical inference for data which are decentralised and stored locally on networked clients. FL raises many constraints which include privacy and data ownership, communication overhead, statistical heterogeneity, and partial client participation. In this paper, we address these problems in the framework of the Bayesian paradigm. To this end, we propose a novel federated Markov Chain Monte Carlo algorithm, referred to as Quantised Langevin Stochastic Dynamics which may be seen as an extension to the FL setting of Stochastic Gradient Langevin Dynamics, which handles the communication bottleneck using gradient compression. To improve performance, we then introduce variance reduction techniques, which lead to two improved versions coined \texttt{QLSD}$^\star$ and \texttt{QLSD}$^{++}$. We give both non-asymptotic and asymptotic convergence guarantees for the proposed algorithms. We illustrate their performances using various Bayesian Federated Learning benchmarks.

Markov Chain Monte Carlo (MCMC) algorithms ubiquitously employ complex deterministic transformations to generate proposal points that are then filtered by the Metropolis-Hastings-Green (MHG) test. However, the condition of the target measure invariance puts restrictions on the design of these transformations. In this paper, we first derive the acceptance test for the stochastic Markov kernel considering arbitrary deterministic maps as proposal generators. When applied to the transformations with orbits of period two (involutions), the test reduces to the MHG test. Based on the derived test we propose two practical algorithms: one operates by constructing periodic orbits from any diffeomorphism, another on contractions of the state space (such as optimization trajectories). Finally, we perform an empirical study demonstrating the practical advantages of both kernels.

**Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons**

Yue Wu · Tao Jin · Hao Lou · PAN XU · Farzad Farnoud · Quanquan Gu

In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus, a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from multiple users and improves the users' average accuracy by maintaining an active set of users. We prove that our algorithm can return the true ranking of items with high probability. We also provide a sample complexity bound for the proposed algorithm, which outperforms the non-active strategies in the literature and close to oracle under mild conditions. Experiments are provided to show the empirical advantage of the proposed methods over the state-of-the-art baselines.

Consider the rank-1 spiked model: $\bf{X}=\sqrt{\nu}\xi \bf{u}+ \bf{Z}$, where $\nu$ is the spike intensity, $\bf{u}\in\mathbb{S}^{k-1}$ is an unknown direction and $\xi\sim \mathcal{N}(0,1),\bf{Z}\sim \mathcal{N}(\bf{0},\bf{I})$. Motivated by recent advances in analog-to-digital conversion, we study the problem of recovering $\bf{u}\in \mathbb{S}^{k-1}$ from $n$ i.i.d. modulo-reduced measurements $\bf{Y}=[\bf{X}]\mod \Delta$, focusing on the high-dimensional regime ($k\gg 1$). We develop and analyze an algorithm that, for most directions $\bf{u}$ and $\nu=\mathrm{poly}(k)$, estimates $\bf{u}$ to high accuracy using $n=\mathrm{poly}(k)$ measurements, provided that $\Delta\gtrsim \sqrt{\log k}$. Up to constants, our algorithm accurately estimates $\bf{u}$ at the smallest possible $\Delta$ that allows (in an information-theoretic sense) to recover $\bf{X}$ from $\bf{Y}$. A key step in our analysis involves estimating the probability that a line segment of length $\approx\sqrt{\nu}$ in a random direction $\bf{u}$ passes near a point in the lattice $\Delta \mathbb{Z}^k$. Numerical experiments show that the developed algorithm performs well even in a non-asymptotic setting.

A key challenge in smooth games is that there is no general guarantee for gradient methods to converge to an equilibrium. Recently, Chavdarova et al. (2021) reported a promising empirical observation that Lookahead (Zhang et al., 2019) significantly improves GAN training. While promising, few theoretical guarantees has been studied for Lookahead in smooth games. In this work, we establish the first convergence guarantees of Lookahead for smooth games. We present a spectral analysis and provide a geometric explanation of how and when it actually improves the convergence around a stationary point. Based on the analysis, we derive sufficient conditions for Lookahead to stabilize or accelerate the local convergence in smooth games. Our study reveals that Lookahead provides a general mechanism for stabilization and acceleration in smooth games.

Informed Markov chain Monte Carlo (MCMC) methods have been proposed as scalable solutions to Bayesian posterior computation on high-dimensional discrete state spaces, but theoretical results about their convergence behavior in general settings are lacking. In this article, we propose a class of MCMC schemes called informed importance tempering (IIT), which combine importance sampling and informed local proposals, and derive generally applicable spectral gap bounds for IIT estimators. Our theory shows that IIT samplers have remarkable scalability when the target posterior distribution concentrates on a small set. Further, both our theory and numerical experiments demonstrate that the informed proposal should be chosen with caution: the performance may be very sensitive to the shape of the target distribution. We find that the ``square-root proposal weighting'' scheme tends to perform well in most settings.

As an example of the nonlinear Fokker-Planck equation, the \textit{mean field Langevin dynamics} recently attracts attention due to its connection to (noisy) gradient descent on infinitely wide neural networks in the mean field regime, and hence the convergence property of the dynamics is of great theoretical interest. In this work, we give a concise and self-contained convergence rate analysis of the mean field Langevin dynamics with respect to the (regularized) objective function in both continuous and discrete time settings. The key ingredient of our proof is a \textit{proximal Gibbs distribution} $p_q$ associated with the dynamics, which, in combination with techniques in \cite{vempala2019rapid}, allows us to develop a simple convergence theory parallel to classical results in convex optimization. Furthermore, we reveal that $p_q$ connects to the \textit{duality gap} in the empirical risk minimization setting, which enables efficient empirical evaluation of the algorithm convergence.

**Density Ratio Estimation via Infinitesimal Classification**

Kristy Choi · Chenlin Meng · Yang Song · Stefano Ermon

Density ratio estimation (DRE) is a fundamental machine learning technique for comparing two probability distributions.However, existing methods struggle in high-dimensional settings, as it is difficult to accurately compare probability distributions based on finite samples.In this work we propose DRE-$\infty$, a divide-and-conquer approach to reduce DRE to a series of easier subproblems. Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions. We then estimate the instantaneous rate of change of the bridge distributions indexed by time (the ``time score'')---a quantity defined analogously to data (Stein) scores---with a novel time score matching objective. Crucially, the learned time scores can then be integrated to compute the desired density ratio. In addition, we show that traditional (Stein) scores can be used to obtain integration paths that connect regions of high density in both distributions, improving performance in practice. Empirically, we demonstrate that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets.

**Sketch-and-lift: scalable subsampled semidefinite program for K-means clustering**

Yubo Zhuang · Xiaohui Chen · Yun Yang

Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed K-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the K-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original K-means SDP with substantially reduced runtime.

**Learning and Generalization in Overparameterized Normalizing Flows**

Kulin Shah · Amit Deshpande · Navin Goyal

In supervised learning, it is known that overparameterized neural networks with one hidden layer provably and efficiently learn and generalize, when trained using stochastic gradient descent with a sufficiently small learning rate and suitable initialization. In contrast, the benefit of overparameterization in unsupervised learning is not well understood. Normalizing flows (NFs) constitute an important class of models in unsupervised learning for sampling and density estimation. In this paper, we theoretically and empirically analyze these models when the underlying neural network is a one-hidden-layer overparametrized network. Our main contributions are two-fold: (1) On the one hand, we provide theoretical and empirical evidence that for constrained NFs (this class of NFs underlies most NF constructions) with the one-hidden-layer network, overparametrization hurts training. (2) On the other hand, we prove that unconstrained NFs, a recently introduced model, can efficiently learn any reasonable data distribution under minimal assumptions when the underlying network is overparametrized and has one hidden-layer.

We study balancing weight estimators, which reweight outcomes from a source population to estimate missing outcomes in a target population. These estimators minimize the worst-case error by making an assumption about the outcome model. In this paper, we show that this outcome assumption has two immediate implications. First, we can replace the minimax optimization problem for balancing weights with a simple convex loss over the assumed outcome function class. Second, we can replace the commonly-made overlap assumption with a more appropriate quantitative measure, the minimum worst-case bias. Finally, we show conditions under which the weights remain robust when our assumptions on the outcomes are wrong.

**Convergence of Langevin Monte Carlo in Chi-Squared and Rényi Divergence**

Murat Erdogdu · Rasa Hosseinzadeh · Shunshi Zhang

We study sampling from a target distribution $\nu_* = e^{-f}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm when the potential $f$ satisfies a strong dissipativity condition and it is first-order smooth with a Lipschitz gradient. We prove that, initialized with a Gaussian random vector that has sufficiently small variance, iterating the LMC algorithm for $\widetilde{\mathcal{O}}(\lambda^2 d\epsilon^{-1})$ steps is sufficient to reach $\epsilon$-neighborhood of the target in both Chi-squared and Rényi divergence, where $\lambda$ is the logarithmic Sobolev constant of $\nu_*$. Our results do not require warm-start to deal with the exponential dimension dependency in Chi-squared divergence at initialization.In particular, for strongly convex and first-order smooth potentials, we show that the LMC algorithm achieves the rate estimate $\widetilde{\mathcal{O}}(d\epsilon^{-1})$ which improves the previously known rates in both of these metrics, under the same assumptions. Translating this rate to other metrics, our results also recover the state-of-the-art rate estimates in KL divergence, total variation and $2$-Wasserstein distance in the same setup. Finally, as we rely on the logarithmic Sobolev inequality, our framework covers a range of non-convex potentials that are first-order smooth and exhibit strong convexity outside of a compact region.

**PAC Mode Estimation using PPR Martingale Confidence Sequences**

Shubham A Jain · Rohan Shah · Sanit Gupta · Denil Mehta · Inderjeet Nair · Jian Vora · Sushil Khyalia · Sourav Das · Vinay Ribeiro · Shivaram Kalyanakrishnan

We consider the problem of correctly identifying the mode of a discrete distribution $\mathcal{P}$ with sufficiently high probability by observing a sequence of i.i.d. samples drawn from $\mathcal{P}$. This problem reduces to the estimation of a single parameter when $\mathcal{P}$ has a support set of size $K = 2$. After noting that this special case is handled very well by prior-posterior-ratio (PPR) martingale confidence sequences (Waudby-Smith and Ramdas, 2020), we propose a generalisation to mode estimation, in which $\mathcal{P}$ may take $K \geq 2$ values. To begin, we show that the "one-versus-one" principle to generalise from $K = 2$ to $K \geq 2$ classes is more efficient than the "one-versus-rest" alternative. We then prove that our resulting stopping rule, denoted PPR-1v1, is asymptotically optimal (as the mistake probability is taken to 0). PPR-1v1 is simple and computationally light, and incurs significantly fewer samples than competitors even in the non-asymptotic regime. We demonstrate its gains in two practical applications of sampling: election forecasting and verification of smart contracts in blockchains.

**Conditionally Tractable Density Estimation using Neural Networks**

Hailiang Dong · Chiradeep Roy · Tahrima Rahman · Vibhav Gogate · Nicholas Ruozzi

Tractable models such as cutset networks and sum-product networks (SPNs) have become increasingly popular because they have superior predictive performance. Among them, cutset networks, which model the mechanics of Pearl’s cutset conditioning algorithm, demonstrate great scalability and prediction accuracy. Existing research on cutset networks has mainly focused on discrete domains, and the best mechanism to extend cutset networks to continuous domains is unclear. We propose one possible alternative to cutset networks that models the full joint distribution as the product of a local, complex distribution over a small subset of variables and a fully tractable conditional distribution whose parameters are controlled using a neural network. This model admits exact inference when all variables in the local distribution are observed, and although the model is not fully tractable in general, we show that ``cutset'' sampling can be employed to efficiently generate accurate predictions in practice. We show that our model performs comparably or better than existing competitors through a variety of prediction tasks on real datasets.

Kernel methods are a highly effective and widely used collection of modern machine learning algorithms. A fundamental limitation of virtually all such methods are computations involving the kernel matrix that naïvely scale quadratically (e.g., matrix-vector multiplication) or cubically (solving linear systems) with the size of the dataset N. We propose the Fast Kernel Transform (FKT), a general algorithm to compute matrix-vector multiplications (MVMs) for datasets in moderate dimensions with quasilinear complexity. Typically, analytically grounded fast multiplication methods require specialized development for specific kernels. In contrast, our scheme is based on auto-differentiation and automated symbolic computations that leverage the analytical structure of the underlying kernel. This allows the FKT to be easily applied to a broad class of kernels, including Gaussian, Matérn, and Rational Quadratic covariance functions and Green's functions, including those of the Laplace and Helmholtz equations. Furthermore, the FKT maintains a high, quantifiable, and controllable level of accuracy---properties that many acceleration methods lack. We illustrate the efficacy and versatility of the FKT by providing timing and accuracy benchmarks with comparisons to adjacent methods, and by applying it to scale the stochastic neighborhood embedding (t-SNE) and Gaussian processes to large real-world datasets.

**Controlling Epidemic Spread using Probabilistic Diffusion Models on Networks**

Amy Babay · Michael Dinitz · Aravind Srinivasan · Leonidas Tsepenekas · Anil Vullikanti

The spread of an epidemic is often modeled by an SIR random process on a social network graph. The MinInfEdge problem for optimal social distancing involves minimizing the expected number of infections, when we are allowed to break at most B edges; similarly the MinInfNode problem involves removing at most B vertices. These are fundamental problems in epidemiology and network science. While a number of heuristics have been considered, the complexity of this problem remains generally open. In this paper, we present two bicriteria approximation algorithms for the MinInfEdge problem, which give the first non-trivial approximations for this problem. The first is based on the cut sparsification result technique of Karger, which works for any graph, when the transmission probabilities are not too small. The second is a Sample Average Approximation (SAA) based algorithm, which we analyze forthe Chung-Lu random graph model. We also extend some of our results for the MinInfNode problem.

Often machine learning and statistical models will attempt to describe the majority of the data. However, there may be situations where only a fraction of the data can be fit well by a linear regression model. Here, we are interested in a case where such inliers can be identified by a Disjunctive Normal Form (DNF) formula. We give a polynomial time algorithm for the conditional linear regression task, which identifies a DNF condition together with the linear predictor on the corresponding portion of the data. In this work, we improve on previous algorithms by removing a requirement that the covariances of the data satisfying each of the terms of the condition have to all be very similar in spectral norm to the covariance of the overall condition.

Programmatic weak supervision creates models without hand-labeled training data by combining the outputs of heuristic labelers. Existing frameworks make the restrictive assumption that labelers output a single class label. Enabling users to create partial labelers that output subsets of possible class labels would greatly expand the expressivity of programmatic weak supervision. We introduce this capability by defining a probabilistic generative model that can estimate the underlying accuracies of multiple noisy partial labelers without ground truth labels. We show how to scale up learning, for example learning on 100k examples in one minute, a 300$\times$ speed up compared to a naive implementation. We also prove that this class of models is generically identifiable up to label swapping under mild conditions. We evaluate our framework on three text classification and six object classification tasks. On text tasks, adding partial labels increases average accuracy by 8.6 percentage points. On image tasks, we show that partial labels allow us to approach some zero-shot object classification problems with programmatic weak supervision by using class attributes as partial labelers. On these tasks, our framework has accuracy comparable to recent embedding-based zero-shot learning methods, while using only pre-trained attribute detectors.

**Firebolt: Weak Supervision Under Weaker Assumptions**

Charles Kwong · Chidubem Arachie · Bangyong Liang · Pradyumna Narayana · Giulia DeSalvo · Michael Quinn · Bert Huang · Geoffrey Downs · Yang Yang

Modern machine learning demands a large amount of training data. Weak supervision is a promising approach to meet this demand. It aggregates multiple labeling functions (LFs)—noisy, user-provided labeling heuristics---to rapidly and cheaply curate probabilistic labels for large-scale unlabeled data. However, standard assumptions in weak supervision---such as user-specified class balance, similar accuracy of an LF in classifying different classes, and full knowledge of LF dependency at inference time---might be undesirable in practice. In response, we present Firebolt, a new weak supervision framework that seeks to operate under weaker assumptions. In particular, Firebolt learns the class balance and class-specific accuracy of LFs jointly from unlabeled data. It carries out inference in an efficient and interpretable manner. We analyze the parameter estimation error of Firebolt and characterize its impact on downstream model performance. Furthermore, we show that on five publicly available datasets, Firebolt outperforms a state-of-the-art weak supervision method by up to 5.8 points in AUC. We also provide a case study in the production setting of a tech company, where a Firebolt-supervised model outperforms the existing weakly-supervised production model by 1.3 points in AUC and speedup label model training and inference from one hour to three minutes.

**Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression**

Giacomo Meanti · Luigi Carratino · Ernesto De Vito · Lorenzo Rosasco

Kernel methods provide a principled approach to nonparametric learning. While their basic implementations scale poorly to large problems, recent advances showed that approximate solvers can efficiently handle massive datasets.A shortcoming of these solutions is that hyperparameter tuning is not taken care of, and left for the user to perform.Hyperparameters are crucial in practice and the lack of automated tuning greatly hinders efficiency and usability. In this paper, we work to fill in this gap focusing on kernel ridge regression based on the Nystr\"om approximation.After reviewing and contrasting a number of hyperparameter tuning strategies,we propose a complexity regularization criterion based on a data dependent penalty, and discuss its efficient optimization.Then, we proceed to a careful and extensive empirical evaluation highlighting strengths and weaknesses of the different tuning strategies. Our analysis shows the benefit of the proposed approach, that we hence incorporate in a library for large scale kernel methods to derive adaptively tuned solutions.

**A prior-based approximate latent Riemannian metric**

Georgios Arvanitidis · Bogdan Georgiev · Bernhard Schölkopf

Stochastic generative models enable us to capture the geometric structure of a data manifold lying in a high dimensional space through a Riemannian metric in the latent space. However, its practical use is rather limited mainly due to inevitable functionality problems and computational complexity. In this work we propose a surrogate conformal Riemannian metric in the latent space of a generative model that is simple, efficient and robust. This metric is based on a learnable prior that we propose to learn using a basic energy-based model. We theoretically analyze the behavior of the proposed metric and show that it is sensible to use in practice. We demonstrate experimentally the efficiency and robustness, as well as the behavior of the new approximate metric. Also, we show the applicability of the proposed methodology for data analysis in the life sciences.

**Expressivity of Neural Networks via Chaotic Itineraries beyond Sharkovsky's Theorem**

Clayton Sanford · Vaggos Chatziafratis

Given a target function $f$, how large must a neural network be in order to approximate $f$? Recent works examine this basic question on neural network \textit{expressivity} from the lens of dynamical systems and provide novel ``depth-vs-width'' tradeoffs for a large family of functions $f$. They suggest that such tradeoffs are governed by the existence of \textit{periodic} points or \emph{cycles} in $f$. Our work, by further deploying dynamical systems concepts, illuminates a more subtle connection between periodicity and expressivity: we prove that periodic points alone lead to suboptimal depth-width tradeoffs and we improve upon them by demonstrating that certain ``chaotic itineraries'' give stronger exponential tradeoffs, even in regimes where previous analyses only imply polynomial gaps. Contrary to prior works, our bounds are nearly-optimal, tighten as the period increases, and handle strong notions of inapproximability (e.g., constant $L_1$ error). More broadly, we identify a phase transition to the \textit{chaotic regime} that exactly coincides with an abrupt shift in other notions of function complexity, including VC-dimension and topological entropy.

Minimax optimization has been central in addressing various applications in machine learning, game theory, and control theory. Prior literature has thus far mainly focused on studying such problems in the continuous domain, e.g., convex-concave minimax optimization is now understood to a significant extent. Nevertheless, minimax problems extend far beyond the continuous domain to mixed continuous-discrete domains or even fully discrete domains. In this paper, we study mixed continuous-discrete minimax problems where the minimization is over a continuous variable belonging to Euclidean space and the maximization is over subsets of a given ground set. We introduce the class of convex-submodular minimax problems, where the objective is convex with respect to the continuous variable and submodular with respect to the discrete variable. Even though such problems appear frequently in machine learning applications, little is known about how to address them from algorithmic and theoretical perspectives. For such problems, we first show that obtaining saddle points are hard up to any approximation, and thus introduce new notions of (near-) optimality. We then provide several algorithmic procedures for solving convex and monotone-submodular minimax problems and characterize their convergence rates, computational complexity, and quality of the final solution according to our notions of optimally. Our proposed algorithms are iterative and combine tools from both discrete and continuous optimization. Finally, we provide numerical experiments to showcase the effectiveness of our purposed methods.

**Zeroth-Order Methods for Convex-Concave Min-max Problems: Applications to Decision-Dependent Risk Minimization**

Chinmay Maheshwari · Chih-Yuan Chiu · Eric Mazumdar · Shankar Sastry · Lillian Ratliff

Min-max optimization is emerging as a key framework for analyzing problems of robustness to strategically and adversarially generated data. We propose the random reshuffling-based gradient-free Optimistic Gradient Descent-Ascent algorithm for solving convex-concave min-max problems with finite sum structure. We prove that the algorithm enjoys the same convergence rate as that of zeroth-order algorithms for convex minimization problems. We deploy the algorithm to solve the distributionally robust strategic classification problem, where gradient information is not readily available, by reformulating the latter into a finite dimensional convex concave min-max problem. Through illustrative simulations, we observe that our proposed approach learns models that are simultaneously robust against adversarial distribution shifts and strategic decisions from the data sources, and outperforms existing methods from the strategic classification literature.

**Deep Generative model with Hierarchical Latent Factors for Time Series Anomaly Detection**

Cristian Challu · Peihong Jiang · Ying Nian Wu · Laurent Callot

Multivariate time series anomaly detection has become an active area of research in recent years, with Deep Learning models outperforming previous approaches on benchmark datasets. Among reconstruction-based models, most previous work has focused on Variational Autoencoders and Generative Adversarial Networks. This work presents DGHL, a new family of generative models for time series anomaly detection, trained by maximizing the observed likelihood by posterior sampling and alternating back-propagation. A top-down Convolution Network maps a novel hierarchical latent space to time series windows, exploiting temporal dynamics to encode information efficiently. Despite relying on posterior sampling, it is computationally more efficient than current approaches, with up to 10x shorter training times than RNN based models. Our method outperformed current state-of-the-art models on four popular benchmark datasets. Finally, DGHL is robust to variable features between entities and accurate even with large proportions of missing values, settings with increasing relevance with the advent of IoT. We demonstrate the superior robustness of DGHL with novel occlusion experiments in this literature. Our code is available at https://github.com/cchallu/dghl.

We study asynchronous finite sum minimization in a distributed-data setting with a central parameter server. While asynchrony is well understood in parallel settings where the data is accessible by all machines---e.g., modifications of variance-reduced gradient algorithms like SAGA work well---little is known for the distributed-data setting. We develop an algorithm ADSAGA based on SAGA for the distributed-data setting, in which the data is partitioned between many machines.We show that with $m$ machines, under a natural stochastic delay model with an mean delay of $m$,ADSAGA converges in $\tilde{O}\left(\left(n + \sqrt{m}\kappa\right)\log(1/\epsilon)\right)$ iterations, where $n$ is the number of component functions, and $\kappa$ is a condition number. This complexity sits squarely between the complexity $\tilde{O}\left(\left(n + \kappa\right)\log(1/\epsilon)\right)$ of SAGA \textit{without delays} and the complexity $\tilde{O}\left(\left(n + m\kappa\right)\log(1/\epsilon)\right)$ of parallel asynchronous algorithms where the delays are \textit{arbitrary} (but bounded by $O(m)$), and the data is accessible by all. Existing asynchronous algorithms with distributed-data setting and arbitrary delays have only been shown to converge in $\tilde{O}(n^2\kappa\log(1/\epsilon))$ iterations. We empirically compare on least-squares problems the iteration complexity and wallclock performance of ADSAGA to existing parallel and distributed algorithms, including synchronous minibatch algorithms. Our results demonstrate the wallclock advantage of variance-reduced asynchronous approaches over SGD or synchronous approaches.

**Fundamental limits for rank-one matrix estimation with groupwise heteroskedasticity**

Joshua Behne · Galen Reeves

Low-rank matrix recovery problems involving high-dimensional and heterogeneous data appear in applications throughout statistics and machine learning. The contribution of this paper is to establish the fundamental limits of recovery for a broad class of these problems. In particular, we study the problem of estimating a rank-one matrix from Gaussian observations where different blocks of the matrix are observed under different noise levels. In the setting where the number of blocks is fixed while the number of variables tends to infinity, we prove asymptotically exact formulas for the minimum mean-squared error in estimating both the matrix and underlying factors. These results are based on a novel reduction from the low-rank matrix tensor product model (with homogeneous noise) to a rank-one model with heteroskedastic noise.As an application of our main result, we show that show recently proposed methods based on applying principal component analysis (PCA) to weighted combinations of the data are optimal in some settings but sub-optimal in others. We also provide numerical results comparing our asymptotic formulas with the performance of methods based weighted PCA, gradient descent, and approximate message passing.

**The role of optimization geometry in single neuron learning**

Nicholas Boffi · Stephen Tu · Jean-Jacques Slotine

Recent numerical experiments have demonstrated that the choice of optimization geometry used during training can impact generalization performance when learning expressive nonlinear model classes such as deep neural networks. These observations have important implications for modern deep learning, but remain poorly understood due to the difficulty of the associated nonconvex optimization. Towards an understanding of this phenomenon, we analyze a family of pseudogradient methods for learning generalized linear models under the square loss – a simplified problem containing both nonlinearity in the model parameters and nonconvexity of the optimization which admits a single neuron as a special case. We prove non-asymptotic bounds on the generalization error that sharply characterize how the interplay between the optimization geometry and the feature space geometry sets the out-of-sample performance of the learned model. Experimentally, selecting the optimization geometry as suggested by our theory leads to improved performance in generalized linear model estimation problems such as nonlinear and nonconvex variants of sparse vector recovery and low-rank matrix sensing.

**Polynomial Time Reinforcement Learning in Factored State MDPs with Linear Value Functions**

ZIHAO DENG · Siddartha Devic · Brendan Juba

Many reinforcement learning (RL) environments in practice feature enormous state spaces that may be described compactly by a “factored” structure, that may be modeled by Factored Markov Decision Processes (FMDPs). We present the first polynomial time algorithm for RL in Factored State MDPs (generalizing FMDPs) that neither relies on an oracle planner nor requires a linear transition model; it only requires a linear value function with a suitable local basis with respect to the factorization, permitting efficient variable elimination. With this assumption, we can solve this family of Factored State MDPs in polynomial time by constructing an efficient separation oracle for convex optimization. Importantly, and in contrast to prior work on FMDPs, we do not assume that the transitions on various factors areconditionally independent.

**Optimal Compression of Locally Differentially Private Mechanisms**

Abhin Shah · Wei-Ning Chen · Johannes Ballé · Peter Kairouz · Lucas Theis

Compressing the output of $\epsilon$-locally differentially private (LDP) randomizers naively leads to suboptimal utility. In this work, we demonstrate the benefits of using schemes that jointly compress and privatize the data using shared randomness. In particular, we investigate a family of schemes based on Minimal Random Coding (Havasi et al., 2019) and prove that they offer optimal privacy-accuracy-communication tradeoffs. Our theoretical and empirical findings show that our approach can compress PrivUnit (Bhowmick et al., 2018) and Subset Selection (Ye et al., 2018), the best known LDP algorithms for mean and frequency estimation, to the order of $\epsilon$ bits of communication while preserving their privacy and accuracy guarantees.

This paper proposes a convex formulation for sparse multicategory linear discriminant analysis and then extend it to the distributed setting when data are stored across multiple sites. The key observation is that for the purpose of classification it suffices to recover the discriminant subspace which is invariant to orthogonal transformations. Theoretically, we establish statistical properties ensuring that the distributed sparse multicategory linear discriminant analysis performs as good as the centralized version after {a few rounds} of communications. Numerical studies lend strong support to our methodology and theory.

Ensembles of decision rules extracted from tree ensembles, like RuleFit, promise a good trade-off between predictive performance and model simplicity. However, they are affected by competing interests: While a sufficiently large number of binary, non-smooth rules is necessary to fit smooth, well generalizing decision boundaries, a too high number of rules in the ensemble severely jeopardizes interpretability. As a way out of this dilemma, we propose to take an extra step in the rule extraction step and compress clusters of similar rules into ensemble rules. The outputs ofthe individual rules in each cluster are pooled to produce a single soft output, reflecting the original ensemble's marginal smoothing behaviour. The final model, that we call Compressed Rule Ensemble (CRE), fits a linear combination of ensemble rules. We empirically show that CRE is both sparse and accurate on various datasets, carrying over the ensemble behaviour while remaining interpretable.

We introduce an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic and study its statistical behavior under both the null hypothesis and the alternative hypothesis. The theoretical results involve tools from U-process theory and optimal transport theory. We also offer a random feature type approximation for large-scale problems, as well as a differentiable program implementation for deep learning applications. We present experimental results on existing benchmarks for independence testing, illustrating the interest of the proposed criterion to capture both linear and nonlinear dependencies in synthetic data and real data.

**Confident Least Square Value Iteration with Local Access to a Simulator**

Botao Hao · Nevena Lazic · Dong Yin · Yasin Abbasi-Yadkori · Csaba Szepesvari

Learning with simulators is ubiquitous in mod-ern reinforcement learning (RL). The simulatorcan either correspond to a simplified version ofthe real environment (such as a physics simulation of a robot arm) or to the environment itself (such as in games like Atari and Go). Among algorithms that are provably sample-efficient in this setting, most make the unrealistic assumption that all possible environment states are known before learning begins, or perform global optimistic planning which is computationally inefficient. In this work, we focus on simulation-based RL under a more realistic local access protocol, where the state space is unknown and the simulator can only be queried at states that have previously been observed (initial states and those returned by previous queries). We propose an algorithm named CONFIDENT-LSVI based on the template of least-square value iteration. CONFIDENT-LSVI incrementally builds a coreset of important states and uses the simulator to revisit them. Assuming that the linear function class has low approximation error under the Bell-man optimality operator (a.k.a. low inherent Bell-man error), we bound the algorithm performance in terms of this error, and show that it is query-and computationally-efficient.

**Entrywise Recovery Guarantees for Sparse PCA via Sparsistent Algorithms**

Joshua Agterberg · Jeremias Sulam

Sparse Principal Component Analysis (PCA) is a prevalent tool across a plethora of subfield of applied statistics. While several results have characterized the recovery error of the principal eigenvectors, these are typically in spectral or Frobenius norms. In this paper, we provide entrywise $\ell_{2,\infty}$ bounds for Sparse PCA under a general high-dimensional subgaussian design. In particular, our bounds hold for any algorithm that selects the correct support with high probability, those that are sparsistent. Our bound improves upon known results by providing a finer characterization of the estimation error, and our proof uses techniques recently developed for entrywise subspace perturbation theory.

**Learning in Stochastic Monotone Games with Decision-Dependent Data**

Adhyyan Narang · Evan Faulkner · Dmitriy Drusvyatskiy · Maryam Fazel · Lillian Ratliff

Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called multi-player performative prediction. We establish transparent sufficient conditions for strong monotonicity of the game and use them to develop algorithms for finding Nash equilibria. We investigate derivative free methods and adaptive gradient algorithms wherein each player alternates between learning a parametric description of their distribution and gradient steps on the empirical risk. Synthetic and semi-synthetic numerical experiments illustrate the results.

**Towards an Understanding of Default Policies in Multitask Policy Optimization**

Ted Moskovitz · Michael Arbel · Jack Parker-Holder · Aldo Pacchiano

Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.

**A general class of surrogate functions for stable and efficient reinforcement learning**

Sharan Vaswani · Olivier Bachem · Simone Totaro · Robert Müller · Shivam Garg · Matthieu Geist · Marlos C. Machado · Pablo Samuel Castro · Nicolas Le Roux

Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO, or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple reinforcement learning problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.

**Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization**

Gideon Dresdner · Maria-Luiza Vladarean · Gunnar Rätsch · Francesco Locatello · Volkan Cevher · Alp Yurtsever

We propose a stochastic conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. Existing CGM variants for this template either suffer from slow convergence rates, or require carefully increasing the batch size over the course of the algorithm's execution, which leads to computing full gradients. In contrast, the proposed method, equipped with a stochastic average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques. In applications we put special emphasis on problems with a large number of separable constraints. Such problems are prevalent among semidefinite programming (SDP) formulations arising in machine learning and theoretical computer science. We provide numerical experiments on matrix completion, unsupervised clustering, and sparsest-cut SDPs.

**Sample-and-threshold differential privacy: Histograms and applications**

Graham Cormode · Akash Bharadwaj

Federated analytics seeks to compute accurate statistics from data distributed across users' devices while providing a suitable privacy guarantee and being practically feasible to implement and scale. In this paper, we show how a strong (epsilon, delta)-Differential Privacy (DP) guarantee can be achieved for the fundamental problem of histogram generation in a federated setting, via a highly practical sampling-based procedure that does not add noise to disclosed data. Given the ubiquity of sampling in practice, we thus obtain a DP guarantee almost for free, avoid over-estimating histogram counts, and allow easy reasoning about how privacy guarantees may obscure minorities and outliers. Using such histograms, related problems such as heavy hitters and quantiles can be answered with provable error and privacy guarantees. Experimental results show that our sample-and-threshold approach is accurate and scalable.

**On the Implicit Bias of Gradient Descent for Temporal Extrapolation**

Edo Cohen-Karlik · Avichai Ben David · Nadav Cohen · Amir Globerson

When using recurrent neural networks (RNNs) it is common practice to apply trained models to sequences longer than those seen in training. This ``extrapolating'' usage deviates from the traditional statistical learning setup where guarantees are provided under the assumption that train and test distributions are identical. Here we set out to understand when RNNs can extrapolate, focusing on a simple case where the data generating distribution is memoryless. We first show that even with infinite training data, there exist RNN models that interpolate perfectly (i.e., they fit the training data) yet extrapolate poorly to longer sequences. We then show that if gradient descent is used for training, learning will converge to perfect extrapolation under certain assumptions on initialization.Our results complement recent studies on the implicit bias of gradient descent, showing that it plays a key role in extrapolation when learning temporal prediction models.

**On the Oracle Complexity of Higher-Order Smooth Non-Convex Finite-Sum Optimization**

Nicolas Emmenegger · Rasmus Kyng · Ahad N. Zehmakan

We prove lower bounds for higher-order methods in smooth non-convex finite-sum optimization. Our contribution is threefold: We first show that a deterministic algorithm cannot profit from the finite-sum structure of the objective and that simulating a pth-order regularized method on the whole function by constructing exact gradient information is optimal up to constant factors. We further show lower bounds for randomized algorithms and compare them with the best-known upper bounds. To address some gaps between the bounds, we propose a new second-order smoothness assumption that can be seen as an analogue of the first-order mean-squared smoothness assumption. We prove that it is sufficient to ensure state-of-the-art convergence guarantees while allowing for a sharper lower bound.

**Causally motivated shortcut removal using auxiliary labels **

Maggie Makar · Ben Packer · Dan Moldovan · Davis Blalock · Yoni Halpern · Alexander D'Amour

Shortcut learning, in which models make use of easy-to-represent but unstable associations, is a major failure mode for robust machine learning. We study a flexible, causally-motivated approach to training robust predictors by discouraging the use of specific shortcuts, focusing on a common setting where a robust predictor could achieve optimal i.i.d generalization in principle, but is overshadowed by a shortcut predictor in practice. Our approach uses auxiliary labels, typically available at training time, to enforce conditional independences implied by the causal graph. We show both theoretically and empirically that causally-motivated regularization schemes (a) lead to more robust estimators that generalize well under distribution shift, and (b) have better finite sample efficiency compared to usual regularization schemes, even when no shortcut is present. Our analysis highlights important theoretical properties of training techniques commonly used in the causal inference, fairness, and disentanglement literatures. Our code is available at github.com/mymakar/causally*motivated*shortcut_removal

**A View of Exact Inference in Graphs from the Degree-4 Sum-of-Squares Hierarchy**

Kevin Bello · Chuyang Ke · Jean Honorio

Performing inference in graphs is a common task within several machine learning problems, e.g., image segmentation, community detection, among others. For a given undirected connected graph, we tackle the statistical problem of \textit{exactly} recovering an unknown ground-truth binary labeling of the nodes from a \textit{single} corrupted observation of each edge. Such problem can be formulated as a quadratic combinatorial optimization problem over the boolean hypercube, where it has been shown before that one can (with high probability and in polynomial time) exactly recover the ground-truth labeling of graphs that have an isoperimetric number that grows with respect to the number of nodes (e.g., complete graphs, regular expanders). In this work, we apply a powerful hierarchy of relaxations, known as the sum-of-squares (SoS) hierarchy, to the combinatorial problem. Motivated by empirical evidence on the improvement in exact recoverability, we center our attention on the degree-4 SoS relaxation and set out to understand the origin of such improvement from a graph theoretical perspective. We show that the solution of the dual of the relaxed problem is related to finding edge weights of the Johnson and Kneser graphs, where the weights fulfill the SoS constraints and intuitively allow the input graph to increase its algebraic connectivity. Finally, as byproduct of our analysis, we derive a novel Cheeger-type lower bound for the algebraic connectivity of graphs with \textit{signed} edge weights.

**Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization**

Yujia Wang · Lu Lin · Jinghui Chen

Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.

**A Bandit Model for Human-Machine Decision Making with Private Information and Opacity**

Sebastian Bordt · Ulrike von Luxburg

Applications of machine learning inform human decision makers in a broad range of tasks. The resulting problem is usually formulated in terms of a single decision maker. We argue that it should rather be described as a two-player learning problem where one player is the machine and the other the human. While both players try to optimize the final decision, the setup is often characterized by (1) the presence of private information and (2) opacity, that is imperfect understanding between the decision makers. We prove that both properties can complicate decision making considerably. A lower bound quantifies the worst-case hardness of optimally advising a decision maker who is opaque or has access to private information. An upper bound shows that a simple coordination strategy is nearly minimax optimal. More efficient learning is possible under certain assumptions on the problem, for example that both players learn to take actions independently. Such assumptions are implicit in existing literature, for example in medical applications of machine learning, but have not been described or justified theoretically.

**A Unified View of SDP-based Neural Network Verification through Completely Positive Programming**

Robin Brown · Edward Schmerling · Navid Azizan · Marco Pavone

Verifying that input-output relationships of a neural network conform to prescribed operational specifications is a key enabler towards deploying these networks in safety-critical applications. Semidefinite programming (SDP)-based approaches to Rectified Linear Unit (ReLU) network verification transcribe this problem into an optimization problem, where the accuracy of any such formulation reflects the level of fidelity in how the neural network computation is represented, as well as the relaxations of intractable constraints. While the literature contains much progress on improving the tightness of SDP formulations while maintaining tractability, comparatively little work has been devoted to the other extreme, i.e., how to most accurately capture the original verification problem before SDP relaxation. In this work, we develop an exact, convex formulation of verification as a completely positive program (CPP), and provide analysis showing that our formulation is minimal—the removal of any constraint fundamentally misrepresents the neural network computation. We leverage our formulation to provide a unifying view of existing approaches, and give insight into the source of large relaxation gaps observed in some cases.

In recent years, researchers have extensively studied adversarial robustness in a variety of threat models, including $l_0, l_1, l_2$, and $l_{\infty}$-norm bounded adversarial attacks. However, attacks bounded by fractional $l_p$-“norms” (quasi-norms defined by the $l_p$ distance with $0 < p < 1$) have yet to be thoroughly considered. We proactively propose a defense with several desirable properties: it provides provable (certified) robustness, scales to ImageNet, and yields deterministic (rather than high-probability) certified guarantees when applied to quantized data (e.g., images). Our technique for fractional $l_p$ robustness constructs expressive, deep classifiers that are globally Lipschitz with respect to the $l_p^p$ metric, for any $0 < p < 1$. However, our method is even more general: we can construct classifiers which are globally Lipschitz with respect to any metric defined as the sum of concave functions of components. Our approach builds on a recent work, Levine and Feizi (2021), which provides a provable defense against $l_1$ attacks. However, we demonstrate that our proposed guarantees are highly non-vacuous, compared to the trivial solution of using (Levine and Feizi, 2021) directly and applying norm inequalities.

Code is available at https://github.com/alevine0/fractionalLpRobustness.

**Optimal Design of Stochastic DNA Synthesis Protocols based on Generative Sequence Models**

Eli Weinstein · Alan Amin · Will Grathwohl · Daniel Kassler · Jean Disset · Debora Marks

Generative probabilistic models of biological sequences have widespread existing and potential applications in analyzing, predicting and designing proteins, RNA and genomes. To test the predictions of such a model experimentally, the standard approach is to draw samples, and then synthesize each sample individually in the laboratory. However, often orders of magnitude more sequences can be experimentally assayed than can be affordably synthesized individually. In this article, we propose instead to use stochastic synthesis methods, such as mixed nucleotides or trimers. We describe a black-box algorithm for optimizing stochastic synthesis protocols to produce approximate samples from any target generative model. We establish theoretical bounds on the method's performance, and validate it in simulation using held-out sequence-to-function predictors trained on real experimental data. We show that using optimized stochastic synthesis protocols in place of individual synthesis can increase the number of hits in protein engineering efforts by orders of magnitude, e.g. from zero to a thousand.

**An Online Learning Approach to Interpolation and Extrapolation in Domain Generalization**

Elan Rosenfeld · Pradeep Ravikumar · Andrej Risteski

A popular assumption for out-of-distribution generalization is that the training data comprises sub-datasets, each drawn from a distinct distribution; the goal is then to "interpolate" these distributions and "extrapolate" beyond them---this objective is broadly known as domain generalization. A common belief is that ERM can interpolate but not extrapolate and that the latter task is considerably more difficult, but these claims are vague and lack formal justification. In this work, we recast generalization over sub-groups as an online game between a player minimizing risk and an adversary presenting new test distributions. Under an existing notion of inter- and extrapolation based on reweighting of sub-group likelihoods, we rigorously demonstrate that extrapolation is computationally much harder than interpolation, though their statistical complexity is not significantly different. Furthermore, we show that ERM---possibly with added structured noise---is \emph{provably minimax-optimal} for both tasks. Our framework presents a new avenue for the formal analysis of domain generalization algorithms which may be of independent interest.

**Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity**

Junchi Yang · Antonio Orvieto · Aurelien Lucchi · Niao He

Gradient descent ascent (GDA), the simplest single-loop algorithm for nonconvex minimax optimization, is widely used in practical applications such as generative adversarial networks (GANs) and adversarial training. Albeit its desirable simplicity, recent work shows inferior convergence rates of GDA in theory, even when assuming strong concavity of the objective in terms of one variable. This paper establishes new convergence results for two alternative single-loop algorithms -- \emph{alternating GDA} and \emph{smoothed GDA} -- under the mild assumption that the objective satisfies the Polyak-$\L$ojasiewicz (PL) condition about one variable. We prove that, to find an $\epsilon$-stationary point, (i) alternating GDA and its stochastic variant (without mini batch) respectively require $O(\kappa^{2} \epsilon^{-2})$ and $O(\kappa^{4} \epsilon^{-4})$ iterations, while (ii) smoothed GDA and its stochastic variant (without mini batch) respectively require $O(\kappa \epsilon^{-2})$ and $O(\kappa^{2} \epsilon^{-4})$ iterations. The latter greatly improves over the vanilla GDA and gives the hitherto best known complexity results among single-loop algorithms under similar settings. We further showcase the empirical efficiency of these algorithms in training GANs and robust nonlinear regression.

We study the problem of learning individualized dose intervals using observational data. There are very few previous works for policy learning with continuous treatment, and all of them focused on recommending an optimal dose rather than an optimal dose interval. In this paper, we propose a new method to estimate such an optimal dose interval, named probability dose interval (PDI). The potential outcomes for doses in the PDI are guaranteed better than a pre-specified threshold with a given probability (e.g., 50%). The associated nonconvex optimization problem can be efficiently solved by the Difference-of-Convex functions (DC) algorithm. We prove that our estimated policy is consistent, and its risk converges to that of the best-in-class policy at a root-n rate. Numerical simulations show the advantage of the proposed method over outcome modeling based benchmarks. We further demonstrate the performance of our method in determining individualized Hemoglobin A1c (HbA1c) control intervals for elderly patients with diabetes.

We address the multi-task Gaussian process (GP) regression problem with the goal of decomposing input effects on outputs into components shared across or specific to tasks and samples. We propose a family of mixed-effects GPs, including doubly and translated mixed-effects GPs, that performs such a decomposition, while also modeling the complex task relationships. Instead of the tensor product widely used in multi-task GPs, we use the direct sum and Kronecker sum for Cartesian product to combine task and sample covariance functions. With this kernel, the overall input effects on outputs decompose into four components: fixed effects shared across tasks and across samples and random effects specific to each task and to each sample. We describe an efficient stochastic variational inference method for our proposed models that also significantly reduces the cost of inference for the existing mixed-effects GPs. On simulated and real-world data, we demonstrate that our approach provides higher test accuracy and interpretable decomposition.

**Fast and Scalable Spike and Slab Variable Selection in High-Dimensional Gaussian Processes**

Hugh Dance · Brooks Paige

Variable selection in Gaussian processes (GPs) is typically undertaken by thresholding the inverse lengthscales of automatic relevance determination kernels, but in high-dimensional datasets this approach can be unreliable. A more probabilistically principled alternative is to use spike and slab priors and infer a posterior probability of variable inclusion. However, existing implementations in GPs are very costly to run in both high-dimensional and large-n datasets, or are only suitable for unsupervised settings with specific kernels. As such, we develop a fast and scalable variational inference algorithm for the spike and slab GP that is tractable with arbitrary differentiable kernels. We improve our algorithm's ability to adapt to the sparsity of relevant variables by Bayesian model averaging over hyperparameters, and achieve substantial speed ups using zero temperature posterior restrictions, dropout pruning and nearest neighbour minibatching. In experiments our method consistently outperforms vanilla and sparse variational GPs whilst retaining similar runtimes (even when n=10^6) and performs competitively with a spike and slab GP using MCMC but runs up to 1000 times faster.

**Parameter-Free Online Linear Optimization with Side Information via Universal Coin Betting**

Jongha Jon Ryu · Alankrita Bhatt · Young-Han Kim

A class of parameter-free online linear optimization algorithms is proposed that harnesses the structure of an adversarial sequence by adapting to some side information. These algorithms combine the reduction technique of Orabona and P{\'a}l (2016) for adapting coin betting algorithms for online linear optimization with universal compression techniques in information theory for incorporating sequential side information to coin betting.Concrete examples are studied in which the side information has a tree structure and consists of quantized values of the previous symbols of the adversarial sequence, including fixed-order and variable-order Markov cases.By modifying the context-tree weighting technique of Willems, Shtarkov, and Tjalkens (1995), the proposed algorithm is further refined to achieve the best performance over all adaptive algorithms with tree-structured side information of a given maximum order in a computationally efficient manner.

**Hardness of Learning a Single Neuron with Adversarial Label Noise**

Ilias Diakonikolas · Daniel Kane · Pasin Manurangsi · Lisheng Ren

We study the problem of distribution-free learning of a single neuronunder adversarial label noise with respect to the squared loss.For a wide range of activation functions, including ReLUs and sigmoids,we prove hardness of learning results in the Statistical Query model andunder a well-studied assumption on the complexity of refuting XOR formulas.Specifically, we establish that no polynomial-time learning algorithm, even improper,can approximate the optimal loss value within any constant factor.

**TD-GEN: Graph Generation Using Tree Decomposition**

Hamed Shirzad · Hossein Hajimirsadeghi · Amir Abdi · Greg Mori

We propose TD-GEN, a graph generation framework based on tree decomposition, and introduce a reduced upper bound on the maximum number of decisions needed for graph generation. The framework includes a permutation invariant tree generation model which forms the backbone of graph generation. Tree nodes are supernodes, each representing a cluster of nodes in the graph. Graph nodes and edges are incrementally generated inside the clusters by traversing the tree supernodes, respecting the structure of the tree decomposition, and following node sharing decisions between the clusters. Further, we discuss the shortcomings of the standard evaluation criteria based on statistical properties of the generated graphs. We propose to compare the generalizability of models based on expected likelihood. Empirical results on a variety of standard graph generation datasets demonstrate the superior performance of our method.

**Recoverability Landscape of Tree Structured Markov Random Fields under Symmetric Noise**

Ashish Katiyar · Soumya Basu · Vatsal Shah · Constantine Caramanis

We study the problem of learning tree-structured Markov random fields (MRF) on discrete random variables with common support when the observations are corrupted by a k-ary symmetric noise channel with unknown probability of error. For Ising models (support size = 2), past work has shown that graph structure can only be recovered up to the leaf clusters (a leaf node, its parent, and its siblings form a leaf cluster) and exact recovery is impossible. No prior work has addressed the setting of support size of 3 or more, and indeed this setting is far richer. As we show, when the support size is 3 or more, the structure of the leaf clusters may be partially or fully identifiable. We provide a precise characterization of this phenomenon and show that the extent of recoverability is dictated by the joint PMF of the random variables. In particular, we provide necessary and sufficient conditions for exact recoverability. Furthermore, we present a polynomial time, sample efficient algorithm that recovers the exact tree when this is possible, or up to the unidentifiability as promised by our characterization, when full recoverability is impossible. Finally, we demonstrate the efficacy of our algorithm experimentally.

**Common Information based Approximate State Representations in Multi-Agent Reinforcement Learning**

Hsu Kao · Vijay Subramanian

Due to information asymmetry, finding optimal policies for Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) is hard with the complexity growing doubly exponentially in the horizon length. The challenge increases greatly in the multi-agent reinforcement learning (MARL) setting where the transition probabilities, observation kernel, and reward function are unknown. Here, we develop a general compression framework with approximate common and private state representations, based on which decentralized policies can be constructed. We derive the optimality gap of executing dynamic programming (DP) with the approximate states in terms of the approximation error parameters and the remaining time steps. When the compression is exact (no error), the resulting DP is equivalent to the one in existing work. Our general framework generalizes a number of methods proposed in the literature. The results shed light on designing practically useful deep-MARL network structures under the "centralized learning distributed execution" scheme.

The analysis of streaming PCA has gained significant traction through the analysis of an early simple variant: Oja's algorithm, which implements online projected gradient descent for the trace objective. Several other streaming PCA algorithms have been developed, each with their own performance guarantees or empirical studies, and the question arises whether there is a relationship between the algorithms. We show that the Grassmannian Rank-One Subspace Estimation (GROUSE) algorithm is indeed equivalent to Oja's algorithm in the sense that, at each iteration, given a step size for one of the algorithms, we may construct a step size for the other algorithm that results in an identical update. This allows us to apply all results on one algorithm to the other. In particular, we have (1) better global convergence guarantees of GROUSE to the global minimizer of the PCA objective with full data; and (2) local convergence guarantees for Oja's algorithm with incomplete or compressed data.

**Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Completion**

Tian Tong · Cong Ma · Ashley Prater-Bennette · Erin Tripp · Yuejie Chi

Tensors, which provide a powerful and flexible model for representing multi-attribute data and multi-way interactions, play an indispensable role in modern data science across various fields in science and engineering. A fundamental task is tensor completion, which aims to faithfully recover the tensor from a small subset of its entries in a statistically and computationally efficient manner. Harnessing the low-rank structure of tensors in the Tucker decomposition, this paper develops a scaled gradient descent (ScaledGD) algorithm to directly recover the tensor factors with tailored spectral initializations, and shows that it provably converges at a linear rate independent of the condition number of the ground truth tensor for tensor completion as soon as the sample size is above the order of $n^{3/2}$ ignoring other parameter dependencies, where $n$ is the dimension of the tensor. To the best of our knowledge, ScaledGD is the first algorithm that achieves near-optimal statistical and computational complexities simultaneously for low-rank tensor completion with the Tucker decomposition. Our algorithm highlights the power of appropriate preconditioning in accelerating nonconvex statistical estimation, where the iteration-varying preconditioners promote desirable invariance properties of the trajectory with respect to the underlying symmetry in low-rank tensor factorization.

Generative models for affiliation networks condition the edges on the membership of their nodes to communities. The problem of community detection under these models is addressed by inferring the membership parameters from the network structure. Current models make several unrealistic assumptions to make the inference feasible, and are mostly designed to work on regular graphs that cannot handle multi-way connections between nodes. While the models designed for hypergraphs attempt to capture the latter, they add further strict assumptions on the structure and size of hyperedges and are usually computationally intractable for real data. This paper proposes an efficient probabilistic generative model for detecting overlapping communities that process hyperedges without any changes or restrictions on their size. Our model represents the entire state space of the hyperedges, which is exponential in the number of nodes. We develop a mathematical computation reduction scheme that reduces the inference time to linear in the volume of the hypergraph without sacrificing precision. Our experimental results validate the effectiveness and scalability of our model and demonstrate the superiority of our approach over state-of-the-art community detection methods.

In this work we study Variational Autoencoders (VAEs) from the perspective of harmonic analysis. By viewing a VAE's latent space as a Gaussian Space, a variety of measure space, we derive a series of results that show that the encoder variance of a VAE controls the frequency content of the functions parameterised by the VAE encoder and decoder neural networks. In particular we demonstrate that larger encoder variances reduce the high frequency content of these functions. Our analysis allows us to show that increasing this variance effectively induces a soft Lipschitz constraint on the decoder network of a VAE, which is a core contributor to the adversarial robustness of VAEs. We further demonstrate that adding Gaussian noise to the input of a VAE allows us to more finely control the frequency content and the Lipschitz constant of the VAE encoder networks. Finally, we show that the KL term of the VAE loss serves as single point of action for modulating the frequency content of both encoder and decoder networks; whereby upweighting this term decreases the high-frequency content of both networks. To support our theoretical analysis we run experiments using VAEs with small fully-connected neural networks and with larger convolutional networks, demonstrating empirically that our theory holds for a variety of neural network architectures.

**Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits**

Wenshuo Guo · Kumar Agrawal · Aditya Grover · Vidya Muthukumar · Ashwin Pananjady

We introduce the ``inverse bandit'' problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, we propose to leverage the demonstrator's behavior en route to optimality, and in particular, the exploration phase, for reward estimation. We begin by establishing a general information-theoretic lower bound under this paradigm that applies to any demonstrator algorithm, which characterizes a fundamental tradeoff between reward estimation and the amount of exploration of the demonstrator. Then, we develop simple and efficient reward estimators for upper-confidence-based demonstrator algorithms that attain the optimal tradeoff, showing in particular that consistent reward estimation---free of identifiability issues---is possible under our paradigm. Extensive simulations on both synthetic and semi-synthetic data corroborate our theoretical results.

**On the Assumptions of Synthetic Control Methods**

Claudia Shi · Dhanya Sridhar · Vishal Misra · David Blei

Synthetic control (SC) methods have been widely applied to estimate the causal effect of large-scale interventions, e.g., the state-wide effect of a change in policy.The idea of synthetic controls is to approximate one unit's counterfactual outcomes using a weighted combination of some other units' observed outcomes.The motivating question of this paper is: how does the SC strategy lead to valid causal inferences?We address this question by re-formulating the causal inference problem targeted by SC with a more fine-grained model, where we change the unit of analysis from `large units" (e.g., states) to`

small units" (e.g., individuals in states).Under the re-formulation, we derive sufficient conditions for the non-parametric causal identification of the causal effect.We show that, in some settings, existing linear SC estimators are valid even when the data generating process is non-linear.We highlight two implications of the reformulation: 1) it clarifies where ``linearity" comes from, and how it falls naturally out of the more fine-grained and flexible model; 2) it suggests new ways of using available data with SC methods for valid causal inference, in particular, new ways of selecting observations from which to estimate the counterfactual.

Multiclass classification problems often have many semantically similar classes. For example, 90 of ImageNet’s 1000 classes are for different breeds of dog. We should expect that these semantically similar classes will have similar parameter vectors, but the standard cross entropy loss does not enforce this constraint.We introduce the tree loss as a drop-in replacement for the cross entropy loss. The tree loss re-parameterizes the parameter matrix in order to guarantee that semantically similar classes will have similar parameter vectors. Using simple properties of stochastic gradient descent, we show that the tree loss’s generalization error is asymptotically better than the cross entropy loss’s. We then validate these theoretical results on synthetic data, image data (CIFAR100, ImageNet), and text data (Twitter).

**Outlier-Robust Optimal Transport: Duality, Structure, and Statistical Analysis**

Sloan Nietert · Ziv Goldfeld · Rachel Cummings

The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Under standard moment assumptions, $\mathsf{W}_p^\varepsilon$ is shown to be minimax optimal for robust estimation under the Huber $\varepsilon$-contamination model. Our formulation of this robust distance amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing robustness guarantees, characterization of optimal perturbations, regularity, duality, and statistical estimation. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.

**A Bayesian Model for Online Activity Sample Sizes**

Thomas Richardson · Yu Liu · James McQueen · Doug Hains

In many contexts it is useful to predict the number of individuals in some population who will initiate a particular activity during a given period. For example, the number of users who will install a software update, the number of customers who will use a new feature on a website or who will participate in an A/B test. In practical settings, there is heterogeneity amongst individuals with regard to the distribution of time until they will initiate. For these reasons it is inappropriate to assume that the number of new individuals observed on successive days will be identically distributed. Given observations on the number of unique users participating in an initial period, we present a simple but novel Bayesian method for predicting the number of additional individuals who will participate during a subsequent period. We illustrate the performance of the method in predicting sample size in online experimentation.

**Learning Quantile Functions without Quantile Crossing for Distribution-free Time Series Forecasting**

Youngsuk Park · Danielle Maddix · François-Xavier Aubet · Kelvin Kan · Jan Gasthaus · Yuyang Wang

Quantile regression is an effective technique to quantify uncertainty, fit challenging underlying distributions, and often provide full probabilistic predictions through joint learnings over multiple quantile levels. A common drawback of these joint quantile regressions, however, is quantile crossing, which violates the desirable monotone property of the conditional quantile function. In this work, we propose the Incremental (Spline) Quantile Functions I(S)QF, a flexible and efficient distribution-free quantile estimation framework that resolves quantile crossing with a simple neural network layer. Moreover, I(S)QF inter/extrapolate to predict arbitrary quantile levels that differ from the underlying training ones. Equipped with the analytical evaluation of the continuous ranked probability score of I(S)QF representations, we apply our methods to NN-based times series forecasting cases, where the savings of the expensive re-training costs for non-trained quantile levels is particularly significant. We also provide a generalization error analysis of our proposed approaches under the sequence-to-sequence setting. Lastly, extensive experiments demonstrate the improvement of consistency and accuracy errors over other baselines.

**Weak Separation in Mixture Models and Implications for Principal Stratification**

Nhat Ho · Avi Feller · Evan Greif · Luke Miratrix · Natesh Pillai

Principal stratification is a popular framework for addressing post-randomization complications, often in conjunction with finite mixture models for estimating the causal effects of interest. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means and variances are unknown, and focus on the setting in which the components are weakly separated. In this case, we show that the asymptotic convergence rate of the MLE is quite poor, such as $O(n^{-1/6})$ or even $O(n^{-1/8})$. We then demonstrate via both theoretical arguments and extensive simulations that the MLE behaves like a threshold estimator in finite samples, in the sense that the MLE can give strong evidence that the means are equal when the truth is otherwise. We also explore the behavior of the MLE when the MLE is non-zero, showing that it is difficult to estimate both the sign and magnitude of the means in this case. We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps. Our results suggest that the corresponding maximum likelihood estimates should be interpreted with caution in these cases.

A Neyman-Scott process is a special case of a Cox process. The latent andobservable stochastic processes are both Poisson processes. We consider adeep Neyman-Scott process in this paper, for which the building componentsof a network are all Poisson processes. We develop an efficient posteriorsampling via Markov chain Monte Carlo and use it for likelihood-basedinference. Our method opens up room for the inference in sophisticatedhierarchical point processes. We show in the experiments that more hiddenPoisson processes brings better performance for likelihood fitting andevents types prediction. We also compare our method with state-of-the-artmodels for temporal real-world datasets and demonstrate competitiveabilities for both data fitting and prediction, using far fewer parameters.

**A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization**

Donghao Ying · Yuhao Ding · Javad Lavaei

We study entropy-regularized constrained Markov decision processes (CMDPs) under the soft-max parameterization, in which an agent aims to maximize the entropy-regularized value function while satisfying constraints on the expected total utility. By leveraging the entropy regularization, our theoretical analysis shows that its Lagrangian dual function is smooth and the Lagrangian duality gap can be decomposed into the primal optimality gap and the constraint violation. Furthermore, we propose an accelerated dual-descent method for entropy-regularized CMDPs. We prove that our method achieves the global convergence rate $\widetilde{\mathcal{O}}(1/T)$ for both the optimality gap and the constraint violation for entropy-regularized CMDPs. A discussion about a linear convergence rate for CMDPs with a single constraint is also provided.

**Cross-Loss Influence Functions to Explain Deep Network Representations**

Andrew Silva · Rohit Chopra · Matthew Gombolay

As machine learning is increasingly deployed in the real world, it is paramount that we develop the tools necessary to analyze the decision-making of the models we train and deploy to end-users. Recently, researchers have shown that influence functions, a statistical measure of sample impact, can approximate the effects of training samples on classification accuracy for deep neural networks. However, this prior work only applies to supervised learning, where training and testing share an objective function. No approaches currently exist for estimating the influence of unsupervised training examples for deep learning models. To bring explainability to unsupervised and semi-supervised training regimes, we derive the first theoretical and empirical demonstration that influence functions can be extended to handle mismatched training and testing (i.e., "cross-loss") settings. Our formulation enables us to compute the influence in an unsupervised learning setup, explain cluster memberships, and identify and augment biases in language models. Our experiments show that our cross-loss influence estimates even exceed matched-objective influence estimation relative to ground-truth sample impact.

We study the effect of reward variance heterogeneity in the approximate top-$m$ arm identification setting. In this setting, the reward for the $i$-th arm follows a $\sigma^2_i$-sub-Gaussian distribution, and the agent needs to incorporate this knowledge to minimize the expected number of arm pulls to identify $m$ arms with the largest means within error $\epsilon$ out of the $n$ arms, with probability at least $1-\delta$. We show that the worst-case sample complexity of this problem is $$\Theta\left( \sum_{i =1}^n \frac{\sigma_i^2}{\epsilon^2} \ln\frac{1}{\delta} + \sum_{i \in G^{m}} \frac{\sigma_i^2}{\epsilon^2} \ln(m) + \sum_{j \in G^{l}} \frac{\sigma_j^2}{\epsilon^2} \text{Ent}(\sigma^2_{G^{r}}) \right), $$where $G^{m}, G^{l}, G^{r}$ are certain specific subsets of the overall arm set $\{1, 2, \ldots, n\}$, and $\text{Ent}(\cdot)$ is an entropy-like function which measures the heterogeneity of the variance proxies. The upper bound of the complexity is obtained using a divide-and-conquer style algorithm, while the matching lower bound relies on the study of a dual formulation.

**A Contraction Theory Approach to Optimization Algorithms from Acceleration Flows**

Pedro Cisneros · Francesco Bullo

Much recent interest has focused on the design of optimization algorithms from the discretization of an associated optimization flow, i.e., a system of differential equations (ODEs) whose trajectories solve an associated optimization problem. Such a design approach poses an important problem: how to find a principled methodology to design and discretize appropriate ODEs. This paper aims to provide a solution to this problem through the use of contraction theory. We first introduce general mathematical results that explain how contraction theory guarantees the stability of the implicit and explicit Euler integration methods. Then, we propose a novel system of ODEs, namely the Accelerated-Contracting-Nesterov flow, and use contraction theory to establish it is an optimization flow with exponential convergence rate, from which the linear convergence rate of its associated optimization algorithm is immediately established. Remarkably, a simple explicit Euler discretization of this flow corresponds to the Nesterov acceleration method. Finally, we present how our approach leads to performance guarantees in the design of optimization algorithms for time-varying optimization problems.

**Contrasting the landscape of contrastive and non-contrastive learning**

Ashwini Pokle · Jinjin Tian · Yuchen Li · Andrej Risteski

A lot of recent advances in unsupervised feature learning are based on designing features which are invariant under semantic data augmentations. A common way to do this is contrastive learning, which uses positive and negative samples. Some recent works however have shown promising results for non-contrastive learning, which does not require negative samples. However, the non-contrastive losses have obvious ``collapsed'' minima, in which the encoders output a constant feature embedding, independent of the input. A folk conjecture is that so long as these collapsed solutions are avoided, the produced feature representations should be good. In our paper, we cast doubt on this story: we show through theoretical results and controlled experiments that even on simple data models, non-contrastive losses have a preponderance of non-collapsed bad minima. Moreover, we show that the training process does not avoid these minima. Code for this work can be found at https://github.com/ashwinipokle/contrastive_landscape.

**Loss as the Inconsistency of a Probabilistic Dependency Graph: Choose Your Model, Not Your Loss Function**

Oliver Richardson

In a world blessed with a great diversity of loss functions, we argue that that choice between them is not a matter of taste or pragmatics, but of model. Probabilistic depencency graphs (PDGs) are probabilistic models that come equipped with a measure of "inconsistency". We prove that many standard loss functions arise as the inconsistency of a natural PDG describing the appropriate scenario, and use the same approach to justify a well-known connection between regularizers and priors. We also show that the PDG inconsistency captures a large class of statistical divergences, and detail benefits of thinking of them in this way, including an intuitive visual language for deriving inequalities between them. In variational inference, we find that the ELBO, a somewhat opaque objective for latent variable models, and variants of it arise for free out of uncontroversial modeling assumptions---as do simple graphical proofs of their corresponding bounds. Finally, we observe that inconsistency becomes the log partition function (free energy) in the setting where PDGs are factor graphs.

**A Manifold View of Adversarial Risk**

Wenjia Zhang · Yikai Zhang · Xiaoling Hu · Mayank Goswami · Chao Chen · Dimitris Metaxas

The adversarial risk of a machine learning model has been widely studied. Most previous works assume that the data lies in the whole ambient space. We propose to take a new angle and take the manifold assumption into consideration. Assuming data lies in a manifold, we investigate two new types of adversarial risk, the normal adversarial risk due to perturbation along normal direction, and the in-manifold adversarial risk due to perturbation within the manifold. We prove that the classic adversarial risk can be bounded from both sides using the normal and in-manifold adversarial risks. We also show with a surprisingly pessimistic case that the standard adversarial risk can be nonzero even when both normal and in-manifold risks are zero. We finalize the paper with empirical studies supporting our theoretical results. Our results suggest the possibility of improving the robustness of a classifier by only focusing on the normal adversarial risk.

**Pulling back information geometry**

Georgios Arvanitidis · Miguel González-Duque · Alison Pouplin · Dimitris Kalatzis · Soren Hauberg

Latent space geometry has shown itself to provide a rich and rigorous framework for interacting with the latent variables of deep generative models. The existing theory, however, relies on the decoder being a Gaussian distribution as its simple reparametrization allows us to interpret the generating process as a random projection of a deterministic manifold. Consequently, this approach breaks down when applied to decoders that are not as easily reparametrized. We here propose to use the Fisher-Rao metric associated with the space of decoder distributions as a reference metric, which we pull back to the latent space. We show that we can achieve meaningful latent geometries for a wide range of decoder distributions for which the previous theory was not applicable, opening the door to 'black box' latent geometries.

**Estimating Functionals of the Out-of-Sample Error Distribution in High-Dimensional Ridge Regression**

Pratik Patil · Alessandro Rinaldo · Ryan Tibshirani

We study the problem of estimating the distribution of the out-of-sample prediction error associated with ridge regression. In contrast, the traditional object of study is the uncentered second moment of this distribution (the mean squared prediction error), which can be estimated using cross-validation methods. We show that both generalized and leave-one-out cross-validation (GCV and LOOCV) for ridge regression can be suitably extended to estimate the full error distribution. This is still possible in a high-dimensional setting where the ridge regularization parameter is zero. In an asymptotic framework in which the feature dimension and sample size grow proportionally, we prove that almost surely, with respect to the training data, our estimators (extensions of GCV and LOOCV) converge weakly to the true out-of-sample error distribution. This result requires mild assumptions on the response and feature distributions. We also establish a more general result that allows us to estimate certain functionals of the error distribution, both linear and nonlinear. This yields various applications, including consistent estimation of the quantiles of the out-of-sample error distribution, which gives rise to prediction intervals with asymptotically exact coverage conditional on the training data.

We study the neural network (NN) compression problem, viewing the tension between the compression ratio and NN performance through the lens of rate-distortion theory. We choose a distortion metric that reflects the effect of NN compression on the model output and then derive the tradeoff between rate (compression ratio) and distortion. In addition to characterizing theoretical limits of NN compression, this formulation shows that \emph{pruning}, implicitly or explicitly, must be a part of a good compression algorithm. This observation bridges a gap between parts of the literature pertaining to NN and data compression, respectively, providing insight into the empirical success of pruning for NN compression. Finally, we propose a novel pruning strategy derived from our information-theoretic formulation and show that it outperforms the relevant baselines on CIFAR-10 and ImageNet datasets.

**Adaptive Private-K-Selection with Adaptive K and Application to Multi-label PATE**

Yuqing Zhu · Yu-Xiang Wang

We provide an end-to-end Renyi DP based-framework for differentially private top-$k$ selection. Unlike previous approaches, which require a data-independent choice on $k$, we propose to privately release a data-dependent choice of $k$ such that the gap between $k$-th and the $(k+1)$st ``quality'' is large. This is achieved by an extension of the Report-Noisy-Max algorithm with a more concentrated Gaussian noise.Not only does this eliminates one hyperparameter, the adaptive choice of $k$ also certifies the stability of the top-$k$ indices in the unordered set so we can release them using a combination of the propose-test-release (PTR) framework and the Distance-to-Stability mechanism. We show that our construction improves the privacy-utility trade-offs compared to the previous top-$k$ selection algorithms theoretically and empirically. Additionally, we apply our algorithm to ``Private Aggregation of Teacher Ensembles (PATE)'' in \emph{multi-label} classification tasks with a large number of labels and show that it leads to significant performance gains.

**Cycle Consistent Probability Divergences Across Different Spaces**

Zhengxin Zhang · Youssef Mroueh · Ziv Goldfeld · Bharath Sriperumbudur

Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Monge optimal transport formulation for matching, up to isometries, distributions on different spaces. Our formulation arises as a principled relaxation of the Gromov-Haussdroff distance between metric spaces, and employs two cycle-consistent maps that push forward each distribution onto the other. We study structural properties of the proposed discrepancy and, in particular, show that it captures the popular cycle-consistent generative adversarial network (GAN) framework as a special case, thereby providing the theory to explain it. Motivated by computational efficiency, we then kernelize the discrepancy and restrict the mappings to parametric function classes. The resulting kernelized version is coined the generalized maximum mean discrepancy (GMMD). Convergence rates for empirical estimation of GMMD are studied and experiments to support our theory are provided.

**Projection Predictive Inference for Generalized Linear and Additive Multilevel Models**

Alejandro Catalina · Paul Bürkner · Aki Vehtari

Projection predictive inference is a decision theoretic Bayesian approach that decouples model estimation from decision making.Given a reference model previously built including all variables present in the data, projection predictive inference projects its posterior onto a constrained space of a subset of variables. Variable selection is then performed by sequentially adding relevant variables until predictive performance is satisfactory. Previously, projection predictive inference has been demonstrated only for generalized linear models (GLMs) and Gaussian processes (GPs) where it showed superior performance to competing variable selection procedures. In this work, we extend projection predictive inference to support variable and structure selection for generalized linear multilevel models (GLMMs) and generalized additive multilevel models (GAMMs). Our simulative and real-world experiments demonstrate that our method can drastically reduce the model complexity required to reach reference predictive performance and achieve good frequency properties.

**Heavy-tailed Streaming Statistical Estimation**

Che-Ping Tsai · Adarsh Prasad · Sivaraman Balakrishnan · Pradeep Ravikumar

We consider the task of heavy-tailed statistical estimation given streaming $p$-dimensional samples. This could also be viewed as stochastic optimization under heavy-tailed distributions, with an additional $O(p)$ space complexity constraint. We design a clipped stochastic gradient descent algorithm and provide an improved analysis, under a more nuanced condition on the noise of the stochastic gradients, which we show is critical when analyzing stochastic optimization problems arising from general statistical estimation problems. Our results guarantee convergence not just in expectation but with exponential concentration, and moreover does so using $O(1)$ batch size. We provide consequences of our results for mean estimation and linear regression. Finally, we provide empirical corroboration of our results and algorithms via synthetic experiments for mean estimation and linear regression.

**Adversarial Tracking Control via Strongly Adaptive Online Learning with Memory**

Zhiyu Zhang · Ashok Cutkosky · Ioannis Paschalidis

We consider the problem of tracking an adversarial state sequence in a linear dynamical system subject to adversarial disturbances and loss functions, generalizing earlier settings in the literature. To this end, we develop three techniques, each of independent interest. First, we propose a comparator-adaptive algorithm for online linear optimization with movement cost. Without tuning, it nearly matches the performance of the optimally tuned gradient descent in hindsight. Next, considering a related problem called online learning with memory, we construct a novel strongly adaptive algorithm that uses our first contribution as a building block. Finally, we present the first reduction from adversarial tracking control to strongly adaptive online learning with memory. Summarizing these individual techniques, we obtain an adversarial tracking controller with a strong performance guarantee even when the reference trajectory has a large range of movement.

**Minimax Kernel Machine Learning for a Class of Doubly Robust Functionals with Application to Proximal Causal Inference**

AmirEmad Ghassami · Andrew Ying · Ilya Shpitser · Eric Tchetgen Tchetgen

Robins et al. (2008) introduced a class of influence functions (IFs) which could be used to obtain doubly robust moment functions for the corresponding parameters. However, that class does not include the IF of parameters for which the nuisance functions are solutions to integral equations. Such parameters are particularly important in the field of causal inference, specifically in the recently proposed proximal causal inference framework of Tchetgen Tchetgen et al. (2020), which allows for estimating the causal effect in the presence of latent confounders. In this paper, we first extend the class of Robins et al. to include doubly robust IFs in which the nuisance functions are solutions to integral equations. Then we demonstrate that the double robustness property of these IFs can be leveraged to construct estimating equations for the nuisance functions, which enables us to solve the integral equations without resorting to parametric models. We frame the estimation of the nuisance functions as a minimax optimization problem. We provide convergence rates for the nuisance functions and conditions required for asymptotic linearity of the estimator of the parameter of interest. The experiment results demonstrate that our proposed methodology leads to robust and high-performance estimators for average causal effect in the proximal causal inference framework.

**Equivariance Discovery by Learned Parameter-Sharing**

Raymond A. Yeh · Yuan-Ting Hu · Mark Hasegawa-Johnson · Alexander Schwing

Designing equivariance as an inductive bias into deep-nets has been a prominent approach to build effective models, e.g., a convolutional neural network incorporates translation equivariance. However, incorporating these inductive biases requires knowledge about the equivariance properties of the data, which may not be available, e.g., when encountering a new domain. To address this, we study how to "discover interpretable equivariances" from data. Specifically, we formulate this discovery process as an optimization problem over a model's parameter-sharing schemes. We propose to use the partition distance to empirically quantify the accuracy of the recovered equivariance. Also, we theoretically analyze the method for Gaussian data and provide a bound on the mean squared gap between the studied discovery scheme and the oracle scheme. Empirically, we show that the approach recovers known equivariances, such as permutations and shifts, on sum of numbers and spatially-invariant data.

**A Dimensionality Reduction Method for Finding Least Favorable Priors with a Focus on Bregman Divergence**

Alex Dytso · Mario Goldenbaum · H. Vincent Poor · Shlomo Shamai

A common way of characterizing minimax estimators in point estimation is by moving the problem into the Bayesian estimation domain and finding a least favorable prior distribution. The Bayesian estimator induced by a least favorable prior, under mild conditions, is then known to be minimax. However, finding least favorable distributions can be challenging due to inherent optimization over the space of probability distributions, which is infinite-dimensional. This paper develops a dimensionality reduction method that allows us to move the optimization to a finite-dimensional setting with an explicit bound on the dimension. The benefit of this dimensionality reduction is that it permits the use of popular algorithms such as projected gradient ascent to find least favorable priors. Throughout the paper, in order to make progress on the problem, we restrict ourselves to Bayesian risks induced by a relatively large class of loss functions, namely Bregman divergences.

**A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions**

Ben Adlam · Jake Levinson · Jeffrey Pennington

One of the distinguishing characteristics of modern deep learning systems is their use of neural network architectures with enormous numbers of parameters, often in the millions and sometimes even in the billions. While this paradigm has inspired significant research on the properties of large networks, relatively little work has been devoted to the fact that these networks are often used to model large complex datasets, which may themselves contain millions or even billions of constraints. In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity. We analyze the performance of random feature regression with features $F=f(WX+B)$ for a random weight matrix $W$ and bias vector $B$, obtaining exact formulae for the asymptotic training and test errors for data generated by a linear teacher model. The role of the bias can be understood as parameterizing a distribution over activation functions, and our analysis directly generalizes to such distributions, even those not expressible with a traditional additive bias. Intriguingly, we find that a mixture of nonlinearities can improve both the training and test errors over the best single nonlinearity, suggesting that mixtures of nonlinearities might be useful for approximate kernel methods or neural network architecture design.

**One-bit Submission for Locally Private Quasi-MLE: Its Asymptotic Normality and Limitation**

Hajime Ono · Kazuhiro Minami · Hideitsu Hino

Local differential privacy~(LDP) is an information-theoretic privacy definition suitable for statistical surveys that involve an untrusted data curator. An LDP version of quasi-maximum likelihood estimator~(QMLE) has been developed, but the existing method to build LDP QMLE is difficult to implement for a large-scale survey system in the real world due to long waiting time, expensive communication cost, and the boundedness assumption of derivative of a log-likelihood function. We provided alternative LDP protocols without those issues, which are potentially much easily deployable to a large-scale survey. We also provided sufficient conditions for the consistency and asymptotic normality and limitations of our protocol. Our protocol is less burdensome for the users, and the theoretical guarantees cover more realistic cases than those for the existing method.

**Exploring Counterfactual Explanations Through the Lens of Adversarial Examples: A Theoretical and Empirical Analysis**

Martin Pawelczyk · Chirag Agarwal · Shalmali Joshi · Sohini Upadhyay · Himabindu Lakkaraju

As machine learning (ML) models becomemore widely deployed in high-stakes applications, counterfactual explanations have emerged as key tools for providing actionable model explanations in practice. Despite the growing popularity of counterfactual explanations, the theoretical understanding of these explanations is still lacking behind. In this work, we systematically analyze counterfactual explanations through the lens of adversarial examples. We do so by formalizing the similarities between popular counterfactual explanation and adversarial example generation methods identifying conditions when they are equivalent. We then derive upper bounds between the solutions output by counterfactual explanation and adversarial example generation methods, which we validate on several real world data sets. By establishing these theoretical and empirical similarities between counterfactual explanations and adversarial examples, our work raises fundamental questions about the design and development of existing counterfactual explanation algorithms.

We prove asymptotic convergence for a general class of k-means algorithms performed over streaming data from a distribution—the centers asymptotically converge to the set of stationary points of the k-means objective function. To do so, we show that online k-means over a distribution can be interpreted as stochastic gradient descent with a stochastic learning rate schedule. Then, we prove convergence by extending techniques used in optimization literature to handle settings where center-specific learning rates may depend on the past trajectory of the centers.