Timezone:
Schedule Tue Wed Thu

Opening Remarks Tue 13 Apr 08:45 a.m.  

Arindam Banerjee · Kenji Fukumizu

Please join us immediately after the opening remarks in Emmanuel Candes' Invited Talk


Invited Talk: Emmanuel Candes

Reliable Predictions? Counterfactual Predictions? Equitable Treatment? Some Recent Progress in Predictive Inference

Recent progress in machine learning provides us with many potentially effective tools to learn from datasets of ever increasing sizes and make useful predictions. How do we know that these tools can be trusted in critical and high-sensitivity systems? If a learning algorithm predicts the GPA of a prospective college applicant, what guarantees do I have concerning the accuracy of this prediction? How do we know that it is not biased against certain groups of applicants? This talk introduces statistical ideas to ensure that the learned models satisfy some crucial properties, especially reliability and fairness (in the sense that the models need to apply to individuals in an equitable manner). To achieve these important objectives, we shall not “open up the black box” and try understanding its underpinnings. Rather we discuss broad methodologies that can be wrapped around any black box to produce results that can be trusted and are equitable. We also show how our ideas can inform causal inference predictive; for instance, we will answer counterfactual predictive problems: i.e. predict the outcome of a treatment would have been given that the patient was actually not treated.

Emmanuel Candes

 

Emmanuel Jean Candès is the Barnum-Simons Chair in Mathematics and Statistics and Professor of Mathematics, of Statistics, and of Electrical Engineering at Stanford University, where he is also Director of Data Science. His research interests include compressive sensing, mathematical signal processing, computational harmonic analysis, statistics, and scientific computing. Candès is also interested in applications to the imaging sciences and inverse problems, as well as theoretical computer science, mathematical optimization, and information theory. A graduate of the École Polytechnique, Candès earned a Ph.D. in statistics from Stanford University. Candès has received numerous awards, including the James H. Wilkinson Prize in Numerical Analysis and Scientific Computing, the Vasil A. Popov Prize, the Alan T. Waterman Award, the George Pólya Prize (with Terence Tao), the ICIAM Collatz Prize, the Lagrange Prize in Continuous Optimization, the Dannie Heineman Prize, the George David Birkhoff Prize, and a MacArthur Fellowship. Candès is a fellow of SIAM and the American Mathematical Society, and was elected to the National Academy of Sciences.



Oral: Theory of Statistical and Deep Learning Methods Tue 13 Apr 10:30 a.m.  

Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent
Frederik Kunstner · Raunak Kumar · Mark Schmidt

Expectation maximization (EM) is the default algorithm for fitting probabilistic models with missing or latent variables, yet we lack a full understanding of its non-asymptotic convergence properties. Previous works show results along the lines of "EM converges at least as fast as gradient descent" by assuming the conditions for the convergence of gradient descent apply to EM. This approach is not only loose, in that it does not capture that EM can make more progress than a gradient step, but the assumptions fail to hold for textbook examples of EM like Gaussian mixtures. In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence. Then, we show how the KL divergence is related to first-order stationarity via Bregman divergences. In contrast to previous works, the analysis is invariant to the choice of parametrization and holds with minimal assumptions. We also show applications of these ideas to local linear (and superlinear) convergence rates, generalized EM, and non-exponential family distributions.

Recovery Guarantees for Kernel-based Clustering under Non-parametric Mixture Models
Leena Chennuru Vankadara · Sebastian Bordt · Ulrike von Luxburg · Debarghya Ghoshdastidar

Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based data-clustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of non-parametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering.

Towards a Theoretical Understanding of the Robustness of Variational Autoencoders
Alexander Camuto · Matthew Willetts · Stephen Roberts · Chris Holmes · Tom Rainforth
We make inroads into understanding the robustness of Variational Autoencoders (VAEs) to adversarial attacks and other input perturbations. While previous work has developed algorithmic approaches to attacking and defending VAEs, there remains a lack of formalization for what it means for a VAE to be robust. To address this, we develop a novel criterion for robustness in probabilistic models: $r$-robustness. We then use this to construct the first theoretical results for the robustness of VAEs, deriving margins in the input space for which we can provide guarantees about the resulting reconstruction. Informally, we are able to define a region within which any perturbation will produce a reconstruction that is similar to the original reconstruction. To support our analysis, we show that VAEs trained using disentangling methods not only score well under our robustness metrics, but that the reasons for this can be interpreted through our theoretical results.
Stable ResNet
Soufiane Hayou · Eugenio Clerico · Bobby He · George Deligiannidis · Arnaud Doucet · Judith Rousseau

Deep ResNet architectures have achieved state of the art performance on many tasks. While they solve the problem of gradient vanishing, they might suffer from gradient exploding as the depth becomes large (Yang et al. 2017). Moreover, recent results have shown that ResNet might lose expressivity as the depth goes to infinity (Yang et al. 2017, Hayou et al. 2019). To resolve these issues, we introduce a new class of ResNet architectures, calledStable ResNet, that have the property of stabilizing the gradient while ensuring expressivity in the infinite depth limit.


Oral: Sampling Methods Tue 13 Apr 11:30 a.m.  

Couplings for Multinomial Hamiltonian Monte Carlo
Kai Xu · Tor Erlend Fjelde · Charles Sutton · Hong Ge

Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, proposed states are not limited to end-points as in Metropolis HMC; instead points along the entire trajectory can be proposed. In this paper, we establish couplings for multinomial HMC, based on optimal transport for multinomial sampling in its transition. We prove an upper bound for the meeting time -- the time it takes for the coupled chains to meet -- based on the notion of local contractivity. We evaluate our methods using three targets: 1,000 dimensional Gaussians, logistic regression and log-Gaussian Cox point processes. Compared to Heng & Jacob (2019), coupled multinomial HMC generally attains a smaller meeting time, and is more robust to choices of step sizes and trajectory lengths, which allows re-use of existing adaptation methods for HMC. These improvements together paves the way for a wider and more practical use of coupled HMC methods.

An Adaptive-MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo
Matthew Hoffman · Alexey Radul · Pavel Sountsov

Hamiltonian Monte Carlo (HMC) is a powerful MCMC algorithm based on simulating Hamiltonian dynamics. Its performance depends strongly on choosing appropriate values for two parameters: the step size used in the simulation, and how long the simulation runs for. The step-size parameter can be tuned using standard adaptive-MCMC strategies, but it is less obvious how to tune the simulation-length parameter. The no-U-turn sampler (NUTS) eliminates this problematic simulation-length parameter, but NUTS’s relatively complex control flow makes it difficult to efficiently run many parallel chains on accelerators such as GPUs. NUTS also spends some extra gradient evaluations relative to HMC in order to decide how long to run each iteration without violating detailed balance. We propose ChEES-HMC, a simple adaptive-MCMC scheme for automatically tuning HMC’s simulation-length parameter, which minimizes a proxy for the autocorrelation of the state’s second moments. We evaluate ChEES-HMC and NUTS on many tasks, and find that ChEES-HMC typically yields larger effective sample sizes per gradient evaluation than NUTS does. When running many chains on a GPU, ChEES-HMC can also run significantly more gradient evaluations per second than NUTS, allowing it to quickly provide accurate estimates of posterior expectations.

Maximal Couplings of the Metropolis-Hastings Algorithm
Guanyang Wang · John O'Leary · Pierre Jacob

Couplings play a central role in the analysis of Markov chain Monte Carlo algorithms and appear increasingly often in the algorithms themselves, e.g. in convergence diagnostics, parallelization, and variance reduction techniques. Existing couplings of the Metropolis-Hastings algorithm handle the proposal and acceptance steps separately and fall short of the upper bound on one-step meeting probabilities given by the coupling inequality. This paper introduces maximal couplings which achieve this bound while retaining the practical advantages of current methods. We consider the properties of these couplings and examine their behavior on a selection of numerical examples.

GANs with Conditional Independence Graphs: On Subadditivity of Probability Divergences
Mucong Ding · Constantinos Daskalakis · Soheil Feizi

Generative Adversarial Networks (GANs) are modern methods to learn the underlying distribution of a data set. GANs have been widely used in sample synthesis, de-noising, domain transfer, etc. GANs, however, are designed in a model-free fashion where no additional information about the underlying distribution is available. In many applications, however, practitioners have access to the underlying independence graph of the variables, either as a Bayesian network or a Markov Random Field (MRF). We ask: how can one use this additional information in designing model-based GANs? In this paper, we provide theoretical foundations to answer this question by studying subadditivity properties of probability divergences, which establish upper bounds on the distance between two high-dimensional distributions by the sum of distances between their marginals over (local) neighborhoods of the graphical structure of the Bayes-net or the MRF. We prove that several popular probability divergences satisfy some notion of subadditivity under mild conditions. These results lead to a principled design of a model-based GAN that uses a set of simple discriminators on the neighborhoods of the Bayes-net/MRF, rather than a giant discriminator on the entire network, providing significant statistical and computational benefits. Our experiments on synthetic and real-world datasets demonstrate the benefits of …


Affinity Event: WiML and Caucus for Women in Statistics Tue 13 Apr 12:30 p.m.  

Tatjana Chavdarova · Sarah Tan · Jessica Kohlschmidt

WiML and the Caucus for Women in Statistics (CWS) are excited to announce a joint event at AISTATS 2021. The event has two components: community-driven mentoring, and a panel. The event will be held on the Icebreaker.video platform on Tuesday, April 13, 2021, 12.30pm – 2pm PT.

WiML Homepage
Caucus for Women in Statistics Homepage

Joining Instructions



How to join: Join the Icebreaker event Event limited to 200 participants. You’ll be asked to sign in to Google, and give Icebreaker permission to access your camera and microphone. Google Chrome browser recommended.

Participant instructions: Whether you will participate as a mentor or mentee, we suggest preparing one or two lines to describe your work and research, as well as any other topics you may want to discuss. During the panel, you can type questions for the panelists in Icebreaker chat, so bring any questions on reviewing and publishing! See below for more information on Icebreaker.

Questions? Email workshop@wimlworkshop.org or cws@cwstat.org. Note that this is a separate event from the AISTATS mentoring sessions. By joining the event, you agree to abide by the AISTATS Code of Conduct and WiML Code of Conduct.


Mentorship Session 1 Tue 13 Apr 12:30 p.m.  

Please use the Mementor portal to schedule virtual mentor sessions during AISTATS and beyond. The goal is to enable mentorship opportunities for researchers in machine learning, both as mentors and mentees, with a special focus on under-represented minorities.

The mentorship session serves as a platform to share experiences. These could be technical and research related (e.g., research topics and technical discussions), or could be about scientific communication (e.g., paper writing, presentation, networking), or could also be mental health, burnouts, work ethics, PhD life etc. The goal is to facilitate sharing of experiences between members of the community which would not happen otherwise.


Poster Session 1 Tue 13 Apr 02:00 p.m.  

Variational Selective Autoencoder: Learning from Partially-Observed Heterogeneous Data
Yu Gong · Hossein Hajimirsadeghi · Jiawei He · Thibaut Durand · Greg Mori

Learning from heterogeneous data poses challenges such as combining data from various sources and of different types. Meanwhile, heterogeneous data are often associated with missingness in real-world applications due to heterogeneity and noise of input sources. In this work, we propose the variational selective autoencoder (VSAE), a general framework to learn representations from partially-observed heterogeneous data. VSAE learns the latent dependencies in heterogeneous data by modeling the joint distribution of observed data, unobserved data, and the imputation mask which represents how the data are missing. It results in a unified model for various downstream tasks including data generation and imputation. Evaluation on both low-dimensional and high-dimensional heterogeneous datasets for these two tasks shows improvement over state-of-the-art models.

Does Invariant Risk Minimization Capture Invariance?
Pritish Kamath · Akilesh Tangella · Danica Sutherland · Nathan Srebro

We show that the Invariant Risk Minimization (IRM) formulation of Arjovsky et al. (2019) can fail to capture "natural" invariances, at least when used in its practical "linear" form, and even on very simple problems which directly follow the motivating examples for IRM. This can lead to worse generalization on new environments, even when compared to unconstrained ERM. The issue stems from a significant gap between the linear variant (as in their concrete method IRMv1) and the full non-linear IRM formulation. Additionally, even when capturing the "right" invariances, we show that it is possible for IRM to learn a sub-optimal predictor, due to the loss function not being invariant across environments. The issues arise even when measuring invariance on the population distributions, but are exacerbated by the fact that IRM is extremely fragile to sampling.

Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation
Mayee Chen · Benjamin Cohen-Wang · Stephen Mussmann · Frederic Sala · Christopher Re

Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios---well-specified, misspecified, and corrected models---to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction.

Simultaneously Reconciled Quantile Forecasting of Hierarchically Related Time Series
Xing Han · Sambarta Dasgupta · Joydeep Ghosh

Many real-life applications involve simultaneously forecasting multiple time series that are hierarchically related via aggregation or disaggregation operations. For instance, commercial organizations often want to forecast inventories simultaneously at store, city, and state levels for resource planning purposes. In such applications, it is important that the forecasts, in addition to being reasonably accurate, are also consistent w.r.t one another. Although forecasting such hierarchical time series has been pursued by economists and data scientists, the current state-of-the-art models use strong assumptions, e.g., all forecasts being unbiased estimates, noise distribution being Gaussian. Besides, state-of-the-art models have not harnessed the power of modern nonlinear models, especially ones based on deep learning. In this paper, we propose using a flexible nonlinear model that optimizes quantile regression loss coupled with suitable regularization terms to maintain the consistency of forecasts across hierarchies. The theoretical framework introduced herein can be applied to any forecasting model with an underlying differentiable loss function. A proof of optimality of our proposed method is also provided. Simulation studies over a range of datasets highlight the efficacy of our approach.

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks
Zhiqi Bu · Shiyun Xu · Kan Chen

When equipped with efficient optimization algorithms, the over-parameterized neural networks have demonstrated high level of performance even though the loss function is non-convex and non-smooth. While many works have been focusing on understanding the loss dynamics by training neural networks with the gradient descent (GD), in this work, we consider a broad class of optimization algorithms that are commonly used in practice. For example, we show from a dynamical system perspective that the Heavy Ball (HB) method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) may only converge to global minimum sublinearly.

Our results rely on the connection between neural tangent kernel (NTK) and finitely-wide over-parameterized neural networks with ReLU activation, which leads to analyzing the limiting ordinary differential equations (ODE) for optimization algorithms. We show that, optimizing the non-convex loss over the weights corresponds to optimizing some strongly convex loss over the prediction error. As a consequence, we can leverage the classical convex optimization theory to understand the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network architectures.

Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula Approach
Yves-Laurent Kom Samo
We propose a novel estimator of the mutual information between two ordinal vectors $x$ and $y$. Our approach is inductive (as opposed to deductive) in that it depends on the data generating distribution solely through some nonparametric properties revealing associations in the data, and does not require having enough data to fully characterize the true joint distributions $P_{x, y}$. Specifically, our approach consists of (i) noting that $I\left(y; x\right) = I\left(u_y; u_x\right)$ where $u_y$ and $u_x$ are the \emph{copula-uniform dual representations} of $y$ and $x$ (i.e. their images under the probability integral transform), and (ii) estimating the copula entropies $h\left(u_y\right)$, $h\left(u_x\right)$ and $h\left(u_y, u_x\right)$ by solving a maximum-entropy problem over the space of copula densities under a constraint of the type $\bm{\alpha}_m = E\left[\phi_m(u_y, u_x)\right]$. We prove that, so long as the constraint is feasible, this problem admits a unique solution, it is in the exponential family, and it can be learned by solving a convex optimization problem. The resulting estimator, which we denote MIND, is marginal-invariant, always non-negative, unbounded for any sample size $n$, consistent, has MSE rate $O(1/n)$, and is more data-efficient than competing approaches.
Bayesian Active Learning by Soft Mean Objective Cost of Uncertainty
Guang Zhao · Edward Dougherty · Byung-Jun Yoon · Francis J. Alexander · Xiaoning Qian

To achieve label efficiency for training supervised learning models, pool-based active learning sequentially selects samples from a set of candidates as queries to label by optimizing an acquisition function. One category of existing methods adopts one-step-look-ahead strategies based on acquisition functions tailored with the learning objectives, for example based on the expected loss reduction (ELR) or the mean objective cost of uncertainty (MOCU) proposed recently. These active learning methods are optimal with the maximum classification error reduction when one considers a single query. However, it is well-known that there is no performance guarantee in the long run for these myopic methods. In this paper, we show that these methods are not guaranteed to converge to the optimal classifier of the true model because MOCU is not strictly concave. Moreover, we suggest a strictly concave approximation of MOCU---Soft MOCU---that can be used to define an acquisition function to guide Bayesian active learning with theoretical convergence guarantee. For training Bayesian classifiers with both synthetic and real-world data, our experiments demonstrate the superior performance of active learning by Soft MOCU compared to other existing methods.

Matérn Gaussian Processes on Graphs
Viacheslav Borovitskiy · Iskander Azangulov · Alexander Terenin · Peter Mostowsky · Marc Deisenroth · Nicolas Durrande

Gaussian processes are a versatile framework for learning unknown functions in a manner that permits one to utilize prior information about their properties. Although many different Gaussian process models are readily available when the input space is Euclidean, the choice is much more limited for Gaussian processes whose input space is an undirected graph. In this work, we leverage the stochastic partial differential equation characterization of Matérn Gaussian processes—a widely-used model class in the Euclidean setting—to study their analog for undirected graphs. We show that the resulting Gaussian processes inherit various attractive properties of their Euclidean and Riemannian analogs and provide techniques that allow them to be trained using standard methods, such as inducing points. This enables graph Matérn Gaussian processes to be employed in mini-batch and non-conjugate settings, thereby making them more accessible to practitioners and easier to deploy within larger learning frameworks.

A Hybrid Approximation to the Marginal Likelihood
Eric Chuu · Debdeep Pati · Anirban Bhattacharya

Computing the marginal likelihood or evidence is one of the core challenges in Bayesian analysis. While there are many established methods for estimating this quantity, they predominantly rely on using a large number of posterior samples obtained from a Markov Chain Monte Carlo (MCMC) algorithm. As the dimension of the parameter space increases, however, many of these methods become prohibitively slow and potentially inaccurate. In this paper, we propose a novel method in which we use the MCMC samples to learn a high probability partition of the parameter space and then form a deterministic approximation over each of these partition sets. This two-step procedure, which constitutes both a probabilistic and a deterministic component, is termed a Hybrid approximation to the marginal likelihood. We demonstrate its versatility in a plethora of examples with varying dimension and sample size, and we also highlight the Hybrid approximation's effectiveness in situations where there is either a limited number or only approximate MCMC samples available.

Fundamental Limits of Ridge-Regularized Empirical Risk Minimization in High Dimensions
Hossein Taheri · Ramtin Pedarsani · Christos Thrampoulidis

Despite the popularity of Empirical Risk Minimization (ERM) algorithms, a theory that explains their statistical properties in modern high-dimensional regimes is only recently emerging. We characterize for the first time the fundamental limits on the statistical accuracy of convex ridge-regularized ERM for inference in high-dimensional generalized linear models. For a stylized setting with Gaussian features and problem dimensions that grow large at a proportional rate, we start with sharp performance characterizations and then derive tight lower bounds on the estimation and prediction error. Our bounds provably hold over a wide class of loss functions, and, for any value of the regularization parameter and of the sampling ratio. Our precise analysis has several attributes. First, it leads to a recipe for optimally tuning the loss function and the regularization parameter. Second, it allows to precisely quantify the sub-optimality of popular heuristic choices, such as optimally-tuned least-squares. Third, we use the bounds to precisely assess the merits of ridge-regularization as a function of the sampling ratio. Our bounds are expressed in terms of the Fisher Information of random variables that are simple functions of the data distribution, thus making ties to corresponding bounds in classical statistics.

Hogwild! over Distributed Local Data Sets with Linearly Increasing Mini-Batch Sizes
Nhuong Nguyen · Toan Nguyen · PHUONG HA NGUYEN · Quoc Tran-Dinh · Lam Nguyen · Marten van Dijk

Hogwild! implements asynchronous Stochastic Gradient Descent (SGD) where multiple threads in parallel access a common repository containing training data, perform SGD iterations and update shared state that represents a jointly learned (global) model. We consider big data analysis where training data is distributed among local data sets in a heterogeneous way -- and we wish to move SGD computations to local compute nodes where local data resides. The results of these local SGD computations are aggregated by a central ``aggregator'' which mimics Hogwild!. We show how local compute nodes can start choosing small mini-batch sizes which increase to larger ones in order to reduce communication cost (round interaction with the aggregator). We improve state-of-the-art literature and show O(K^{0.5}) communication rounds for heterogeneous data for strongly convex problems, where K is the total number of gradient computations across all local compute nodes. For our scheme, we prove a tight and novel non-trivial convergence analysis for strongly convex problems for heterogeneous data which does not use the bounded gradient assumption as seen in many existing publications. The tightness is a consequence of our proofs for lower and upper bounds of the convergence rate, which show a constant factor difference. We show experimental …

Detection and Defense of Topological Adversarial Attacks on Graphs
Yingxue Zhang · Florence Regol · Soumyasundar Pal · Sakif Khan · Liheng Ma · Mark Coates

Graph neural network (GNN) models achieve superior performance when classifying nodes in graph-structured data. Given that state-of-the-art GNNs share many similarities with their CNN cousins and that CNNs suffer adversarial vulnerabilities, there has also been interest in exploring analogous vulnerabilities in GNNs. Indeed, recent work has demonstrated that node classification performance of several graph models, including the popular graph convolution network (GCN) model, can be severely degraded through adversarial perturbations to the graph structure and the node features. In this work, we take a first step towards detecting adversarial attacks against graph models. We first propose a straightforward single node threshold test for detecting nodes subject to targeted attacks. Subsequently, we describe a kernel-based two-sample test for detecting whether a given subset of nodes within a graph has been maliciously corrupted. The efficacy of our algorithms is established via thorough experiments using commonly used node classification benchmark datasets. We also illustrate the potential practical benefit of our detection method by demonstrating its application to a real-world Bitcoin transaction network.

Efficient Interpolation of Density Estimators
Paxton Turner · Jingbo Liu · Philippe Rigollet

We study the problem of space and time efficient evaluation of a nonparametric estimator that approximates an unknown density. In the regime where consistent estimation is possible, we use a piecewise multivariate polynomial interpolation scheme to give a computationally efficient construction that converts the original estimator to a new estimator that can be queried efficiently and has low space requirements, all without adversely deteriorating the original approximation quality. Our result gives a new statistical perspective on the problem of fast evaluation of kernel density estimators in the presence of underlying smoothness. As a corollary, we give a succinct derivation of a classical result of Kolmogorov---Tikhomirov on the metric entropy of Holder classes of smooth functions.

On the Convergence of Gradient Descent in GANs: MMD GAN As a Gradient Flow
Youssef Mroueh · Truyen Nguyen

We consider the maximum mean discrepancy MMD GAN problem and propose a parametric kernelized gradient flow that mimics the min-max game in gradient regularized MMD GAN. We show that this flow provides a descent direction minimizing the MMD on a statistical manifold of probability distributions. We then derive an explicit condition which ensures that gradient descent on the parameter space of the generator in gradient regularized MMD GAN is globally convergent to the target distribution. Under this condition , we give non asymptotic convergence results for MMD GAN. Another contribution of this paper is the introduction of a dynamic formulation of a regularization of MMD and demonstrating that the parametric kernelized descent for MMD is the gradient flow of this functional with respect to the new Riemannian structure. Our obtained theoretical result allows ones to treat gradient flows for quite general functionals and thus has potential applications to other types of variational inferences on a statistical manifold beyond GANs. Finally, numerical experiments suggest that our parametric kernelized gradient flow stabilizes GAN training and guarantees convergence.

Deep Spectral Ranking
ILKAY YILDIZ · Jennifer Dy · Deniz Erdogmus · Susan Ostmo · J. Peter Campbell · Michael F. Chiang · Stratis Ioannidis

Learning from ranking observations arises in many domains, and siamese deep neural networks have shown excellent inference performance in this setting. However, SGD does not scale well, as an epoch grows exponentially with the ranking observation size. We show that a spectral algorithm can be combined with deep learning methods to significantly accelerate training. We combine a spectral estimate of Plackett-Luce ranking scores with a deep model via the Alternating Directions Method of Multipliers with a Kullback-Leibler proximal penalty. Compared to a state-of-the-art siamese network, our algorithms are up to 175 times faster and attain better predictions by up to 26% Top-1 Accuracy and 6% Kendall-Tau correlation over five real-life ranking datasets.

Designing Transportable Experiments Under S-admissability
My Phan · David Arbour · Drew Dimmery · Anup Rao

We consider the problem of designing a randomized experiment on a source population to estimate the Average Treatment Effect (ATE) on a target population. We propose a novel approach which explicitly considers the target when designing the experiment on the source. Under the covariate shift assumption, we design an unbiased importance-weighted estimator for the target population's ATE. To reduce the variance of our estimator, we design a covariate balance condition (Target Balance) between the treatment and control groups based on the target population. We show that Target Balance achieves a higher variance reduction asymptotically than methods that do not consider the target population during the design phase. Our experiments illustrate that Target Balance reduces the variance even for small sample sizes.

Semi-Supervised Learning with Meta-Gradient
Taihong Xiao · Xin-Yu Zhang · Haolin Jia · Ming-Ming Cheng · Ming-Hsuan Yang

In this work, we propose a simple yet effective meta-learning algorithm in semi-supervised learning. We notice that most existing consistency-based approaches suffer from overfitting and limited model generalization ability, especially when training with only a small number of labeled data. To alleviate this issue, we propose a learn-to-generalize regularization term by utilizing the label information and optimize the problem in a meta-learning fashion. Specifically, we seek the pseudo labels of the unlabeled data so that the model can generalize well on the labeled data, which is formulated as a nested optimization problem. We address this problem using the meta-gradient that bridges between the pseudo label and the regularization term. In addition, we introduce a simple first-order approximation to avoid computing higher-order derivatives and provide theoretic convergence analysis. Extensive evaluations on the SVHN, CIFAR, and ImageNet datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.

Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in their Interpretations.
Neil Jethani · Mukund Sudarshan · Yindalon Aphinyanaphongs · Rajesh Ranganath

While the need for interpretable machine learning has been established, many common approaches are slow, lack fidelity, or hard to evaluate. Amortized explanation methods reduce the cost of providing interpretations by learning a global selector model that returns feature importances for a single instance of data. The selector model is trained to optimize the fidelity of the interpretations, as evaluated by a predictor model for the target. Popular methods learn the selector and predictor model in concert, which we show allows predictions to be encoded within interpretations. We introduce EVAL-X as a method to quantitatively evaluate interpretations and REAL-X as an amortized explanation method, which learn a predictor model that approximates the true data generating distribution given any subset of the input. We show EVAL-X can detect when predictions are encoded in interpretations and show the advantages of REAL-X through quantitative and radiologist evaluation.

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms
Tengyu Xu · Yingbin Liang
Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with single-sample update at each iteration. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two time-scale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with mini-batch data for each update. For linear TDC, we provide a novel non-asymptotic analysis and our sample complexity result achieves the complexity $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$. For nonlinear TDC and Greedy-GQ, we show that both algorithms attain $\epsilon$-accurate stationary solution with sample complexity $\mathcal{O}(\epsilon^{-2})$. It is the first time that non-asymptotic convergence result has been established for nonlinear TDC and our result for Greedy-GQ outperforms previous result orderwisely by a factor of $\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$.
Reinforcement Learning for Mean Field Games with Strategic Complementarities
Kiyeob Lee · Desik Rengarajan · Dileep Kalathil · Srinivas Shakkottai

Mean Field Games (MFG) are the class of games with a very large number of agents and the standard equilibrium concept is a Mean Field Equilibrium (MFE). Algorithms for learning MFE in dynamic MFGs are unknown in general. Our focus is on an important subclass that possess a monotonicity property called Strategic Complementarities (MFG-SC). We introduce a natural refinement to the equilibrium concept that we call Trembling-Hand-Perfect MFE (T-MFE), which allows agents to employ a measure of randomization while accounting for the impact of such randomization on their payoffs. We propose a simple algorithm for computing T-MFE under a known model. We also introduce a model-free and a model-based approach to learning T-MFE and provide sample complexities of both algorithms. We also develop a fully online learning scheme that obviates the need for a simulator. Finally, we empirically evaluate the performance of the proposed algorithms via examples motivated by real-world applications.

Maximizing Agreements for Ranking, Clustering and Hierarchical Clustering via MAX-CUT
Vaggos Chatziafratis · Mohammad Mahdian · Sara Ahmadian

In this paper, we study a number of well-known combinatorial optimization problems that fit in the following paradigm: the input is a collection of (potentially inconsistent) local relationships between the elements of a ground set (e.g., pairwise comparisons, similar/dissimilar pairs, or ancestry structure of triples of points), and the goal is to aggregate this information into a global structure (e.g., a ranking, a clustering, or a hierarchical clustering) in a way that maximizes agreement with the input. Well-studied problems such as rank aggregation, correlation clustering, and hierarchical clustering with triplet constraints fall in this class of problems.

We study these problems on stochastic instances with a hidden embedded ground truth solution. Our main algorithmic contribution is a unified technique that uses the maximum cut problem in graphs to approximately solve these problems. Using this technique, we can often get approximation guarantees in the stochastic setting that are better than the known worst case inapproximability bounds for the corresponding problem. On the negative side, we improve the worst case inapproximability bound on several hierarchical clustering formulations through a reduction to related ranking problems.

Training a Single Bandit Arm
Eren Ozbay · Vijay Kamble
In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to train novice workers of unknown quality in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from $T$ pulls, we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls, and aim to maximize the expected value of the highest cumulative reward across the $K$ arms. This corresponds to the objective of training a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and an instance-independent regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy, featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). Our numerical experiments demonstrate the efficacy of this policy compared to several natural alternatives in practical …
Sketch based Memory for Neural Networks
Rina Panigrahy · Xin Wang · Manzil Zaheer

Deep learning has shown tremendous success on a variety of problems. However, unlike traditional computational paradigm, most neural networks do not have access to a memory, which might be hampering its ability to scale to large data structures such as graphs, lookup-tables, databases. We propose a neural architecture where sketch based memory is integrated into a neural network in a uniform manner at every layer. This architecture supplements a neural layer by information accessed from the memory before feeding it to the next layer, thereby significantly expanding the capacity of the network to solve larger problem instances. We show theoretically that problems involving key-value lookup that are traditionally stored in standard databases can now be solved using neural networks augmented by our memory architecture. We also show that our memory layer can be viewed as a kernel function. We show benefits on diverse problems such as long tail image classification, language model, large graph multi hop traversal, etc. arguing that they are all build upon the classical key-value lookup problem (or the variant where the keys may be fuzzy).

Provably Efficient Actor-Critic for Risk-Sensitive and Robust Adversarial RL: A Linear-Quadratic Case
Yufeng Zhang · Zhuoran Yang · Zhaoran Wang

Risk-sensitivity plays a central role in artificial intelligence safety. In this paper, we study the global convergence of the actor-critic algorithm for risk-sensitive reinforcement learning (RSRL) with exponential utility, which remains challenging for policy optimization as it lacks the linearity needed to formulate policy gradient. To bypass such an issue of nonlinearity, we resort to the equivalence between RSRL and robust adversarial reinforcement learning (RARL), which is formulated as a zero-sum Markov game with a hypothetical adversary. In particular, the Nash equilibrium (NE) of such a game yields the optimal policy for RSRL, which is provably robust. We focus on a simple yet fundamental setting known as linear-quadratic (LQ) game. To attain the optimal policy, we develop a nested natural actor-critic algorithm, which provably converges to the NE of the LQ game at a sublinear rate, thus solving both RSRL and RARL. To the best knowledge, the proposed nested actor-critic algorithm appears to be the first model-free policy optimization algorithm that provably attains the optimal policy for RSRL and RARL in the LQ setting, which sheds light on more general settings.

List Learning with Attribute Noise
Mahdi Cheraghchi · Elena Grigorescu · Brendan Juba · Karl Wimmer · Ning Xie

We introduce and study the model of list learning with attribute noise. Learning with attribute noise was introduced by Shackelford and Volper (COLT, 1988) as a variant of PAC learning, in which the algorithm has access to noisy examples and uncorrupted labels, and the goal is to recover an accurate hypothesis. Sloan (COLT, 1988) and Goldman and Sloan (Algorithmica, 1995) discovered information-theoretic limits to learning in this model, which have impeded further progress. In this article we extend the model to that of list learning, drawing inspiration from the list-decoding model in coding theory, and its recent variant studied in the context of learning. On the positive side, we show that sparse conjunctions can be efficiently list learned under some assumptions on the underlying ground-truth distribution. On the negative side, our results show that even in the list-learning model, efficient learning of parities and majorities is not possible regardless of the representation used.

No-Regret Algorithms for Private Gaussian Process Bandit Optimization
Abhimanyu Dubey

The widespread proliferation of data-driven decision-making has ushered in a recent interest in the design of privacy-preserving algorithms. In this paper, we consider the ubiquitous problem of gaussian process (GP) bandit optimization from the lens of privacy-preserving statistics. We propose a solution for differentially private GP bandit optimization that combines uniform kernel approximation with random perturbations, providing a generic framework to create differentially-private (DP) Gaussian process bandit algorithms. For two specific DP settings - joint and local differential privacy, we provide algorithms based on efficient quadrature Fourier feature approximators, that are computationally efficient and provably no-regret for a class of stationary kernel functions. In contrast to previous work, our algorithms maintain differential privacy throughout the optimization procedure and critically do not rely on the sample path for prediction, making them scalable and straightforward to release as well.

The Minecraft Kernel: Modelling correlated Gaussian Processes in the Fourier domain
Fergus Simpson · Alexis Boukouvalas · Vaclav Cadek · Elvijs Sarkans · Nicolas Durrande

In the univariate setting, using the kernel spectral representation is an appealing approach for generating stationary covariance functions. However, performing the same task for multiple-output Gaussian processes is substantially more challenging. We demonstrate that current approaches to modelling cross-covariances with a spectral mixture kernel possess a critical blind spot. Pairs of highly correlated (or highly anti-correlated) processes are not reproducible, aside from the special case when their spectral densities are of identical shape. We present a solution to this issue by replacing the conventional Gaussian components of a spectral mixture with block components of finite bandwidth (i.e. rectangular step functions). The proposed family of kernel represents the first multi-output generalisation of the spectral mixture kernel that can approximate any stationary multi-output kernel to arbitrary precision.

Non-Stationary Off-Policy Optimization
Joey Hong · Branislav Kveton · Manzil Zaheer · Yinlam Chow · Amr Ahmed

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.

Learning the Truth From Only One Side of the Story
Heinrich Jiang · Qijia Jiang · Aldo Pacchiano

Learning under one-sided feedback (i.e., where we only observe the labels for examples we predicted positively on) is a fundamental problem in machine learning -- applications include lending and recommendation systems. Despite this, there has been surprisingly little progress made in ways to mitigate the effects of the sampling bias that arises. We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge suboptimally or even fail to converge to the optimal solution. We propose an adaptive approach that comes with theoretical guarantees and show that it outperforms several existing methods empirically. Our method leverages variance estimation techniques to efficiently learn under uncertainty, offering a more principled alternative compared to existing approaches.

Generalization of Quasi-Newton Methods: Application to Robust Symmetric Multisecant Updates
Damien Scieur · Lewis Liu · Thomas Pumir · Nicolas Boumal

Quasi-Newton (qN) techniques approximate the Newton step by estimating the Hessian using the so-called secant equations. Some of these methods compute the Hessian using several secant equations but produce non-symmetric updates. Other quasi-Newton schemes, such as BFGS, enforce symmetry but cannot satisfy more than one secant equation. We propose a new type of quasi-Newton symmetric update using several secant equations in a least-squares sense. Our approach generalizes and unifies the design of quasi-Newton updates and satisfies provable robustness guarantees.

Counterfactual Representation Learning with Balancing Weights
Serge Assaad · Shuxi Zeng · Chenyang Tao · Shounak Datta · Nikhil Mehta · Ricardo Henao · Fan Li · Lawrence Carin Duke

A key to causal inference with observational data is achieving balance in predictive features associated with each treatment type. Recent literature has explored representation learning to achieve this goal. In this work, we discuss the pitfalls of these strategies – such as a steep trade-off between achieving balance and predictive power – and present a remedy via the integration of balancing weights in causal learning. Specifically, we theoretically link balance to the quality of propensity estimation, emphasize the importance of identifying a proper target population, and elaborate on the complementary roles of feature balancing and weight adjustments. Using these concepts, we then develop an algorithm for flexible, scalable and accurate estimation of causal effects. Finally, we show how the learned weighted representations may serve to facilitate alternative causal learning procedures with appealing statistical features. We conduct an extensive set of experiments on both synthetic examples and standard benchmarks, and report encouraging results relative to state-of-the-art baselines.

An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling
QIN DING · Cho-Jui Hsieh · James Sharpnack
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to $t$, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves $\tilde{O}(\sqrt{dT})$ regret with the total time complexity that scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$ is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.
Regression Discontinuity Design under Self-selection
Sida Peng · Yang Ning

Regression Discontinuity (RD) design is commonly used to estimate the causal effect of a policy. Existing RD relies on the continuity assumption of potential outcomes. However, self selection leads to different distributions of covariates on two sides of the policy intervention, which violates this assumption. The standard RD estimators are no longer applicable in such setting. We show that the direct causal effect can still be recovered under a class of weighted average treatment effects. We propose a set of estimators through a weighted local linear regression framework and prove the consistency and asymptotic normality of the estimators. We apply our method to a novel data set from Microsoft Bing on Generalized Second Price (GSP) auction and show that by placing the advertisement on the second ranked position can increase the click-ability by 1.91%.

Multitask Bandit Learning Through Heterogeneous Feedback Aggregation
Zhi Wang · Chicheng Zhang · Manish Kumar Singh · Laurel Riek · Kamalika Chaudhuri
In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $\epsilon$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg($\epsilon$), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise dissimilarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise dissimilarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.
Problem-Complexity Adaptive Model Selection for Stochastic Linear Bandits
Avishek Ghosh · Abishek Sankararaman · Ramchandran Kannan
We consider the problem of model selection for two popular stochastic linear bandit settings, and propose algorithms that adapts to the unknown problem complexity. In the first setting, we consider the $K$ armed mixture bandits, where the mean reward of arm $i \in [K]$ is $\mu_i+ \langle \alpha_{i,t},\theta^* \rangle $, with $\alpha_{i,t} \in \mathbb{R}^d$ being the known context vector and $\mu_i \in [-1,1]$ and $\theta^*$ are unknown parameters. We define $\|\theta^*\|$ as the problem complexity and consider a sequence of nested hypothesis classes, each positing a different upper bound on $\|\theta^*\|$. Exploiting this, we propose Adaptive Linear Bandit (ALB), a novel phase based algorithm that adapts to the true problem complexity, $\|\theta^*\|$. We show that ALB achieves regret scaling of $\widetilde{O}(\|\theta^*\|\sqrt{T})$, where $\|\theta^*\|$ is apriori unknown. As a corollary, when $\theta^*=0$, ALB recovers the minimax regret for the simple bandit algorithm without such knowledge of $\theta^*$. ALB is the first algorithm that uses parameter norm as model section criteria for linear bandits. Prior state of art algorithms achieve a regret of $\widetilde{O}(L\sqrt{T})$, where $L$ is the upper bound on $\|\theta^*\|$, fed as an input to the problem. In the second setting, we consider the standard linear bandit problem (with possibly …
Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
Sebastian Macaluso · Craig Greenberg · Nicholas Monath · Ji Ah Lee · Patrick Flaherty · Kyle Cranmer · Andrew McGregor · Andrew McCallum

Hierarchical clustering is a fundamental task often used to discover meaningful structures in data. Due to the combinatorial number of possible hierarchical clusterings, approximate algorithms are typically used for inference. In contrast to existing methods, we present novel dynamic-programming algorithms for exact inference in hierarchical clustering based on a novel trellis data structure, and we prove that we can exactly compute the partition function, maximum likelihood hierarchy, and marginal probabilities of sub-hierarchies and clusters. Our algorithms scale in time and space proportional to the powerset of N elements, which is super-exponentially more efficient than explicitly considering each of the (2N − 3)!! possible hierarchies. Also, for larger datasets where our exact algorithms become infeasible, we introduce an approximate algorithm based on a sparse trellis that out- performs greedy and beam search baselines.

Dynamic Cutset Networks
Chiradeep Roy · Tahrima Rahman · Hailiang Dong · Nicholas Ruozzi · Vibhav Gogate

Tractable probabilistic models (TPMs) are appealing because they admit polynomial-time inference for a wide variety of queries. In this work, we extend the cutset network (CN) framework, a powerful sub-class of TPMs that often outperforms probabilistic graphical models in terms of prediction accuracy, to the temporal domain. This extension, dubbed dynamic cutset networks (DCNs), uses a CN to model the prior distribution and a conditional CN to model the transition distribution. We show that although exact inference is intractable when arbitrary conditional CNs are used, particle filtering is efficient. To ensure tractability of exact inference, we introduce a novel constrained conditional model called AND/OR conditional cutset networks and show that under certain conditions exact inference is linear in the size of the corresponding constrained DCN. Experiments on several sequential datasets demonstrate the efficacy of our framework.

A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix
Thang Doan · Mehdi Abbana Bennani · Bogdan Mazoure · Guillaume Rabusseau · Pierre Alquier

Continual learning (CL) is a setting in which an agent has to learn from an incoming stream of data during its entire lifetime. Although major advances have been made in the field, one recurring problem which remains unsolved is that of Catastrophic Forgetting (CF). While the issue has been extensively studied empirically, little attention has been paid from a theoretical angle. In this paper, we show that the impact of CF increases as two tasks increasingly align. We introduce a measure of task similarity called the NTK overlap matrix which is at the core of CF. We analyze common projected gradient algorithms and demonstrate how they mitigate forgetting. Then, we propose a variant of Orthogonal Gradient Descent (OGD) which leverages structure of the data through Principal Component Analysis (PCA). Experiments support our theoretical findings and show how our method can help reduce CF on classical CL datasets.

Learning-to-Rank with Partitioned Preference: Fast Estimation for the Plackett-Luce Model
Jiaqi Ma · Xinyang Yi · Weijing Tang · Zhe Zhao · Lichan Hong · Ed Chi · Qiaozhu Mei
We consider the problem of listwise learning-to-rank (LTR) on data with \textit{partitioned preference}, where a set of items are sliced into ordered and disjoint partitions, but the ranking of items within a partition is unknown. The Plackett-Luce (PL) model has been widely used in listwise LTR methods. However, given $N$ items with $M$ partitions, calculating the likelihood of data with partitioned preference under the PL model has a time complexity of $O(N+S!)$, where $S$ is the maximum size of the top $M-1$ partitions. This computational challenge restrains existing PL-based listwise LTR methods to only a special case of partitioned preference, \textit{top-$K$ ranking}, where the exact order of the top $K$ items is known. In this paper, we exploit a random utility model formulation of the PL model and propose an efficient approach through numerical integration for calculating the likelihood. This numerical approach reduces the aforementioned time complexity to $O(N+MS)$, which allows training deep-neural-network-based ranking models with a large output space. We demonstrate that the proposed method outperforms well-known LTR baselines and remains scalable through both simulation experiments and applications to real-world eXtreme Multi-Label (XML) classification tasks. The proposed method also achieves state-of-the-art performance on XML datasets with relatively large numbers …
Nonparametric Variable Screening with Optimal Decision Stumps
Jason Klusowski · Peter M Tian

Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derive finite sample performance guarantees for variable selection in nonparametric models using a single-level CART decision tree (a decision stump). Under standard operating assumptions in variable screening literature, we find that the marginal signal strength of each variable and ambient dimensionality can be considerably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods. Furthermore, unlike previous marginal screening methods that estimate each marginal projection via a truncated basis expansion, the fitted model used here is a simple, parsimonious decision stump, thereby eliminating the need for tuning the number of basis terms. Thus, surprisingly, even though decision stumps are highly inaccurate for estimation purposes, they can still be used to perform consistent model selection.

Principal Subspace Estimation Under Information Diffusion
Fan Zhou · Ping Li · Zhixin Zhou
Let $\mathbf{A} = \mathbf{L}_0 + \mathbf{S}_0$, where $\mathbf{L}_0 \in \mathbb{R}^{d\times d}$ is low rank and $\mathbf{S}_0$ is a perturbation matrix. We study the principal subspace estimation of $\mathbf{L}_0$ through observations $\mathbf{y}_j = f(\mathbf{A})\mathbf{x}_j$, $j=1,\dots,n$, where $f:\mathbb{R}\rightarrow \mathbb{R}$ is an unknown polynomial and $\mathbf{x}_j$'s are i.i.d. random input signals. Such models are widely used in graph signal processing to model information diffusion dynamics over networks with applications in network topology inference and data analysis. We develop an estimation procedure based on nuclear norm penalization, and establish upper bounds on the principal subspace estimation error when $\mathbf{A}$ is the adjacency matrix of a random graph generated by $\mathbf{L}_0$. Our theory shows that when the signal strength is strong enough, the exact rank of $\mathbf{L}_0$ can be recovered. By applying our results to blind community detection, we show that consistency of spectral clustering can be achieved for some popular stochastic block models. Together with the experimental results, our theory show that there is a fundamental limit of using the principal components obtained from diffused graph signals which is commonly adapted in current practice. Finally, under some structured perturbation $\mathbf{S}_0$, we build the connection between this model with spiked covariance model and develop a …
Learning with Gradient Descent and Weakly Convex Losses
Dominic Richards · Mike Rabbat

We study the learning performance of gradient descent when the empirical risk is weakly convex, namely, the smallest negative eigenvalue of the empirical risk's Hessian is bounded in magnitude. By showing that this eigenvalue can control the stability of gradient descent, generalisation error bounds are proven that hold under a wider range of step sizes compared to previous work. Out of sample guarantees are then achieved by decomposing the test error into generalisation, optimisation and approximation errors, each of which can be bounded and traded off with respect to algorithmic parameters, sample size and magnitude of this eigenvalue. In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity, specifically, the Hessian's smallest eigenvalue during training can be controlled by the normalisation of the layers, i.e., network scaling. This allows test error guarantees to then be achieved when the population risk minimiser satisfies a complexity assumption. By trading off the network complexity and scaling, insights are gained into the implicit bias of neural network scaling, which are further supported by experimental findings.

Active Online Learning with Hidden Shifting Domains
Yining Chen · Haipeng Luo · Tengyu Ma · Chicheng Zhang

Online machine learning systems need to adapt to domain shifts. Meanwhile, acquiring label at every timestep is expensive. We propose a surprisingly simple algorithm that adaptively balances its regret and its number of label queries in settings where the data streams are from a mixture of hidden domains. For online linear regression with oblivious adversaries, we provide a tight tradeoff that depends on the durations and dimensionalities of the hidden domains. Our algorithm can adaptively deal with interleaving spans of inputs from different domains. We also generalize our results to non-linear regression for hypothesis classes with bounded eluder dimension and adaptive adversaries. Experiments on synthetic and realistic datasets demonstrate that our algorithm achieves lower regret than uniform queries and greedy queries with equal labeling budget.

Differentially Private Online Submodular Maximization
Sebastian Perez-Salazar · Rachel Cummings
In this work we consider the problem of online submodular maximization under a cardinality constraint with differential privacy (DP). A stream of T submodular functions over a common finite ground set U arrives online, and at each time-step the decision maker must choose at most k elements of U before observing the function. The decision maker obtains a profit equal to the function evaluated on the chosen set and aims to learn a sequence of sets that achieves low expected regret. In the full-information setting, we develop an $(\varepsilon,\delta)$-DP algorithm with expected (1-1/e)-regret bound of $O( \frac{k^2\log |U|\sqrt{T \log k/\delta}}{\varepsilon} )$. This algorithm contains k ordered experts that learn the best marginal increments for each item over the whole time horizon while maintaining privacy of the functions. In the bandit setting, we provide an $(\varepsilon,\delta+ O(e^{-T^{1/3}}))$-DP algorithm with expected (1-1/e)-regret bound of $O( \frac{\sqrt{\log k/\delta}}{\varepsilon} (k (|U| \log |U|)^{1/3})^2 T^{2/3} )$. One challenge for privacy in this setting is that the payoff and feedback of expert i depends on the actions taken by her i-1 predecessors. This particular type of information leakage is not covered by post-processing, and new analysis is required. Our techniques for maintaining privacy with feedforward may …
Fast Statistical Leverage Score Approximation in Kernel Ridge Regression
Yifan Chen · Yun Yang

Nyström approximation is a fast randomized method that rapidly solves kernel ridge regression (KRR) problems through sub-sampling the n-by-n empirical kernel matrix appearing in the objective function. However, the performance of such a sub-sampling method heavily relies on correctly estimating the statistical leverage scores for forming the sampling distribution, which can be as costly as solving the original KRR. In this work, we propose a linear time (modulo poly-log terms) algorithm to accurately approximate the statistical leverage scores in the stationary-kernel-based KRR with theoretical guarantees. Particularly, by analyzing the first-order condition of the KRR objective, we derive an analytic formula, which depends on both the input distribution and the spectral density of stationary kernels, for capturing the non-uniformity of the statistical leverage scores. Numerical experiments demonstrate that with the same prediction accuracy our method is orders of magnitude more efficient than existing methods in selecting the representative sub-samples in the Nyström approximation.

Distributionally Robust Optimization for Deep Kernel Multiple Instance Learning
Hitesh Sapkota · Yiming Ying · Feng Chen · Qi Yu

Multiple Instance Learning (MIL) provides a promising solution to many real-world problems, where labels are only available at the bag level but missing for instances due to a high labeling cost. As a powerful Bayesian non-parametric model, Gaussian Processes (GP) have been extended from classical supervised learning to MIL settings, aiming to identify the most likely positive (or least negative) instance from a positive (or negative) bag using only the bag-level labels. However, solely focusing on a single instance in a bag makes the model less robust to outliers or multi-modal scenarios, where a single bag contains a diverse set of positive instances. We propose a general GP mixture framework that simultaneously considers multiple instances through a latent mixture model. By adding a top-k constraint, the framework is equivalent to choosing the top-k most positive instances, making it more robust to outliers and multimodal scenarios. We further introduce a Distributionally Robust Optimization (DRO) constraint that removes the limitation of specifying a fix k value. To ensure the prediction power over high-dimensional data (e.g., videos and images) that are common in MIL, we augment the GP kernel with fixed basis functions by using a deep neural network to learn adaptive basis …

Finding First-Order Nash Equilibria of Zero-Sum Games with the Regularized Nikaido-Isoda Function
Ioannis Tsaknakis · Mingyi Hong

Efficiently finding First-order Nash Equilibria (FNE) in zero-sum games can be challenging, even in a two-player setting. This work proposes an algorithm for finding the FNEs of a two-player zero-sum game, in which the local cost functions can be non-convex, and the players only have access to local stochastic gradients. The proposed approach is based on reformulating the problem of interest as minimizing the Regularized Nikaido-Isoda (RNI) function. We show that the global minima of the RNI correspond to the set of FNEs, and that for certain classes of non-convex games the RNI minimization problem becomes convex. Moreover, we introduce a first-order (stochastic) optimization method, and establish its convergence to a neighborhood of a stationary solution of the RNI objective. The key in the analysis is to properly control the bias between the local stochastic gradient and the true one. Although the RNI function has been used in analyzing convex games, to our knowledge, this is the first time that the properties of the RNI formulation have been exploited to find FNEs for non-convex games in a stochastic setting.

Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization
Jelena Diakonikolas · Constantinos Daskalakis · Michael Jordan
The use of min-max optimization in the adversarial training of deep neural network classifiers, and the training of generative adversarial networks has motivated the study of nonconvex-nonconcave optimization objectives, which frequently arise in these applications. Unfortunately, recent results have established that even approximate first-order stationary points of such objectives are intractable, even under smoothness conditions, motivating the study of min-max objectives with additional structure. We introduce a new class of structured nonconvex-nonconcave min-max optimization problems, proposing a generalization of the extragradient algorithm which provably converges to a stationary point. The algorithm applies not only to Euclidean spaces, but also to general $\ell_p$-normed finite-dimensional real vector spaces. We also discuss its stability under stochastic oracles and provide bounds on its sample complexity. Our iteration complexity and sample complexity bounds either match or improve the best known bounds for the same or less general nonconvex-nonconcave settings, such as those that satisfy variational coherence or in which a weak solution to the associated variational inequality problem is assumed to exist.
On the Linear Convergence of Policy Gradient Methods for Finite MDPs
Jalaj Bhandari · Daniel Russo

We revisit the finite time analysis of policy gradient methods in the one of the simplest settings: finite state and action MDPs with a policy class consisting of all stochastic policies and with exact gradient evaluations. There has been some recent work viewing this setting as an instance of smooth non-linear optimization problems, to show sub-linear convergence rates with small step-sizes. Here, we take a completely different perspective based on illuminating connections with policy iteration, to show how many variants of policy gradient algorithms succeed with large step-sizes and attain a linear rate of convergence.

Influence Decompositions For Neural Network Attribution
Kyle Reing · Greg Ver Steeg · Aram Galstyan

Methods of neural network attribution have emerged out of a necessity for explanation and accountability in the predictions of black-box neural models. Most approaches use a variation of sensitivity analysis, where individual input variables are perturbed and the downstream effects on some output metric are measured. We demonstrate that a number of critical functional properties are not revealed when only considering lower-order perturbations. Motivated by these shortcomings, we propose a general framework for decomposing the orders of influence that a collection of input variables has on an output classification. These orders are based on the cardinality of input subsets which are perturbed to yield a change in classification. This decomposition can be naturally applied to attribute which input variables rely on higher-order coordination to impact the classification decision. We demonstrate that our approach correctly identifies higher-order attribution on a number of synthetic examples. Additionally, we showcase the differences between attribution in our approach and existing approaches on benchmark networks for MNIST and ImageNet.

Learning GPLVM with arbitrary kernels using the unscented transformation
Daniel Augusto Ramos Macedo Antunes de Souza · Diego Mesquita · João Paulo Gomes · César Lincoln Mattos

Gaussian Process Latent Variable Model (GPLVM) is a flexible framework to handle uncertain inputs in Gaussian Processes (GPs) and incorporate GPs as components of larger graphical models. Nonetheless, the standard GPLVM variational inference approach is tractable only for a narrow family of kernel functions. The most popular implementations of GPLVM circumvent this limitation using quadrature methods, which may become a computational bottleneck even for relatively low dimensions. For instance, the widely employed Gauss-Hermite quadrature has exponential complexity on the number of dimensions. In this work, we propose using the unscented transformation instead. Overall, this method presents comparable, if not better, performance than off-the-shelf solutions to GPLVM, and its computational complexity scales only linearly on dimension. In contrast to Monte Carlo methods, our approach is deterministic and works well with quasi-Newton methods, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. We illustrate the applicability of our method with experiments on dimensionality reduction and multistep-ahead prediction with uncertainty propagation.

On the Privacy Properties of GAN-generated Samples
Zinan Lin · Vyas Sekar · Giulia Fanti

The privacy implications of generative adversarial networks (GANs) are a topic of great interest, leading to several recent algorithms for training GANs with privacy guarantees. By drawing connections to the generalization properties of GANs, we prove that under some assumptions, GAN-generated samples inherently satisfy some (weak) privacy guarantees. First, we show that if a GAN is trained on m samples and used to generate n samples, the generated samples are (epsilon, delta)-differentially-private for (epsilon, delta) pairs where delta scales as O(n/m). We show that under some special conditions, this upper bound is tight. Next, we study the robustness of GAN-generated samples to membership inference attacks. We model membership inference as a hypothesis test in which the adversary must determine whether a given sample was drawn from the training dataset or from the underlying data distribution. We show that this adversary can achieve an area under the ROC curve that scales no better than O(m^{-1/4}).

Differentially Private Weighted Sampling
Edith Cohen · Ofir Geri · Tamas Sarlos · Uri Stemmer

Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampling (PWS): A method that sanitizes a weighted sample as to ensure element-level differential privacy, while retaining its utility to the maximum extent possible. PWS maximizes the reporting probabilities of keys and estimation quality of a broad family of statistics. PWS improves over the state of the art even for the well-studied special case of private histograms, when no sampling is performed. We empirically observe significant performance gains of 20%-300% increase in key reporting for common Zipfian frequency distributions and accurate estimation with x2-8 lower frequencies. PWS is applied as a post-processing of a non-private sample, without requiring the original data. Therefore, it can be a seamless addition to existing implementations, such as those optimizes for distributed or streamed data. We believe that due to practicality and performance, PWS may become a method of choice in applications …

Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning
Zachary Charles · Jakub Konečný

We study a family of algorithms, which we refer to as local update methods, generalizing many federated and meta-learning algorithms. We prove that for quadratic models, local update methods are equivalent to first-order optimization on a surrogate loss we exactly characterize. Moreover, fundamental algorithmic choices (such as learning rates) explicitly govern a trade-off between the condition number of the surrogate loss and its alignment with the true loss. We derive novel convergence rates showcasing these trade-offs and highlight their importance in communication-limited settings. Using these insights, we are able to compare local update methods based on their convergence/accuracy trade-off, not just their convergence to critical points of the empirical loss. Our results shed new light on a broad range of phenomena, including the efficacy of server momentum in federated learning and the impact of proximal client updates.

Misspecification in Prediction Problems and Robustness via Improper Learning
Annie Marsden · John Duchi · Gregory Valiant
We study probabilistic prediction games when the underlying model is misspecified, investigating the consequences of predicting using an incorrect parametric model. We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a ``proper'' predictor---one from the putative model class---relative to the best predictor in the same model class has lower bound scaling at least as $\sqrt{\gamma n}$, where $\gamma$ is a measure of the model misspecification to the true distribution in terms of total variation distance. In contrast, using an aggregation-based (improper) learner, one can obtain regret $d \log n$ for any underlying generating distribution, where $d$ is the dimension of the parameter; we exhibit instances in which this is unimprovable even over the family of all learners that may play distributions in the convex hull of the parametric family. These results suggest that simple strategies for aggregating multiple learners together should be more robust, and several experiments conform to this hypothesis.
Sharp Analysis of a Simple Model for Random Forests
Jason Klusowski
Random forests have become an important tool for improving accuracy in regression and classification problems since their inception by Leo Breiman in 2001. In this paper, we revisit a historically important random forest model, called centered random forests, originally proposed by Breiman in 2004 and later studied by G\'erard Biau in 2012, where a feature is selected at random and the splits occurs at the midpoint of the node along the chosen feature. If the regression function is $d$-dimensional and Lipschitz, we show that, given access to $n$ observations, the mean-squared prediction error is $O((n(\log n)^{(d-1)/2})^{-\frac{1}{d\log2+1}})$. This positively answers an outstanding question of Biau about whether the rate of convergence for this random forest model could be improved beyond $O(n^{-\frac{1}{d(4/3)\log2+1}})$. Furthermore, by a refined analysis of the approximation and estimation errors for linear models, we show that our new rate cannot be improved in general. Finally, we generalize our analysis and improve current prediction error bounds for another random forest model, called median random forests, in which each tree is constructed from subsampled data and the splits are performed at the empirical median along a chosen feature.
On Learning Continuous Pairwise Markov Random Fields
Abhin Shah · Devavrat Shah · Gregory Wornell

We consider learning a sparse pairwise Markov Random Field (MRF) with continuous-valued variables from i.i.d samples. We adapt the algorithm of Vuffray et al. (2019) to this setting and provide finite-sample analysis revealing sample complexity scaling logarithmically with the number of variables, as in the discrete and Gaussian settings. Our approach is applicable to a large class of pairwise MRFs with continuous variables and also has desirable asymptotic properties, including consistency and normality under mild conditions. Further, we establish that the population version of the optimization criterion employed in Vuffray et al. (2019) can be interpreted as local maximum likelihood estimation (MLE). As part of our analysis, we introduce a robust variation of sparse linear regression a` la Lasso, which may be of interest in its own right.

Non-asymptotic Performance Guarantees for Neural Estimation of f-Divergences
Sreejith Sreekumar · Zhengxin Zhang · Ziv Goldfeld

Statistical distances (SDs), which quantify the dissimilarity between probability distributions, are central to machine learning and statistics. A modern method for estimating such distances from data relies on parametrizing a variational form by a neural network (NN) and optimizing it. These estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there seems to be a fundamental tradeoff between the two sources of error involved: approximation and estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. This paper explores this tradeoff by means of non-asymptotic error bounds, focusing on three popular choices of SDs---Kullback-Leibler divergence, chi-squared divergence, and squared Hellinger distance. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. Numerical results validating the theory are also provided.

Sampling in Combinatorial Spaces with SurVAE Flow Augmented MCMC
Priyank Jaini · Didrik Nielsen · Max Welling

Hybrid Monte Carlo is a powerful Markov Chain Monte Carlo method for sampling from complex continuous distributions. However, a major limitation of HMC is its inability to be applied to discrete domains due to the lack of gradient signal. In this work, we introduce a new approach based on augmenting monte carlo methods with SurVAE Flow to sample from discrete distributions using a combination of neural transport methods like normalizing flows, variational dequantization, and the Metropolis-Hastings rule. Our method first learns a continuous embedding of the discrete space using a surjective map and subsequently learns a bijective transformation from the continuous space to an approximately Gaussian distributed latent variable. Sampling proceeds by simulating MCMC chains in the latent space and mapping these samples to the target discrete space via the learned transformations. We demonstrate the efficacy of our algorithm on a range of examples from statistics, computational physics, and, machine learning, and observe improvements compared to alternative algorithms.

Alternating Direction Method of Multipliers for Quantization
Tianjian Huang · Prajwal Singhania · Maziar Sanjabi · Pabitra Mitra · Meisam Razaviyayn

Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization (ADMM-Q) algorithm, which is a variant of the widely-used ADMM method applied to our discrete optimization problem. We establish the convergence of the iterates of ADMM-Q to certain stationary points. To the best of our knowledge, this is the first analysis of an ADMM-type method for problems with discrete variables/constraints. Based on our theoretical insights, we develop a few variants of ADMM-Q that can handle inexact update rules, and have improved performance via the use of "soft projection" and "injecting randomness to the algorithm". We empirically evaluate the efficacy of our proposed approaches.

Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples
Yixing Zhang · Xiuyuan Cheng · Galen Reeves
The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $p$-Wasserstein distance in $d$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales …
Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning
Ming Yin · Yu Bai · Yu-Xiang Wang
The problem of \emph{Offline Policy Evaluation} (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real life applications. Existing work on OPE mostly focus on evaluating a \emph{fixed} target policy $\pi$, which does not provide useful bounds for offline policy learning as $\pi$ will then be data-dependent. We address this problem by \emph{simultaneously} evaluating all policies in a policy class $\Pi$ --- uniform convergence in OPE --- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_m\epsilon^2)$ in identifying an $\epsilon$-optimal policy under the \emph{time-inhomogeneous episodic} MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $\mu$). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.
Online Model Selection for Reinforcement Learning with Function Approximation
Jonathan Lee · Aldo Pacchiano · Vidya Muthukumar · Weihao Kong · Emma Brunskill

Deep reinforcement learning has achieved impressive successes yet often requires a very large amount of interaction data. This result is perhaps unsurprising, as using complicated function approximation often requires more data to fit, and early theoretical results on linear Markov decision processes provide regret bounds that scale with the dimension of the linear approximation. Ideally, we would like to automatically identify the minimal dimension of the approximation that is sufficient to encode an optimal policy. Towards this end, we consider the problem of model selection in RL with function approximation, given a set of candidate RL algorithms with known regret guarantees. The learner's goal is to adapt to the complexity of the optimal algorithm without knowing it a priori. We present a meta-algorithm that successively rejects increasingly complex models using a simple statistical test. Given at least one candidate that satisfies realizability, we prove the meta-algorithm adapts to the optimal complexity with regret that is only marginally suboptimal in the number of episodes and number of candidate algorithms. The dimension and horizon dependencies remain optimal with respect to the best candidate, and our meta-algorithmic approach is flexible to incorporate multiple candidate algorithms and models. Finally, we show that the meta-algorithm …

Animal pose estimation from video data with a hierarchical von Mises-Fisher-Gaussian model
Libby Zhang · Tim Dunn · Jesse Marshall · Bence Olveczky · Scott Linderman

Animal pose estimation from video data is an important step in many biological studies, but current methods struggle in complex environments where occlusions are common and training data is scarce. Recent work has demonstrated improved accuracy with deep neural networks, but these methods often do not incorporate prior distributions that could improve localization. Here we present GIMBAL: a hierarchical von Mises-Fisher-Gaussian model that improves upon deep networks' estimates by leveraging spatiotemporal constraints. The spatial constraints come from the animal's skeleton, which induces a curved manifold of keypoint configurations. The temporal constraints come from the postural dynamics, which govern how angles between keypoints change over time. Importantly, the conditional conjugacy of the model permits simple and efficient Bayesian inference algorithms. We assess the model on a unique experimental dataset with video of a freely-behaving rodent from multiple viewpoints and ground-truth motion capture data for 20 keypoints. GIMBAL extends existing techniques, and in doing so offers more accurate estimates of keypoint positions, especially in challenging contexts.

Density of States Estimation for Out of Distribution Detection
Warren Morningstar · Cusuh Ham · Andrew Gallagher · Balaji Lakshminarayanan · Alex Alemi · Joshua Dillon

Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes theprobability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks.

A constrained risk inequality for general losses
John Duchi · Feng Ruan
We provide a general constrained risk inequality that applies to arbitrary non-decreasing losses, extending a result of Brown and Low [\emph{Ann.~Stat.~1996}]. Given two distributions $P_0$ and $P_1$, we find a lower bound for the risk of estimating a parameter $\theta(P_1)$ under $P_1$ given an upper bound on the risk of estimating the parameter $\theta(P_0)$ under $P_0$. The inequality is a useful pedagogical tool, as its proof relies only on the Cauchy-Schwartz inequality, it applies to general losses, and it transparently gives risk lower bounds on super-efficient and adaptive estimators.
Accumulations of Projections—A Unified Framework for Random Sketches in Kernel Ridge Regression
Yifan Chen · Yun Yang

Building a sketch of an n-by-n empirical kernel matrix is a common approach to accelerate the computation of many kernel methods. In this paper, we propose a unified framework of constructing sketching methods in kernel ridge regression (KRR), which views the sketching matrix S as an accumulation of m rescaled sub-sampling matrices with independent columns. Our framework incorporates two commonly used sketching methods, sub-sampling sketches (known as the Nyström method) and sub-Gaussian sketches, as special cases with m=1 and m=infinity respectively. Under the new framework, we provide a unified error analysis of sketching approximation and show that our accumulation scheme improves the low accuracy of sub-sampling sketches when certain incoherence characteristic is high, and accelerates the more accurate but computationally heavier sub-Gaussian sketches. By optimally choosing the number m of accumulations, we show that a best trade-off between computational efficiency and statistical accuracy can be achieved. In practice, the sketching method can be as efficiently implemented as the sub-sampling sketches, as only minor extra matrix additions are needed. Our empirical evaluations also demonstrate that the proposed method may attain the accuracy close to sub-Gaussian sketches, while is as efficient as sub-sampling-based sketches.

Active Learning under Label Shift
Eric Zhao · Anqi Liu · Animashree Anandkumar · Yisong Yue

We address the problem of active learning under label shift: when the class proportions of source and target domains differ. We introduce a "medial distribution" to incorporate a tradeoff between importance weighting and class-balanced sampling and propose their combined usage in active learning. Our method is known as Mediated Active Learning under Label Shift (MALLS). It balances the bias from class-balanced sampling and the variance from importance weighting. We prove sample complexity and generalization guarantees for MALLS which show active learning reduces asymptotic sample complexity even under arbitrary label shift. We empirically demonstrate MALLS scales to high-dimensional datasets and can reduce the sample complexity of active learning by 60% in deep active learning tasks.

Competing AI: How does competition feedback affect machine learning?
Tony Ginart · Eva Zhang · Yongchan Kwon · James Zou

This papers studies how competition affects machine learning (ML) predictors. As ML becomes more ubiquitous, it is often deployed by companies to compete over customers. For example, digital platforms like Yelp use ML to predict user preference and make recommendations. A service that is more often queried by users, perhaps because it more accurately anticipates user preferences, is also more likely to obtain additional user data (e.g. in the form of a Yelp review). Thus, competing predictors cause feedback loops whereby a predictor's performance impacts what training data it receives and biases its predictions over time. We introduce a flexible model of competing ML predictors that enables both rapid experimentation and theoretical tractability. We show with empirical and mathematical analysis that competition causes predictors to specialize for specific sub-populations at the cost of worse performance over the general population. We further analyze the impact of predictor specialization on the overall prediction quality experienced by users. We show that having too few or too many competing predictors in a market can hurt the overall prediction quality. Our theory is complemented by experiments on several real datasets using popular learning algorithms, such as neural networks and nearest neighbor methods.

Random Coordinate Underdamped Langevin Monte Carlo
Zhiyan Ding · Qin Li · Jianfeng Lu · Stephen Wright

The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Monte Carlo sampling method. It requires the computation of the full gradient of the log-density at each iteration, an expensive operation if the dimension of the problem is high. We propose a sampling method called Random Coordinate ULMC (RC-ULMC), which selects a single coordinate at each iteration to be updated and leaves the other coordinates untouched. We investigate the computational complexity of RC-ULMC and compare it with the classical ULMC for strongly log-concave probability distributions. We show that RC-ULMC is always cheaper than the classical ULMC, with a significant cost reduction when the problem is highly skewed and high dimensional. Our complexity bound for RC-ULMC is also tight in terms of dimension dependence.

Differentiable Greedy Algorithm for Monotone Submodular Maximization: Guarantees, Gradient Estimators, and Applications
Shinsaku Sakaue
Motivated by, e.g., sensitivity analysis and end-to-end learning, the demand for differentiable optimization algorithms has been increasing. This paper presents a theoretically guaranteed differentiable greedy algorithm for monotone submodular function maximization. We smooth the greedy algorithm via randomization, and prove that it almost recovers original approximation guarantees in expectation for the cases of cardinality and $\kappa$-extendible system constraints. We then present how to efficiently compute gradient estimators of any expected output-dependent quantities. We demonstrate the usefulness of our method by instantiating it for various applications.
Differentiable Causal Discovery Under Unmeasured Confounding
Rohit Bhattacharya · Tushar Nagarajan · Daniel Malinsky · Ilya Shpitser

The data drawn from biological, economic, and social systems are often confounded due to the presence of unmeasured variables. Prior work in causal discovery has focused on discrete search procedures for selecting acyclic directed mixed graphs (ADMGs), specifically ancestral ADMGs, that encode ordinary conditional independence constraints among the observed variables of the system. However, confounded systems also exhibit more general equality restrictions that cannot be represented via these graphs, placing a limit on the kinds of structures that can be learned using ancestral ADMGs. In this work, we derive differentiable algebraic constraints that fully characterize the space of ancestral ADMGs, as well as more general classes of ADMGs, arid ADMGs and bow-free ADMGs, that capture all equality restrictions on the observed variables. We use these constraints to cast causal discovery as a continuous optimization problem and design differentiable procedures to find the best fitting ADMG when the data comes from a confounded linear system of equations with correlated errors. We demonstrate the efficacy of our method through simulations and application to a protein expression dataset. Code implementing our methods is open-source and publicly available at https://gitlab.com/rbhatta8/dcd and will be incorporated into the Ananke package.

Algorithms for Fairness in Sequential Decision Making
Min Wen · Osbert Bastani · Ufuk Topcu

It has recently been shown that if feedback effects of decisions are ignored, then imposing fairness constraints such as demographic parity or equality of opportunity can actually exacerbate unfairness. We propose to address this challenge by modeling feedback effects as Markov decision processes (MDPs). First, we propose analogs of fairness properties for the MDP setting. Second, we propose algorithms for learning fair decision-making policies for MDPs. Finally, we demonstrate the need to account for dynamical effects using simulations on a loan applicant MDP.

Minimax Optimal Regression over Sobolev Spaces via Laplacian Regularization on Neighborhood Graphs
Alden Green · Sivaraman Balakrishnan · Ryan Tibshirani
In this paper we study the statistical properties of Laplacian smoothing, a graph-based approach to nonparametric regression. Under standard regularity conditions, we establish upper bounds on the error of the Laplacian smoothing estimator \smash{$\widehat{f}$}, and a goodness-of-fit test also based on \smash{$\widehat{f}$}. These upper bounds match the minimax optimal estimation and testing rates of convergence over the first-order Sobolev class $H^1(\mathcal{X})$, for $\mathcal{X} \subseteq \mathbb{R}^d$ and $1 \leq d < 4$; in the estimation problem, for $d = 4$, they are optimal modulo a $\log n$ factor. Additionally, we prove that Laplacian smoothing is manifold-adaptive: if $\mathcal{X} \subseteq \mathbb{R}^d$ is an $m$-dimensional manifold with $m < d$, then the error rate of Laplacian smoothing (in either estimation or testing) depends only on $m$, in the same way it would if $\mathcal{X}$ were a full-dimensional set in $\mathbb{R}^m$.
The Base Measure Problem and its Solution
Alexey Radul · Boris Alexeev

Probabilistic programming systems generally compute with probability density functions, leaving the base measure of each such function implicit. This mostly works, but creates problems when densities with respect to different base measures are accidentally combined or compared. Mistakes also happen when computing volume corrections for continuous changes of variables, which in general depend on the support measure. We motivate and clarify the problem in the context of a composable library of probability distributions and bijective transformations. We solve the problem by standardizing on Hausdorff measure as a base, and deriving formulas for comparing and combining mixed-dimension densities, as well as updating densities with respect to Hausdorff measure under diffeomorphic transformations. We also propose a software architecture that implements these formulas efficiently in the common case. We hope that by adopting our solution, probabilistic programming systems can become more robust and general, and make a broader class of models accessible to practitioners.

Sample Elicitation
Jiaheng Wei · Zuyue Fu · Yang Liu · Xingyu Li · Zhuoran Yang · Zhaoran Wang
It is important to collect credible training samples $(x,y)$ for building data-intensive learning systems (e.g., a deep learning system). Asking people to report complex distribution $p(x)$, though theoretically viable, is challenging in practice. This is primarily due to the cognitive loads required for human agents to form the report of this highly complicated information. While classical elicitation mechanisms apply to eliciting a complex and generative (and continuous) distribution $p(x)$, we are interested in eliciting samples $x_i \sim p(x)$ from agents directly. We coin the above problem sample elicitation. This paper introduces a deep learning aided method to incentivize credible sample contributions from self-interested and rational agents. We show that with an accurate estimation of a certain $f$-divergence function we can achieve approximate incentive compatibility in eliciting truthful samples. We then present an efficient estimator with theoretical guarantees via studying the variational forms of the $f$-divergence function. We also show a connection between this sample elicitation problem and $f$-GAN, and how this connection can help reconstruct an estimator of the distribution based on collected samples. Experiments on synthetic data, MNIST, and CIFAR-10 datasets demonstrate that our mechanism elicits truthful samples. Our implementation is available at https://github.com/weijiaheng/Credible-sample-elicitation.git.
Bayesian Coresets: Revisiting the Nonconvex Optimization Perspective
Yibo Zhang · Rajiv Khanna · Anastasios Kyrillidis · Sanmi Koyejo

Bayesian coresets have emerged as a promising approach for implementing scalable Bayesian inference. The Bayesian coreset problem involves selecting a (weighted) subset of the data samples, such that the posterior inference using the selected subset closely approximates the posterior inference using the full dataset. This manuscript revisits Bayesian coresets through the lens of sparsity constrained optimization. Leveraging recent advances in accelerated optimization methods, we propose and analyze a novel algorithm for coreset selection. We provide explicit convergence rate guarantees and present an empirical evaluation on a variety of benchmark datasets to highlight our proposed algorithm's superior performance compared to state-of-the-art on speed and accuracy.

The Sample Complexity of Meta Sparse Regression
Zhanyu Wang · Jean Honorio
This paper addresses the meta-learning problem in sparse linear regression with infinite tasks. We assume that the learner can access several similar tasks. The goal of the learner is to transfer knowledge from the prior tasks to a similar but novel task. For $p$ parameters, size of the support set $k$, and $l$ samples per task, we show that $T \in O((k \log (p-k)) / l)$ tasks are sufficient in order to recover the common support of all tasks. With the recovered support, we can greatly reduce the sample complexity for estimating the parameter of the novel task, i.e., $l \in O(1)$ with respect to $T$ and $p$. We also prove that our rates are minimax optimal. A key difference between meta-learning and the classical multi-task learning, is that meta-learning focuses only on the recovery of the parameters of the novel task, while multi-task learning estimates the parameter of all tasks, which requires $l$ to grow with $T$. Instead, our efficient meta-learning estimator allows for $l$ to be constant with respect to $T$ (i.e., few-shot learning).
Nonlinear Projection Based Gradient Estimation for Query Efficient Blackbox Attacks
Huichen Li · Linyi Li · Xiaojun Xu · Xiaolu Zhang · Shuang Yang · Bo Li

Gradient estimation and vector space projection have been studied as two distinct topics. We aim to bridge the gap between the two by investigating how to efficiently estimate gradient based on a projected low-dimensional space. We first provide lower and upper bounds for gradient estimation under both linear and nonlinear projections, and outline checkable sufficient conditions under which one is better than the other. Moreover, we analyze the query complexity for the projection-based gradient estimation and present a sufficient condition for query-efficient estimators. Built upon our theoretic analysis, we propose a novel query-efficient Nonlinear Gradient Projection-based Boundary Blackbox Attack (NonLinear-BA). We conduct extensive experiments on four image datasets: ImageNet, CelebA, CIFAR-10, and MNIST, and show the superiority of the proposed methods compared with the state-of-the-art baselines. In particular, we show that the projection-based boundary blackbox attacks are able to achieve much smaller magnitude of perturbations with 100% attack success rate based on efficient queries. Both linear and nonlinear projections demonstrate their advantages under different conditions. We also evaluate NonLinear-BA against the commercial online API MEGVII Face++, and demonstrate the high blackbox attack performance both quantitatively and qualitatively. The code is publicly available at https://github.com/AI-secure/NonLinear-BA.

Right Decisions from Wrong Predictions: A Mechanism Design Alternative to Individual Calibration
Shengjia Zhao · Stefano Ermon

Decision makers often need to rely on imperfect probabilistic forecasts. While average performance metrics are typically available, it is difficult to assess the quality of individual forecasts and the corresponding utilities. To convey confidence about individual predictions to decision-makers, we propose a compensation mechanism ensuring that the forecasted utility matches the actually accrued utility. While a naive scheme to compensate decision-makers for prediction errors can be exploited and might not be sustainable in the long run, we propose a mechanism based on fair bets and online learning that provably cannot be exploited. We demonstrate an application showing how passengers could confidently optimize individual travel plans based on flight delay probabilities estimated by an airline.

SONIA: A Symmetric Blockwise Truncated Optimization Algorithm
Majid Jahani · MohammadReza Nazari · Rachael Tappenden · Albert Berahas · Martin Takac

This work presents a new optimization algorithm for empirical risk minimization. The algorithm bridges the gap between first- and second-order methods by computing a search direction that uses a second-order-type update in one subspace, coupled with a scaled steepest descent step in the orthogonal complement. To this end, partial curvature information is incorporated to help with ill-conditioning, while simultaneously allowing the algorithm to scale to the large problem dimensions often encountered in machine learning applications. Theoretical results are presented to confirm that the algorithm converges to a stationary point in both the strongly convex and nonconvex cases. A stochastic variant of the algorithm is also presented, along with corresponding theoretical guarantees. Numerical results confirm the strengths of the new approach on standard machine learning problems.

Online Robust Control of Nonlinear Systems with Large Uncertainty
Dimitar Ho · Hoang Le · John Doyle · Yisong Yue

Robust control is a core approach for controlling systems with performance guarantees that are robust to modeling error, and is widely used in real-world systems. However, current robust control approaches can only handle small system uncertainty, and thus require significant effort in system identification prior to controller design. We present an online approach that robustly controls a nonlinear system under large model uncertainty. Our approach is based on decomposing the problem into two sub-problems, robust control design'' (which assumes small model uncertainty) andchasing consistent models'', which can be solved using existing tools from control theory and online learning, respectively. We provide a learning convergence analysis that yields a finite mistake bound on the number of times performance requirements are not met and can provide strong safety guarantees, by bounding the worst-case state deviation. To the best of our knowledge, this is the first approach for online robust control of nonlinear systems with such learning theoretic and safety guarantees. We also show how to instantiate this framework for general robotic systems, demonstrating the practicality of our approach.

Shapley Flow: A Graph-based Approach to Interpreting Model Predictions
Jiaxuan Wang · Jenna Wiens · Scott Lundberg

Many existing approaches for estimating feature importance are problematic because they ignore or hide dependencies among features. A causal graph, which encodes the relationships among input variables, can aid in assigning feature importance. However, current approaches that assign credit to nodes in the causal graph fail to explain the entire graph. In light of these limitations, we propose Shapley Flow, a novel approach to interpreting machine learning models. It considers the entire causal graph, and assigns credit to edges instead of treating nodes as the fundamental unit of credit assignment. Shapley Flow is the unique solution to a generalization of the Shapley value axioms for directed acyclic graphs. We demonstrate the benefit of using Shapley Flow to reason about the impact of a model's input on its output. In addition to maintaining insights from existing approaches, Shapley Flow extends the flat, set-based, view prevalent in game theory based explanation methods to a deeper, graph-based, view. This graph-based view enables users to understand the flow of importance through a system, and reason about potential interventions.

Improving Classifier Confidence using Lossy Label-Invariant Transformations
Sooyong Jang · Insup Lee · James Weimer

Providing reliable model uncertainty estimates is imperative to enabling robust decision making by autonomous agents and humans alike. While recently there have been significant advances in confidence calibration for trained models, examples with poor calibration persist in most calibrated models. Consequently, multiple techniques have been proposed that leverage label-invariant transformations of the input (i.e., an input manifold) to improve worst-case confidence calibration. However, manifold-based confidence calibration techniques generally do not scale and/or require expensive retraining when applied to models with large input spaces (e.g., ImageNet). In this paper, we present the recursive lossy label-invariant calibration (ReCal) technique that leverages label-invariant transformations of the input that induce a loss of discriminatory information to recursively group (and calibrate) inputs – without requiring model retraining. We show that ReCal outperforms other calibration methods on multiple datasets, especially, on large-scale datasets such as ImageNet.

Associative Convolutional Layers
Hamed Omidvar · Vahideh Akhlaghi · Hao Su · Massimo Franceschetti · Rajesh Gupta
We provide a general and easy to implement method for reducing the number of parameters of Convolutional Neural Networks (CNNs) during the training and inference phases. We introduce a simple trainable auxiliary neural network which can generate approximate versions of ``slices'' of the sets of convolutional filters of any CNN architecture from a low dimensional ``code'' space. These slices are then concatenated to form the sets of filters in the CNN architecture. The auxiliary neural network, which we call “Convolutional Slice Generator” (CSG), is unique to the network and provides the association among its convolutional layers. We apply our method to various CNN architectures including ResNet, DenseNet, MobileNet and ShuffleNet. Experiments on CIFAR-10 and ImageNet-1000, without any hyper-parameter tuning, show that our approach reduces the network parameters by approximately $2\times$ while the reduction in accuracy is confined to within one percent and sometimes the accuracy even improves after compression. Interestingly, through our experiments, we show that even when the CSG takes random binary values for its weights that are not learned, still acceptable performances are achieved. To show that our approach generalizes to other tasks, we apply it to an image segmentation architecture, Deeplab V3, on the Pascal VOC 2012 …
An Optimal Reduction of TV-Denoising to Adaptive Online Learning
Dheeraj Baby · Xuandong Zhao · Yu-Xiang Wang
We consider the problem of estimating a function from $n$ noisy samples whose discrete Total Variation (TV) is bounded by $C_n$. We reveal a deep connection to the seemingly disparate problem of \emph{Strongly Adaptive} online learning [Daniely et al 2015] and provide an $O(n \log n)$ time algorithm that attains the near minimax optimal rate of $\tilde O (n^{1/3}C_n^{2/3})$ under squared error loss. The resulting algorithm runs online and optimally \emph{adapts} to the \emph{unknown} smoothness parameter $C_n$. This leads to a new and more versatile alternative to wavelets-based methods for (1) adaptively estimating TV bounded functions; (2) online forecasting of TV bounded trends in time series.
TenIPS: Inverse Propensity Sampling for Tensor Completion
Chengrun Yang · Lijun Ding · Ziyang Wu · Madeleine Udell

Tensors are widely used to represent multiway arrays of data. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a higher-order SVD approach, reweighting the observed tensor by the inverse propensities. We provide finite-sample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach.

Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning
Chong Liu · Yuqing Zhu · Kamalika Chaudhuri · Yu-Xiang Wang

The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget. The novel components in the proofs include a more refined analysis of the majority voting classifier --- which could be of independent interest --- and an observation that the synthetic ``student'' learning problem is nearly realizable by construction under the Tsybakov noise condition.

Accelerating Metropolis-Hastings with Lightweight Inference Compilation
Feynman Liang · Nimar Arora · Nazanin Tehrani · Yucen Li · Michael Tingley · Erik Meijer

In order to construct accurate proposers for Metropolis-Hastings Markov Chain Monte Carlo, we integrate ideas from probabilistic graphical models and neural networks in an open-source framework we call Lightweight Inference Compilation (LIC). LIC implements amortized inference within an open-universe declarative probabilistic programming language (PPL). Graph neural networks are used to parameterize proposal distributions as functions of Markov blankets, which during ``compilation'' are optimized to approximate single-site Gibbs sampling distributions. Unlike prior work in inference compilation (IC), LIC forgoes importance sampling of linear execution traces in favor of operating directly on Bayesian networks. Through using a declarative PPL, the Markov blankets of nodes (which may be non-static) are queried at inference-time to produce proposers Experimental results show LIC can produce proposers which have less parameters, greater robustness to nuisance random variables, and improved posterior sampling in a Bayesian logistic regression and n-schools inference application.

Fast Adaptation with Linearized Neural Networks
Wesley Maddox · Shuai Tang · Pablo Moreno · Andrew Gordon Wilson · Andreas Damianou

The inductive biases of trained neural networks are difficult to understand and, consequently, to adapt to new settings. We study the inductive biases of linearizations of neural networks, which we show to be surprisingly good summaries of the full network functions. Inspired by this finding, we propose a technique for embedding these inductive biases into Gaussian processes through a kernel designed from the Jacobian of the network. In this setting, domain adaptation takes the form of interpretable posterior inference, with accompanying uncertainty estimation. This inference is analytic and free of local optima issues found in standard techniques such as fine-tuning neural network weights to a new task. We develop significant computational speed-ups based on matrix multiplies, including a novel implementation for scalable Fisher vector products. Our experiments on both image classification and regression demonstrate the promise and convenience of this framework for transfer learning, compared to neural network fine-tuning.


Oral: Bandits, Reinforcement Learning / Optimization Tue 13 Apr 04:15 p.m.  

Federated Multi-armed Bandits with Personalization
Chengshuai Shi · Cong Shen · Jing Yang

A general framework of personalized federated multi-armed bandits (PF-MAB) is proposed, which is a new bandit paradigm analogous to the federated learning (FL) framework in supervised learning and enjoys the features of FL with personalization. Under the PF-MAB framework, a mixed bandit learning problem that flexibly balances generalization and personalization is studied. A lower bound analysis for the mixed model is presented. We then propose the Personalized Federated Upper Confidence Bound (PF-UCB) algorithm, where the exploration length is chosen carefully to achieve the desired balance of learning the local model and supplying global information for the mixed learning objective. Theoretical analysis proves that PF-UCB achieves an O(log(T)) regret regardless of the degree of personalization, and has a similar instance dependency as the lower bound. Experiments using both synthetic and real-world datasets corroborate the theoretical analysis and demonstrate the effectiveness of the proposed algorithm.

Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning
Ming Yin · Yu Bai · Yu-Xiang Wang
The problem of \emph{Offline Policy Evaluation} (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real life applications. Existing work on OPE mostly focus on evaluating a \emph{fixed} target policy $\pi$, which does not provide useful bounds for offline policy learning as $\pi$ will then be data-dependent. We address this problem by \emph{simultaneously} evaluating all policies in a policy class $\Pi$ --- uniform convergence in OPE --- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_m\epsilon^2)$ in identifying an $\epsilon$-optimal policy under the \emph{time-inhomogeneous episodic} MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $\mu$). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization
Dongsheng Ding · Xiaohan Wei · Zhuoran Yang · Zhaoran Wang · Mihailo Jovanovic
We study the safe reinforcement learning problem using the constrained Markov decision processes in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing safe reinforcement learning algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization \mbox{(OPDOP)} algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the …
Bayesian Coresets: Revisiting the Nonconvex Optimization Perspective
Yibo Zhang · Rajiv Khanna · Anastasios Kyrillidis · Sanmi Koyejo

Bayesian coresets have emerged as a promising approach for implementing scalable Bayesian inference. The Bayesian coreset problem involves selecting a (weighted) subset of the data samples, such that the posterior inference using the selected subset closely approximates the posterior inference using the full dataset. This manuscript revisits Bayesian coresets through the lens of sparsity constrained optimization. Leveraging recent advances in accelerated optimization methods, we propose and analyze a novel algorithm for coreset selection. We provide explicit convergence rate guarantees and present an empirical evaluation on a variety of benchmark datasets to highlight our proposed algorithm's superior performance compared to state-of-the-art on speed and accuracy.


Oral: Theory and Practice of Machine Learning Tue 13 Apr 05:15 p.m.  

Entropy Partial Transport with Tree Metrics: Theory and Practice
Tam Le · Truyen Nguyen

Optimal transport (OT) theory provides powerful tools to compare probability measures. However, OT is limited to nonnegative measures having the same mass, and suffers serious drawbacks about its computation and statistics. This leads to several proposals of regularized variants of OT in the recent literature. In this work, we consider an entropy partial transport (EPT) problem for nonnegative measures on a tree having different masses. The EPT is shown to be equivalent to a standard complete OT problem on a one-node extended tree. We derive its dual formulation, then leverage this to propose a novel regularization for EPT which admits fast computation and negative definiteness. To our knowledge, the proposed regularized EPT is the first approach that yields a closed-form solution among available variants of unbalanced OT for general nonnegative measures. For practical applications without prior knowledge about the tree structure for measures, we propose tree-sliced variants of the regularized EPT, computed by averaging the regularized EPT between these measures using random tree metrics, built adaptively from support data points. Exploiting the negative definiteness of our regularized EPT, we introduce a positive definite kernel, and evaluate it against other baselines on benchmark tasks such as document classification with word embedding …

Independent Innovation Analysis for Nonlinear Vector Autoregressive Process
Hiroshi Morioka · Hermanni Hälvä · Aapo Hyvarinen

The nonlinear vector autoregressive (NVAR) model provides an appealing framework to analyze multivariate time series obtained from a nonlinear dynamical system. However, the innovation (or error), which plays a key role by driving the dynamics, is almost always assumed to be additive. Additivity greatly limits the generality of the model, hindering analysis of general NVAR processes which have nonlinear interactions between the innovations. Here, we propose a new general framework called independent innovation analysis (IIA), which estimates the innovations from completely general NVAR. We assume mutual independence of the innovations as well as their modulation by an auxiliary variable (which is often taken as the time index and simply interpreted as nonstationarity). We show that IIA guarantees the identifiability of the innovations with arbitrary nonlinearities, up to a permutation and component-wise invertible nonlinearities. We also propose three estimation frameworks depending on the type of the auxiliary variable. We thus provide the first rigorous identifiability result for general NVAR, as well as very general tools for learning such models.

Beyond Marginal Uncertainty: How Accurately can Bayesian Regression Models Estimate Posterior Predictive Correlations?
Chaoqi Wang · Shengyang Sun · Roger Grosse

While uncertainty estimation is a well-studied topic in deep learning, most such work focuses on marginal uncertainty estimates, i.e. the predictive mean and variance at individual input locations. But it is often more useful to estimate predictive correlations between the function values at different input locations. In this paper, we consider the problem of benchmarking how accurately Bayesian models can estimate predictive correlations. We first consider a downstream task which depends on posterior predictive correlations: transductive active learning (TAL). We find that TAL makes better use of models' uncertainty estimates than ordinary active learning, and recommend this as a benchmark for evaluating Bayesian models. Since TAL is too expensive and indirect to guide development of algorithms, we introduce two metrics which more directly evaluate the predictive correlations and which can be computed efficiently: meta-correlations (i.e. the correlations between the models correlation estimates and the true values), and cross-normalized likelihoods (XLL). We validate these metrics by demonstrating their consistency with TAL performance and obtain insights about the relative performance of current Bayesian neural net and Gaussian process models.

A Variational Information Bottleneck Approach to Multi-Omics Data Integration
Changhee Lee · Mihaela van der Schaar

Integration of data from multiple omics techniques is becoming increasingly important in biomedical research. Due to non-uniformity and technical limitations in omics platforms, such integrative analyses on multiple omics, which we refer to as views, involve learning from incomplete observations with various view-missing patterns. This is challenging because i) complex interactions within and across observed views need to be properly addressed for optimal predictive power and ii) observations with various view-missing patterns need to be flexibly integrated. To address such challenges, we propose a deep variational information bottleneck (IB) approach for incomplete multi-view observations. Our method applies the IB framework on marginal and joint representations of the observed views to focus on intra-view and inter-view interactions that are relevant for the target. Most importantly, by modeling the joint representations as a product of marginal representations, we can efficiently learn from observed views with various view-missing patterns. Experiments on real-world datasets show that our method consistently achieves gain from data integration and outperforms state-of-the-art benchmarks.


Poster Session 2 Tue 13 Apr 06:30 p.m.  

Understanding the wiring evolution in differentiable neural architecture search
Sirui Xie · Shoukang Hu · Xinjiang Wang · Chunxiao Liu · Jianping Shi · Xunying Liu · Dahua Lin

Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.

Linear Regression Games: Convergence Guarantees to Approximate Out-of-Distribution Solutions
Kartik Ahuja · Karthikeyan Shanmugam · Amit Dhurandhar
Recently, invariant risk minimization (IRM) (Arjovsky et al. 2019) was proposed as a promising solution to address out-of-distribution (OOD) generalization. In Ahuja et al. (2020), it was shown that solving for the Nash equilibria of a new class of “ensemble-games” is equivalent to solving IRM. In this work, we extend the framework in Ahuja et al. (2020) for linear regressions by projecting the ensemble-game on an $\ell_{\infty}$ ball. We show that such projections help achieve non-trivial out-of-distribution guarantees despite not achieving perfect invariance. For linear models with confounders, we prove that Nash equilibria of these games are closer to the ideal OOD solutions than the standard empirical risk minimization (ERM) and we also provide learning algorithms that provably converge to these Nash Equilibria. Empirical comparisons of the proposed approach with the state-of-the-art show consistent gains in achieving OOD solutions in several settings involving anti-causal variables and confounders.
Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization
Peiyuan Zhang · Antonio Orvieto · Hadi Daneshmand · Thomas Hofmann · Roy S. Smith

Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation that questions this connection: both the explicit and the semi-implicit (a.k.a symplectic) Euler discretizations on this ODE lead to an accelerated algorithm for convex programming. Although semi-implicit methods are well-known in numerical analysis to enjoy many desirable features for the integration of physical systems, our findings show that these properties do not necessarily relate to acceleration.

Fenchel-Young Losses with Skewed Entropies for Class-posterior Probability Estimation
Han Bao · Masashi Sugiyama

We study class-posterior probability estimation (CPE) for binary responses where one class has much fewer data than the other. For example, events such as species co-occurrence in ecology and wars in political science are often much rarer than non-events. Logistic regression has been widely used for CPE, while it tends to underestimate the probability of rare events. Its main drawback is symmetry of the logit link---symmetric links can be misled by small and imbalanced samples because it is more incentivized to overestimate the majority class with finite samples. Parametric skewed links have been proposed to overcome this limitation, but their estimation usually results in nonconvex optimization unlike the logit link. Such nonconvexity is knotty not only from the computational viewpoint but also in terms of the parameter identifiability. In this paper, we provide a procedure to derive a convex loss for a skewed link based on the recently proposed Fenchel-Young losses. The derived losses are always convex and have a nice property suitable for class imbalance. The simulation shows the practicality of the derived losses.

γ-ABC: Outlier-Robust Approximate Bayesian Computation Based on a Robust Divergence Estimator
Masahiro Fujisawa · Takeshi Teshima · Issei Sato · Masashi Sugiyama

Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based γ-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable robustness property called the redescending property. In addition, our estimator enjoys various desirable properties such as high flexibility, asymptotic unbiasedness, almost sure convergence, and linear time complexity. Through experiments, we demonstrate that our method achieves significantly higher robustness than existing discrepancy measures.

Taming heavy-tailed features by shrinkage
Ziwei Zhu · Wenjing Zhou

In this work, we focus on a variant of the generalized linear model (GLM) called corrupted GLM (CGLM) with heavy-tailed features and responses. To robustify the statistical inference on this model, we propose to apply L4-norm shrinkage to the feature vectors in the low-dimensional regime and apply elementwise shrinkage to them in the high-dimensional regime. Under bounded fourth moment assumptions, we show that the maximum likelihood estimator (MLE) based on the shrunk data enjoys nearly the minimax optimal rate with an exponential deviation bound. Our simulations demonstrate that the proposed feature shrinkage significantly enhances the statistical performance in linear regression and logistic regression on heavy-tailed data. Finally, we apply our shrinkage principle to guard against mislabeling and image noise in the human-written digit recognition problem. We add an L4-norm shrinkage layer to the original neural net and reduce the testing misclassification rate by more than 30% relatively in the presence of mislabeling and image noise.

Learning Individually Fair Classifier with Path-Specific Causal-Effect Constraint
Yoichi Chikahara · Shinsaku Sakaue · Akinori Fujino · Hisashi Kashima

Machine learning is used to make decisions for individuals in various fields, which require us to achieve good prediction accuracy while ensuring fairness with respect to sensitive features (e.g., race and gender). This problem, however, remains difficult in complex real-world scenarios. To quantify unfairness under such situations, existing methods utilize {\it path-specific causal effects}. However, none of them can ensure fairness for each individual without making impractical functional assumptions about the data. In this paper, we propose a far more practical framework for learning an individually fair classifier. To avoid restrictive functional assumptions, we define the {\it probability of individual unfairness} (PIU) and solve an optimization problem where PIU's upper bound, which can be estimated from data, is controlled to be close to zero. We elucidate why our method can guarantee fairness for each individual. Experimental results show that our method can learn an individually fair classifier at a slight cost of accuracy.

Learning Shared Subgraphs in Ising Model Pairs
Burak Varici · Saurabh Sihag · Ali Tajer

Probabilistic graphical models (PGMs) are effective for capturing the statistical dependencies in stochastic databases. In many domains (e.g., working with multimodal data), one faces multiple information layers that can be modeled by structurally similar PGMs. While learning the structures of PGMs in isolation is well-investigated, the algorithmic design and performance limits of learning from multiple coupled PGMs are investigated far less. This paper considers learning the structural similarities shared by a pair of Ising PGMs. The objective is learning the shared structure with no regard for the structures exclusive to either of the graphs, and significantly different from the existing approaches that focus on entire structure of the graphs. We propose an algorithm for the shared structure learning objective, evaluate its performance empirically, and compare with existing approaches on structure learning of single graphs.

Identification of Matrix Joint Block Diagonalization
Yunfeng Cai · Ping Li
Given a set $\mathcal{C}=\{C_i\}_{i=1}^m$ of square matrices, the matrix blind joint block diagonalization problem (BJBDP) is to find a full column rank matrix $A$ such that $C_i=A\Sigma_iA^{\T}$ for all $i$, where $\Sigma_i$'s are all block diagonal matrices with as many diagonal blocks as possible. The BJBDP plays an important role in independent subspace analysis. This paper considers the identification problem for BJBDP, that is, under what conditions and by what means, we can identify the diagonalizer $A$ and the block diagonal structure of $\Sigma_i$, especially when there is noise in $C_i$'s. In this paper, we propose a ``bi-block diagonalization'' method to solve BJBDP, and establish sufficient conditions for when the method is able to accomplish the task. Numerical simulations validate our theoretical results. To the best of the authors' knowledge, current numerical methods for BJBDP have no theoretical guarantees for the identification of the exact solution, whereas our method does.
Tight Regret Bounds for Infinite-armed Linear Contextual Bandits
Yingkai Li · Yining Wang · Xi Chen · Yuan Zhou

Linear contextual bandit is a class of sequential decision-making problems with important applications in recommendation systems, online advertising, healthcare, and other machine learning-related tasks. While there is much prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we consider the linear contextual bandit problem with (changing) infinite action sets. We prove a regret upper bound on the order of O(\sqrt{d^2T\log T}) \poly(\log\log T) where d is the domain dimension and T is the time horizon. Our upper bound matches the previous lower bound of \Omega(\sqrt{d^2 T\log T}) in [Li et al., 2019] up to iterated logarithmic terms.

Generalization Bounds for Stochastic Saddle Point Problems
Junyu Zhang · Mingyi Hong · Mengdi Wang · Shuzhong Zhang
This paper studies the generalization bounds for the empirical saddle point (ESP) solution to stochastic saddle point (SSP) problems. For SSP with Lipschitz continuous and strongly convex-strongly concave objective functions, we establish an {\footnotesize$\cO\left(1/n\right)$} generalization bound by using a probabilistic stability argument. We also provide generalization bounds under a variety of assumptions, including the cases without strong convexity and without bounded domains. We illustrate our results in three examples: batch policy learning in Markov decision process, stochastic composite optimization problem, and mixed strategy Nash equilibrium estimation for stochastic games. In each of these examples, we show that a regularized ESP solution enjoys a near-optimal sample complexity. To the best of our knowledge, this is the first set of results on the generalization theory of ESP.
Neural Function Modules with Sparse Arguments: A Dynamic Approach to Integrating Information across Layers
Alex Lamb · Anirudh Goyal · Agnieszka Słowik · Michael Mozer · Philippe Beaudoin · Yoshua Bengio

Feed-forward neural networks consist of a sequence of layers, in which each layer performs some processing on the information from the previous layer. A downside to this approach is that each layer (or module, as multiple modules can operate in parallel) is tasked with processing the entire hidden state, rather than a particular part of the state which is most relevant for that module. Methods which only operate on a small number of input variables are an essential part of most programming languages, and they allow for improved modularity and code re-usability. Our proposed method, Neural Function Modules (NFM), aims to introduce the same structural capability into deep learning. Most of the work in the context of feed-forward networks combining top-down and bottom-up feedback is limited to classification problems. The key contribution of our work is to combine attention, sparsity, top-down and bottom-up feedback, in a flexible algorithm which, as we show, improves the results in standard classification, out-of-domain generalization, generative modeling, and learning representations in the context of reinforcement learning.

Predictive Power of Nearest Neighbors Algorithm under Random Perturbation
Yue Xing · Qifan Song · Guang Cheng
This work investigates the predictive performance of the classical $k$ Nearest Neighbors ($k$-NN) algorithm when the testing data are corrupted by random perturbation. The impact of corruption level on the asymptotic regret is carefully characterized and we reveal a phase-transition phenomenon that, when the corruption level of the random perturbation $\omega$ is below a critical order (i.e., small-$\omega$ regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large-$\omega$ regime), the asymptotic regret deteriorates polynomially. More importantly, the regret of $k$-NN classifier heuristically matches the rate of minimax regret for randomly perturbed testing data, thus implies the strong robustness of $k$-NN against random perturbation on testing data. In fact, we show that the classical $k$-NN can achieve no worse predictive performance, compared to the NN classifiers trained via the popular noise-injection strategy. Our numerical experiment also illustrates that combining $k$-NN component with modern learning algorithms will inherit the strong robustness of $k$-NN. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in \cite{xue2017achieving} will at most achieve a sub-optimal rate when the data dimension $d>4$ even if $k$ is chosen optimally in the pre-processing step.
The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry
Tomohiro Hayase · Ryo Karakida

The Fisher information matrix (FIM) is fundamental to understanding the trainability of deep neural nets (DNN), since it describes the parameter space's local metric. We investigate the spectral distribution of the conditional FIM, which is the FIM given a single sample, by focusing on fully-connected networks achieving dynamical isometry. Then, while dynamical isometry is known to keep specific backpropagated signals independent of the depth, we find that the parameter space's local metric linearly depends on the depth even under the dynamical isometry. More precisely, we reveal that the conditional FIM's spectrum concentrates around the maximum and the value grows linearly as the depth increases. To examine the spectrum, considering random initialization and the wide limit, we construct an algebraic methodology based on the free probability theory. As a byproduct, we provide an analysis of the solvable spectral distribution in two-hidden-layer cases. Lastly, experimental results verify that the appropriate learning rate for the online training of DNNs is in inverse proportional to depth, which is determined by the conditional FIM's spectrum.

Provably Efficient Safe Exploration via Primal-Dual Policy Optimization
Dongsheng Ding · Xiaohan Wei · Zhuoran Yang · Zhaoran Wang · Mihailo Jovanovic
We study the safe reinforcement learning problem using the constrained Markov decision processes in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing safe reinforcement learning algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization \mbox{(OPDOP)} algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the …
Mirror Descent View for Neural Network Quantization
Thalaiyasingam Ajanthan · Kartik Gupta · Philip Torr · RICHARD HARTLEY · Puneet Dokania

Quantizing large Neural Networks (NN) while maintaining the performance is highly desirable for resource-limited devices due to reduced memory and time complexity. It is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. In this work, by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which would enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we present a numerically stable implementation of MD that requires storing an additional set of auxiliary variables (unconstrained), and show that it is strikingly analogous to the Straight Through Estimator (STE) based method which is typically viewed as a “trick” to avoid vanishing gradients issue. Our experiments on CIFAR-10/100, TinyImageNet, and ImageNet classification datasets with VGG-16, ResNet-18, and MobileNetV2 architectures show that our MD variants yield state-of-the-art performance.

Adversarially Robust Estimate and Risk Analysis in Linear Regression
Yue Xing · Ruizhi Zhang · Guang Cheng

Adversarial robust learning aims to design algorithms that are robust to small adversarial perturbations on input variables. Beyond the existing studies on the predictive performance to adversarial samples, our goal is to understand statistical properties of adversarial robust estimates and analyze adversarial risk in the setup of linear regression models. By discovering the statistical minimax rate of convergence of adversarial robust estimators, we emphasize the importance of incorporating model information, e.g., sparsity, in adversarial robust learning. Further, we reveal an explicit connection of adversarial and standard estimates, and propose a straightforward two-stage adversarial training framework, which facilitates to utilize model structure information to improve adversarial robustness. In theory, the consistency of the adversarial robust estimator is proven and its Bahadur representation is also developed for the statistical inference purpose. The proposed estimator converges in a sharp rate under either low-dimensional or sparse scenario. Moreover, our theory confirms two phenomena in adversarial robust learning: adversarial robustness hurts generalization, and unlabeled data help improve the generalization. In the end, we conduct numerical simulations to verify our theory.

Bayesian Model Averaging for Causality Estimation and its Approximation based on Gaussian Scale Mixture Distributions
Shunsuke Horii

In estimation of the causal effect under linear Structural Causal Models (SCMs), it is common practice to first identify the causal structure, estimate the probability distributions, and then calculate the causal effect. However, if the goal is to estimate the causal effect, it is not necessary to fix a single causal structure or probability distributions. In this paper, we first show from a Bayesian perspective that it is Bayes optimal to weight (average) the causal effects estimated under each model rather than estimating a single model. This idea is also known as Bayesian model averaging. Although the Bayesian model averaging is optimal, as the number of candidate models increases, the weighting calculations become computationally hard. We develop an approximation to the Bayes optimal estimator by using Gaussian scale mixture distributions.

Contrastive learning of strong-mixing continuous-time stochastic processes
Bingbin Liu · Pradeep Ravikumar · Andrej Risteski

Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data. It has recently emerged as one of the leading learning paradigms in the absence of labels across many different domains (e.g. brain imaging, text, images). However, theoretical understanding of many aspects of training, both statistical and algorithmic, remain fairly elusive.

In this work, we study the setting of time series---more precisely, when we get data from a strong-mixing continuous-time stochastic process. We show that a properly constructed contrastive learning task can be used to the transition kernel for small-to-mid-range intervals in the diffusion case. Moreover, we give sample complexity bounds for solving this task and quantitatively characterize what the value of the contrastive loss implies for distributional closeness of the learned kernel. As a byproduct, we illuminate the appropriate settings for the contrastive distribution, as well as other hyperparameters in this setup.

Regularized Policies are Reward Robust
Hisham Husain · Kamil Ciosek · Ryota Tomioka

Entropic regularization of policies in Reinforcement Learning (RL) is a commonly used heuristic to ensure that the learned policy explores the state-space sufficiently before overfitting to a local optimal policy. The primary motivation for using entropy is for exploration and disambiguating optimal policies; however, the theoretical effects are not entirely understood. In this work, we study the more general regularized RL objective and using Fenchel duality; we derive the dual problem which takes the form of an adversarial reward problem. In particular, we find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward. Our result allows us to reinterpret the popular entropic regularization scheme as a form of robustification. Furthermore, due to the generality of our results, we apply to other existing regularization schemes. Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large.

A comparative study on sampling with replacement vs Poisson sampling in optimal subsampling
HaiYing Wang · Jiahui Zou

Faced with massive data, subsampling is a commonly used technique to improve computational efficiency, and using nonuniform subsampling probabilities is an effective approach to improve estimation efficiency. For computational efficiency, subsampling is often implemented with replacement or through Poisson subsampling. However, no rigorous investigation has been performed to study the difference between the two subsampling procedures such as their estimation efficiency and computational convenience. In the context of maximizing a general target function, this paper derives optimal subsampling probabilities for both subsampling with replacement and Poisson subsampling. The optimal subsampling probabilities minimize variance functions of the subsampling estimators. Furthermore, they provide deep insights on the theoretical similarities and differences between subsampling with replacement and Poisson subsampling. Practically implementable algorithms are proposed based on the optimal structural results, which are evaluated by both theoretical and empirical analysis.

Ridge Regression with Over-parametrized Two-Layer Networks Converge to Ridgelet Spectrum
Sho Sonoda · Isao Ishikawa · Masahiro Ikeda

Characterization of local minima draws much attention in theoretical studies of deep learning. In this study, we investigate the distribution of parameters in an over-parametrized finite neural network trained by ridge regularized empirical square risk minimization (RERM). We develop a new theory of ridgelet transform, a wavelet-like integral transform that provides a powerful and general framework for the theoretical study of neural networks involving not only the ReLU but general activation functions. We show that the distribution of the parameters converges to a spectrum of the ridgelet transform. This result provides a new insight into the characterization of the local minima of neural networks, and the theoretical background of an inductive bias theory based on lazy regimes. We confirm the visual resemblance between the parameter distribution trained by SGD, and the ridgelet spectrum calculated by numerical integration through numerical experiments with finite models.

Minimal enumeration of all possible total effects in a Markov equivalence class
Richard Guo · Emilija Perkovic

In observational studies, when a total causal effect of interest is not identified, the set of all possible effects can be reported instead. This typically occurs when the underlying causal DAG is only known up to a Markov equivalence class, or a refinement thereof due to background knowledge. As such, the class of possible causal DAGs is represented by a maximally oriented partially directed acyclic graph (MPDAG), which contains both directed and undirected edges. We characterize the minimal additional edge orientations required to identify a given total effect. A recursive algorithm is then developed to enumerate subclasses of DAGs, such that the total effect in each subclass is identified as a distinct functional of the observed distribution. This resolves an issue with existing methods, which often report possible total effects with duplicates, namely those that are numerically distinct due to sampling variability but are in fact causally identical.

DAG-Structured Clustering by Nearest Neighbors
Nicholas Monath · Manzil Zaheer · Kumar Avinava Dubey · Amr Ahmed · Andrew McCallum

Hierarchical clusterings compactly encode multiple granularities of clusters within a tree structure. Hierarchies, by definition, fail to capture different flat partitions that are not subsumed in one another. In this paper, we advocate for an alternative structure for representing multiple clusterings, a directed acyclic graph (DAG). By allowing nodes to have multiple parents, DAG structures are not only more flexible than trees, but also allow for points to be members of multiple clusters. We describe a scalable algorithm, Llama, which simply merges nearest neighbor substructures to form a DAG structure. Llama discovers structures that are more accurate than state-of-the-art tree-based techniques while remaining scalable to large-scale clustering benchmarks. Additionally, we support the proposed algorithm with theoretical guarantees on separated data, including types of data that cannot be correctly clustered by tree-based algorithms.

Tensor Networks for Probabilistic Sequence Modeling
Jacob E Miller · Guillaume Rabusseau · John Terilla

Tensor networks are a powerful modeling framework developed for computational many-body physics, which have only recently been applied within machine learning. In this work we utilize a uniform matrix product state (u-MPS) model for probabilistic modeling of sequence data. We first show that u-MPS enable sequence-level parallelism, with length-n sequences able to be evaluated in depth O(log n). We then introduce a novel generative algorithm giving trained u-MPS the ability to efficiently sample from a wide variety of conditional distributions, each one defined by a regular expression. Special cases of this algorithm correspond to autoregressive and fill-in-the-blank sampling, but more complex regular expressions permit the generation of richly structured data in a manner that has no direct analogue in neural generative models. Experiments on sequence modeling with synthetic and real text data show u-MPS outperforming a variety of baselines and effectively generalizing their predictions in the presence of limited data.

Minimax Model Learning
Cameron Voloshin · Nan Jiang · Yisong Yue

We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning. Notably, our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift. Compared to previous model-based techniques, our approach allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy. We provide a theoretical analysis and show empirical improvements over existing model-based off-policy evaluation methods. We provide further analysis showing our loss can be used for off-policy optimization (OPO) and demonstrate its integration with more recent improvements in OPO.

Q-learning with Logarithmic Regret
Kunhe Yang · Lin Yang · Simon Du
This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.
Federated Multi-armed Bandits with Personalization
Chengshuai Shi · Cong Shen · Jing Yang

A general framework of personalized federated multi-armed bandits (PF-MAB) is proposed, which is a new bandit paradigm analogous to the federated learning (FL) framework in supervised learning and enjoys the features of FL with personalization. Under the PF-MAB framework, a mixed bandit learning problem that flexibly balances generalization and personalization is studied. A lower bound analysis for the mixed model is presented. We then propose the Personalized Federated Upper Confidence Bound (PF-UCB) algorithm, where the exploration length is chosen carefully to achieve the desired balance of learning the local model and supplying global information for the mixed learning objective. Theoretical analysis proves that PF-UCB achieves an O(log(T)) regret regardless of the degree of personalization, and has a similar instance dependency as the lower bound. Experiments using both synthetic and real-world datasets corroborate the theoretical analysis and demonstrate the effectiveness of the proposed algorithm.

Shadow Manifold Hamiltonian Monte Carlo
Chris van der Heide · Fred Roosta · Liam Hodgkinson · Dirk Kroese

Hamiltonian Monte Carlo and its descendants have found success in machine learning and computational statistics due to their ability to draw samples in high dimensions with greater efficiency than classical MCMC. One of these derivatives, Riemannian manifold Hamiltonian Monte Carlo (RMHMC), better adapts the sampler to the geometry of the target density, allowing for improved performances in sampling problems with complex geometric features. Other approaches have boosted acceptance rates by sampling from an integrator-dependent “shadow density” and compensating for the induced bias via importance sampling. We combine the benefits of RMHMC with those attained by sampling from the shadow density, by deriving the shadow Hamiltonian corresponding to the generalized leapfrog integrator used in RMHMC. This leads to a new algorithm, shadow manifold Hamiltonian Monte Carlo, that shows improved performance over RMHMC, and leaves the target density invariant.

Online Forgetting Process for Linear Regression Models
Yuantong Li · Chi-Hua Wang · Guang Cheng

Motivated by the EU's "Right To Be Forgotten" regulation, we initiate a study of statistical data deletion problems where users' data are accessible only for a limited period of time. This setting is formulated as an online supervised learning task with \textit{constant memory limit}. We propose a deletion-aware algorithm \texttt{FIFD-OLS} for the low dimensional case, and witness a catastrophic rank swinging phenomenon due to the data deletion operation, which leads to statistical inefficiency. As a remedy, we propose the \texttt{FIFD-Adaptive Ridge} algorithm with a novel online regularization scheme, that effectively offsets the uncertainty from deletion. In theory, we provide the cumulative regret upper bound for both online forgetting algorithms. In the experiment, we showed \texttt{FIFD-Adaptive Ridge} outperforms the ridge regression algorithm with fixed regularization level, and hopefully sheds some light on more complex statistical models.

Differentially Private Monotone Submodular Maximization Under Matroid and Knapsack Constraints
Omid Sadeghi · Maryam Fazel
Numerous tasks in machine learning and artificial intelligence have been modeled as submodular maximization problems. These problems usually involve sensitive data about individuals, and in addition to maximizing the utility, privacy concerns should be considered. In this paper, we study the general framework of non-negative monotone submodular maximization subject to matroid or knapsack constraints in both offline and online settings. For the offline setting, we propose a differentially private $(1-\frac{\kappa}{e})$-approximation algorithm, where $\kappa\in[0,1]$ is the total curvature of the submodular set function, which improves upon prior works in terms of approximation guarantee and query complexity under the same privacy budget. In the online setting, we propose the first differentially private algorithm, and we specify the conditions under which the regret bound scales as $\O(\sqrt{T})$, i.e., privacy could be ensured while maintaining the same regret bound as the optimal regret guarantee in the non-private setting.
Meta Learning in the Continuous Time Limit
Ruitu Xu · Lin Chen · Amin Karbasi

In this paper, we establish the ordinary differential equation (ODE) that underlies the training dynamics of Model-Agnostic Meta-Learning (MAML). Our continuous-time limit view of the process eliminates the influence of the manually chosen step size of gradient descent and includes the existing gradient descent training algorithm as a special case that results from a specific discretization. We show that the MAML ODE enjoys a linear convergence rate to an approximate stationary point of the MAML loss function for strongly convex task losses, even when the corresponding MAML loss is non-convex. Moreover, through the analysis of the MAML ODE, we propose a new BI-MAML training algorithm that reduces the computational burden associated with existing MAML training methods, and empirical experiments are performed to showcase the superiority of our proposed methods in the rate of convergence with respect to the vanilla MAML algorithm.

Hierarchical Inducing Point Gaussian Process for Inter-domian Observations
Luhuan Wu · Andrew Miller · Lauren Anderson · Geoff Pleiss · David Blei · John Cunningham

We examine the general problem of inter-domain Gaussian Processes (GPs): problems where the GP realization and the noisy observations of that realization lie on different domains. When the mapping between those domains is linear, such as integration or differentiation, inference is still closed form. However, many of the scaling and approximation techniques that our community has developed do not apply to this setting. In this work, we introduce the hierarchical inducing point GP (HIP-GP), a scalable inter-domain GP inference method that enables us to improve the approximation accuracy by increasing the number of inducing points to the millions. HIP-GP, which relies on inducing points with grid structure and a stationary kernel assumption, is suitable for low-dimensional problems. In developing HIP-GP, we introduce (1) a fast whitening strategy, and (2) a novel preconditioner for conjugate gradients which can be helpful in general GP settings.

Independent Innovation Analysis for Nonlinear Vector Autoregressive Process
Hiroshi Morioka · Hermanni Hälvä · Aapo Hyvarinen

The nonlinear vector autoregressive (NVAR) model provides an appealing framework to analyze multivariate time series obtained from a nonlinear dynamical system. However, the innovation (or error), which plays a key role by driving the dynamics, is almost always assumed to be additive. Additivity greatly limits the generality of the model, hindering analysis of general NVAR processes which have nonlinear interactions between the innovations. Here, we propose a new general framework called independent innovation analysis (IIA), which estimates the innovations from completely general NVAR. We assume mutual independence of the innovations as well as their modulation by an auxiliary variable (which is often taken as the time index and simply interpreted as nonstationarity). We show that IIA guarantees the identifiability of the innovations with arbitrary nonlinearities, up to a permutation and component-wise invertible nonlinearities. We also propose three estimation frameworks depending on the type of the auxiliary variable. We thus provide the first rigorous identifiability result for general NVAR, as well as very general tools for learning such models.

Gradient Descent in RKHS with Importance Labeling
Tomoya Murata · Taiji Suzuki

Labeling cost is often expensive and is a fundamental limitation of supervised learning. In this paper, we study importance labeling problem, in which we are given many unlabeled data and select a limited number of data to be labeled from the unlabeled data, and then a learning algorithm is executed on the selected one. We propose a new importance labeling scheme that can effectively select an informative subset of unlabeled data in least squares regression in Reproducing Kernel Hilbert Spaces (RKHS). We analyze the generalization error of gradient descent combined with our labeling scheme and show that the proposed algorithm achieves the optimal rate of convergence in much wider settings and especially gives much better generalization ability in a small noise setting than the usual uniform sampling scheme. Numerical experiments verify our theoretical findings.

GANs with Conditional Independence Graphs: On Subadditivity of Probability Divergences
Mucong Ding · Constantinos Daskalakis · Soheil Feizi

Generative Adversarial Networks (GANs) are modern methods to learn the underlying distribution of a data set. GANs have been widely used in sample synthesis, de-noising, domain transfer, etc. GANs, however, are designed in a model-free fashion where no additional information about the underlying distribution is available. In many applications, however, practitioners have access to the underlying independence graph of the variables, either as a Bayesian network or a Markov Random Field (MRF). We ask: how can one use this additional information in designing model-based GANs? In this paper, we provide theoretical foundations to answer this question by studying subadditivity properties of probability divergences, which establish upper bounds on the distance between two high-dimensional distributions by the sum of distances between their marginals over (local) neighborhoods of the graphical structure of the Bayes-net or the MRF. We prove that several popular probability divergences satisfy some notion of subadditivity under mild conditions. These results lead to a principled design of a model-based GAN that uses a set of simple discriminators on the neighborhoods of the Bayes-net/MRF, rather than a giant discriminator on the entire network, providing significant statistical and computational benefits. Our experiments on synthetic and real-world datasets demonstrate the benefits of …

Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning
Yunhao Tang · Alp Kucukelbir

We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective. The E-step provides a natural interpretation of how 'learning in hindsight' techniques, such as HER, to handle extremely sparse goal-conditioned rewards. The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images. We show that the combined algorithm, hEM significantly outperforms model-free baselines on a wide range of goal-conditioned benchmarks with sparse rewards.

Linear Models are Robust Optimal Under Strategic Behavior
Wei Tang · Chien-Ju Ho · Yang Liu

There is an increasing use of algorithms to inform decisions in many settings, from student evaluations, college admissions, to credit scoring. These decisions are made by applying a decision rule to individual's observed features. Given the impacts of these decisions on individuals, decision makers are increasingly required to be transparent on their decision making to offer the ``right to explanation.'' Meanwhile, being transparent also invites potential manipulations, also known as gaming, that the individuals can utilize the knowledge to strategically alter their features in order to receive a more beneficial decision.

In this work, we study the problem of \emph{robust} decision-making under strategic behavior. Prior works often assume that the decision maker has full knowledge of individuals' cost structure for manipulations. We study the robust variant that relaxes this assumption: The decision maker does not have full knowledge but knows only a subset of the individuals' available actions and associated costs. To approach this non-quantifiable uncertainty, we define robustness based on the worst-case guarantee of a decision, over all possible actions (including actions unknown to the decision maker) individuals might take. A decision rule is called \emph{robust optimal} if its worst case performance is (weakly) better than that of all …

Minimax Estimation of Laplacian Constrained Precision Matrices
Jiaxi Ying · José Vinícius de Miranda Cardoso · Daniel Palomar
This paper considers the problem of high-dimensional sparse precision matrix estimation under Laplacian constraints. We prove that the Laplacian constraints bring favorable properties for estimation: the Gaussian maximum likelihood estimator exists and is unique almost surely on the basis of one observation, irrespective of the dimension. We establish the optimal rate of convergence under Frobenius norm by the derivation of the minimax lower and upper bounds. The minimax lower bound is obtained by applying Le Cam-Assouad's method with a novel construction of a subparameter space of multivariate normal distributions. The minimax upper bound is established by designing an adaptive $\ell_1$-norm regularized maximum likelihood estimation method and quantifying the rate of convergence. We prove that the proposed estimator attains the optimal rate of convergence with an overwhelming probability. Numerical experiments demonstrate the effectiveness of the proposed estimator.
Beyond Marginal Uncertainty: How Accurately can Bayesian Regression Models Estimate Posterior Predictive Correlations?
Chaoqi Wang · Shengyang Sun · Roger Grosse

While uncertainty estimation is a well-studied topic in deep learning, most such work focuses on marginal uncertainty estimates, i.e. the predictive mean and variance at individual input locations. But it is often more useful to estimate predictive correlations between the function values at different input locations. In this paper, we consider the problem of benchmarking how accurately Bayesian models can estimate predictive correlations. We first consider a downstream task which depends on posterior predictive correlations: transductive active learning (TAL). We find that TAL makes better use of models' uncertainty estimates than ordinary active learning, and recommend this as a benchmark for evaluating Bayesian models. Since TAL is too expensive and indirect to guide development of algorithms, we introduce two metrics which more directly evaluate the predictive correlations and which can be computed efficiently: meta-correlations (i.e. the correlations between the models correlation estimates and the true values), and cross-normalized likelihoods (XLL). We validate these metrics by demonstrating their consistency with TAL performance and obtain insights about the relative performance of current Bayesian neural net and Gaussian process models.

A Parameter-Free Algorithm for Misspecified Linear Contextual Bandits
Kei Takemura · Shinji Ito · Daisuke Hatano · Hanna Sumita · Takuro Fukunaga · Naonori Kakimura · Ken-ichi Kawarabayashi
We investigate the misspecified linear contextual bandit (MLCB) problem, which is a generalization of the linear contextual bandit (LCB) problem. The MLCB problem is a decision-making problem in which a learner observes $d$-dimensional feature vectors, called arms, chooses an arm from $K$ arms, and then obtains a reward from the chosen arm in each round. The learner aims to maximize the sum of the rewards over $T$ rounds. In contrast to the LCB problem, the rewards in the MLCB problem may not be represented by a linear function in feature vectors; instead, it is approximated by a linear function with additive approximation parameter $\varepsilon \geq 0$. In this paper, we propose an algorithm that achieves $\tilde{O}(\sqrt{dT\log(K)} + \eps\sqrt{d}T)$ regret, where $\tilde{O}(\cdot)$ ignores polylogarithmic factors in $d$ and $T$. This is the first algorithm that guarantees a high-probability regret bound for the MLCB problem without knowledge of the approximation parameter $\varepsilon$.
Consistent k-Median: Simpler, Better and Robust
Xiangyu Guo · Janardhan Kulkarni · Shi Li · Jiayi Xian

In this paper we introduce and study the online consistent k-clustering with outliers problem, generalizing the non-outlier version of the problem studied in Lattanzi-Vassilvitskii [18]. We show that a simple local-search based on-line algorithm can give a bicriteria constant approximation for the problem with O(k^2 log^2(nD)) swaps of medians (recourse) in total, where D is the diameter of the metric. When restricted to the problem without outliers, our algorithm is simpler, deterministic and gives better approximation ratio and recourse, compared to that of Lattanzi-Vassilvitskii [18].

Optimal query complexity for private sequential learning against eavesdropping
Jiaming Xu · Kuang Xu · Dana Yang

We study the query complexity of a learner-private sequential learning problem, motivated by the privacy and security concerns due to eavesdropping that arise in practical applications such as pricing and Federated Learning. A learner tries to estimate an unknown scalar value, by sequentially querying an external database and receiving binary responses; meanwhile, a third-party adversary observes the learner's queries but not the responses. The learner's goal is to design a querying strategy with the minimum number of queries (optimal query complexity) so that she can accurately estimate the true value, while the eavesdropping adversary even with the complete knowledge of her querying strategy cannot.

On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification
Tianyi Lin · Zeyu Zheng · Elynn Chen · Marco Cuturi · Michael Jordan
Optimal transport (OT) distances are increasingly used as loss functions for statistical inference, notably in the learning of generative models or supervised learning. Yet, the behavior of minimum Wasserstein estimators is poorly understood, notably in high-dimensional regimes or under model misspecification. In this work we adopt the viewpoint of projection robust (PR) OT, which seeks to maximize the OT cost between two measures by choosing a $k$-dimensional subspace onto which they can be projected. Our first contribution is to establish several fundamental statistical properties of PR Wasserstein distances, complementing and improving previous literature that has been restricted to one-dimensional and well-specified cases. Next, we propose the integral PR Wasserstein (IPRW) distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces. Our complexity bounds can help explain why both PRW and IPRW distances outperform Wasserstein distances empirically in high-dimensional inference tasks. Finally, we consider parametric inference using the PRW distance. We provide an asymptotic guarantee of two types of minimum PRW estimators and formulate a central limit theorem for max-sliced Wasserstein estimator under model misspecification. To enable our analysis on PRW with projection dimension larger than one, we devise a novel combination of variational analysis and …
Learning Temporal Point Processes with Intermittent Observations
Vinayak Gupta · Srikanta Bedathur · Sourangshu Bhattacharya · Abir De

Marked temporal point processes (MTPP) have emerged as a powerful framework to model the underlying generative mechanism of asynchronous events localized in continuous time. Most existing models and inference methods in MTPP framework consider only the complete observation scenario i.e. the event sequence being modeled is completely observed with no missing events – an ideal setting barely encountered in practice. A recent line of work which considers missing events uses supervised learning techniques which require a missing or observed label for each event. In this work, we provide a novel unsupervised model and inference method for MTPPs in presence of missing events. We first model the generative processes of observed events and missing events using two MTPPs, where the missing events are represented as latent random variables. Then we devise an unsupervised training method that jointly learns both the MTPPs by means of variational inference. Experiments with real datasets show that our modeling and inference frameworks can effectively impute the missing data among the observed events, which in turn enhances its predictive prowess.

On the Faster Alternating Least-Squares for CCA
Zhiqiang Xu · Ping Li

We study alternating least-squares (ALS) for canonical correlation analysis (CCA). Recent research shows that the alternating least-squares solver for k-CCA can be directly accelerated with momentum and prominent performance gain has been observed in practice for the resulting simple algorithm. However, despite the simplicity, it is difficult for the accelerated rate to be analyzed in theory in order to explain and match the empirical performance gain. By looking into two neighboring iterations, in this work, we propose an even simpler variant of the faster alternating least-squares solver. Instead of applying momentum to each update for acceleration, the proposed variant only leverages momentum at every other iteration and can converge at a provably faster linear rate of nearly square-root dependence on the singular value gap of the whitened cross-covariance matrix. In addition to the high consistency between theory and practice, experimental studies also show that our variant of the alternating least-squares algorithm as a block CCA solver is even more pass efficient than other variants.

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
Alexander Lew · Monica Agrawal · David Sontag · Vikash Mansinghka

Data cleaning is naturally framed as probabilistic inference in a generative model of ground-truth data and likely errors, but the diversity of real-world error patterns and the hardness of inference make Bayesian approaches difficult to automate. We present PClean, a probabilistic programming language (PPL) for leveraging dataset-specific knowledge to automate Bayesian cleaning. Compared to general-purpose PPLs, PClean tackles a restricted problem domain, enabling three modeling and inference innovations: (1) a non-parametric model of relational database instances, which users' programs customize; (2) a novel sequential Monte Carlo inference algorithm that exploits the structure of PClean's model class; and (3) a compiler that generates near-optimal SMC proposals and blocked-Gibbs rejuvenation kernels based on the user's model and data. We show empirically that short (< 50-line) PClean programs can: be faster and more accurate than generic PPL inference on data-cleaning benchmarks; match state-of-the-art data-cleaning systems in terms of accuracy and runtime (unlike generic PPL inference in the same runtime); and scale to real-world datasets with millions of records.

Robust Imitation Learning from Noisy Demonstrations
Voot Tangkaratt · Nontawat Charoenphakdee · Masashi Sugiyama

Robust learning from noisy demonstrations is a practical but highly challenging problem in imitation learning. In this paper, we first theoretically show that robust imitation learning can be achieved by optimizing a classification risk with a symmetric loss. Based on this theoretical finding, we then propose a new imitation learning method that optimizes the classification risk by effectively combining pseudo-labeling with co-training. Unlike existing methods, our method does not require additional labels or strict assumptions about noise distributions. Experimental results on continuous-control benchmarks show that our method is more robust compared to state-of-the-art methods.

Entropy Partial Transport with Tree Metrics: Theory and Practice
Tam Le · Truyen Nguyen

Optimal transport (OT) theory provides powerful tools to compare probability measures. However, OT is limited to nonnegative measures having the same mass, and suffers serious drawbacks about its computation and statistics. This leads to several proposals of regularized variants of OT in the recent literature. In this work, we consider an entropy partial transport (EPT) problem for nonnegative measures on a tree having different masses. The EPT is shown to be equivalent to a standard complete OT problem on a one-node extended tree. We derive its dual formulation, then leverage this to propose a novel regularization for EPT which admits fast computation and negative definiteness. To our knowledge, the proposed regularized EPT is the first approach that yields a closed-form solution among available variants of unbalanced OT for general nonnegative measures. For practical applications without prior knowledge about the tree structure for measures, we propose tree-sliced variants of the regularized EPT, computed by averaging the regularized EPT between these measures using random tree metrics, built adaptively from support data points. Exploiting the negative definiteness of our regularized EPT, we introduce a positive definite kernel, and evaluate it against other baselines on benchmark tasks such as document classification with word embedding …

Learning with risk-averse feedback under potentially heavy tails
Matthew Holland · El Mehdi Haress

We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses (and gradients) incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not change too fast or slow around just the quantile of interest. With this estimator in hand, we then derive a new learning algorithm which robustly chooses among candidates produced by stochastic gradient-driven sub-processes, obtain excess CVaR bounds, and finally complement the theory with a regression application.

Prediction with Finitely many Errors Almost Surely
Changlong Wu · Narayana Santhanam

Using only samples from a probabilistic model, we predict properties of the model and of future observations. The prediction game continues in an online fashion as the sample size grows with new observations. After each prediction, the predictor incurs a binary (0-1) loss. The probability model underlying a sample is otherwise unknown except that it belongs to a known class of models. The goal is to make finitely many errors (i.e. loss of 1) with probability 1 under the generating model, no matter what it may be in the known model class.

Model classes admitting predictors that make only finitely many errors are eventually almost surely (eas) predictable. When the losses incurred are observable (the supervised case), we completely characterize eas predictable classes. We provide analogous results in the unsupervised case. Our results have a natural interpretation in terms of regularization. In eas-predictable classes, we study if it is possible to have a universal stopping rule that identifies (to any given confidence) when no more errors will be made. Classes admitting such a stopping rule are eas learnable. When samples are generated iid, we provide a complete characterization of eas learnability. We also study cases when samples are not generated …

Mean-Variance Analysis in Bayesian Optimization under Uncertainty
Shogo Iwazaki · Yu Inatsu · Ichiro Takeuchi
We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measures need to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysis in Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engineering and has been used to make decisions that take into account the trade-off between the average and variance of investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component and consider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of the uncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive the bounds of the two risk measures and propose AL algorithm for each of the above three scenarios based on the risk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis and numerical experiments.
Sample efficient learning of image-based diagnostic classifiers via probabilistic labels
Roberto Vega · Pouneh Gorji · Zichen Zhang · Xuebin Qin · Abhilash Rakkunedeth · Jeevesh Kapur · Jacob Jaremko · Russell Greiner

Deep learning approaches often require huge datasets to achieve good generalization. This complicates its use in tasks like image-based medical diagnosis, where the small training datasets are usually insufficient to learn appropriate data representations. For such sensitive tasks it is also important to provide the confidence in the predictions. Here, we propose a way to learn and use probabilistic labels to train accurate and calibrated deep networks from relatively small datasets. We observe gains of up to 22% in the accuracy of models trained with these labels, as compared with traditional approaches, in three classification tasks: diagnosis of hip dysplasia, fatty liver, and glaucoma. The outputs of models trained with probabilistic labels are calibrated, allowing the interpretation of its predictions as proper probabilities. We anticipate this approach will apply to other tasks where few training instances are available and expert knowledge can be encoded as probabilities.

Beyond Perturbation Stability: LP Recovery Guarantees for MAP Inference on Noisy Stable Instances
Hunter Lang · Aravind Reddy Talla · David Sontag · Aravindan Vijayaraghavan

Several works have shown that perturbation stable instances of the MAP inference problem can be solved exactly using a natural linear programming (LP) relaxation. However, most of these works give few (or no) guarantees for the LP solutions on instances that do not satisfy the relatively strict perturbation stability definitions. In this work, we go beyond these stability results by showing that the LP approximately recovers the MAP solution of a stable instance even after the instance is corrupted by noise. This "noisy stable" model realistically fits with practical MAP inference problems: we design an algorithm for finding "close" stable instances, and show that several real-world instances from computer vision have nearby instances that are perturbation stable. These results suggest a new theoretical explanation for the excellent performance of this LP relaxation in practice.

Experimental Design for Regret Minimization in Linear Bandits
Andrew Wagenmaker · Julian Katz-Samuels · Kevin Jamieson

In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to be suboptimal in many cases--our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.

Stability and Risk Bounds of Iterative Hard Thresholding
Xiaotong Yuan · Ping Li
The Iterative Hard Thresholding (IHT) algorithm is one of the most popular and promising greedy pursuit methods for high-dimensional statistical estimation under cardinality constraint. The existing analysis of IHT mostly focuses on parameter estimation and sparsity recovery consistency. From the perspective of statistical learning theory, another fundamental question is how well the IHT estimation would perform on unseen samples. The answer to this question is important for understanding the generalization ability of IHT yet has remaind elusive. In this paper, we investigate this problem and develop a novel generalization theory for IHT from the viewpoint of algorithmic stability. Our theory reveals that: 1) under natural conditions on the empirical risk function over $n$ samples of dimension $p$, IHT with sparsity level $k$ enjoys an $\mathcal{\tilde O}(n^{-1/2}\sqrt{k\log(n)\log(p)})$ rate of convergence in sparse excess risk; and 2) a fast rate of order $\mathcal{\tilde O}(n^{-1}k(\log^3(n)+\log(p)))$ can be derived for strongly convex risk function under certain strong-signal conditions. The results have been substantialized to sparse linear regression and logistic regression models along with numerical evidence provided to support our theory.
Parametric Programming Approach for More Powerful and General Lasso Selective Inference
Vo Nguyen Le Duy · Ichiro Takeuchi

Selective Inference (SI) has been actively studied in the past few years for conducting inference on the features of linear models that are adaptively selected by feature selection methods such as Lasso. The basic idea of SI is to make inference conditional on the selection event. Unfortunately, the main limitation of the original SI approach for Lasso is that the inference is conducted not only conditional on the selected features but also on their signs---this leads to loss of power because of over-conditioning. Although this limitation can be circumvented by considering the union of such selection events for all possible combinations of signs, this is only feasible when the number of selected features is sufficiently small. To address this computational bottleneck, we propose a parametric programming-based method that can conduct SI without conditioning on signs even when we have thousands of active features. The main idea is to compute the continuum path of Lasso solutions in the direction of a test statistic, and identify the subset of the data space corresponding to the feature selection event by following the solution path. The proposed parametric programming-based method not only avoids the aforementioned computational bottleneck but also improves the performance and practicality …

One-pass Stochastic Gradient Descent in overparametrized two-layer neural networks
Hanjing Zhu · Jiaming Xu

There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous work assumes that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid's inequality.

On the Absence of Spurious Local Minima in Nonlinear Low-Rank Matrix Recovery Problems
Yingjie Bi · Javad Lavaei

The restricted isometry property (RIP) is a well-known condition that guarantees the absence of spurious local minima in low-rank matrix recovery problems with linear measurements. In this paper, we introduce a novel property named bound difference property (BDP) to study low-rank matrix recovery problems with nonlinear measurements. Using RIP and BDP jointly, we propose a new criterion to certify the nonexistence of spurious local minima in the rank-1 case, and prove that it leads to a much stronger theoretical guarantee than the existing bounds on RIP.

Flow-based Alignment Approaches for Probability Measures in Different Spaces
Tam Le · Nhat Ho · Makoto Yamada

Gromov-Wasserstein (GW) is a powerful tool to compare probability measures whose supports are in different metric spaces. However, GW suffers from a computational drawback since it requires to solve a complex non-convex quadratic program. In this work, we consider a specific family of cost metrics, namely, tree metrics for supports of each probability measure, to develop efficient and scalable discrepancies between the probability measures. Leveraging a tree structure, we propose to align flows from a root to each support instead of pair-wise tree metrics of supports, i.e., flows from a support to another support, in GW. Consequently, we propose a novel discrepancy, named Flow-based Alignment (FlowAlign), by matching the flows of the probability measures. FlowAlign is computationally fast and scalable for large-scale applications. Further exploring the tree structure, we propose a variant of FlowAlign, named Depth-based Alignment (DepthAlign), by aligning the flows hierarchically along each depth level of the tree structures. Theoretically, we prove that both FlowAlign and DepthAlign are pseudo-metrics. We also derive tree-sliced variants of the proposed discrepancies for applications without prior knowledge about tree structures for probability measures, computed by averaging FlowAlign/DepthAlign using random tree metrics, adaptively sampled from supports of probability measures. Empirically, we test our …

Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features
Shingo Yashima · Atsushi Nitanda · Taiji Suzuki

Although kernel methods are widely used in many learning problems, they have poor scalability to large datasets. To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive computationally efficient learning algorithms. We consider solving a binary classification problem using random features and stochastic gradient descent, both of which are common and widely used in practical large-scale problems. Although there are plenty of previous works investigating the efficiency of these algorithms in terms of the convergence of the objective loss function, these results suggest that the computational gain comes at expense of the learning accuracy when dealing with general Lipschitz loss functions such as logistic loss. In this study, we analyze the properties of these algorithms in terms of the convergence not of the loss function, but the classification error under the strong low-noise condition, which reflects a realistic property of real-world datasets. We extend previous studies on SGD to a random features setting, examining a novel analysis about the error induced by the approximation of random features in terms of the distance between the generated hypothesis to show that an exponential convergence of the expected classification error is achieved even if random features …

RankDistil: Knowledge Distillation for Ranking
Sashank Reddi · Rama Kumar Pasumarthi · Aditya Menon · Ankit Singh Rawat · Felix Yu · Seungyeon Kim · Andreas Veit · Sanjiv Kumar

Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a distillation framework for top-k ranking and draw connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the order of items of student and teacher, while penalizing large scores for items ranked low by the teacher. Building on this, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank, and establish statistical basis for the method. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems.

Multi-Fidelity High-Order Gaussian Processes for Physical Simulation
Zheng Wang · Wei Xing · Robert Kirby · Shandian Zhe

The key task of physical simulation is to solve partial differential equations (PDEs) on discretized domains, which is known to be costly. In particular, high-fidelity solutions are much more expensive than low-fidelity ones. To reduce the cost, we consider novel Gaussian process (GP) models that leverage simulation examples of different fidelities to predict high-dimensional PDE solution outputs. Existing GP methods are either not scalable to high-dimensional outputs or lack effective strategies to integrate multi-fidelity examples. To address these issues, we propose Multi-Fidelity High-Order Gaussian Process (MFHoGP) that can capture complex correlations both between the outputs and between the fidelities to enhance solution estimation, and scale to large numbers of outputs. Based on a novel nonlinear coregionalization model, MFHoGP propagates bases throughout fidelities to fuse information, and places a deep matrix GP prior over the basis weights to capture the (nonlinear) relationships across the fidelities. To improve inference efficiency and quality, we use bases decomposition to largely reduce the model parameters, and layer-wise matrix Gaussian posteriors to capture the posterior dependency and to simplify the computation. Our stochastic variational learning algorithm successfully handles millions of outputs without extra sparse approximations. We show the advantages of our method in several typical applications.

One-Round Communication Efficient Distributed M-Estimation
Yajie Bao · Weijia Xiong

Communication cost and local computation complexity are two main bottlenecks of the distributed statistical learning. In this paper, we consider the distributed M-estimation problem in both regular and sparse case and propose a novel one-round communication efficient algorithm. For regular distributed M-estimator, the asymptotic normality is provided to conduct statistical inference. For sparse distributed M-estimator, we only require solving a quadratic Lasso problem in the master machine using the same local information as the regular distributed M-estimator. Consequently, the computation complexity of the local machine is sufficiently reduced compared with the existing debiased sparse estimator. Under mild conditions, the theoretical results guarantee that our proposed distributed estimators achieve (near)optimal statistical convergence rate. The effectiveness of our proposed algorithm is verified through experiments across different M-estimation problems using both synthetic and real benchmark datasets.

Communication Efficient Primal-Dual Algorithm for Nonconvex Nonsmooth Distributed Optimization
Congliang Chen · Jiawei Zhang · Li Shen · Peilin Zhao · Zhiquan Luo
Decentralized optimization problems frequently appear in the large scale machine learning problems. However, few works work on the difficult nonconvex nonsmooth case. In this paper, we propose a decentralized primal-dual algorithm to solve this type of problem in a decentralized manner and the proposed algorithm can achieve an $\mathcal{O}(1/\epsilon^2)$ iteration complexity to attain an $\epsilon-$solution, which is the well-known lower iteration complexity bound for nonconvex optimization. To our knowledge, it is the first algorithm achieving this rate under a nonconvex, nonsmooth decentralized setting. Furthermore, to reduce communication overhead, we also modifying our algorithm by compressing the vectors exchanged between agents. The iteration complexity of the algorithm with compression is still $\mathcal{O}(1/\epsilon^2)$. Besides, we apply the proposed algorithm to solve nonconvex linear regression problem and train deep learning model, both of which demonstrate the efficiency and efficacy of the proposed algorithm.
No-Regret Reinforcement Learning with Heavy-Tailed Rewards
Vincent Zhuang · Yanan Sui

Reinforcement learning algorithms typically assume rewards to be sampled from light-tailed distributions, such as Gaussian or bounded. However, a wide variety of real-world systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging techniques from robust mean estimation, we propose Heavy-UCRL2 and Heavy-Q-Learning, and show that they achieve near-optimal regret bounds in this setting. Our algorithms also naturally generalize to deep reinforcement learning applications; we instantiate Heavy-DQN as an example of this. We demonstrate that all of our algorithms outperform baselines on both synthetic MDPs and standard RL benchmarks.

Regret Minimization for Causal Inference on Large Treatment Space
Akira Tanimoto · Tomoya Sakai · Takashi Takenouchi · Hisashi Kashima

Predicting which action (treatment) will lead to a better outcome is a central task in decision support systems. To build a prediction model in real situations, learning from observational data with a sampling bias is a critical issue due to the lack of randomized controlled trial (RCT) data. To handle such biased observational data, recent efforts in causal inference and counterfactual machine learning have focused on debiased estimation of the potential outcomes on a binary action space and the difference between them, namely, the individual treatment effect. When it comes to a large action space (e.g., selecting an appropriate combination of medicines for a patient), however, the regression accuracy of the potential outcomes is no longer sufficient in practical terms to achieve a good decision-making performance. This is because a high mean accuracy on the large action space does not guarantee the nonexistence of a single potential outcome misestimation that misleads the whole decision. Our proposed loss minimizes the classification error of whether or not the action is relatively good for the individual target among all feasible actions, which further improves the decision-making performance, as we demonstrate. We also propose a network architecture and a regularizer that extracts a debiased …

Statistical Guarantees for Transformation Based Models with applications to Implicit Variational Inference
Sean Plummer · Shuang Zhou · Anirban Bhattacharya · David Dunson · Debdeep Pati
Transformation based methods have been an attractive approach in non-parametric inference for problems such as unconditioned and conditional density estimation due to their unique hierarchical structure that models the data as flexible transformation of a set of common latent variables. More recently, transformation based models have been used in variational inference (VI) to construct flexible implicit families of variational distributions. However, their use in both non-parametric inference and variational inference lacks theoretical justification. In the context of non-linear latent variable models (NL-LVM), we provide theoretical justification for the use of these models in non-parametric inference by showing that the support of the transformation induced prior in the space of densities is sufficiently large in the $L_1$ sense and show that for this class of priors the posterior concentrates at the optimal rate up to a logarithmic factor. Adopting the flexibility demonstrated in the non-parametric setting we use the NL-LVM to construct an implicit family of variational distributions, deemed as GP-IVI. We delineate sufficient conditions under which GP-IVI achieves optimal risk bounds and approximates the true posterior in the sense of the Kullback-Leibler divergence. To the best of our knowledge, this is the first work on providing theoretical guarantees for implicit …
On the Generalization Properties of Adversarial Training
Yue Xing · Qifan Song · Guang Cheng
Modern machine learning and deep learning models are shown to be vulnerable when testing data are slightly perturbed. Theoretical studies of adversarial training algorithms mostly focus on their adversarial training losses or local convergence properties. In contrast, this paper studies the generalization performance of a generic adversarial training algorithm. Specifically, we consider linear regression models and two-layer neural networks (with lazy training) using squared loss under low-dimensional regime and high-dimensional regime. In the former regime, after overcoming the non-smoothness of adversarial training, the adversarial risk of the trained models will converge to the minimal adversarial risk. In the latter regime, we discover that data interpolation prevents the adversarial robust estimator from being consistent (i.e. converge in probability). Therefore, inspired by successes of the least absolute shrinkage and selection operator (LASSO), we incorporate the $\mathcal{L}_1$ penalty in the high dimensional adversarial learning, and show that it leads to consistent adversarial robust estimation. A series of numerical studies are conducted to demonstrate that how the smoothness and $\mathcal{L}_1$ penalization help to improve the adversarial robustness of DNN models.
On the High Accuracy Limitation of Adaptive Property Estimation
Yanjun Han
Recent years have witnessed the success of adaptive (or unified) approaches in estimating symmetric properties of discrete distributions, where the learner first obtains a distribution estimator independent of the target property, and then plugs the estimator into the target property as the final estimator. Several such approaches have been proposed and proved to be adaptively optimal, i.e. they achieve the optimal sample complexity for a large class of properties within a low accuracy, especially for a large estimation error $\varepsilon\gg n^{-1/3}$ where $n$ is the sample size. In this paper, we characterize the high accuracy limitation, or the penalty for adaptation, for general adaptive approaches. Specifically, we obtain the first known adaptation lower bound that under a mild condition, any adaptive approach cannot achieve the optimal sample complexity for every $1$-Lipschitz property within accuracy $\varepsilon \ll n^{-1/3}$. In particular, this result disproves a conjecture in [Acharya et al. 2017] that the profile maximum likelihood (PML) plug-in approach is optimal in property estimation for all ranges of $\varepsilon$, and confirms a conjecture in [Han and Shiragur 2020] that their competitive analysis of the PML is tight.
Unifying Clustered and Non-stationary Bandits
Chuanhao Li · Qingyun Wu · Hongning Wang

Non-stationary bandits and clustered bandits lift the restrictive assumptions in contextual bandits and provide solutions to many important real-world scenarios. Though they have been studied independently so far, we point out the essence in solving these two problems overlaps considerably. In this work, we connect these two strands of bandit research under the notion of test of homogeneity, which seamlessly addresses change detection for non-stationary bandit and cluster identification for clustered bandit in a unified solution framework. Rigorous regret analysis and extensive empirical evaluations demonstrate the value of our proposed solution, especially its flexibility in handling various environment assumptions, e.g., a clustered non-stationary environment.

vqSGD: Vector Quantized Stochastic Gradient Descent
Venkata Gandikota · Daniel Kane · Raj Kumar Maity · Arya Mazumdar
In this work, we present a family of vector quantization schemes vqSGD (Vector-Quantized Stochastic Gradient Descent) that provide an asymptotic reduction in the communication cost with convergence guarantees in first-order distributed optimization. In the process we derive the following fundamental information theoretic fact: $\Theta(\frac{d}{R^2})$ bits are necessary and sufficient (up to an additive $O(\log d)$ term) to describe an unbiased estimator $\hat{g}(g)$ for any $g$ in the $d$-dimensional unit sphere, under the constraint that $\|\hat{g}(g)\|_2\le R$ almost surely. In particular, we consider a randomized scheme based on the convex hull of a point set, that returns an unbiased estimator of a $d$-dimensional gradient vector with almost surely bounded norm. We provide multiple efficient instances of our scheme, that are near optimal, and require only $o(d)$ bits of communication at the expense of tolerable increase in error. The instances of our quantization scheme are obtained using the properties of binary error-correcting codes and provide a smooth tradeoff between the communication and the estimation error of quantization. Furthermore, we show that vqSGD also offers some automatic privacy guarantees.
Diagnostic Uncertainty Calibration: Towards Reliable Machine Predictions in Medical Domain
Takahiro Mimori · Keiko Sasada · Hirotaka Matsui · Issei Sato

We propose an evaluation framework for class probability estimates (CPEs) in the presence of label uncertainty, which is commonly observed as diagnosis disagreement between experts in the medical domain. We also formalize evaluation metrics for higher-order statistics, including inter-rater disagreement, to assess predictions on label uncertainty. Moreover, we propose a novel post-hoc method called alpha-calibration, that equips neural network classifiers with calibrated distributions over CPEs. Using synthetic experiments and a large-scale medical imaging application, we show that our approach significantly enhances the reliability of uncertainty estimates: disagreement probabilities and posterior CPEs.

Tractable contextual bandits beyond realizability
Sanath Kumar Krishnamurthy · Vitor Hadad · Susan Athey

Tractable contextual bandit algorithms often rely on the realizability assumption – i.e., that the true expected reward model belongs to a known class, such as linear functions. In this work, we present a tractable bandit algorithm that is not sensitive to the realizability assumption and computationally reduces to solving a constrained regression problem in every epoch. When realizability does not hold, our algorithm ensures the same guarantees on regret achieved by realizability-based algorithms under realizability, up to an additive term that accounts for the misspecification error. This extra term is proportional to T times a function of the mean squared error between the best model in the class and the true model, where T is the total number of time-steps. Our work sheds light on the bias-variance trade-off for tractable contextual bandits. This trade-off is not captured by algorithms that assume realizability, since under this assumption there exists an estimator in the class that attains zero bias.

On the Suboptimality of Negative Momentum for Minimax Optimization
Guodong Zhang · Yuanhao Wang

Smooth game optimization has recently attracted great interest in machine learning as it generalizes the single-objective optimization paradigm. However, game dynamics is more complex due to the interaction between different players and is therefore fundamentally different from minimization, posing new challenges for algorithm design. Notably, it has been shown that negative momentum is preferred due to its ability to reduce oscillation in game dynamics. Nevertheless, existing analysis of negative momentum was restricted to simple bilinear games. In this paper, we extend the analysis to smooth and strongly-convex strongly-concave minimax games by taking the variational inequality formulation. By connecting Polyak’s momentum with Chebyshev polynomials, we show that negative momentum accelerates convergence of game dynamics locally, though with a suboptimal rate. To the best of our knowledge, this is the \emph{first work} that provides an explicit convergence rate for negative momentum in this setting.

Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees
Alessio Mazzetto · Dylan Sam · Andrew Park · Eli Upfal · Stephen Bach

We develop the first theoretical guarantees for learning from weak labelers without the (mostly unrealistic) assumption that the errors of the weak labelers are independent or come from a particular family of distributions. We show a rigorous technique for efficiently selecting small subsets of the labelers so that a majority vote from such subsets has a provably low error rate. We explore several extensions of this method and provide experimental results over a range of labeled data set sizes on 45 image classification tasks. Our performance-guaranteed methods consistently match the best performing alternative, which varies based on problem difficulty. On tasks with accurate weak labelers, our methods are on average 3 percentage points more accurate than the state-of-the-art adversarial method. On tasks with inaccurate weak labelers, our methods are on average 15 percentage points more accurate than the semi-supervised Dawid-Skene model (which assumes independence).

Corralling Stochastic Bandit Algorithms
Raman Arora · Teodor Vanislavov Marinov · Mehryar Mohri

We study the problem of corralling stochastic bandit algorithms, that is combining multiple bandit algorithms designed for a stochastic environment, with the goal of devising a corralling algorithm that performs almost as well as the best base algorithm. We give two general algorithms for this setting, which we show benefit from favorable regret guarantees. We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward, and depends on the gap between the highest reward and other rewards.

CADA: Communication-Adaptive Distributed Adam
Tianyi Chen · Ziye Guo · Yuejiao Sun · Wotao Yin

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for largescale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communicationadaptive counterpart of the celebrated Adam method — justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

Robustness and scalability under heavy tails, without strong convexity
Matthew Holland

Real-world data is laden with outlying values. The challenge for machine learning is that the learner typically has no prior knowledge of whether the feedback it receives (losses, gradients, etc.) will be heavy-tailed or not. In this work, we study a simple, cost-efficient algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. The core technique introduces a simple robust validation sub-routine, which is used to boost the confidence of inexpensive gradient-based sub-processes. Compared with recent robust gradient descent methods from the literature, dimension dependence (both risk bounds and cost) is substantially improved, without relying upon strong convexity or expensive per-step robustification. We also empirically show that the proposed procedure cannot simply be replaced with naive cross-validation.

Causal Modeling with Stochastic Confounders
Thanh Vinh Vo · Pengfei Wei · Wicher Bergsma · Tze Yun Leong

This work extends causal inference in temporal models with stochastic confounders. We propose a new approach to variational estimation of causal inference based on a representer theorem with a random input space. We estimate causal effects involving latent confounders that may be interdependent and time-varying from sequential, repeated measurements in an observational study. Our approach extends current work that assumes independent, non-temporal latent confounders with potentially biased estimators. We introduce a simple yet elegant algorithm without parametric specification on model components. Our method avoids the need for expensive and careful parameterization in deploying complex models, such as deep neural networks in existing approaches, for causal inference and analysis. We demonstrate the effectiveness of our approach on various benchmark temporal datasets.

Dual Principal Component Pursuit for Learning a Union of Hyperplanes: Theory and Algorithms
Tianyu Ding · Zhihui Zhu · Manolis Tsakiris · Rene Vidal · Daniel Robinson

State-of-the-art subspace clustering methods are based on convex formulations whose theoretical guarantees require the subspaces to be low-dimensional. Dual Principal Component Pursuit (DPCP) is a non-convex method that is specifically designed for learning high-dimensional subspaces, such as hyperplanes. However, existing analyses of DPCP in the multi-hyperplane case lack a precise characterization of the distribution of the data and involve quantities that are difficult to interpret. Moreover, the provable algorithm based on recursive linear programming is not efficient. In this paper, we introduce a new notion of geometric dominance, which explicitly captures the distribution of the data, and derive both geometric and probabilistic conditions under which a global solution to DPCP is a normal vector to a geometrically dominant hyperplane. We then prove that the DPCP problem for a union of hyperplanes satisfies a Riemannian regularity condition, and use this result to show that a scalable Riemannian subgradient method exhibits (local) linear convergence to the normal vector of the geometrically dominant hyperplane. Finally, we show that integrating DPCP into popular subspace clustering schemes, such as K-ensembles, leads to superior or competitive performance over the state-of-the-art in clustering hyperplanes.

Good Classifiers are Abundant in the Interpolating Regime
Ryan Theisen · Jason Klusowski · Michael Mahoney
Within the machine learning community, the widely-used uniform convergence framework has been used to answer the question of how complex, over-parameterized models can generalize well to new data. This approach bounds the test error of the \emph{worst-case} model one could have fit to the data, but it has fundamental limitations. Inspired by the statistical mechanics approach to learning, we formally define and develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers from several model classes. We apply our method to compute this distribution for several real and synthetic datasets, with both linear and random feature classification models. We find that test errors tend to concentrate around a small \emph{typical} value $\varepsilon^*$, which deviates substantially from the test error of the worst-case interpolating model on the same datasets, indicating that ``bad'' classifiers are extremely rare. We provide theoretical results in a simple setting in which we characterize the full asymptotic distribution of test errors, and we show that these indeed concentrate around a value $\varepsilon^*$, which we also identify exactly. We then formalize a more general conjecture supported by our empirical findings. Our results show that the usual style of analysis in statistical learning theory …
Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship
Abishek Sankararaman · Soumya Basu · Karthik Abinav Sankararaman

Online learning in a two-sided matching market, with demand side agents continuously competing to be matched with supply side (arms), abstracts the complex interactions under partial information on matching platforms (e.g. UpWork, TaskRabbit). We study the decentralized serial dictatorship setting, a two-sided matching market where the demand side agents have unknown and heterogeneous valuation over the supply side (arms), while the arms have known uniform preference over the demand side (agents). We design the first decentralized algorithm - UCB with Decentralized Dominant-arm Deletion (UCB-D3), for the agents, that does not require any knowledge of reward gaps or time horizon. UCB-D3 works in phases, where in each phase, agents delete dominated arms -- the arms preferred by higher ranked agents, and play only from the non-dominated arms according to the UCB. At the end of the phase, agents broadcast in a decentralized fashion, their estimated preferred arms through pure exploitation. We prove a new regret lower bound for the decentralized serial dictatorship model, and prove that UCB-D3 achieves order optimal regret guarantee.

Graph Gamma Process Linear Dynamical Systems
Rahi Kalantari · Mingyuan Zhou
We introduce graph gamma process (GGP) linear dynamical systems to model real-valued multivariate time series. GGP generates $S$ latent states that are shared by $K$ different communities, each of which is characterized by its own pattern of activation probabilities imposed on a $S\times S$ directed sparse graph, and allow both $S$ and $K$ to grow without bound. For temporal pattern discovery, the latent representation under the model is used to decompose the time series into a parsimonious set of multivariate sub-sequences generated by formed communities. In each sub-sequence, different data dimensions often share similar temporal patterns but may exhibit distinct magnitudes, and hence allowing the superposition of all sub-sequences to exhibit diverse behaviors at different data dimensions. On both synthetic and real-world time series, the proposed nonparametric Bayesian dynamic models, which are initialized at random, consistently exhibit good predictive performance in comparison to a variety of baseline models, revealing interpretable latent state transition patterns and decomposing the time series into distinctly behaved sub-sequences.
Bandit algorithms: Letting go of logarithmic regret for statistical robustness
Kumar Ashutosh · Jayakrishnan Nair · Anmol Kagrecha · Krishna Jagannathan

We study regret minimization in a stochastic multi-armed bandit setting, and establish a fundamental trade-off between the regret suffered under an algorithm, and its statistical robustness. Considering broad classes of underlying arms' distributions, we show that bandit learning algorithms with logarithmic regret are always inconsistent and that consistent learning algorithms always suffer a super-logarithmic regret. This result highlights the inevitable statistical fragility of all `logarithmic regret' bandit algorithms available in the literature - for instance, if a UCB algorithm designed for 1-subGaussian distributions is used in a subGaussian setting with a mismatched variance parameter, the learning performance could be inconsistent. Next, we show a positive result: statistically robust and consistent learning performance is attainable if we allow the regret to be slightly worse than logarithmic. Specifically, we propose three classes of distribution oblivious algorithms that achieve an asymptotic regret that is arbitrarily close to logarithmic.

Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent
Frederik Kunstner · Raunak Kumar · Mark Schmidt

Expectation maximization (EM) is the default algorithm for fitting probabilistic models with missing or latent variables, yet we lack a full understanding of its non-asymptotic convergence properties. Previous works show results along the lines of "EM converges at least as fast as gradient descent" by assuming the conditions for the convergence of gradient descent apply to EM. This approach is not only loose, in that it does not capture that EM can make more progress than a gradient step, but the assumptions fail to hold for textbook examples of EM like Gaussian mixtures. In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence. Then, we show how the KL divergence is related to first-order stationarity via Bregman divergences. In contrast to previous works, the analysis is invariant to the choice of parametrization and holds with minimal assumptions. We also show applications of these ideas to local linear (and superlinear) convergence rates, generalized EM, and non-exponential family distributions.

Tracking Regret Bounds for Online Submodular Optimization
Tatsuya Matsuoka · Shinji Ito · Naoto Ohsaka

In this paper, we propose algorithms for online submodular optimization with tracking regret bounds. Online submodular optimization is a generic framework for sequential decision making used to select subsets. Existing algorithms for online submodular optimization have been shown to achieve small (static) regret, which means that the algorithm's performance is comparable to the performance of a fixed optimal action. Such algorithms, however, may perform poorly in an environment that changes over time. To overcome this problem, we apply a tracking-regret-analysis framework to online submodular optimization, one by which output is assessed through comparison with time-varying optimal subsets. We propose algorithms for submodular minimization, monotone submodular maximization under a size constraint, and unconstrained submodular maximization, and we show tracking regret bounds. In addition, we show that our tracking regret bound for submodular minimization is nearly tight.

Causal Inference with Selectively Deconfounded Data
Jingyi Gan · Andrew Li · Zachary Lipton · Sridhar Tayur

Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data; (b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a large \emph{confounded} observational dataset (\emph{confounder unobserved}) alongside a small \emph{deconfounded} observational dataset (\emph{confounder revealed}) when estimating the ATE. Our theoretical results {\red suggest} that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases---say, genetics---we could imagine retrospectively selecting samples to deconfound. We demonstrate that by actively selecting these samples based upon the (already observed) treatment and outcome, we can reduce sample complexity further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Finally, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer.

Understanding Gradient Clipping In Incremental Gradient Methods
Jiang Qian · Yuren Wu · Bojin Zhuang · Shaojun Wang · Jing Xiao

We provide a theoretical analysis on how gradient clipping affects the convergence of the incremental gradient methods on minimizing an objective function that is the sum of a large number of component functions. We show that clipping on gradients of component functions leads to bias on the descent direction, which is affected by the clipping threshold, the norms of gradients of component functions, together with the angles between gradients of component functions and the full gradient. We then propose some sufficient conditions under which the increment gradient methods with gradient clipping can be shown to be convergent under the more general relaxed smoothness assumption. We also empirically observe that the angles between gradients of component functions and the full gradient generally decrease as the batchsize increases, which may help to explain why larger batchsizes generally lead to faster convergence in training deep neural networks with gradient clipping.

Non-Volume Preserving Hamiltonian Monte Carlo and No-U-TurnSamplers
Hadi Mohasel Afshar · Rafael Oliveira · Sally Cripps

Volume preservation is usually regarded as a necessary property for the leapfrog transition functions that are used in Hamiltonian Monte Carlo (HMC) and No-U-Turn (NUTS) samplers to guarantee convergence to the target distribution. In this work we rigorously prove that with minimal algorithmic modifications, both HMC and NUTS can be combined with transition functions that are not necessarily volume preserving. In light of these results, we propose a non-volume preserving transition function that conserves the Hamiltonian better than the baseline leapfrog mechanism, on piecewise-continuous distributions. The resulting samplers do not require any assumptions on the geometry of the discontinuity boundaries, and our experimental results show a significant improvement upon traditional HMC and NUTS.

Maximal Couplings of the Metropolis-Hastings Algorithm
Guanyang Wang · John O'Leary · Pierre Jacob

Couplings play a central role in the analysis of Markov chain Monte Carlo algorithms and appear increasingly often in the algorithms themselves, e.g. in convergence diagnostics, parallelization, and variance reduction techniques. Existing couplings of the Metropolis-Hastings algorithm handle the proposal and acceptance steps separately and fall short of the upper bound on one-step meeting probabilities given by the coupling inequality. This paper introduces maximal couplings which achieve this bound while retaining the practical advantages of current methods. We consider the properties of these couplings and examine their behavior on a selection of numerical examples.


Mentorship Session 2 Tue 13 Apr 08:30 p.m.