Abstract:

Chat is not available.

Prem Talwai · Ali Shameli · David Simchi-Levi

We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the misspecifed setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS. We hope our analyses will allow the much broader application of conditional mean embeddings to more complex ML/RL settings involving infinite dimensional RKHSs and continuous state spaces.

Sahin Lale · Kamyar Azizzadenesheli · Babak Hassibi · Animashree Anandkumar

In this work, we study model-based reinforcement learning (RL) in unknown stabilizable linear dynamical systems. When learning a dynamical system, one needs to stabilize the unknown dynamics in order to avoid system blow-ups. We propose an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment with an improved exploration strategy. We show that the proposed algorithm attains $\Tilde{\mathcal{O}}(\sqrt{T})$ regret after $T$ time steps of agent-environment interaction. We also show that the regret of the proposed algorithm has only a polynomial dependence in the problem dimensions, which gives an exponential improvement over the prior methods. Our improved exploration method is simple, yet efficient, and it combines a sophisticated exploration policy in RL with an isotropic exploration strategy to achieve fast stabilization and improved regret. We empirically demonstrate that the proposed algorithm outperforms other popular methods in several adaptive control tasks.

Variational inference for state space models (SSMs) is known to be hard in general. Recent works focus on deriving variational objectives for SSMs from unbiased sequential Monte Carlo estimators. We reveal that the marginal particle filter is obtained from sequential Monte Carlo by applying Rao-Blackwellization operations, which sacrifices the trajectory information for reduced variance and differentiability. We propose the variational marginal particle filter (VMPF), which is a differentiable and reparameterizable variational filtering objective for SSMs based on an unbiased estimator. We find that VMPF with biased gradients gives tighter bounds than previous objectives, and the unbiased reparameterization gradients are sometimes beneficial.

Motivated by practical needs in online experimentation and off-policy learning, we study the problem of safe optimal design, where we develop a data logging policy that efficiently explores while achieving competitive rewards with a baseline production policy. We first show, perhaps surprisingly, that a common practice of mixing the production policy with uniform exploration, despite being safe, is sub-optimal in maximizing information gain. Then we propose a safe optimal logging policy for the case when no side information about the actions' expected rewards is available. We improve upon this design by considering side information and also extend both approaches to a large number of actions with a linear reward model. We analyze how our data logging policies impact errors in off-policy learning. Finally, we empirically validate the benefit of our designs by conducting extensive experiments.

Anish Acharya · AbolfazL Hashemi · Prateek Jain · Sujay Sanghavi · Inderjit Dhillon · Ufuk Topcu

Geometric median (GM) is a classical methodin statistics for achieving robust estimationof the uncorrupted data; under gross corruption,it achieves the optimal breakdownpoint of 1/2. However, its computationalcomplexity makes it infeasible for robustifyingstochastic gradient descent (SGD) inhigh-dimensional optimization problems. Inthis paper, we show that by applying GMto only a judiciously chosen block of coordinatesat a time and using a memory mechanism,one can retain the breakdown pointof 1/2 for smooth non-convex problems, withnon-asymptotic convergence rates comparableto the SGD with GM while resultingin significant speedup in training. We furthervalidate the run-time and robustness ofour approach empirically on several populardeep learning tasks. Code available at:https://github.com/anishacharya/BGMD

Samory Kpotufe · Gan Yuan · YUNFAN ZHAO

We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in E[Y|X] determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction between the two settings. Namely, we show that some seemingly benign nuances in notions of margin - involving the uniqueness of the Bayes classifier, and which have no apparent effect on rates in passive learning - determine whether or not any active learner can outperform passive learning rates. In particular, for Audibert-Tsybakov's margin condition (allowing general situations with non-unique Bayes classifiers), no active learner can gain over passive learning in commonly studied settings where the marginal on X is near uniform. Our results thus negate the usual intuition from past literature that active rates should improve over passive rates in nonparametric settings.

Tongzheng Ren · Fuheng Cui · Alexia Atsidakou · Sujay Sanghavi · Nhat Ho

We study the statistical and computational complexities of the Polyak step size gradient descent algorithm under generalized smoothness and Łojasiewicz conditions of the population loss function, namely, the limit of the empirical loss function when the sample size goes to infinity, and the stability between the gradients of the empirical and population loss functions, namely, the polynomial growth on the concentration bound between the gradients of sample and population loss functions. We demonstrate that the Polyak step size gradient descent iterates reach a final statistical radius of convergence around the true parameter after logarithmic number of iterations in terms of the sample size. It is computationally cheaper than the polynomial number of iterations on the sample size of the fixed-step size gradient descent algorithm to reach the same final statistical radius when the population loss function is not locally strongly convex. Finally, we illustrate our general theory under three statistical examples: generalized linear model, mixture model, and mixed linear regression model.

Lisha Chen · Tianyi Chen

Meta learning aims at learning a model that can quickly adapt to unseen tasks. Widely used meta learning methods include model agnostic meta learning (MAML), implicit MAML, Bayesian MAML. Thanks to its ability of modeling uncertainty, Bayesian MAML often has advantageous empirical performance. However, the theoretical understanding of Bayesian MAML is still limited, especially on questions such as if and when Bayesian MAML has provably better performance than MAML. In this paper, we aim to provide theoretical justifications for Bayesian MAML’s advantageous performance by comparing the meta test risks of MAML and Bayesian MAML. In the meta linear regression, under both the distribution agnostic and linear centroid cases, we have established that Bayesian MAML indeed has provably lower meta test risks than MAML. We verify our theoretical results through experiments, the code of which is available at https://github.com/lishachen/Bayesian-MAML-vs-MAML.

Flavio Chierichetti · Alessandro Panconesi · Giuseppe Re · Luca Trevisan

Correlation Clustering is an important clustering problem with many applications. We study the reconstruction version of this problem, in which one seeks to reconstruct a latent clustering that has been corrupted by random noise and adversarial modifications. Concerning the latter, there is a standard “post-adversarial” model in the literature, in which adversarial modifications come after the noise. Here, we introduce and analyse a “pre-adversarial” model, in which adversarial modifications come before the noise. Given an input coming from such a semi-adversarial generative model, the goal is to approximately reconstruct with high probability the latent clustering. We focus on the case where the hidden clusters have nearly equal size and show the following. In the pre-adversarial setting, spectral algorithms are optimal, in the sense that they reconstruct all the way to the information-theoretic threshold beyond which no reconstruction is possible. This is in contrast to the post-adversarial setting, in which their ability to restore the hidden clusters stops before the threshold, but the gap is optimally filled by SDP-based algorithms.These results highlight a heretofore unknown robustness of spectral algorithms, showing them less brittle than previously thought.

Kishan Panaganti · Dileep Kalathil

The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mismatches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an $\epsilon$-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.

We develop a kernel projected Wasserstein distance for the two-sample test, an essential building block in statistics and machine learning: given two sets of samples, to determine whether they are from the same distribution. This method operates by finding the nonlinear mapping in the data space which maximizes the distance between projected distributions. In contrast to existing works about projected Wasserstein distance, the proposed method circumvents the curse of dimensionality more efficiently. We present practical algorithms for computing this distance function together with the non-asymptotic uncertainty quantification of empirical estimates. Numerical examples validate our theoretical results and demonstrate good performance of the proposed method.

Shivam Garg · Santosh Vempala

The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure — low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA finds the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on {\em how} FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input.

Eduard Gorbunov · Hugo Berard · Gauthier Gidel · Nicolas Loizou

The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, several important questions regarding the convergence properties of SEG are still open, including the sampling of stochastic gradients, mini-batching, convergence guarantees for the monotone finite-sum variational inequalities with possibly non-monotone terms, and others. To address these questions, in this paper, we develop a novel theoretical framework that allows us to analyze several variants of SEG in a unified manner. Besides standard setups, like Same-Sample SEG under Lipschitzness and monotonicity or Independent-Samples SEG under uniformly bounded variance, our approach allows us to analyze variants of SEG that were never explicitly considered in the literature before. Notably, we analyze SEG with arbitrary sampling which includes importance sampling and various mini-batching strategies as special cases. Our rates for the new variants of SEG outperform the current state-of-the-art convergence guarantees and rely on less restrictive assumptions.

Jianyu Xu · Yu-Xiang Wang

In feature-based dynamic pricing, a seller sets appropriate prices for a sequence of products (described by feature vectors) on the fly by learning from the binary outcomes of previous sales sessions ("Sold" if valuation $\geq$ price, and "Not Sold" otherwise). Existing works either assume noiseless linear valuation or precisely-known noise distribution, which limits the applicability of those algorithms in practice when these assumptions are hard to verify. In this work, we study two more agnostic models: (a) a "linear policy" problem where we aim at competing with the best linear pricing policy while making no assumptions on the data, and (b) a "linear noisy valuation" problem where the random valuation is linear plus an unknown and assumption-free noise. For the former model, we show a $\Theta(d^{1/3}T^{2/3})$ minimax regret up to logarithmic factors. For the latter model, we present an algorithm that achieves an $O(T^{3/4})$ regret and improve the best-known lower bound from $Omega(T^{3/5})$ to $\Omega(T^{2/3})$. These results demonstrate that no-regret learning is possible for feature-based dynamic pricing under weak assumptions, but also reveal a disappointing fact that the seemingly richer pricing feedback is not significantly more useful than the bandit-feedback in regret reduction.

Qingfeng Lan · Samuele Tosatto · Homayoon Farrahi · Rupam Mahmood

Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the Reward Policy Gradient estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.

Contextual bandits are a rich model for sequential decision making given side information, with important applications, e.g., in recommender systems. We propose novel algorithms for contextual bandits harnessing neural networks to approximate the unknown reward function. We resolve the open problem of proving sublinear regret bounds in this setting for general context sequences, considering both fully-connected and convolutional networks. To this end, we first analyze NTK-UCB, a kernelized bandit optimization algorithm employing the Neural Tangent Kernel (NTK), and bound its regret in terms of the NTK maximum information gain $\gamma_T$, a complexity parameter capturing the difficulty of learning. Our bounds on $\gamma_T$ for the NTK may be of independent interest. We then introduce our neural network based algorithm NN-UCB, and show that its regret closely tracks that of NTK-UCB. Under broad non-parametric assumptions about the reward function, our approach converges to the optimal policy at a $\tilde{\mathcal{O}}(T^{-1/2d})$ rate, where $d$ is the dimension of the context.

Jingfeng Wu · Vladimir Braverman · Lin Yang

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$, where $\epsilon>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $\rho>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $\epsilon$-optimal policy for a post-revealed reward with sub-optimality gap at least $\rho$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $\epsilon$. We show that, information-theoretically, this bound is nearly tight for $\rho < \Theta(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

Knowing when a graphical model perfectly encodes the conditional independence structure of a distribution is essential in applications, and this is particularly important when performing inference from data. When the model is perfect, there is a one-to-one correspondence between conditional independence statements in the distribution and separation statements in the graph. Previous work has shown that almost all models based on linear directed acyclic graphs as well as Gaussian chain graphs are perfect, the latter of which subsumes Gaussian graphical models (i.e., the undirected Gaussian models) as a special case. In this paper, we directly approach the problem of perfectness for the Gaussian graphical models, and provide a new proof, via a more transparent parametrization, that almost all such models are perfect. Our approach is based on, and substantially extends, a construction of Ln{\v{e}}ni{\v{c}}ka and Mat{\'{u}}{\v{s}} showing the existence of a perfect Gaussian distribution for any graph. The analysis involves constructing a probability measure on the set of normalized covariance matrices Markov with respect to a graph that may be of independent interest.

Darshan Chakrabarti · John Dickerson · Seyed Esmaeili · Aravind Srinivasan · Leonidas Tsepenekas

Clustering is a fundamental problem in unsupervised machine learning, and due to its numerous societal implications fair variants of it have recently received significant attention. In this work we introduce a novel definition of individual fairness for clustering problems. Specifically, in our model, each point $j$ has a set of other points $\mathcal{S}_j$ that it perceives as similar to itself, and it feels that it is being fairly treated if the quality of service it receives in the solution is $\alpha$-close (in a multiplicative sense, for some given $\alpha \geq 1$) to that of the points in $\mathcal{S}_j$. We begin our study by answering questions regarding the combinatorial structure of the problem, namely for what values of $\alpha$ the problem is well-defined, and what the behavior of the Price of Fairness (PoF) for it is. For the well-defined region of $\alpha$, we provide efficient and easily-implementable approximation algorithms for the $k$-center objective, which in certain cases also enjoy bounded-PoF guarantees. We finally complement our analysis by an extensive suite of experiments that validates the effectiveness of our theoretical results.

Jiachang Liu · Chudi Zhong · Margo Seltzer · Cynthia Rudin

We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.

Cecilia Ferrando · Shufan Wang · Daniel Sheldon

The goal of this paper is to develop a practical and general-purpose approach to construct confidence intervals for differentially private parametric estimation. We find that the parametric bootstrap is a simple and effective solution. It cleanly reasons about variability of both the data sample and the randomized privacy mechanism and applies "out of the box" to a wide class of private estimation routines. It can also help correct bias caused by clipping data to limit sensitivity. We prove that the parametric bootstrap gives consistent confidence intervals in two broadly relevant settings, including a novel adaptation to linear regression that avoids accessing the covariate data multiple times. We demonstrate its effectiveness for a variety of estimators, and find empirically that it provides confidence intervals with good coverage even at modest sample sizes and performs better than alternative approaches.

Nicholas Irons · Meyer Scetbon · Soumik Pal · Zaid Harchaoui

Triangular flows, also known as Kn\"othe-Rosenblatt measure couplings, comprise an important building block of normalizing flow models for generative modeling and density estimation, including popular autoregressive flows such as real-valued non-volume preserving transformation models (Real NVP). We present statistical guarantees and sample complexity bounds for triangular flow statistical models. In particular, we establish the statistical consistency and the finite sample convergence rates of the minimum Kullback-Leibler divergence statistical estimator of the Kn\"othe-Rosenblatt measure coupling using tools from empirical process theory. Our results highlight the anisotropic geometry of function classes at play in triangular flows, shed light on optimal coordinate ordering, and lead to statistical guarantees for Jacobian flows. We conduct numerical experiments to illustrate the practical implications of our theoretical findings.

Michael Curry · Uro Lyi · Tom Goldstein · John Dickerson

We propose a new architecture to approximately learn incentive compatible, revenue-maximizing auctions from sampled valuations. Our architecture uses the Sinkhorn algorithm to perform a differentiable bipartite matching which allows the network to learn strategyproof revenue-maximizing mechanisms in settings not learnable by the previous RegretNet architecture. In particular, our architecture is able to learn mechanisms in settings without free disposal where each bidder must be allocated exactly some number of items. In experiments, we show our approach successfully recovers multiple known optimal mechanisms and high-revenue, low-regret mechanisms in larger settings where the optimal mechanism is unknown.

Yue Wu · Dongruo Zhou · Quanquan Gu

We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation for linear mixture Markov decision processes (MDPs), where the transition probability function of the underlying MDP admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $D$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.

Arman Zharmagambetov · Miguel Carreira-Perpinan

Model interpretability is a topic of renewed interest given today's widespread practical use of machine learning, and the need to trust or understand automated predictions. We consider the problem of optimally learning interpretable out-of-sample mappings for nonlinear embedding methods such as $t$-SNE. We argue for the use of sparse oblique decision trees because they strike a good tradeoff between accuracy and interpretability which can be controlled via a hyperparameter, thus allowing one to achieve a model with a desired explanatory complexity. The resulting optimization problem is difficult because decision trees are not differentiable. By using an equivalent formulation of the problem, we give an algorithm that can learn such a tree for any given nonlinear embedding objective. We illustrate experimentally how the resulting trees provide insights into the data beyond what a simple 2D visualization of the embedding does.

Yuheng Bu · Gholamali Aminian · Laura Toni · Gregory Wornell · Miguel Rodrigues

We provide an information-theoretic analysis of the generalization ability of Gibbs-based transfer learning algorithms by focusing on two popular empirical risk minimization (ERM) approaches for transfer learning, $\alpha$-weighted-ERM and two-stage-ERM. Our key result is an exact characterization of the generalization behavior using the conditional symmetrized Kullback-Leibler (KL) information between the output hypothesis and the target training samples given the source training samples. Our results can also be applied to provide novel distribution-free generalization error upper bounds on these two aforementioned Gibbs algorithms. Our approach is versatile, as it also characterizes the generalization errors and excess risks of these two Gibbs algorithms in the asymptotic regime, where they converge to the $\alpha$-weighted-ERM and two-stage-ERM, respectively. Based on our theoretical results, we show that the benefits of transfer learning can be viewed as a bias-variance trade-off, with the bias induced by the source distribution and the variance induced by the lack of target samples. We believe this viewpoint can guide the choice of transfer learning algorithms in practice.

Yuqing Zhu · Jinshuo Dong · Yu-Xiang Wang

Characterizing the privacy degradation over compositions, i.e., privacy accounting, is a fundamental topic in differential privacy (DP) with many applications to differentially private machine learning and federated learning. We propose a unification of recent advances (Renyi DP, privacy profiles, $f$-DP and the PLD formalism) via the \emph{characteristic function} ($\phi$-function) of a certain \emph{dominating} privacy loss random variable. We show that our approach allows \emph{natural} adaptive composition like Renyi DP, provides \emph{exactly tight} privacy accounting like PLD, and can be (often \emph{losslessly}) converted to privacy profile and $f$-DP, thus providing $(\epsilon,\delta)$-DP guarantees and interpretable tradeoff functions. Algorithmically, we propose an \emph{analytical Fourier accountant} that represents the \emph{complex} logarithm of $\phi$-functions symbolically and uses Gaussian quadrature for numerical computation. On several popular DP mechanisms and their subsampled counterparts, we demonstrate the flexibility and tightness of our approach in theory and experiments.

Thibault Sejourne · Francois-Xavier Vialard · Gabriel Peyré

Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations when comparing distributions. This is crucial for successful ML applications of OT, as it makes it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop an accelerated Sinkhorn algorithm (coined ”translation invariant Sinkhorn”) for UOT, bridging the computational gap with OT. Our second contribution focuses on 1-D UOT and proposes a Frank-Wolfe solver applied to this translation invariant formulation. The linear oracle of each step amounts to solving a 1-D OT problem, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1-D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches.

Ignavier Ng · Sebastien Lachapelle · Nan Rosemary Ke · Simon Lacoste-Julien · Kun Zhang

Recently, structure learning of directed acyclic graphs (DAGs) has been formulated as a continuous optimization problem by leveraging an algebraic characterization of acyclicity. The constrained problem is solved using the augmented Lagrangian method (ALM) which is often preferred to the quadratic penalty method (QPM) by virtue of its standard convergence result that does not require the penalty coefficient to go to infinity, hence avoiding ill-conditioning. However, the convergence properties of these methods for structure learning, including whether they are guaranteed to return a DAG solution, remain unclear, which might limit their practical applications. In this work, we examine the convergence of ALM and QPM for structure learning in the linear, nonlinear, and confounded cases. We show that the standard convergence result of ALM does not hold in these settings, and demonstrate empirically that its behavior is akin to that of the QPM which is prone to ill-conditioning. We further establish the convergence guarantee of QPM to a DAG solution, under mild conditions. Lastly, we connect our theoretical results with existing approaches to help resolve the convergence issue, and verify our findings in light of an empirical comparison of them.

Zaiwei Chen · Siva Theja Maguluri

In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules (including natural policy gradient) are considered for policy improvement. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy improvement, we provide a universal analysis that establishes geometric convergence of various policy update rules, which leads to an overall $\Tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity.

Rishi Saket · Aravindan Raghuveer · Balaraman Ravindran

In the framework of learning from label proportions (LLP) the goal is to learn a good instance-level label predictor from the observed label proportions of bags of instances. Most of the LLP algorithms either explicitly or implicitly assume the nature of bag distributions with respect to the actual labels and instances, or cleverly adapt supervised learning techniques to suit LLP. In practical applications however, the scale and nature of data could render such assumptions invalid and the many of the algorithms impractical. In this paper we address the hard problem of solving LLP with provable error bounds while being bag distribution agnostic and model agnostic.We first propose the concept of generalized bags, an extension of bags and then devise an algorithm to combine bag distributions, if possible, into good generalized bag distributions. We show that (w.h.p) any classifier optimizing the squared Euclidean label-proportion loss on such a generalized bag distribution is guaranteed to minimize the instance-level loss as well. The predictive quality of our method is experimentally evaluated and it equals or betters the previous methods on pseudo-synthetic and real-world datasets.

Warren Morningstar · Alex Alemi · Joshua Dillon

The Bayesian posterior minimizes the “inferential risk” which itself bounds the “predictive risk.” This bound is tight when the likelihood and prior are well-specified. How-ever since misspecification induces a gap,the Bayesian posterior predictive distribution may have poor generalization performance. This work develops a multi-sample loss (PACm) which can close the gap by spanning a trade-off between the two risks. The loss is computationally favorable and offers PAC generalization guarantees. Empirical study demonstrates improvement to the predictive distribution

Abhin Shah · Karthikeyan Shanmugam · Kartik Ahuja

Treatment effect estimation from observational data is a fundamental problem in causal inference. There are two very different schools of thought that have tackled this problem. On the one hand, the Pearlian framework commonly assumes structural knowledge (provided by an expert) in the form of directed acyclic graphs and provides graphical criteria such as the back-door criterion to identify the valid adjustment sets. On the other hand, the potential outcomes (PO) framework commonly assumes that all the observed features satisfy ignorability (i.e., no hidden confounding), which in general is untestable. In prior works that attempted to bridge these frameworks, there is an observational criteria to identify an *anchor variable* and if a subset of covariates (not involving the anchor variable) passes a suitable conditional independence criteria, then that subset is a valid back-door. Our main result strengthens these prior results by showing that under a different expert-driven structural knowledge --- that one variable is a direct causal parent of the treatment variable --- remarkably, testing for subsets (not involving the known parent variable) that are valid back-doors is *equivalent* to an invariance test. Importantly, we also cover the non-trivial case where the entire set of observed features is not ignorable (generalizing the PO framework) without requiring the knowledge of all the parents of the treatment variable. Our key technical idea involves generation of a synthetic sub-sampling (or environment) variable that is a function of the known parent variable. In addition to designing an invariance test, this sub-sampling variable allows us to leverage Invariant Risk Minimization, and thus, connects finding valid adjustments (in non-ignorable observational settings) to representation learning. We demonstrate the effectiveness and tradeoffs of these approaches on a variety of synthetic datasets as well as real causal effect estimation benchmarks.

Romain Laroche · Remi Tachet des Combes

In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.

Roy Fox · Stephen McAleer · Will Overman · Ioannis Panageas

Natural policy gradient has emerged as one of the most successful algorithms for computing optimal policies in challenging Reinforcement Learning (RL) tasks, yet, very little was known about its convergence properties until recently. The picture becomes more blurry when it comes to multi-agent RL (MARL); the line of works that have theoretical guarantees for convergence to Nash policies are very limited. In this paper, we focus on a particular class of multi-agent stochastic games called Markov Potential Games and we prove that Independent Natural Policy Gradient always converges using constant learning rates. The proof deviates from the existing approaches and the main challenge lies in the fact that Markov potential Games do not have unique optimal values (as single-agent settings exhibit) so different initializations can lead to different limit point values. We complement our theoretical results with experiments that indicate that Natural Policy Gradient outperforms Policy Gradient in MARL settings (our process benchmark is multi-state congestion games).

Axel Brando · Joan Gimeno · Jose Rodriguez-Serrano · Jordi Vitria

Quantile Regression (QR) provides a way to approximate a single conditional quantile. To have a more informative description of the conditional distribution, QR can be merged with deep learning techniques to simultaneously estimate multiple quantiles. However, the minimisation of the QR-loss function does not guarantee non-crossing quantiles, which affects the validity of such predictions and introduces a critical issue in certain scenarios. In this article, we propose a generic deep learning algorithm for predicting an arbitrary number of quantiles that ensures the quantile monotonicity constraint up to the machine precision and maintains its modelling performance with respect to alternative models. The presented method is evaluated over several real-world datasets obtaining state-of-the-art results as well as showing that it scales to large-size data sets.

Dheeraj Baby · Yu-Xiang Wang

We study the framework of *universal dynamic regret* minimization with *strongly convex* losses. We answer an open problem in Baby and Wang 2021 by showing that in a *proper learning* setup, Strongly Adaptive algorithms can achieve the near optimal dynamic regret of $\tilde O(d^{1/3} n^{1/3}\text{TV}[u_{1:n}]^{2/3} \vee d)$ against any comparator sequence $u_1,\ldots,u_n$ *simultaneously*, where $n$ is the time horizon and $\text{TV}[u_{1:n}]$ is the Total Variation of comparator. These results are facilitated by exploiting a number of *new* structures imposed by the KKT conditions that were not considered in Baby and Wang 2021 which also lead to other improvements over their results such as: (a) handling non-smooth losses and (b) improving the dimension dependence on regret. Further, we also derive near optimal dynamic regret rates for the special case of proper online learning with exp-concave losses and an $L_\infty$ constrained decision set.

Yuhao Ding · Junzi Zhang · Javad Lavaei

Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning due to their relative stability and incremental nature. In recent years, the empirical success of PG methods has led to the development of a theoretical foundation for these methods. In this work, we generalize this line of research by establishing the first set of global convergence results of stochastic PG methods with momentum terms, which have been demonstrated to be efficient recipes for improving PG methods. We study both the soft-max and the Fisher-non-degenerate policy parametrizations, and show that adding a momentum term improves the global optimality sample complexities of vanilla PG methods by $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ and $\tilde{\mathcal{O}}(\epsilon^{-1})$, respectively, where $\epsilon>0$ is the target tolerance. Our results for the generic Fisher-non-degenerate policy parametrizations also provide the first single-loop and finite-batch PG algorithm achieving an $\tilde{O}(\epsilon^{-3})$ global optimality sample complexity. Finally, as a by-product, our analyses provide general tools for deriving the global convergence rates of stochastic PG methods, which can be readily applied and extended to other PG estimators under the two parametrizations.

yuyang deng · Mohammad Mahdi Kamani · Mehrdad Mahdavi

In this paper we prove that Local (S)GD (or FedAvg) can optimize deep neural networks with Rectified LinearUnit (ReLU) activation function in polynomial time. Despite the established convergence theory of Local SGD on optimizing general smooth functions in communication-efficient distributed optimization, its convergence on non-smooth ReLU networks still eludes full theoretical understanding. The key property used in many Local SGD analysis on smooth function is gradient Lipschitzness, so that the gradient on local models will not drift far away from that on averaged model. However, this decent property does not hold in networks with non-smooth ReLU activation function. We show that, even though ReLU network does not admit gradient Lipschitzness property, the difference between gradients on local models and average model will not change too much, under the dynamics of Local SGD. We validate our theoretical results via extensive experiments. This work is the first to show the convergence of Local SGD on non-smooth functions, and will shed lights on the optimization theory of federated training of deep neural networks.

Sana Tonekaboni · Chun-Liang Li · Sercan Arik · Anna Goldenberg · Tomas Pfister

Real-world time series data are often generated from several sources of variation. Learning representations that capture the factors contributing to this variability enables better understanding of the data via its underlying generative process and can lead to improvements in performance on downstream machine learning tasks. In this paper, we propose a novel generative approach for learning representations for the global and local factors of variation in time series data. The local representation of each sample models non-stationarity over time with a stochastic process prior, and the global representation of the sample encodes the time-independent characteristics. To encourage decoupling between the representations, we introduce a counterfactual regularization that minimizes the mutual information between the two variables. In experiments, we demonstrate successful recovery of the true local and global factors of variability on simulated data, and show that representations learned using our method lead to superior performance on downstream tasks on real-world datasets. We believe that the proposed way of defining representations is beneficial for data modelling and can yield better insights into the complexity of the real-world data.

Wenshuo Guo · Kirthevasan Kandasamy · Joseph Gonzalez · Michael Jordan · Ion Stoica

The sharing of scarce resources among multiple rational agents is one of the classical problems in economics. In exchange economies, which are used to model such situations, agents begin with an initial endowment of resources and exchange them in a way that is mutually beneficial until they reach a competitive equilibrium (CE). The allocations at a CE are Pareto efficient and fair. Consequently, they are used widely in designing mechanisms for fair division. However, computing CEs requires the knowledge of agent preferences which are unknown in several applications of interest. In this work, we explore a new online learning mechanism, which, on each round, allocates resources to the agents and collects stochastic feedback on their experience in using that allocation. Its goal is to learn the agent utilities via this feedback and imitate the allocations at a CE in the long run. We quantify CE behavior via two losses and propose a randomized algorithm which achieves sublinear lossunder a parametric class of utilities. Empirically, we demonstrate the effectiveness of this mechanism through numerical simulations.

Yinglun Zhu · Julian Katz-Samuels · Robert Nowak

The model selection problem in the pure exploration linear bandit setting is introduced and studied in both the fixed confidence and fixed budget settings. The model selection problem considers a nested sequence of hypothesis classes of increasing complexities. Our goal is to automatically adapt to the instance-dependent complexity measure of the smallest hypothesis class containing the true model, rather than suffering from the complexity measure related to the largest hypothesis class. We provide evidence showing that a standard doubling trick over dimension fails to achieve the optimal instance-dependent sample complexity. Our algorithms define a new optimization problem based on experimental design that leverages the geometry of the action set to efficiently identify a near-optimal hypothesis class. Our fixed budget algorithm uses a novel application of a selection-validation trick in bandits. This provides a new method for the understudied fixed budget setting in linear bandits (even without the added challenge of model selection). We further generalize the model selection problem to the misspecified regime, adapting our algorithms in both fixed confidence and fixed budget settings.

Vivek Farias · Andrew Li · Tianyi Peng

The problem of low-rank matrix completion with heterogeneous and sub-exponential (as opposed to homogeneous Gaussian) noise is particularly relevant to a number of applications in modern commerce. Examples include panel sales data and data collected from web-commerce systems such as recommendation engines. An important unresolved question for this problem is characterizing the distribution of estimated matrix entries under common low-rank estimators. Such a characterization is essential to any application that requires quantification of uncertainty in these estimates and has heretofore only been available under the assumption of homogenous Gaussian noise. Here we characterize the distribution of estimated matrix entries when the observation noise is heterogeneous sub-Exponential and provide, as an application, explicit formulas for this distribution when observed entries are Poisson or Binary distributed.

Bayesian nonparametric methods are a popular choice for analysing survival data due to their ability to flexibly model the distribution of survival times. These methods typically employ a nonparametric prior on the survival function that is conjugate with respect to right-censored data. Eliciting these priors, particularly in the presence of covariates, can be challenging and inference typically relies on computationally intensive Markov chain Monte Carlo schemes. In this paper, we build on recent work that recasts Bayesian inference as assigning a predictive distribution on the unseen values of a population conditional on the observed samples, thus avoiding the need to specify a complex prior. We describe a copula-based predictive update which admits a scalable sequential importance sampling algorithm to perform inference that properly accounts for right-censoring. We provide theoretical justification through an extension of Doob's consistency theorem and illustrate the method on a number of simulated and real data sets, including an example with covariates. Our approach enables analysts to perform Bayesian nonparametric inference through only the specification of a predictive distribution.

The unsupervised task of aligning two or more distributions in a shared latent space has many applications including fair representations, batch effect mitigation, and unsupervised domain adaptation. Existing flow-based approaches estimate multiple flows independently, which is equivalent to learning multiple full generative models. Other approaches require adversarial learning, which can be computationally expensive and challenging to optimize. Thus, we aim to jointly align multiple distributions while avoiding adversarial learning. Inspired by efficient alignment algorithms from optimal transport (OT) theory for univariate distributions, we develop a simple iterative method to build deep and expressive flows. Our method decouples each iteration into two subproblems: 1) form a variational approximation of a distribution divergence and 2) minimize this variational approximation via closed-form invertible alignment maps based on known OT results. Our empirical results give evidence that this iterative algorithm achieves competitive distribution alignment at low computational cost while being able to naturally handle more than two distributions.

Ben Letham · Phillip Guan · Chase Tymms · Eytan Bakshy · Michael Shvartsman

Level set estimation (LSE) is the problem of identifying regions where an unknown function takes values above or below a specified threshold. Active sampling strategies for efficient LSE have primarily been studied in continuous-valued functions. Motivated by applications in human psychophysics where common experimental designs produce binary responses, we study LSE active sampling with Bernoulli outcomes. With Gaussian process classification surrogate models, the look-ahead model posteriors used by state-of-the-art continuous-output methods are intractable. However, we derive analytic expressions for look-ahead posteriors of sublevel set membership, and show how these lead to analytic expressions for a class of look-ahead LSE acquisition functions, including information-based methods. Benchmark experiments show the importance of considering the global look-ahead impact on the entire posterior. We demonstrate a clear benefit to using this new class of acquisition functions on benchmark problems, and on a challenging real-world task of estimating a high-dimensional contrast sensitivity function.

Ignavier Ng · Kun Zhang

Traditionally, Bayesian network structure learning is often carried out at a central site, in which all data is gathered. However, in practice, data may be distributed across different parties (e.g., companies, devices) who intend to collectively learn a Bayesian network, but are not willing to disclose information related to their data owing to privacy or security concerns. In this work, we present a federated learning approach to estimate the structure of Bayesian network from data that is horizontally partitioned across different parties. We develop a distributed structure learning method based on continuous optimization, using the alternating direction method of multipliers (ADMM), such that only the model parameters have to be exchanged during the optimization process. We demonstrate the flexibility of our approach by adopting it for both linear and nonlinear cases. Experimental results on synthetic and real datasets show that it achieves an improved performance over the other methods, especially when there is a relatively large number of clients and each has a limited sample size.

Beau Coker · Wessel Bruinsma · David Burt · Weiwei Pan · Finale Doshi-Velez

Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference *entirely fails to model the data* when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the *optimal* mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.

Linear bandits have a wide variety of applications including recommendation systems yet they make one strong assumption: the algorithms must know an upper bound $S$ on the norm of the unknown parameter $\theta^*$ that governs the reward generation. Such an assumption forces the practitioner to guess $S$ involved in the confidence bound, leaving no choice but to wish that $\|\theta^*\|\le S$ is true to guarantee that the regret will be low. In this paper, we propose novel algorithms that do not require such knowledge for the first time. Specifically, we propose two algorithms and analyze their regret bounds: one for the changing arm set setting and the other for the fixed arm set setting. Our regret bound for the former shows that the price of not knowing $S$ does not affect the leading term in the regret bound and inflates only the lower order term. For the latter, we do not pay any price in the regret for now knowing $S$. Our numerical experiments show standard algorithms assuming knowledge of $S$ can fail catastrophically when $\|\theta^*\|\le S$ is not true whereas our algorithms enjoy low regret.

Margalit Glasgow · Honglin Yuan · Tengyu Ma

Federated Averaging (FedAvg), also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL). Despite its simplicity and popularity, the convergence rate of FedAvg has thus far been undetermined. Even under the simplest assumptions (convex, smooth, homogeneous, and bounded covariance), the best-known upper and lower bounds do not match, and it is not clear whether the existing analysis captures the capacity of the algorithm. In this work, we first resolve this question by providing a lower bound for FedAvg that matches the existing upper bound, which shows the existing FedAvg upper bound analysis is not improvable. Additionally, we establish a lower bound in a heterogeneous setting that nearly matches the existing upper bound. While our lower bounds show the limitations of FedAvg, under an additional assumption of third-order smoothness, we prove more optimistic state-of-the-art convergence results in both convex and non-convex settings. Our analysis stems from a notion we call iterate bias, which is defined by the deviation of the expectation of the SGD trajectory from the noiseless gradient descent trajectory with the same initialization. We prove novel sharp bounds on this quantity, and show intuitively how to analyze this quantity from a Stochastic Differential Equation (SDE) perspective.

William Stephenson · Soumya Ghosh · Tin Nguyen · Mikhail Yurochkin · Sameer Deshpande · Tamara Broderick

Gaussian processes (GPs) are used to make medical and scientific decisions, including in cardiac care and monitoring of carbon dioxide emissions. Notably, the choice of GP kernel is often somewhat arbitrary. In particular, uncountably many kernels typically align with qualitative prior knowledge (e.g. function smoothness or stationarity). But in practice, data analysts choose among a handful of convenient standard kernels (e.g. squared exponential). In the present work, we ask: Would decisions made with a GP differ under other, qualitatively interchangeable kernels? We show how to formulate this sensitivity analysis as a constrained optimization problem over a finite-dimensional space. We can then use standard optimizers to identify substantive changes in relevant decisions made with a GP. We demonstrate in both synthetic and real-world examples that decisions made with a GP can exhibit substantial sensitivity to kernel choice, even when prior draws are qualitatively interchangeable to a user.

We study adaptive sensing of Cox point processes, a widely used model from spatial statistics. We introduce three tasks: maximization of captured events, search for the maximum of the intensity function and learning level sets of the intensity function. We model the intensity function as a sample from a truncated Gaussian process, represented in a specially constructed positive basis. In this basis, the positivity constraint on the intensity function has a simple form. We show how the *minimal description positive basis* can be adapted to the covariance kernel, to non-stationarity and make connections to common positive bases from prior works. Our adaptive sensing algorithms use Langevin dynamics and are based on posterior sampling (*Cox-Thompson*) and top-two posterior sampling (*Top2*) principles. With latter, the difference between samples serves as a surrogate to the uncertainty. We demonstrate the approach using examples from environmental monitoring and crime rate modeling, and compare it to the classical Bayesian experimental design approach.

Jianfeng Chi · Jian Shen · Xinyi Dai · Weinan Zhang · Yuan Tian · Han Zhao

Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. However, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose return parity, a fairness notion that requires MDPs from different demographic groups that share the same state and action spaces to achieve approximately the same expected time-discounted rewards. We first provide a decomposition theorem for return disparity, which decomposes the return disparity of any two MDPs sharing the same state and action spaces into the distance between group-wise reward functions, the discrepancy of group policies, and the discrepancy between state visitation distributions induced by the group policies. Motivated by our decomposition theorem, we propose algorithms to mitigate return disparity via learning a shared group policy with state visitation distributional alignment using integral probability metrics. We conduct experiments to corroborate our results, showing that the proposed algorithm can successfully close the disparity gap while maintaining the performance of policies on two real-world recommender system benchmark datasets.

Shibo Li · Zheng Wang · Robert Kirby · Shandian Zhe

Many applications, such as in physical simulation and engineering design, demand we estimate functions with high-dimensional outputs. To reduce the expensive cost of generating training examples, we usually choose several fidelities to enable a cost/quality trade-off. In this paper, we consider the active learning task to automatically identify the fidelities and training inputs to query new examples so as to achieve the best learning benefit-cost ratio. To this end, we propose DMFAL, a Deep Multi-Fidelity Active Learning approach. We first develop a deep neural network-based multi-fidelity model for high-dimensional outputs, which can flexibly capture strong complex correlations across the outputs and fidelities to enhance the learning of the target function. We then propose a mutual information based acquisition function that extends the predictive entropy principle. To overcome the computational challenges caused by large output dimensions, we use the multi-variate delta method and moment-matching to estimate the output posterior, and Weinstein-Aronszajn identity to calculate and optimize the acquisition function. We show the advantage of our method in several applications of computational physics and engineering design. The code is available at https://github.com/shib0li/DMFAL.

Xupeng Shi · Pengfei Zheng · A. Adam Ding · Yuan Gao · Weizhong Zhang

Modern deep neural networks (DNNs) are vulnerable to adversarial attacks and adversarial training has been shown to be a promising method for improving the adversarial robustness of DNNs. Pruning methods have been considered in adversarial context to reduce model capacity and improve adversarial robustness simultaneously in training. Existing adversarial pruning methods generally mimic the classical pruning methods for natural training, which follow the 'training, pruning, fine-tuning' three stages pipeline. We observe that such pruning methods do not necessarily preserve the dynamics of dense networks, making it potentially hard to be fine-tuned to compensate the accuracy degradation in pruning. Based on recent works of \textit{neural tangent kernel} (NTK), we systematically study the dynamics of adversarial training and prove the existence of trainable sparse sub-network at initialization which can be trained to be adversarial robust from scratch. This theoretically verifies the \textit{lottery ticket hypothesis} in adversarial context and we refer such sub-network structure as \textit{adversarial winning ticket} (AWT). We also show empirical evidences that AWT preserves the dynamics of adversarial training and achieve equal performance as dense adversarial training.

Devansh Bisla · Jing Wang · Anna Choromanska

In this paper, we study the sharpness of a deep learning (DL) loss landscape around local minima in order to reveal systematic mechanisms underlying the generalization abilities of DL models. Our analysis is performed across varying network and optimizer hyper-parameters, and involves a rich family of different sharpness measures. We compare these measures and show that the low-pass filter based measure exhibits the highest correlation with the generalization abilities of DL models, has high robustness to both data and label noise, and furthermore can track the double descent behavior for neural networks. We next derive the optimization algorithm, relying on the low-pass filter (LPF), that actively searches the flat regions in the DL optimization landscape using SGD-like procedure. The update of the proposed algorithm, that we call LPF-SGD, is determined by the gradient of the convolution of the filter kernel with the loss function and can be efficiently computed using MC sampling. We empirically show that our algorithm achieves superior generalization performance compared to the common DL training strategies. On the theoretical front we prove that LPF-SGD converges to a better optimal point with smaller generalization error than SGD.

Wanrong Zhang · Yajun Mei · Rachel Cummings

The sequential hypothesis testing problem is a class of statistical analyses where the sample size is not fixed in advance. Instead, the decision-process takes in new observations sequentially to make real-time decisions for testing an alternative hypothesis against a null hypothesis until some stopping criterion is satisfied. In many common applications of sequential hypothesis testing, the data can be highly sensitive and may require privacy protection; for example, sequential hypothesis testing is used in clinical trials, where doctors sequentially collect data from patients and must determine when to stop recruiting patients and whether the treatment is effective. The field of differential privacy has been developed to offer data analysis tools with strong privacy guarantees, and has been commonly applied to machine learning and statistical tasks. In this work, we study the sequential hypothesis testing problem under a slight variant of differential privacy, known as Renyi differential privacy. We present a new private algorithm based on Wald's Sequential Probability Ratio Test (SPRT) that also gives strong theoretical privacy guarantees. We provide theoretical analysis on statistical performance measured by Type I and Type II error as well as the expected sample size. We also empirically validate our theoretical results on several synthetic databases, showing that our algorithms also perform well in practice. Unlike previous work in private hypothesis testing that focused only on the classical fixed sample setting, our results in the sequential setting allow a conclusion to be reached much earlier, and thus saving the cost of collecting additional samples.

Yae Jee Cho · Jianyu Wang · Gauri Joshi

Federated learning is a distributed optimization paradigm that enables a large number of resource-limited client nodes to cooperatively train a model without data sharing. Previous works analyzed the convergence of federated learning by accounting of data heterogeneity, communication/computation limitations, and partial client participation. However, most assume unbiased client participation, where clients are selected such that the aggregated model update is unbiased. In our work, we present the convergence analysis of federated learning with biased client selection and quantify how the bias affects convergence speed. We show that biasing client selection towards clients with higher local loss yields faster error convergence. From this insight, we propose Power-of-Choice, a communication- and computation-efficient client selection framework that flexibly spans the trade-off between convergence speed and solution bias. Extensive experiments demonstrate that Power-of-Choice can converge up to 3 times faster and give 10% higher test accuracy than the baseline random selection.

Ramit Sawhney · Shivam Agarwal · Atula T Neerkaje · Kapil Pathak

Directed Acyclic Graphs and trees are widely prevalent in several real-world applications. These hierarchical structures show intriguing properties such as scale-free and bipartite nature, with fine-grained temporal irregularities among nodes. Building on advances in geometrical deep learning, we explore a time-aware neural network to model trees and Directed Acyclic Graphs in multiple Riemannian manifolds of varying curvatures. To jointly utilize the strength of these manifolds, we propose **M**ulti-Manifold **R**ecursive **I**nteraction **L**earning (**MRIL**) on Directed Acyclic Graphs where we introduce an inter-manifold learning mechanism that recursively enriches each manifold with representations from sibling manifolds. We propose the integration of the Stiefel orthogonality constraint which stabilizes the training process in Riemannian manifolds. Through a series of quantitative and exploratory experiments, we show that our method achieves competitive performance and converges much faster on data spanning several domains.

Chris Junchi Li · Yaodong Yu · Nicolas Loizou · Gauthier Gidel · Yi Ma · Nicolas Le Roux · Michael Jordan

We study the stochastic bilinear minimax optimization problem, presenting an analysis of the same-sample Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. In sharp contrasts with the basic SEG method whose last iterate only contracts to a fixed neighborhood of the Nash equilibrium, SEG augmented with iteration averaging provably converges to the Nash equilibrium under the same standard settings, and such a rate is further improved by incorporating a scheduled restarting procedure. In the interpolation setting where noise vanishes at the Nash equilibrium, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting.

Spencer Frei · Difan Zou · Zixiang Chen · Quanquan Gu

We consider a binary classification problem when the data comes from a mixture of two rotationally symmetric distributions satisfying concentration and anti-concentration properties enjoyed by log-concave distributions among others. We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\beta_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\beta_0 := \beta_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \beta_t, \xb\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension. That is, self-training converts weak learners to strong learners using only unlabeled examples. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $\beta_{\mathrm{pl}}$ with classification error $C_{\mathrm{err}}$ using only $O(d)$ labeled examples (i.e., independent of $\varepsilon$). Together our results imply that mixture models can be learned to within $\varepsilon$ of the Bayes-optimal accuracy using at most $O(d)$ labeled examples and $\tilde O(d/\varepsilon^2)$ unlabeled examples by way of a semi-supervised self-training algorithm.

Blake Mason · Lalit Jain · Subhojyoti Mukherjee · Romain Camilleri · Kevin Jamieson · Robert Nowak

The level set estimation problem seeks to find all points in a domain $\mathcal{X}$ where the value of an unknown function $f:\mathcal{X}\rightarrow \mathbb{R}$ exceeds a threshold $\alpha$. The estimation is based on noisy function evaluations that may be acquired at sequentially and adaptively chosen locations in $\mathcal{X}$. The threshold value $\alpha$ can either be \emph{explicit} and provided a priori, or \emph{implicit} and defined relative to the optimal function value, i.e. $\alpha = (1-\epsilon)f(\mathbf{x}_\ast)$ for a given $\epsilon > 0$ where $f(\mathbf{x}_\ast)$ is the maximal function value and is unknown. In this work we provide a new approach to the level set estimation problem by relating it to recent adaptive experimental design methods for linear bandits in the Reproducing Kernel Hilbert Space (RKHS) setting. We assume that $f$ can be approximated by a function in the RKHS up to an unknown misspecification and provide novel algorithms for both the implicit and explicit cases in this setting with strong theoretical guarantees. Moreover, in the linear (kernel) setting, we show that our bounds are nearly optimal, namely, our upper bounds match existing lower bounds for threshold linear bandits. To our knowledge this work provides the first instance-dependent, non-asymptotic upper bounds on sample complexity of level-set estimation that match information theoretic lower bounds.

Van Bach Nguyen · Kanishka Ghosh Dastidar · Michael Granitzer · Wissam Siblini

Fraud detection systems (FDS) mainly perform two tasks: (i) real-time detection while the payment is being processed and (ii) posterior detection to block the card retrospectively and avoid further frauds. Since human verification is often necessary and the payment processing time is limited, the second task manages the largest volume of transactions. In the literature, fraud detection challenges and algorithms performance are widely studied but the very formulation of the problem is never disrupted: it aims at predicting if a transaction is fraudulent based on its characteristics and the past transactions of the cardholder. Yet, in posterior detection, verification often takes days, so new payments on the card become available before a decision is taken. This is our motivation to propose a new paradigm: posterior fraud detection with “future” information. We start by providing evidence of the on-time availability of subsequent transactions, usable as extra context to improve detection. We then design a Bidirectional LSTM to make use of these transactions. On a real-world dataset with over 30 million transactions, it achieves higher performance than a regular LSTM, which is the state-of-the-art classifier for fraud detection that only uses the past context. We also introduce new metrics to show that the proposal catches more frauds, more compromised cards, and based on their earliest frauds. We believe that future works on this new paradigm will have a significant impact on the detection of compromised cards.

Statistical Relational Learning (SRL) models have attracted significant attention due to their ability to model complex data while handling uncertainty. However, most of these models have been restricted to discrete domains owing to the complexity of inference in continuous domains. In this work, we introduce Relational Neural Markov Random Fields (RN-MRFs) that allow handling of complex relational hybrid domains, i.e., those that include discrete and continuous quantities, and we propose a maximum pseudolikelihood estimation-based learning algorithm with importance sampling for training the neural potential parameters. The key advantage of our approach is that it makes minimal data distributional assumptions and can seamlessly embed human knowledge through potentials or relational rules. Our empirical evaluations across diverse domains, such as image processing and relational object mapping, demonstrate its practical utility.

Chuanhao Li · Hongning Wang

Linear contextual bandit is a popular online learning problem. It has been mostly studied in centralized learning settings. With the surging demand of large-scale decentralized model learning, e.g., federated learning, how to retain regret minimization while reducing communication cost becomes an open challenge. In this paper, we study linear contextual bandit in a federated learning setting. We propose a general framework with asynchronous model update and communication for a collection of homogeneous clients and heterogeneous clients, respectively. Rigorous theoretical analysis is provided about the regret and communication cost under this distributed learning framework; and extensive empirical evaluations demonstrate the effectiveness of our solution.

Kartik Sreenivasan · Shashank Rajput · Jy-yong Sohn · Dimitris Papailiopoulos

A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks. A follow-up line of theoretical work provides justification of these findings by proving that slightly overparameterized neural networks, with commonly used continuous-valued random initializations can indeed be pruned to approximate any target network. In this work, we show that the amplitude of those random weights does not even matter. We prove that any target network of width $d$ and depth $l$ can be approximated up to arbitrary accuracy $\varepsilon$ by simply pruning a random network of binary $\{\pm1\}$ weights that is wider and deeper than the target network only by a polylogarithmic factor of $d, l$ and $\varepsilon$.

Jiafan He · Dongruo Zhou · Quanquan Gu

Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode. We propose an optimistic policy optimization algorithm POWERS and show that it can achieve $\tilde{O}(dH\sqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP, and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $\tilde{\Omega}(dH\sqrt{T})$ up to logarithmic factors. Our key technical contributions are two-fold: (1) a new value function estimator based on importance weighting; and (2) a tighter confidence set for the transition kernel. They together lead to the nearly minimax optimal regret.

Guillaume Martinet · Alexander Strzalkowski · Barbara Engelhardt

Selecting powerful predictors for an outcome is a cornerstone task for machine learning. However, some types of questions can only be answered by identifying the predictors that causally affect the outcome. A recent approach to this causal inference problem leverages the invariance property of a causal mechanism across differing experimental environments (Peters et al., 2016; Heinze-Deml et al., 2018). This method, invariant causal prediction (ICP), has a substantial computational defect -- the runtime scales exponentially with the number of possible causal variables. In this work, we show that the approach taken in ICP may be reformulated as a series of nonparametric tests that scales linearly in the number of predictors. Each of these tests relies on the minimization of a novel loss function -- the Wasserstein variance -- that is derived from tools in optimal transport theory and is used to quantify distributional variability across environments. We prove under mild assumptions that our method is able to recover the set of identifiable direct causes, and we demonstrate in our experiments that it is competitive with other benchmark causal discovery algorithms.

Hamiltonian Monte Carlo (HMC) has become a go-to family of Markov chain Monte Carlo (MCMC) algorithms for Bayesian inference problems, in part because we have good procedures for automatically tuning its parameters. Much less attention has been paid to automatic tuning of generalized HMC (GHMC), in which the auxiliary momentum vector is partially updated frequently instead of being completely resampled infrequently. Since GHMC spreads progress over many iterations, it is not straightforward to tune GHMC based on quantities typically used to tune HMC such as average acceptance rate and squared jumped distance. In this work, we propose an ensemble-chain adaptation (ECA) algorithm for GHMC that automatically selects values for all of GHMC's tunable parameters each iteration based on statistics collected from a population of many chains. This algorithm is designed to make good use of SIMD hardware accelerators such as GPUs, allowing most chains to be updated in parallel each iteration. Unlike typical adaptive-MCMC algorithms, our ECA algorithm does not perturb the chain's stationary distribution, and therefore does not need to be ``frozen'' after warmup. Empirically, we find that the proposed algorithm quickly converges to its stationary distribution, producing accurate estimates of posterior expectations with relatively few gradient evaluations per chain.

Strategic classification studies the design of a classifier robust to the manipulation of input by strategic individuals. However, the existing literature does not consider the effect of competition among individuals as induced by the algorithm design. Motivated by constrained allocation settings such as college admissions, we introduce strategic ranking, in which the (designed) individual reward depends on an applicant's post-effort rank in a measurement of interest. Our results illustrate how competition among applicants affects the resulting equilibria and model insights. We analyze how various ranking reward designs, belonging to a family of step functions, trade off applicant, school, and societal utility, as well as how ranking design counters inequities arising from disparate access to resources. In particular, we find that randomization in the reward design can mitigate two measures of disparate impact, welfare gap and access.

Chih-Kuan Yeh · Kuan-Yun Lee · Frederick Liu · Pradeep Ravikumar

A popular explainable AI (XAI) approach to quantify feature importance of a given model is via Shapley values. These Shapley values arose in cooperative games, and hence a critical ingredient to compute these in an XAI context is a so-called value function, that computes the ``value'' of a subset of features, and which connects machine learning models to cooperative games. There are many possible choices for such value functions, which broadly fall into two categories: on-manifold and off-manifold value functions, which take an observational and an interventional viewpoint respectively. Both these classes however have their respective flaws, where on-manifold value functions violate key axiomatic properties and are computationally expensive, while off-manifold value functions pay less heed to the data manifold and evaluate the model on regions for which it wasn't trained. Thus, there is no consensus on which class of value functions to use. In this paper, we show that in addition to these existing issues, both classes of value functions are prone to adversarial manipulations on low density regions. We formalize the desiderata of value functions that respect both the model and the data manifold in a set of axioms and are robust to perturbation on off-manifold regions, and show that there exists a unique value function that satisfies these axioms, which we term the Joint Baseline value function, and the resulting Shapley value the Joint Baseline Shapley (JBshap), and validate the effectiveness of JBshap in experiments.

We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through both simulations and analyses of Peru Legislation dataset.

This paper is concerned with the algorithmic aspects of sharper stationarity of a nonconvex, nonsmooth, Clarke irregular machine learning model. We study the SVM problem with a $\rho$-margin loss function, which is the margin theory generalization bound of SVM introduced in the learning theory textbook by Mohri et al. [2018], and has been extensively studied in operations research, statistics, and machine learning communities. However, due to its nonconvex, nonsmooth, and irregular nature, none of the existing optimization methods can efficiently compute a d(irectional)-stationary point, which turns out to be also a local minimum, for the $\rho$-margin loss SVM problem. After a detailed discussion of various nonsmooth stationarity notions, we propose a highly efficient nonconvex semi-proximal ADMM-based scheme that provably computes d-stationary points and enjoys a local linear convergence rate. We report concrete examples to demonstrate the necessity of our assumptions. Numerical results verify the effectiveness of the new algorithm and complement our theoretical results.

Aman Bansal · Rahul Chunduru · Deepesh Data · Manoj Prabhakaran

Differential Privacy (DP) has become a gold standard in privacy-preserving data analysis. While it provides one of the most rigorous notions of privacy, there are many settings where its applicability is limited.Our main contribution is in augmenting differential privacy with {\em Flexible Accuracy}, which allows small distortions in the input (e.g., droppingoutliers) before measuring accuracy of the output, allowing one to extend DPmechanisms to high-sensitivity functions. We present mechanisms that can help in achieving this notion for functions that had no meaningful differentially private mechanisms previously. In particular, we illustrate an application to differentially private histograms, which in turn yields mechanisms for revealing the support of a dataset or the extremal values in the data. Analyses of our constructions exploit newversatile composition theorems that facilitate modular design. All the above extensions use our new definitional framework, which is in terms of ``lossy Wasserstein distance'' -- a 2-parameter error measure for distributions. This may be of independent interest.

Meta-, multi-task, and federated learning can be all viewed as solving similar tasks, drawn from a distribution that reflects task similarities. We provide a unified view of all these problems, as learning to act in a hierarchical Bayesian bandit. We propose and analyze a natural hierarchical Thompson sampling algorithm (HierTS) for this class of problems. Our regret bounds hold for many variants of the problems, including when the tasks are solved sequentially or in parallel; and show that the regret decreases with a more informative prior. Our proofs rely on a novel total variance decomposition that can be applied beyond our models. Our theory is complemented by experiments, which show that the hierarchy helps with knowledge sharing among the tasks. This confirms that hierarchical Bayesian bandits are a universal and statistically-efficient tool for learning to act with similar bandit tasks.

Branislav Kveton · Ofer Meshi · Masrour Zoghi · Zhen Qin

This paper addresses the cold-start problem in online learning to rank (OLTR). We show both theoretically and empirically that priors improve the quality of ranked lists presented to users interactively based on user feedback. These priors can come in the form of unbiased estimates of the relevance of the ranked items, or more practically, can be obtained from offline-learned models. Our experiments show the effectiveness of priors in improving the short-term regret of tabular OLTR algorithms, based on Thompson sampling and BayesUCB.

Matias Altamirano · Felipe Tobar

Kernel design for Multi-output Gaussian Processes (MOGP) has received increased attention recently, in particular, the Multi-Output Spectral Mixture kernel (MOSM) approach has been praised as a general model in the sense that it extends other approaches such as Linear Model of Corregionalization, Intrinsic Corregionalization Model and Cross-Spectral Mixture. MOSM relies on Cramér’s theorem to parametrise the power spectral densities (PSD) as a Gaussian mixture, thus, having a structural restriction: by assuming the existence of a PSD, the method is only suited for multi-output stationary processes. We develop a nonstationary extension of MOSM by proposing the family of harmonizable kernels for MOGPs, a class of kernels that contains both stationary and a vast majority of non-stationary processes. A main contribution of the proposed harmonizable kernels is that they automatically identify a possible nonstationary behaviour meaning that practitioners do not need to choose between stationary or non-stationary kernels. The proposed method is first validated on synthetic data with the purpose of illustrating the key properties of our approach, and then compared to existing MOGP methods on two real-world settings from finance and electroencephalography.

Piotr Indyk · Frederik Mallmann-Trenn · Slobodan Mitrovic · Ronitt Rubinfeld

We consider online algorithms for the {\em page migration problem} that use predictions, potentially imperfect, to improve their performance. The best known online algorithms for this problem, due to Westbrook'94 and Bienkowski et al'17, have competitive ratios strictly bounded away from 1. In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve a competitive ratio that tends to $1$ as the prediction error rate tends to $0$. Specifically, the competitive ratio is equal to $1+O(q)$, where $q$ is the prediction error rate. We also design a ``fallback option'' that ensures that the competitive ratio of the algorithm for {\em any} input sequence is at most $O(1/q)$. Our result adds to the recent body of work that uses machine learning to improve the performance of ``classic'' algorithms.

John Nguyen · Kshitiz Malik · Hongyuan Zhan · Ashkan Yousefpour · Mike Rabbat · Mani Malek · Dzmitry Huba

Scalability and privacy are two critical concerns for cross-device federated learning (FL) systems. In this work, we identify that synchronous FL — synchronized aggregation of client updates in FL — cannot scale efficiently beyond a few hundred clients training in parallel. It leads to diminishing returns in model performance and training speed, analogous to large-batch training. On the other hand, asynchronous aggregation of client updates in FL (i.e., asynchronous FL) alleviates the scalability issue. However, aggregating individual client updates is incompatible with Secure Aggregation, which could result in an undesirable level of privacy for the system. To address these concerns, we propose a novel buffered asynchronous aggregation method, FedBuff, that is agnostic to the choice of optimizer, and combines the best properties of synchronous and asynchronous FL. We empirically demonstrate that FedBuff is 3.3x⇥ more efficient than synchronous FL and up to 2.5x⇥ more efficient than asynchronous FL, while being compatible with privacy-preserving technologies such as Secure Aggregation and differential privacy. We provide theoretical convergence guarantees in a smooth non-convex setting. Finally, we show that under differentially private training, FedBuff can outperform FedAvgM at low privacy settings and achieve the same utility for higher privacy settings.

We study the problem of adversarially robust learning in the transductive setting. For classes H of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in the VC dimension and is adaptive to the complexity of the perturbation set. This result provides an exponential improvement in dependence on VC dimension over the best known upper bound on the robust error in the inductive setting, at the expense of competing with a more restrictive notion of optimal robust error.

Chenjun Xiao · ILBIN LEE · Bo Dai · Dale Schuurmans · Csaba Szepesvari

In high stake applications, active experimentation may be considered too risky and thus data are often collected passively. While in simple cases, such as in bandits, passive and active data collection are similarly effective, the price of passive sampling can be much higher when collecting data from a system with controlled states. The main focus of the current paper is the characterization of this price.For example, when learning in episodic finite state-action Markov decision processes (MDPs) with $\rS$ states and $\rA$ actions, we show that even with the best (but passively chosen) logging policy, $\Omega(\rA^{\min(\rS-1, H)}/\varepsilon^2)$ episodes are necessary (and sufficient) to obtain an $\epsilon$-optimal policy, where $H$ is the length of episodes.Note that this shows that the sample complexity blows up exponentially compared to the case of active data collection, a result which is not unexpected, but, as far as we know, have not been published beforehand and perhaps the form of the exact expression is a little surprising. We also extend these results in various directions, such as other criteria or learning in the presence of function approximation, with similar conclusions. A remarkable feature of our result is the sharp characterization of the exponent that appears, which is critical for understanding what makes passive learning hard.

Robert Dyro · Edward Schmerling · Nikos Arechiga · Marco Pavone

In this work we derive a second-order approach to bilevel optimization, a type of mathematical programming in which the solution to a parameterized optimization problem (the `lower'' problem) is itself to be optimized (in the`

upper'' problem) as a function of the parameters. Many existing approaches to bilevel optimization employ first-order sensitivity analysis, based on the implicit function theorem (IFT), for the lower problem to derive a gradient of the lower problem solution with respect to its parameters; this IFT gradient is then used in a first-order optimization method for the upper problem. This paper extends this sensitivity analysis to provide second-order derivative information of the lower problem (which we call the IFT Hessian), enabling the usage of faster-converging second-order optimization methods at the upper level. Our analysis shows that (i) much of the computation already used to produce the IFT gradient can be reused for the IFT Hessian, (ii) errors bounds derived for the IFT gradient readily apply to the IFT Hessian, (iii) computing IFT Hessians can significantly reduce overall computation by extracting more information from each lower level solve. We corroborate our findings and demonstrate the broad range of applications of our method by applying it to problem instances of least squares hyperparameter auto-tuning, multi-class SVM auto-tuning, and inverse optimal control.

Shengchao Liu · Meng Qu · Zuobai Zhang · Huiyu Cai · Jian Tang

Multi-task learning for molecular property prediction is becoming increasingly important in drug discovery. However, in contrast to other domains, the performance of multi-task learning in drug discovery is still not satisfying as the number of labeled data for each task is too limited, which calls for additional data to complement the data scarcity. In this paper, we study multi-task learning for molecular property prediction in a novel setting, where a relation graph between tasks is available. We first construct a dataset including around 400 tasks as well as a task relation graph. Then to better utilize such relation graph, we propose a method called SGNN-EBM to systematically investigate the structured task modeling from two perspectives. (1) In the \emph{latent} space, we model the task representations by applying a state graph neural network (SGNN) on the relation graph. (2) In the \emph{output} space, we employ structured prediction with the energy-based model (EBM), which can be efficiently trained through noise-contrastive estimation (NCE) approach. Empirical results justify the effectiveness of SGNN-EBM. Code is available on https://github.com/chao1224/SGNN-EBM.

Tatsuki Koga · Casey Meehan · Kamalika Chaudhuri

Aggregate time-series data like traffic flow and site occupancy repeatedly sample statistics from a population across time. Such data can be profoundly useful for understanding trends within a given population, but also pose a significant privacy risk, potentially revealing \emph{e.g.,} who spends time where. Producing a private version of a time-series satisfying the standard definition of Differential Privacy (DP) is challenging due to the large influence a single participant can have on the sequence: if an individual can contribute to each time step, the amount of additive noise needed to satisfy privacy increases linearly with the number of time steps sampled. As such, if a signal spans a long duration or is oversampled, an excessive amount of noise must be added, drowning out underlying trends. However, in many applications an individual realistically \emph{cannot} participate at every time step. When this is the case, we observe that the influence of a single participant (sensitivity) can be reduced by subsampling and/or filtering in time, while still meeting privacy requirements. Using a novel analysis, we show this significant reduction in sensitivity and propose a corresponding class of privacy mechanisms. We demonstrate the utility benefits of these techniques empirically with real-world and synthetic time-series data.

In this paper, we study the problem of recovering the community structure of a network under federated myopic learning. Under this paradigm, we have several clients, each of them having a myopic view, i.e., observing a small subgraph of the network. Each client sends a censored evidence graph to a central server. We provide an efficient algorithm, which computes a consensus signed weighted graph from clients evidence, and recovers the underlying network structure in the central server. We analyze the topological structure conditions of the network, as well as the signal and noise levels of the clients that allow for recovery of the network structure. Our analysis shows that exact recovery is possible and can be achieved in polynomial time. In addition, our experiments show that in an extremely sparse network with 10000 nodes, our method can achieve exact recovery of the community structure even if every client has access to only 20 nodes.We also provide information-theoretic limits for the central server to recover the network structure from any single client evidence.Finally, as a byproduct of our analysis, we provide a novel Cheeger-type inequality for general signed weighted graphs.

Siu Lun Chau · Javier Gonzalez · Dino Sejdinovic

We revisit widely used preferential Gaussian processes (PGP) by Chu and Ghahramani [2005] and challenge their modelling assumption that imposes rankability of data items via latent utility function values. We propose a generalisation of PGP which can capture more expressive latent preferential structures in the data and thus be used to model inconsistent preferences, i.e. where transitivity is violated, or to discover clusters of comparable items via spectral decomposition of the learned preference functions. We also consider the properties of associated covariance kernel functions and its reproducing kernel Hilbert Space (RKHS), giving a simple construction that satisfies universality in the space of preference functions. Finally, we provide an extensive set of numerical experiments on simulated and real-world datasets showcasing the competitiveness of our proposed method with state-of-the-art. Our experimental findings support the conjecture that violations of rankability are ubiquitous in real-world preferential data.

Yaodong Yu · Tianyi Lin · Eric Mazumdar · Michael Jordan

Distributionally robust supervised learning (DRSL) is emerging as a key paradigm for building reliable machine learning systems for real-world applications---reflecting the need for classifiers and predictive models that are robust to the distribution shifts that arise from phenomena such as selection bias or nonstationarity. Existing algorithms for solving Wasserstein DRSL--- one of the most popular DRSL frameworks based around robustness to perturbations in the Wasserstein distance---have serious limitations that limit their use in large-scale problems---in particular they involve solving complex subproblems and they fail to make use of stochastic gradients. We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable stochastic extra-gradient algorithms which provably achieve faster convergence rates than existing approaches. We demonstrate their effectiveness on synthetic and real data when compared to existing DRSL approaches. Key to our results is the use of variance reduction and random reshuffling to accelerate stochastic min-max optimization, the analysis of which may be of independent interest.

Joey Hong · Branislav Kveton · Manzil Zaheer · Mohammad Ghavamzadeh · Craig Boutilier

We study Thompson sampling (TS) in online decision making, where the uncertain environment is sampled from a mixture distribution. This is relevant in multi-task learning, where a learning agent faces different classes of problems. We incorporate this structure in a natural way by initializing TS with a mixture prior, and call the resulting algorithm MixTS. To analyze MixTS, we develop a novel and general proof technique for analyzing the concentration of mixture distributions. We use it to derive Bayes regret bounds for MixTS in both linear bandits and finite-horizon reinforcement learning (RL). Our regret bounds reflect the structure of the mixture prior, and depend on the number of mixture components and their width. We demonstrate the empirical effectiveness of MixTS in synthetic and real-world experiments.

Rachit Chhaya · Anirban Dasgupta · Jayesh Choudhari · Supratim Shit

In this paper we present coresets for Fair Regression with Statistical Parity (SP) constraints and for Individually Fair Clustering. Due to the fairness constraints, the classical coreset definition is not enough for these problems. We first define coresets for both the problems. We show that to obtain such coresets, it is sufficient to sample points based on the probabilities dependent on combination of sensitivity score and a carefully chosen term according to the fairness constraints. We give provable guarantees with relative error in preserving the cost and a small additive error in preserving fairness constraints for both problems. Since our coresets are much smaller in size as compared to $n$, the number of points, they can give huge benefits in computational costs (from polynomial to polylogarithmic in $n$), especially when $n \gg d$, where $d$ is the input dimension. We support our theoretical claims with experimental evaluations.

Eduardo Pavez

This paper considers the problem of estimating high dimensional Laplacian constrained precision matrices by minimizing Stein's loss. We obtain a necessary and sufficient condition for existence of this estimator, that consists on checking whether a certain data dependent graph is connected. We also prove consistency in the high dimensional setting under the symmetrized Stein loss. We show that the error rate does not depend on the graph sparsity, or other type of structure, and that Laplacian constraints are sufficient for high dimensional consistency. Our proofs exploit properties of graph Laplacians, the matrix tree theorem, and a characterization of the proposed estimator based on effective graph resistances. We validate our theoretical claims with numerical experiments.

Antonio Orvieto · Jonas Kohler · Dario Pavllo · Thomas Hofmann · Aurelien Lucchi

Deep ReLU networks are at the basis of many modern neural architectures. Yet, the loss landscape of such networks and its interaction with state-of-the-art optimizers is not fully understood. One of the most crucial aspects is the landscape at random initialization, which often influences convergence speed dramatically. In their seminal works, Xavier & Bengio, 2010 and He et al., 2015 propose an initialization strategy that is supposed to prevent gradients from vanishing. Yet, we identify some shortcomings of their expectation analysis as network depth increases, and show that the proposed initialization can actually fail to deliver stable gradient norms. More precisely, by leveraging an in-depth analysis of the median of the forward pass, we first show that, with high probability, vanishing gradients cannot be circumvented when the network width scales with less than O(depth). Second, we extend this analysis to second-order derivatives and show that random i.i.d. initialization also gives rise to Hessian matrices with eigenspectra that vanish as networks grow in depth. Whenever this happens, optimizers are initialized in a very flat, saddle point-like plateau, which is particularly hard to escape with stochastic gradient descent (SGD) as its escaping time is inversely related to curvature magnitudes. We believe that this observation is crucial for fully understanding (a) the historical difficulties of training deep nets with vanilla SGD and (b) the success of adaptive gradient methods, which naturally adapt to curvature and thus quickly escape flat plateaus.

The growing use of machine learning models in consequential settings has highlighted an important and seemingly irreconcilable tension between transparency and vulnerability to gaming. While this has sparked sizable debate in legal literature, there has been comparatively less technical study of this contention. In this work, we propose a clean-cut formulation of this tension and a way to make the tradeoff between transparency and gaming. We identify the source of gaming as being points close to the \emph{decision boundary} of the model. And we initiate an investigation on how to provide example-based explanations that are expansive and yet consistent with a version space that is sufficiently uncertain with respect to the boundary points' labels. Finally, we furnish our theoretical results with empirical investigations of this tradeoff on real-world datasets.

YooJung Choi · Tal Friedman · Guy Van den Broeck

Probabilistic circuits (PCs) are a class of tractable probabilistic models that allow efficient, often linear-time, inference of queries such as marginals and most probable explanations (MPE). However, marginal MAP, which is central to many decision-making problems, remains a hard query for PCs unless they satisfy highly restrictive structural constraints. In this paper, we develop a pruning algorithm that removes parts of the PC that are irrelevant to a marginal MAP query, shrinking the PC while maintaining the correct solution. This pruning technique is so effective that we are able to build a marginal MAP solver based solely on iteratively transforming the circuit---no search is required. We empirically demonstrate the efficacy of our approach on real-world datasets.

In lifelong learning, tasks (or classes) to be learned arrive sequentially over time in arbitrary order. During training, knowledge from previous tasks can be captured and transferred to subsequent ones to improve sample efficiency. We consider the setting where all target tasks can be represented in the span of a small number of unknown linear or nonlinear features of the input data. We propose a lifelong learning algorithm that maintains and refines the internal feature representation. We prove that for any desired accuracy on all tasks, the dimension of the representation remains close to that of the underlying representation. The resulting sample complexity improves significantly on existing bounds. In the setting of linear features, our algorithm is provably efficient and the sample complexity for input dimension $d$, $m$ tasks with $k$ features up to error $\epsilon$ is $\tilde{O}(dk^{1.5}/\epsilon+km/\epsilon)$. We also prove a matching lower bound for any lifelong learning algorithm that uses a single task learner as a black box. We complement our analysis with an empirical study, including a heuristic lifelong learning algorithm for deep neural networks. Our method performs favorably on challenging realistic image datasets compared to state-of-the-art continual learning methods.

Vincent Hsiao · Dana Nau · Rina Dechter

Bayesian Networks are useful for analyzing the properties of systems with large populations of interacting agents (e.g., in social modeling applications and distributed service applications). These networks typically have large functions (CPTs), making exact inference intractable. However, often these models have additive symmetry. In this paper we show how summation-based CPTs, especially in the presence of symmetry, can be computed efficiently through the usage of the Fast Fourier Transform (FFT).In particular, we propose an efficient method using the FFT for reducing the size of Conditional Probability Tables (CPTs) in Bayesian Networks with summation-based causal independence (CI). We then show how to apply this reduction directly towards the acceleration of Bucket Elimination, and we subsequently provide experimental results demonstrating the computational speedup provided by our method.

We study model selection in linear bandits, where the learner must adapt to the dimension (denoted by $d_\star$) of the smallest hypothesis class containing the true linear model while balancing exploration and exploitation. Previous papers provide various guarantees for this model selection problem, but have limitations; i.e., the analysis requires favorable conditions that allow for inexpensive statistical testing to locate the right hypothesis class or are based on the idea of ``corralling'' multiple base algorithms, which often performs relatively poorly in practice. These works also mainly focus on upper bounds. In this paper, we establish the first lower bound for the model selection problem. Our lower bound implies that, even with a fixed action set, adaptation to the unknown dimension $d_\star$ comes at a cost: There is no algorithm that can achieve the regret bound $\widetilde{O}(\sqrt{d_\star T})$ simultaneously for all values of $d_\star$. We propose Pareto optimal algorithms that match the lower bound. Empirical evaluations show that our algorithm enjoys superior performance compared to existing ones.

Deployed supervised machine learning models make predictions that interact with and influence the world. This phenomenon is called *performative prediction* by Perdomo et al. (ICML 2020). It is an ongoing challenge to understand the influence of such predictions as well as design tools so as to control that influence. We propose a theoretical framework where the response of a target population to the deployed classifier is modeled as a function of the classifier and the current state (distribution) of the population. We show necessary and sufficient conditions for convergence to an equilibrium of two retraining algorithms, *repeated risk minimization* and a lazier variant. Furthermore, convergence is near an optimal classifier. We thus generalize results of Perdomo et al., whose performativity framework does not assume any dependence on the state of the target population. A particular phenomenon captured by our model is that of distinct groups that acquire information and resources at different rates to be able to respond to the latest deployed classifier. We study this phenomenon theoretically and empirically.

Rebecca Roelofs · Nicholas Cain · Jonathon Shlens · Mike Mozer

For an AI system to be reliable, the confidence it expresses in its decisions must match its accuracy. To assess the degree of match, examples are typically binned by confidence and the per-bin mean confidence and accuracy are compared. Most research in calibration focuses on techniques to reduce this empirical measure of calibration error, ECE*bin. We instead focus on assessing statistical bias in this empirical measure, and we identify better estimators. We propose a framework through which we can compute the bias of a particular estimator for an evaluation data set of a given size. The framework involves synthesizing model outputs that have the same statistics as common neural architectures on popular data sets. We find that binning-based estimators with bins of equal mass (number of instances) have lower bias than estimators with bins of equal width. Our results indicate two reliable calibration-error estimators: the debiased estimator (Brocker, 2012; Ferro and Fricker, 2012) and a method we propose, ECE*sweep, which uses equal-mass bins and chooses the number of bins to be as large as possible while preserving monotonicity in the calibration function. With these estimators, we observe improvements in the effectiveness of recalibration methods and in the detection of model miscalibration.

Michael Dinitz · Aravind Srinivasan · Leonidas Tsepenekas · Anil Vullikanti

Graph cut problems are fundamental in combinatorial Optimization, and are a central object of study in both theory and practice. Further, the study of fairness in Algorithmic Design and Machine Learning has recently received significant attention, with many different notions proposed and analyzed for a variety of contexts. In this paper we initiate the study of fairness for graph cut problems by giving the first fair definitions for them, and subsequently we demonstrate appropriate algorithmic techniques that yield a rigorous theoretical analysis. Specifically, we incorporate two different notions of fairness, namely demographic and probabilistic individual fairness, in a particular cut problem that models disaster containment scenarios. Our results include a variety of approximation algorithms with provable theoretical guarantees.

Sumedh Sontakke · Stephen Iota · Zizhao Hu · Arash Mehrjou · Laurent Itti · Bernhard Schölkopf

Out-of-distribution (OOD) detection is a well-studied topic in supervised learning. Extending the successes in supervised learning methods to the reinforcement learning (RL) setting, however, is difficult due to the data generating process - RL agents actively query their environment for data and this data is a function of the policy followed by the agent. Thus, an agent could neglect a shift in the environment if its policy did not lead it to explore the aspect of the environment that shifted. Therefore, to achieve safe and robust generalization in RL, there exists an unmet need for OOD detection through active experimentation. Here, we attempt to bridge this lacuna by first - defining a causal framework for OOD scenarios or environments encountered by RL agents in the wild. Then, we propose a novel task - that of Out-of-Task Distribution (OOTD) detection. We introduce an RL agent which actively experiments in a test environment and subsequently concludes whether it is OOTD or not. We name our method GalilAI, in honor of Galileo Galilei, as it also discovers, among other causal processes, that gravitational acceleration is independent of the mass of a body. Finally, we propose a simple probabilistic neural network baseline for comparison, which extends extant Model-Based RL. We find that our method outperforms the baseline significantly.

Zhiyuan Jerry Lin · Raul Astudillo · Peter Frazier · Eytan Bakshy

We consider Bayesian optimization of expensive-to-evaluate experiments that generate vector-valued outcomes over which a decision-maker (DM) has preferences. These preferences are encoded by a utility function that is not known in closed form but can be estimated by asking the DM to express preferences over pairs of outcome vectors. To address this problem, we develop Bayesian optimization with preference exploration, a novel framework that alternates between interactive real-time preference learning with the DM via pairwise comparisons between outcomes, and Bayesian optimization with a learned compositional model of DM utility and outcomes. Within this framework, we propose preference exploration strategies specifically designed for this task, and demonstrate their performance via extensive simulation studies.

Sébastien Arnold · Pierre L'Ecuyer · Liyu Chen · Yi-fan Chen · Fei Sha

Hard integrals arise frequently in reinforcement learning, for example when computing expectations in policy evaluation and policy iteration. They are often analytically intractable and typically estimated with Monte Carlo methods, whose sampling contributes to high variance in policy values and gradients. In this work, we propose to replace Monte Carlo samples with low-discrepancy point sets. We combine policy gradient methods with Randomized Quasi-Monte Carlo, yielding variance-reduced formulations of policy gradient and actor-critic algorithms. These formulations are effective for policy evaluation and policy improvement, as they outperform state-of-the-art algorithms on standardized continuous control benchmarks. Our empirical analyses validate the intuition that replacing Monte Carlo with Quasi-Monte Carlo yields significantly more accurate gradient estimates.

Jinhang Zuo · Xutong Liu · Carlee Joe-Wong · John C. S. Lui · Wei Chen

Online influence maximization has attracted much attention as a way to maximize influence spread through a social network while learning the values of unknown network parameters. Most previous works focus on single-item diffusion. In this paper, we introduce a new Online Competitive Influence Maximization (OCIM) problem, where two competing items (e.g., products, news stories) propagate in the same network and influence probabilities on edges are unknown. We adopt a combinatorial multi-armed bandit (CMAB) framework for OCIM, but unlike the non-competitive setting, the important monotonicity property (influence spread increases when influence probabilities on edges increase) no longer holds due to the competitive nature of propagation, which brings a significant new challenge to the problem. We provide a nontrivial proof showing that the Triggering Probability Modulated (TPM) condition for CMAB still holds in OCIM, which is instrumental for our proposed algorithms OCIM-TS and OCIM-OFU to achieve sublinear Bayesian and frequentist regret, respectively. We also design an OCIM-ETC algorithm that requires less feedback and easier offline computation, at the expense of a worse frequentist regret bound. Experimental evaluations demonstrate the effectiveness of our algorithms.

Setareh Ariafar · Justin Gilmer · Zachary Nado · Jasper Snoek · Rodolphe Jenatton · George Dahl

Black box optimization requires specifying a search space to explore for solutions, e.g. a d-dimensional compact space, and this choice is critical for getting the best results at a reasonable budget. Unfortunately, determining a high quality search space can be challenging in many applications. For example, when tuning hyperparameters for machine learning pipelines on a new problem given a limited budget, one must strike a balance between excluding potentially promising regions and keeping the search space small enough to be tractable. The goal of this work is to motivate---through example applications in tuning deep neural networks---the problem of predicting the quality of search spaces conditioned on budgets, as well as to provide a simple scoring method based on a utility function applied to a probabilistic response surface model, similar to Bayesian optimization. We show that the method we present can compute meaningful budget-conditional scores in a variety of situations. We also provide experimental evidence that accurate scores can be useful in constructing and pruning search spaces. Ultimately, we believe scoring search spaces should become standard practice in the experimental workflow for deep learning.

Variational Gaussian process (GP) approximations have become a standard tool in fast GP inference. This technique requires a user to select variational features to increase efficiency. So far the common choices in the literature are disparate and lacking generality. We propose to view the GP as lying in a Banach space which then facilitates a unified perspective. This is used to understand the relationship between existing features and to draw a connection between kernel ridge regression and variational GP approximations.

Ethan Weinberger · Nicasia Beebe-Wang · Su-In Lee

In the contrastive analysis (CA) setting, machine learning practitioners are specifically interested in discovering patterns that are enriched in a target dataset as compared to a background dataset generated from sources of variation irrelevant to the task at hand. For example, a biomedical data analyst may seek to understand variations in genomic data only present among patients with a given disease as opposed to those also present in healthy control subjects. Such scenarios have motivated the development of contrastive latent variable models to isolate variations unique to these target datasets from those shared across the target and background datasets, with current state of the art models based on the variational autoencoder (VAE) framework. However, previously proposed models do not explicitly enforce the constraints on latent variables underlying CA, potentially leading to the undesirable leakage of information between the two sets of latent variables. Here we propose the moment matching contrastive VAE (MM-cVAE), a reformulation of the VAE for CA that uses the maximum mean discrepancy to explicitly enforce two crucial latent variable constraints underlying CA. On three challenging CA tasks we find that our method outperforms the previous state-of-the-art both qualitatively and on a set of quantitative metrics.

Osama Hanna · Lin Yang · Christina Fragouli

The multi-armed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards. Recently, it has become popular for a number of applications over wireless networks, where communication constraints can form a bottleneck. Existing works usually fail to address this issue and can become infeasible in certain applications. In this paper we address the communication problem by optimizing the communication of rewards collected by distributed agents. By providing nearly matching upper and lower bounds, we tightly characterize the number of bits needed per reward for the learner to accurately learn without suffering additional regret. In particular, we establish a generic reward quantization algorithm, QuBan, that can be applied on top of any (no-regret) MAB algorithm to form a new communication-efficient counterpart, that requires only a few (as low as 3) bits to be sent per iteration while preserving the same regret bound. Our lower bound is established via constructing hard instances from a subgaussian distribution. Our theory is further corroborated by numerically experiments.

Saeid Naderiparizi · Adam Scibior · Andreas Munk · Mehrdad Ghadiri · Atilim Gunes Baydin · Bradley Gram-Hansen · Christian Schroeder de Witt · Robert Zinkov · Philip Torr · Tom Rainforth · Yee Whye Teh · Frank Wood

Naive approaches to amortized inference in probabilistic programs with unbounded loops can produce estimators with infinite variance. This is particularly true of importance sampling inference in programs that explicitly include rejection sampling as part of the user-programmed generative procedure. In this paper we develop a new and efficient amortized importance sampling estimator. We prove finite variance of our estimator and empirically demonstrate our method's correctness and efficiency compared to existing alternatives on generative programs containing rejection sampling loops and discuss how to implement our method in a generic probabilistic programming framework.