Abstract:

Ziping Xu · Amirhossein Meisami · Ambuj Tewari

This paper studies the decision making problem with Funnel Structure. Funnel structure, a well-known concept in the marketing field, occurs in those systems where the decision maker interacts with the environment in a layered manner receiving far fewer observations from deep layers than shallow ones. For example, in the email marketing campaign application, the layers correspond to Open, Click and Purchase events. Conversions from Click to Purchase happen very infrequently because a purchase cannot be made unless the link in an email is clicked on.

We formulate this challenging decision making problem as a contextual bandit with funnel structure and develop a multi-task learning algorithm that mitigates the lack of sufficient observations from deeper layers. We analyze both the prediction error and the regret of our algorithms. We verify our theory on prediction errors through a simple simulation. Experiments on both a simulated environment and an environment based on real-world data from a major email marketing company show that our algorithms offer significant improvement over previous methods.

Anilesh K. Krishnaswamy · Zhihao Jiang · Kangning Wang · Yu Cheng · Kamesh Munagala

Standard approaches to group-based notions of fairness, such as \emph{parity} and \emph{equalized odds}, try to equalize absolute measures of performance across known groups (based on race, gender, etc.). Consequently, a group that is inherently harder to classify may hold back the performance on other groups; and no guarantees can be provided for unforeseen groups. Instead, we propose a fairness notion whose guarantee, on each group $g$ in a class $\mathcal{G}$, is relative to the performance of the best classifier on $g$. We apply this notion to broad classes of groups, in particular, where (a) $\mathcal{G}$ consists of all possible groups (subsets) in the data, and (b) $\mathcal{G}$ is more streamlined. For the first setting, which is akin to groups being completely unknown, we devise the {\sc PF} (Proportional Fairness) classifier, which guarantees, on any possible group $g$, an accuracy that is proportional to that of the optimal classifier for $g$, scaled by the relative size of $g$ in the data set. Due to including all possible groups, some of which could be too complex to be relevant, the worst-case theoretical guarantees here have to be proportionally weaker for smaller subsets. For the second setting, we devise the {\sc BeFair} (Best-effort Fair) framework which seeks an accuracy, on every $g \in \mathcal{G}$, which approximates that of the optimal classifier on $g$, independent of the size of $g$. Aiming for such a guarantee results in a non-convex problem, and we design novel techniques to get around this difficulty when $\mathcal{G}$ is the set of linear hypotheses. We test our algorithms on real-world data sets, and present interesting comparative insights on their performance.

Ilker Demirel · Cem Tekin

Combinatorial bandit models and algorithms are used in many sequential decision-making tasks ranging from item list recommendation to influence maximization. Typical algorithms proposed for combinatorial bandits, including combinatorial UCB (CUCB) and combinatorial Thompson sampling (CTS) do not exploit correlations between base arms during the learning process. Moreover, their regret is usually analyzed under independent base arm outcomes. In this paper, we use Gaussian Processes (GPs) to model correlations between base arms. In particular, we consider a combinatorial bandit model with probabilistically triggered arms, and assume that the expected base arm outcome function is a sample from a GP. We assume that the learner has access to an exact computation oracle, which returns an optimal solution given expected base arm outcomes, and analyze the regret of Combinatorial Gaussian Process Upper Confidence Bound (ComGP-UCB) algorithm for this setting. Under (triggering probability modulated) Lipschitz continuity assumption on the expected reward function, we derive ($O( \sqrt{m T \log T \gamma_{T, \boldsymbol{\mu}}^{PTA}})$) $O(m \sqrt{\frac{T \log T}{p^*}})$ upper bounds for the regret of ComGP-UCB that hold with high probability, where $m$ denotes the number of base arms, $p^*$ denotes the minimum non-zero triggering probability, and $\gamma_{T, \boldsymbol{\mu}}^{PTA}$ denotes the pseudo-information gain. Finally, we show via simulations that when the correlations between base arm outcomes are strong, ComGP-UCB significantly outperforms CUCB and CTS.

Antonious Girgis · Deepesh Data · Suhas Diggavi · Peter Kairouz · Ananda Theertha Suresh

We consider a distributed empirical risk minimization (ERM) optimization problem with communication efficiency and privacy requirements, motivated by the federated learning (FL) framework. We propose a distributed communication-efficient and local differentially private stochastic gradient descent (CLDP-SGD) algorithm and analyze its communication, privacy, and convergence trade-offs. Since each iteration of the CLDP-SGD aggregates the client-side local gradients, we develop (optimal) communication-efficient schemes for mean estimation for several $\ell_p$ spaces under local differential privacy (LDP). To overcome performance limitation of LDP, CLDP-SGD takes advantage of the inherent privacy amplification provided by client subsampling and data subsampling at each selected client (through SGD) as well as the recently developed shuffled model of privacy. For convex loss functions, we prove that the proposed CLDP-SGD algorithm matches the known lower bounds on the \textit{centralized} private ERM while using a finite number of bits per iteration for each client, \emph{i.e.,} effectively getting communication efficiency for ``free''. We also provide preliminary experimental results supporting the theory.

Paxton Turner · Jingbo Liu · Philippe Rigollet

Coresets have emerged as a powerful tool to summarize data by selecting a small subset of the original observations while retaining most of its information. This approach has led to significant computational speedups but the performance of statistical procedures run on coresets is largely unexplored. In this work, we develop a statistical framework to study coresets and focus on the canonical task of nonparameteric density estimation. Our contributions are twofold. First, we establish the minimax rate of estimation achievable by coreset-based estimators. Second, we show that the practical coreset kernel density estimators are near-minimax optimal over a large class of Holder-smooth densities.

Chi Zhang · Carlos Cinelli · Bryant Chen · Judea Pearl

Assumptions about equality of effects are commonly made in causal inference tasks. For example, the well-known `difference-in-differences'' method assumes that confounding remains constant across time periods. Similarly, it is not unreasonable to assume that causal effects apply equally to units undergoing interference. Finally, sensitivity analysis often hypothesizes equality among existing and unaccounted for confounders. Despite the ubiquity of these`

equality constraints,'' modern identification methods have not leveraged their presence in a systematic way. In this paper, we develop a novel graphical criterion that extends the well-known method of generalized instrumental sets to exploit such additional constraints for causal identification in linear models. We further demonstrate how it solves many diverse problems found in the literature in a general way, including difference-in-differences, interference, as well as benchmarking in sensitivity analysis.

The iterative selection of examples for labeling in active machine learning is conceptually similar to feedback channel coding in information theory: in both tasks, the objective is to seek a minimal sequence of actions to encode information in the presence of noise. While this high-level overlap has been previously noted, there remain open questions on how to best formulate active learning as a communications system to leverage existing analysis and algorithms in feedback coding. In this work, we formally identify and leverage the structural commonalities between the two problems, including the characterization of encoder and noisy channel components, to design a new algorithm. Specifically, we develop an optimal transport-based feedback coding scheme called Approximate Posterior Matching (APM) for the task of active example selection and explore its application to Bayesian logistic regression, a popular model in active learning. We evaluate APM on a variety of datasets and demonstrate learning performance comparable to existing active learning methods, at a reduced computational cost. These results demonstrate the potential of directly deploying concepts from feedback channel coding to design efficient active learning strategies.

In this work, we address the problem of balanced treatment assignment for experiments by considering an interpretation of the problem as optimization of a two-sample test between test and control units. Using this lens we provide an assignment algorithm that is optimal with respect to the minimum spanning tree test of Friedman and Rafsky [1979]. This assignment to treatment groups may be performed exactly in polynomial time and allows for the design of experiments explicitly targeting the individual treatment effect. We provide a probabilistic interpretation of this process in terms of the most probable element of designs drawn from a determinantal point process. We provide a novel formulation of estimation as transductive inference and show how the tree structures used in design can also be used in an adjustment estimator. We conclude with a simulation study demonstrating the improved efficacy of our method.

Omer Deniz Akyildiz · Gerrit van den Burg · Theodoros Damoulas · Mark Steel

We introduce the probabilistic sequential matrix factorization (PSMF) method for factorizing time-varying and non-stationary datasets consisting of high-dimensional time-series. In particular, we consider nonlinear Gaussian state-space models where sequential approximate inference results in the factorization of a data matrix into a dictionary and time-varying coefficients with potentially nonlinear Markovian dependencies. The assumed Markovian structure on the coefficients enables us to encode temporal dependencies into a low-dimensional feature space. The proposed inference method is solely based on an approximate extended Kalman filtering scheme, which makes the resulting method particularly efficient. PSMF can account for temporal nonlinearities and, more importantly, can be used to calibrate and estimate generic differentiable nonlinear subspace models. We also introduce a robust version of PSMF, called rPSMF, which uses Student-t filters to handle model misspecification. We show that PSMF can be used in multiple contexts: modeling time series with a periodic subspace, robustifying changepoint detection methods, and imputing missing data in several high-dimensional time-series, such as measurements of pollutants across London.

We study a novel curriculum learning scheme where in each round, samples are selected to achieve the greatest progress and fastest learning speed towards the ground-truth on all available samples. Inspired by an analysis of optimization dynamics under gradient flow for both regression and classification, the problem reduces to selecting training samples by a score computed from samples' residual and linear temporal dynamics. It encourages the model to focus on the samples at learning frontier, i.e., those with large loss but fast learning speed. The scores in discrete time can be estimated via already-available byproducts of training, and thus require a negligible amount of extra computation. We discuss the properties and potential advantages of the proposed dynamics optimization via current deep learning theory and empirical study. By integrating it with cyclical training of neural networks, we introduce "dynamics-optimized curriculum learning (DoCL)", which selects the training set for each step by weighted sampling based on the scores. On nine different datasets, DoCL significantly outperforms random mini-batch SGD and recent curriculum learning methods both in terms of efficiency and final performance.

We consider dimensionality reduction for data sets with two or more independent degrees of freedom. For example, measurements of deformable shapes with several parts that move independently fall under this characterization. Mathematically, if the space of each continuous independent motion is a manifold, then their combination forms a product manifold. In this paper, we present an algorithm for manifold factorization given a sample of points from the product manifold. Our algorithm is based on spectral graph methods for manifold learning and the separability of the Laplacian operator on product spaces. Recovering the factors of a manifold yields meaningful lower-dimensional representations, allowing one to focus on particular aspects of the data space while ignoring others. We demonstrate the potential use of our method for an important and challenging problem in structural biology: mapping the motions of proteins and other large molecules using cryo-electron microscopy data sets.

We provide a deterministic space-efficient algorithm for estimating ridge regression. For n data points with d features and a large enough regularization parameter, we provide a solution within eps L_2 error using only O(d/eps) space. This is the first o(d^2) space deterministic streaming algorithm with guaranteed solution error and risk bound for this classic problem. The algorithm sketches the covariance matrix by variants of Frequent Directions, which implies it can operate in insertion-only streams and a variety of distributed data settings. In comparisons to randomized sketching algorithms on synthetic and real-world datasets, our algorithm has less empirical error using less space and similar time.

Kishor Jothimurugan · Osbert Bastani · Rajeev Alur

We propose a novel hierarchical reinforcement learning framework for control with continuous state and action spaces. In our framework, the user specifies subgoal regions which are subsets of states; then, we (i) learn options that serve as transitions between these subgoal regions, and (ii) construct a high-level plan in the resulting abstract decision process (ADP). A key challenge is that the ADP may not be Markov; we propose two algorithms for planning in the ADP that address this issue. Our first algorithm is conservative, allowing us to prove theoretical guarantees on its performance, which help inform the design of subgoal regions. Our second algorithm is a practical one that interweaves planning at the abstract level and learning at the concrete level. In our experiments, we demonstrate that our approach outperforms state-of-the-art hierarchical reinforcement learning algorithms on several challenging benchmarks.

Pratik Patil · Yuting Wei · Alessandro Rinaldo · Ryan Tibshirani

We examine generalized and leave-one-out cross-validation for ridge regression in a proportional asymptotic framework where the dimension of the feature space grows proportionally with the number of observations. Given i.i.d.\ samples from a linear model with an arbitrary feature covariance and a signal vector that is bounded in $\ell_2$ norm, we show that generalized cross-validation for ridge regression converges almost surely to the expected out-of-sample prediction error, uniformly over a range of ridge regularization parameters that includes zero (and even negative values). We prove the analogous result for leave-one-out cross-validation. As a consequence, we show that ridge tuning via minimization of generalized or leave-one-out cross-validation asymptotically almost surely delivers the optimal level of regularization for predictive accuracy, whether it be positive, negative, or zero.

We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $\varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $\varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretically optimal dimension-independent error guarantees in nearly-linear time $\widetilde O(Nd)$. Our results extend recent work on computationally-efficient robust estimation to a more widely applicable incomplete-data setting.

Alessandro Rinaldo · Daren Wang · Qin Wen · Rebecca Willett · Yi Yu

This paper addresses the problem of localizing change points in high-dimensional linear regression models with piecewise constant regression coefficients. We develop a dynamic programming approach to estimate the locations of the change points whose performance improves upon the current state-of-the-art, even as the dimension, the sparsity of the regression coefficients, the temporal spacing between two consecutive change points, and the magnitude of the difference of two consecutive regression coefficient vectors are allowed to vary with the sample size. Furthermore, we devise a computationally-efficient refinement procedure that provably reduces the localization error of preliminary estimates of the change points. We demonstrate minimax lower bounds on the localization error that nearly match the upper bound on the localization error of our methodology and show that the signal-to-noise condition we impose is essentially the weakest possible based on information-theoretic arguments. Extensive numerical results support our theoretical findings, and experiments on real air quality data reveal change points supported by historical information not used by the algorithm.

Yishay Mansour · Mehryar Mohri · Jae Ro · Ananda Theertha Suresh · Ke Wu

We study multiple-source domain adaptation, when the learner has access to abundant labeled data from multiple-source domains and limited labeled data from the target domain. We analyze existing algorithms for this problem, and propose a novel algorithm based on model selection. Our algorithms are efficient, and experiments on real data-sets empirically demonstrate their benefits.

Ian Covert · Su-In Lee

The Shapley value concept from cooperative game theory has become a popular technique for interpreting ML models, but efficiently estimating these values remains challenging, particularly in the model-agnostic setting. Here, we revisit the idea of estimating Shapley values via linear regression to understand and improve upon this approach. By analyzing the original KernelSHAP alongside a newly proposed unbiased version, we develop techniques to detect its convergence and calculate uncertainty estimates. We also find that the original version incurs a negligible increase in bias in exchange for significantly lower variance, and we propose a variance reduction technique that further accelerates the convergence of both estimators. Finally, we develop a version of KernelSHAP for stochastic cooperative games that yields fast new estimators for two global explanation methods.

Due to recent empirical successes, the options framework for hierarchical reinforcement learning is gaining increasing popularity. Rather than learning from rewards, we consider learning an options-type hierarchical policy from expert demonstrations. Such a problem is referred to as hierarchical imitation learning. Converting this problem to parameter inference in a latent variable model, we develop convergence guarantees for the EM approach proposed by Daniel et al. (2016b). The population level algorithm is analyzed as an intermediate step, which is nontrivial due to the samples being correlated. If the expert policy can be parameterized by a variant of the options framework, then, under regularity conditions, we prove that the proposed algorithm converges with high probability to a norm ball around the true parameter. To our knowledge, this is the first performance guarantee for an hierarchical imitation learning algorithm that only observes primitive state-action pairs.

Providing meaningful privacy to users of location based services is particularly challenging when multiple locations are revealed in a short period of time. This is primarily due to the tremendous degree of dependence that can be anticipated between points. We propose a R\'enyi divergence based privacy framework for bounding expected privacy loss for conditionally dependent data. Additionally, we demonstrate an algorithm for achieving this privacy under Gaussian process conditional priors. This framework both exemplifies why conditionally dependent data is so challenging to protect and offers a strategy for preserving privacy to within a fixed radius for sensitive locations in a user's trace.

Aditya Bhaskara · Ashok Cutkosky · Ravi Kumar · Manish Purohit

We consider the online linear optimization problem with movement costs, a variant of online learning in which the learner must not only respond to cost vectors $c_t$ with points $x_t$ in order to maintain low regret, but is also penalized for movement by an additional cost $\|x_t-x_{t+1}\|^{1+\epsilon}$ for some $\epsilon>0$. Classically, simple algorithms that obtain the optimal $\sqrt{T}$ regret already are very stable and do not incur a significant movement cost. However, recent work has shown that when the learning algorithm is provided with weak ``hint'' vectors that have a positive correlation with the costs, the regret can be significantly improved to $\log(T)$. In this work, we study the stability of such algorithms, and provide matching upper and lower bounds showing that incorporating movement costs results in intricate tradeoffs between $\log(T)$ when $\epsilon\ge 1$ and $\sqrt{T}$ regret when $\epsilon=0$.

Jeongyeol Kwon · Nhat Ho · Constantine Caramanis

We study the convergence rates of the EM algorithm for learning two-component mixed linear regression under all regimes of signal-to-noise ratio (SNR). We resolve a long-standing question that many recent results have attempted to tackle: we completely characterize the convergence behavior of EM, and show that the EM algorithm achieves minimax optimal sample complexity under all SNR regimes. In particular, when the SNR is sufficiently large, the EM updates converge to the true parameter $\theta^{*}$ at the standard parametric convergence rate $\calo((d/n)^{1/2})$ after $\calo(\log(n/d))$ iterations. In the regime where the SNR is above $\calo((d/n)^{1/4})$ and below some constant, the EM iterates converge to a $\calo({\rm SNR}^{-1} (d/n)^{1/2})$ neighborhood of the true parameter, when the number of iterations is of the order $\calo({\rm SNR}^{-2} \log(n/d))$. In the low SNR regime where the SNR is below $\calo((d/n)^{1/4})$, we show that EM converges to a $\calo((d/n)^{1/4})$ neighborhood of the true parameters, after $\calo((n/d)^{1/2})$ iterations. Notably, these results are achieved under mild conditions of either random initialization or an efficiently computable local initialization. By providing tight convergence guarantees of the EM algorithm in middle-to-low SNR regimes, we fill the remaining gap in the literature, and significantly, reveal that in low SNR, EM changes rate, matching the $n^{-1/4}$ rate of the MLE, a behavior that previous work had been unable to show.

andres munoz · Umar Syed · Sergei Vassilvtiskii · Ellen Vitercik

We study the problem of differentially private optimization with linear constraints when the right-hand-side of the constraints depends on private data. This type of problem appears in many applications, especially resource allocation. Previous research provided solutions that retained privacy but sometimes violated the constraints. In many settings, however, the constraints cannot be violated under any circumstances. To address this hard requirement, we present an algorithm that releases a nearly-optimal solution satisfying the constraints with probability 1. We also prove a lower bound demonstrating that the difference between the objective value of our algorithm's solution and the optimal solution is tight up to logarithmic factors among all differentially private algorithms. We conclude with experiments demonstrating that our algorithm can achieve nearly optimal performance while preserving privacy.

Meta-learning has enabled learning statistical models that can be quickly adapted to new prediction tasks. Motivated by use-cases in personalized federated learning, we study the often overlooked aspect of the modern meta-learning algorithms—their data efficiency. To shed more light on which methods are more efficient, we use techniques from algorithmic stability to derive bounds on the transfer risk that have important practical implications, indicating how much supervision is needed and how it must be allocated for each method to attain the desired level of generalization. Further, we introduce a new simple framework for evaluating meta-learning methods under a limit on the available supervision, conduct an empirical study of MAML, Reptile, andProtoNets, and demonstrate the differences in the behavior of these methods on few-shot and federated learning benchmarks. Finally, we propose active meta-learning, which incorporates active data selection into learning-to-learn, leading to better performance of all methods in the limited supervision regime.

Aristide Baratin · Thomas George · César Laurent · R Devon Hjelm · Guillaume Lajoie · Pascal Vincent · Simon Lacoste-Julien

We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a regularization effect induced by a dynamical alignment ofthe neural tangent features introduced by Jacot et al. (2018), along a small number of task-relevant directions. This can be interpreted as a combined mechanism of feature selection and compression. By extrapolating a new analysis of Rademacher complexity bounds for linear models, we motivate and study a heuristic complexity measure that captures this phenomenon, in terms of sequences of tangent kernel classes along optimization paths. The code for our experiments is available as https://github.com/tfjgeorge/ntk_alignment.

We propose a novel method for selective classification (SC), a problem which allows a classifier to abstain from predicting some instances, thus trading off accuracy against coverage (the fraction of instances predicted). In contrast to prior gating or confidence-set based work, our proposed method optimises a collection of class-wise decoupled one-sided empirical risks, and is in essence a method for explicitly finding the largest decision sets for each class that have few false positives. This one-sided prediction (OSP) based relaxation yields an SC scheme that attains near-optimal coverage in the practically relevant high target accuracy regime, and further admits efficient implementation, leading to a flexible and principled method for SC. We theoretically derive generalization bounds for SC and OSP, and empirically we show that our scheme strongly outperforms state of the art methods in coverage at small error levels.

Zichong Li · Pin-Yu Chen · Sijia Liu · Songtao Lu · Yangyang Xu

First-order methods have been studied for nonlinear constrained optimization within the framework of the augmented Lagrangian method (ALM) or penalty method. We propose an improved inexact ALM (iALM) and conduct a unified analysis for nonconvex problems with either convex or nonconvex constraints. Under certain regularity conditions (that are also assumed by existing works), we show an $\tilde{O}(\varepsilon^{-\frac{5}{2}})$ complexity result for a problem with a nonconvex objective and convex constraints and an $\tilde{O}(\varepsilon^{-3})$ complexity result for a problem with a nonconvex objective and nonconvex constraints, where the complexity is measured by the number of first-order oracles to yield an $\varepsilon$-KKT solution. Both results are the best known. The same-order complexity results have been achieved by penalty methods. However, two different analysis techniques are used to obtain the results, and more importantly, the penalty methods generally perform significantly worse than iALM in practice. Our improved iALM and analysis close the gap between theory and practice. Numerical experiments on nonconvex problems with convex or nonconvex constraints are provided to demonstrate the effectiveness of our proposed method.

Soumya Basu · Orestis Papadigenopoulos · Constantine Caramanis · Sanjay Shakkottai

We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. The above contextual setting captures important scenarios such as recommendation systems or ad placement with diverse users. This problem has been recently studied [Dickerson et al., AAAI 2018] in the full-information setting (i.e., assuming knowledge of the mean context-dependent arm rewards), where competitive ratio bounds have been derived. We focus on the bandit setting, where these means are initially unknown; we propose a UCB-based variant of the full-information algorithm that guarantees a $\mathcal{O}(\log T)$-regret w.r.t. an $\alpha$-optimal strategy in $T$ time steps, matching the $\Omega(\log(T))$ regret lower bound in this setting. Due to the time correlations caused by blocking, existing techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling and combine them with ideas from combinatorial bandits and non-stationary Markov chains coupling.

Song Wei · Shixiang Zhu · Minghe Zhang · Yao Xie

Recently there have been many research efforts in developing generative models for self-exciting point processes, partly due to their broad applicability for real-world applications. However, rarely can we quantify how well the generative model captures the nature or ground-truth since it is usually unknown. The challenge typically lies in the fact that the generative models typically provide, at most, good approximations to the ground-truth (e.g., through the rich representative power of neural networks), but they cannot be precisely the ground-truth. We thus cannot use the classic goodness-of-fit (GOF) test framework to evaluate their performance. In this paper, we develop a GOF test for generative models of self-exciting processes by making a new connection to this problem with the classical statistical theory of Quasi-maximum-likelihood estimator (QMLE). We present a non-parametric self-normalizing statistic for the GOF test: the Generalized Score (GS) statistics, and explicitly capture the model misspecification when establishing the asymptotic distribution of the GS statistic. Numerical simulation and real-data experiments validate our theory and demonstrate the proposed GS test's good performance.

We consider the problem of Bayesian Optimization (BO) in which the goal is to design an adaptive querying strategy to optimize a function $f:[0,1]^d\mapsto \reals$. The function is assumed to be drawn from a Gaussian Process, and can only be accessed through noisy oracle queries. The most commonly used oracle in BO literature is the noisy Zeroth-Order-Oracle~(ZOO) which returns noise-corrupted function value $y = f(x) + \eta$ at any point $x \in \domain$ queried by the agent. A less studied oracle in BO is the First-Order-Oracle~(FOO) which also returns noisy gradient value at the queried point. In this paper we consider the fundamental question of quantifying the possible improvement in regret that can be achieved under FOO access as compared to the case in which only ZOO access is available. Under some regularity assumptions on $K$, we first show that the expected cumulative regret with ZOO of any algorithm must satisfy a lower bound of $\Omega(\sqrt{2^d n})$, where $n$ is the query budget. This lower bound captures the appropriate scaling of the regret on both dimension $d$ and budget $n$, and relies on a novel reduction from BO to a multi-armed bandit~(MAB) problem. We then propose a two-phase algorithm which, with some additional prior knowledge, achieves a vastly improved $\mc{O}\lp d (\log n)^2 \rp$ regret when given access to a FOO. Together, these two results highlight the significant value of incorporating gradient information in BO algorithms.

We consider the problem of filtering in linear state-space models (e.g., the Kalman filter setting) through the lens of regret optimization. Specifically, we study the problem of causally estimating a desired signal, generated by a linear state-space model driven by process noise, based on noisy observations of a related observation process. We define a novel regret criterion for estimator design as the difference of the estimation error energies between a clairvoyant estimator that has access to all future observations (a so-called smoother) and a causal one that only has access to current and past observations. The regret-optimal estimator is the causal estimator that minimizes the worst-case regret across all bounded-energy noise sequences. We provide a solution for the regret filtering problem at two levels. First, an horizon-independent solution at the operator level is obtained by reducing the regret to the well-known Nehari problem. Secondly, our main result for state-space models is an explicit estimator that achieves the optimal regret. The regret-optimal estimator is represented as a finite-dimensional state-space whose parameters can be computed by solving three Riccati equations and a single Lyapunov equation. We demonstrate the applicability and efficacy of the estimator in a variety of problems and observe that the estimator has average and worst-case performances that are simultaneously close to their optimal values.

Zhenhuan Yang · Yunwen Lei · Siwei Lyu · Yiming Ying

Pairwise learning has recently received increasing attention since it subsumes many important machine learning tasks (e.g. AUC maximization and metric learning) into a unifying framework. In this paper, we give the first-ever-known stability and generalization analysis of stochastic gradient descent (SGD) for pairwise learning with non-smooth loss functions, which are widely used (e.g. Ranking SVM with the hinge loss). We introduce a novel decomposition in its stability analysis to decouple the pairwisely dependent random variables, and derive generalization bounds consistent with pointwise learning. Furthermore, we apply our stability analysis to develop differentially private SGD for pairwise learning, for which our utility bounds match with the state-of-the-art output perturbation method (Huai et al., 2020) with smooth losses. Finally, we illustrate the results using specific examples of AUC maximization and similarity metric learning. As a byproduct, we provide an affirmative solution to an open question on the advantage of the nuclear-norm constraint over Frobenius norm constraint in similarity metric learning.

Moses Charikar · Lunjia Hu

In the non-negative matrix factorization (NMF) problem, the input is an $m\times n$ matrix $M$ with non-negative entries and the goal is to factorize it as $M\approx AW$. The $m\times k$ matrix $A$ and the $k\times n$ matrix $W$ are both constrained to have non-negative entries. This is in contrast to singular value decomposition, where the matrices $A$ and $W$ can have negative entries but must satisfy the orthogonality constraint: the columns of $A$ are orthogonal and the rows of $W$ are also orthogonal. The orthogonal non-negative matrix factorization (ONMF) problem imposes both the non-negativity and the orthogonality constraints, and previous work showed that it leads to better performances than NMF on many clustering tasks. We give the first constant-factor approximation algorithm for ONMF when one or both of $A$ and $W$ are subject to the orthogonality constraint. We also show an interesting connection to the correlation clustering problem on bipartite graphs. Our experiments on synthetic and real-world data show that our algorithm achieves similar or smaller errors compared to previous ONMF algorithms while ensuring perfect orthogonality (many previous algorithms do not satisfy the hard orthogonality constraint).

Andrew Bennett · Nathan Kallus · Lihong Li · Ali Mousavi

Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings where experimentation is limited, such as healthcare. But, in these very same settings, observed actions are often confounded by unobserved variables making OPE even more difficult. We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders, where states and actions can act as proxies for the unobserved confounders. We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data. Our method involves two stages. In the first, we show how to use proxies to estimate stationary distribution ratios, extending recent work on breaking the curse of horizon to the confounded setting. In the second, we show optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions. We establish theoretical guarantees of consistency and benchmark our method empirically.

Yihan Wu · Aleksandar Bojchevski · Aleksei Kuvshinov · Stephan Günnemann

Randomized smoothing is currently the most competitive technique for providing provable robustness guarantees. Since this approach is model-agnostic and inherently scalable we can certify arbitrary classifiers. Despite its success, recent works show that for a small class of i.i.d. distributions, the largest $l_p$ radius that can be certified using randomized smoothing decreases as $O(1/d^{1/2-1/p})$ with dimension $d$ for $p > 2$. We complete the picture and show that similar no-go results hold for the $l_2$ norm for a much more general family of distributions which are continuous and symmetric about the origin. Specifically, we calculate two different upper bounds of the $l_2$ certified radius which have a constant multiplier of order $\Theta(1/d^{1/2})$. Moreover, we extend our results to $l_p (p>2)$ certification with spherical symmetric distributions solidifying the limitations of randomized smoothing. We discuss the implications of our results for how accuracy and robustness are related, and why robust training with noise augmentation can alleviate some of the limitations in practice. We also show that on real-world data the gap between the certified radius and our upper bounds is small.

Samuel Stanton · Wesley Maddox · Ian Delbridge · Andrew Gordon Wilson

Gaussian processes (GPs) provide a gold standard for performance in online settings, such as sample-efficient control and black box optimization, where we need to update a posterior distribution as we acquire data in a sequential online setting. However, updating a GP posterior to accommodate even a single new observation after having observed $n$ points incurs at least $\mathcal{O}(n)$ computations in the exact setting. We show how to use structured kernel interpolation to efficiently reuse computations for constant-time $\mathcal{O}(1)$ online updates with respect to the number of points $n$, while retaining exact inference. We demonstrate the promise of our approach in a range of online regression and classification settings, Bayesian optimization, and active sampling to reduce error in malaria incidence forecasting.

Learning rich visual representations using contrastive self-supervised learning has been extremely successful. However, it is still a major question whether we could use a similar approach to learn superior auditory representations. In this paper, we expand on prior work (SimCLR) to learn better auditory representations. We (1) introduce various data augmentations suitable for auditory data and evaluate their impact on predictive performance, (2) show that training with time-frequency audio features substantially improves the quality of the learned representations compared to raw signals, and (3) demonstrate that training with both supervised and contrastive losses simultaneously improves the learned representations compared to self-supervised pre-training followed by supervised fine-tuning. We illustrate that by combining all these methods and with substantially less labeled data, our framework (CLAR) achieves significant improvement on prediction performance compared to supervised approach. Moreover, compared to self-supervised approach, our framework converges faster with significantly better representations.

Charles Guille-Escuret · Manuela Girotti · Baptiste Goujaud · Ioannis Mitliagkas

In this work we introduce a new framework for the theoretical study of convergence and tuning of first-order optimization algorithms (FOA). The study of such algorithms typically requires assumptions on the objective functions: the most popular ones are probably smoothness and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *-norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy. Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. Finally we discuss how our work impacts the theoretical understanding of FOA and their performances.

Ali Lotfi Rezaabad · Rahi Kalantari · Sriram Vishwanath · Mingyuan Zhou · Jonathan Tamir

Efficient modeling of relational data arising in physical, social, and information sciences is challenging due to complicated dependencies within the data. In this work we build off of semi-implicit graph variational auto-encoders to capture higher order statistics in a low-dimensional graph latent representation. We incorporate hyperbolic geometry in the latent space through a Poincare embedding to efficiently represent graphs exhibiting hierarchical structure. To address the naive posterior latent distribution assumptions in classical variational inference, we use semi-implicit hierarchical variational Bayes to implicitly capture posteriors of given graph data, which may exhibit heavy tails, multiple modes, skewness, and highly correlated latent structures. We show that the existing semi-implicit variational inference objective provably reduces information in the observed graph. Based on this observation, we estimate and add an additional mutual information term to the semi-implicit variational inference learning objective to capture rich correlations arising between the input and latent spaces. We show that the inclusion of this regularization term in conjunction with the \poincare embedding boosts the quality of learned high-level representations and enables more flexible and faithful graphical modeling. We experimentally demonstrate that our approach outperforms existing graph variational auto-encoders both in Euclidean and in hyperbolic spaces for edge link prediction and node classification.

Changhee Lee · Mihaela van der Schaar

Integration of data from multiple omics techniques is becoming increasingly important in biomedical research. Due to non-uniformity and technical limitations in omics platforms, such integrative analyses on multiple omics, which we refer to as views, involve learning from incomplete observations with various view-missing patterns. This is challenging because i) complex interactions within and across observed views need to be properly addressed for optimal predictive power and ii) observations with various view-missing patterns need to be flexibly integrated. To address such challenges, we propose a deep variational information bottleneck (IB) approach for incomplete multi-view observations. Our method applies the IB framework on marginal and joint representations of the observed views to focus on intra-view and inter-view interactions that are relevant for the target. Most importantly, by modeling the joint representations as a product of marginal representations, we can efficiently learn from observed views with various view-missing patterns. Experiments on real-world datasets show that our method consistently achieves gain from data integration and outperforms state-of-the-art benchmarks.

Jeet Mohapatra · Ching-Yun Ko · Lily Weng · Pin-Yu Chen · Sijia Liu · Luca Daniel

The fragility of modern machine learning models has drawn a considerable amount of attention from both academia and the public. While immense interests were in either crafting adversarial attacks as a way to measure the robustness of neural networks or devising worst-case analytical robustness verification with guarantees, few methods could enjoy both scalability and robustness guarantees at the same time. As an alternative to these attempts, randomized smoothing adopts a different prediction rule that enables statistical robustness arguments which easily scale to large networks. However, in this paper, we point out the side effects of current randomized smoothing workflows. Specifically, we articulate and prove two major points: 1) the decision boundaries of smoothed classifiers will shrink, resulting in disparity in class-wise accuracy; 2) applying noise augmentation in the training process does not necessarily resolve the shrinking issue due to the inconsistent learning objectives.

This paper explores the statistical properties of fair representation learning, a pre-processing method that preemptively removes the correlations between features and sensitive attributes by mapping features to a fair representation space. We show that the demographic parity of a representation can be certified from a finite sample if and only if the mapping guarantees that the chi-squared mutual information between features and representations is finite for distributions of the features. Empirically, we find that smoothing representations with an additive Gaussian white noise provides generalization guarantees of fairness certificates, which improves upon existing fair representation learning approaches.

Warren Morningstar · Sharad Vikram · Cusuh Ham · Andrew Gallagher · Joshua Dillon

Automatic Differentiation Variational Inference (ADVI) is a useful tool for efficiently learning probabilistic models in machine learning. Generally approximate posteriors learned by ADVI are forced to be unimodal in order to facilitate use of the reparameterization trick. In this paper, we show how stratified sampling may be used to enable mixture distributions as the approximate posterior, and derive a new lower bound on the evidence analogous to the importance weighted autoencoder (IWAE). We show that this "SIWAE" is a tighter bound than both IWAE and the traditional ELBO, both of which are special instances of this bound. We verify empirically that the traditional ELBO objective disfavors the presence of multimodal posterior distributions and may therefore not be able to fully capture structure in the latent space. Our experiments show that using the SIWAE objective allows the encoder to learn more complex distributions which regularly contain multimodality, resulting in higher accuracy and better calibration in the presence of incomplete, limited, or corrupted data.

Zhengqing Zhou · Qinxun Bai · Zhengyuan Zhou · Linhai Qiu · Jose Blanchet · Peter Glynn

While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness--or the lack thereof--remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $O_P\left(1/\sqrt{n}\right)$ regret, meaning that with high probability, the policy learned from using $n$ training data points will be $O\left(1/\sqrt{n}\right)$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.

Sinho Chewi · Julien Clancy · Thibaut Le Gouic · Philippe Rigollet · George Stepaniants · Austin Stromme

We propose a new method for smoothly interpolating probability measures using the geometry of optimal transport. To that end, we reduce this problem to the classical Euclidean setting, allowing us to directly leverage the extensive toolbox of spline interpolation. Unlike previous approaches to measure-valued splines, our interpolated curves (i) have a clear interpretation as governing particle flows, which is natural for applications, and (ii) come with the first approximation guarantees on Wasserstein space. Finally, we demonstrate the broad applicability of our interpolation methodology by fitting surfaces of measures using thin-plate splines.

In a low-rank linear bandit problem, the reward of an action (represented by a matrix of size $d_1 \times d_2$) is the inner product between the action and an unknown low-rank matrix $\Theta^*$. We propose an algorithm based on a novel combination of online-to-confidence-set conversion~\citep{abbasi2012online} and the exponentially weighted average forecaster constructed by a covering of low-rank matrices. In $T$ rounds, our algorithm achieves $\widetilde{O}((d_1+d_2)^{3/2}\sqrt{rT})$ regret that improves upon the standard linear bandit regret bound of $\widetilde{O}(d_1d_2\sqrt{T})$ when the rank of $\Theta^*$: $r \ll \min\{d_1,d_2\}$. We also extend our algorithmic approach to the generalized linear setting to get an algorithm which enjoys a similar bound under regularity conditions on the link function. To get around the computational intractability of covering based approaches, we propose an efficient algorithm by extending the "Explore-Subspace-Then-Refine" algorithm of~\citet{jun2019bilinear}. Our efficient algorithm achieves $\widetilde{O}((d_1+d_2)^{3/2}\sqrt{rT})$ regret under a mild condition on the action set $\mathcal{X}$ and the $r$-th singular value of $\Theta^*$. Our upper bounds match the conjectured lower bound of \cite{jun2019bilinear} for a subclass of low-rank linear bandit problems. Further, we show that existing lower bounds for the sparse linear bandit problem strongly suggest that our regret bounds are unimprovable. To complement our theoretical contributions, we also conduct experiments to demonstrate that our algorithm can greatly outperform the performance of the standard linear bandit approach when $\Theta^*$ is low-rank.

In linear regression, SLOPE is a new convex analysis method that generalizes the Lasso via the sorted $\ell_1$ penalty: larger fitted coefficients are penalized more heavily. This magnitude-dependent regularization requires an input of penalty sequence $\blam$, instead of a scalar penalty as in the Lasso case, thus making the design extremely expensive in computation. In this paper, we propose two efficient algorithms to design the possibly high-dimensional SLOPE penalty, in order to minimize the mean squared error. For Gaussian data matrices, we propose a first order Projected Gradient Descent (PGD) under the Approximate Message Passing regime. For general data matrices, we present a zero-th order Coordinate Descent (CD) to design a sub-class of SLOPE, referred to as the $k$-level SLOPE. Our CD allows a useful trade-off between the accuracy and the computation speed. We demonstrate the performance of SLOPE with our designs via extensive experiments on synthetic data and real-world datasets.

Chen-Yu Wei · Mehdi Jafarnia · Haipeng Luo · Rahul Jain

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal O(\sqrt{T}) regret and another computationally efficient variant with O(T^{3/4}) regret, where T is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with O(\sqrt{T}) regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with O(T^{2/3}) regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).

A common assumption in generative models is that the generator immerses the latent space into a Euclidean ambient space. Instead, we consider the ambient space to be a Riemannian manifold, which allows for encoding domain knowledge through the associated Riemannian metric. Shortest paths can then be defined accordingly in the latent space to both follow the learned manifold and respect the ambient geometry. Through careful design of the ambient metric we can ensure that shortest paths are well-behaved even for deterministic generators that otherwise would exhibit a misleading bias. Experimentally we show that our approach improves the interpretability and the functionality of learned representations both using stochastic and deterministic generators.

Edmond Cunningham · Madalina Fiterau

Rectangular matrix-vector products (MVPs) are used extensively throughout machine learning and are fundamental to neural networks such as multi-layer perceptrons. However, the use of rectangular MVPs in successive normalizing flow transformations is notably missing. This paper identifies this methodological gap and plugs it with a tall and wide MVP change of variables formula. Our theory builds up to a practical algorithm that envelops existing dimensionality increasing flow methods such as augmented flows. We show that tall MVPs are closely related to the stochastic inverse of wide MVPs and empirically demonstrate that they improve density estimation over existing dimension changing methods.

Aditya Bhaskara · Aravinda Kanchana Ruwanpathirana · Pruthuvi Maheshakya Wijewardena

Principal Component Regression (PCR) is a popular method for prediction from data, and is one way to address the so-called multi-collinearity problem in regression. It was shown recently that algorithms for PCR such as hard singular value thresholding (HSVT) are also quite robust, in that they can handle data that has missing or noisy covariates. However, such spectral approaches require strong distributional assumptions on which entries are observed. Specifically, every covariate is assumed to be observed with probability (exactly) $p$, for some value of $p$. Our goal in this work is to weaken this requirement, and as a step towards this, we study a ``semi-random'' model. In this model, every covariate is revealed with probability $p$, and then an adversary comes in and reveals additional covariates. While the model seems intuitively easier, it is well known that algorithms such as HSVT perform poorly. Our approach is based on studying the closely related problem of Noisy Matrix Completion in a semi-random setting. By considering a new semidefinite programming relaxation, we develop new guarantees for matrix completion, which is our core technical contribution.

Yichen Ruan · Xiaoxi Zhang · Shu-Che Liang · Carlee Joe-Wong

Traditional federated learning algorithms impose strict requirements on the participation rates of devices, which limit the potential reach of federated learning. This paper extends the current learning paradigm to include devices that may become inactive, compute incomplete updates, and depart or arrive in the middle of training. We derive analytical results to illustrate how allowing more flexible device participation can affect the learning convergence when data is not independently and identically distributed (non-IID). We then propose a new federated aggregation scheme that converges even when devices may be inactive or return incomplete updates. We also study how the learning process can adapt to early departures or late arrivals, and analyze their impacts on the convergence.

Benjamin Moseley · Sergei Vassilvtiskii · Yuyan Wang

Hierarchical clustering is a widely used data analysis method, but suffers from scalability issues, requiring quadratic time in general metric spaces.

In this work, we demonstrate how approximate nearest neighbor (ANN) queries can be used to improve the running time of the popular single-linkage and average-linkage methods. Our proposed algorithms are the first subquadratic time algorithms for non-Euclidean metrics. We complement our theoretical analysis with an empirical evaluation showcasing our methods' efficiency and accuracy.

Yilun Zhou · Adithya Renduchintala · Xian Li · Sida Wang · Yashar Mehdad · Asish Ghoshal

Active learning (AL) algorithms may achieve better performance with fewer data because the model guides the data selection process. While many algorithms have been proposed, there is little study on what the optimal AL algorithm looks like, which would help researchers understand where their models fall short and iterate on the design. In this paper, we present a simulated annealing algorithm to search for this optimal oracle and analyze it for several tasks. We present qualitative and quantitative insights into the behaviors of this oracle, comparing and contrasting them with those of various heuristics. Moreover, we are able to consistently improve the heuristics using one particular insight. We hope that our findings can better inform future active learning research. The code is available at https://github.com/YilunZhou/optimal-active-learning.

The global optimization of a high-dimensional black-box function under black-box constraints is a pervasive task in machine learning, control, and engineering. These problems are challenging since the feasible set is typically non-convex and hard to find, in addition to the curses of dimensionality and the heterogeneity of the underlying functions. In particular, these characteristics dramatically impact the performance of Bayesian optimization methods, that otherwise have become the defacto standard for sample-efficient optimization in unconstrained settings, leaving practitioners with evolutionary strategies or heuristics. We propose the scalable constrained Bayesian optimization (SCBO) algorithm that overcomes the above challenges and pushes the applicability of Bayesian optimization far beyond the state-of-the-art. A comprehensive experimental evaluation demonstrates that SCBO achieves excellent results on a variety of benchmarks. To this end, we propose two new control problems that we expect to be of independent value for the scientific community.

Shiyun Xu · Zhiqi Bu

Recent years have witnessed strong empirical performance of over-parameterized neural networks on various tasks and many advances in the theory, e.g. the universal approximation and provable convergence to global minimum. In this paper, we incorporate over-parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction, especially in the high dimensional linear problem. By doing so, we can exploit a wide class of networks to approximate the nuisance functions and to estimate the parameters of interest consistently. Therefore, we may offer the best of two worlds: the universal approximation ability from neural networks and the interpretability from classic ordinary linear model, leading to valid inference and accurate prediction. We show the theoretical foundations that make this possible and demonstrate with numerical experiments. Furthermore, we propose a framework, DebiNet, in which we plug-in arbitrary feature selection methods to our semi-parametric neural network and illustrate that our framework debiases the regularized estimators and performs well, in terms of the post-selection inference and the generalization error.

The paper provides a thorough investigation of Direct Loss Minimization (DLM), which optimizes the posterior to minimize predictive loss, in sparse Gaussian processes. For the conjugate case, we consider DLM for log-loss and DLM for square loss showing a significant performance improvement in both cases. The application of DLM in non-conjugate cases is more complex because the logarithm of expectation in the log-loss DLM objective is often intractable and simple sampling leads to biased estimates of gradients. The paper makes two technical contributions to address this. First, a new method using product sampling is proposed, which gives unbiased estimates of gradients (uPS) for the objective function. Second, a theoretical analysis of biased Monte Carlo estimates (bMC) shows that stochastic gradient descent converges despite the biased gradients. Experiments demonstrate empirical success of DLM. A comparison of the sampling methods shows that, while uPS is potentially more sample-efficient, bMC provides a better tradeoff in terms of convergence time and computational efficiency.

Adarsh Subbaswamy · Roy Adams · Suchi Saria

As the use of machine learning in high impact domains becomes widespread, the importance of evaluating safety has increased. An important aspect of this is evaluating how robust a model is to changes in setting or population, which typically requires applying the model to multiple, independent datasets. Since the cost of collecting such datasets is often prohibitive, in this paper, we propose a framework for evaluating this type of stability using the available data. We use the original evaluation data to determine distributions under which the algorithm performs poorly, and estimate the algorithm's performance on the "worst-case" distribution. We consider shifts in user defined conditional distributions, allowing some distributions to shift while keeping other portions of the data distribution fixed. For example, in a healthcare context, this allows us to consider shifts in clinical practice while keeping the patient population fixed. To address the challenges associated with estimation in complex, high-dimensional distributions, we derive a "debiased" estimator which maintains root-N consistency even when machine learning methods with slower convergence rates are used to estimate the nuisance parameters. In experiments on a real medical risk prediction task, we show this estimator can be used to analyze stability and accounts for realistic shifts that could not previously be expressed. The proposed framework allows practitioners to proactively evaluate the safety of their models without requiring additional data collection.

Model-Agnostic Meta-Learning (MAML) and its variants have achieved success in meta-learning tasks on many datasets and settings. Nonetheless, we have just started to understand and analyze how they are able to adapt fast to new tasks. In this work, we contribute by conducting a series of empirical and theoretical studies, and discover several interesting, previously unknown properties of the algorithm. First, we find MAML adapts better with a deep architecture even if the tasks need only a shallow one. Secondly, linear layers can be added to the output layers of a shallower model to increase the depth without altering the modelling capacity, leading to improved performance in adaptation. Alternatively, an external and separate neural network meta-optimizer can also be used to transform the gradient updates of a smaller model so as to obtain improved performances in adaptation. Drawing from these evidences, we theorize that for a deep neural network to meta-learn well, the upper layers must transform the gradients of the bottom layers as if the upper layers were an external meta-optimizer, operating on a smaller network that is composed of the bottom layers.

Aldo Pacchiano · Mohammad Ghavamzadeh · Peter Bartlett · Heinrich Jiang

We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multi-armed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of the feasible action is unknown.

Nathaniel Grammel · Brian Brubach · Wei Ma · Aravind Srinivasan

We study several generalizations of the Online Bipartite Matching problem. We consider settings with stochastic rewards, patience constraints, and weights (considering both vertex- and edge-weighted variants). We introduce a stochastic variant of the patience-constrained problem, where the patience is chosen randomly according to some known distribution and is not known in advance. We also consider stochastic arrival settings (i.e. the nature in which the online vertices arrive is determined by a known random process), which are natural settings that are able to beat the hard worst-case bounds of adversarial arrivals.

We design black-box algorithms for star graphs under various models of patience, which solve the problem optimally for deterministic or geometrically-distributed patience, and yield a 1/2-approximation for any patience distribution. These star graph algorithms are then used as black boxes to solve the online matching problems under different arrival settings. We show improved (or first-known) competitive ratios for these problems. We also present negative results that include formalizing the concept of a stochasticity gap for LP upper bounds on these problems, showing some new stochasticity gaps for popular LPs, and bounding the worst-case performance of some greedy approaches.

Shuang Song · Thomas Steinke · Om Thakkar · Abhradeep Thakurta

We revisit the well-studied problem of differentially private empirical risk minimization (ERM). We show that for unconstrained convex generalized linear models (GLMs), one can obtain an excess empirical risk of $\tilde O\left(\sqrt{\rank}/\epsilon n\right)$, where $\rank$ is the rank of the feature matrix in the GLM problem, $n$ is the number of data samples, and $\epsilon$ is the privacy parameter. This bound is attained via differentially private gradient descent (DP-GD). Furthermore, via the \emph{first lower bound for unconstrained private ERM}, we show that our upper bound is tight. In sharp contrast to the constrained ERM setting, there is no dependence on the dimensionality of the ambient model space ($p$). (Notice that $\rank\leq \min\{n, p\}$.) Besides, we obtain an analogous excess population risk bound which depends on $\rank$ instead of $p$. For the smooth non-convex GLM setting (i.e., where the objective function is non-convex but preserves the GLM structure), we further show that DP-GD attains a dimension-independent convergence of $\tilde O\left(\sqrt{\rank}/\epsilon n\right)$ to a first-order-stationary-point of the underlying objective. Finally, we show that for convex GLMs, a variant of DP-GD commonly used in practice (which involves clipping the individual gradients) also exhibits the same dimension-independent convergence to the minimum of a well-defined objective. To that end, we provide a structural lemma that characterizes the effect of clipping on the optimization profile of DP-GD.

We propose a simple robust hypothesis test that has the same sample complexity as that of the optimal Neyman-Pearson test up to constants, but robust to distribution perturbations under Hellinger distance. We discuss the applicability of such a robust test for estimating distributions in Hellinger distance. We empirically demonstrate the power of the test on canonical distributions.

yuyang deng · Mehrdad Mahdavi

Local SGD is a promising approach to overcome the communication overhead in distributed learning by reducing the synchronization frequency among worker nodes. Despite the recent theoretical advances of local SGD in empirical risk minimization, the efficiency of its counterpart in minimax optimization remains unexplored. Motivated by large scale minimax learning problems, such as adversarial robust learning and GANs, we propose local Stochastic Gradient Descent Ascent (local SGDA), where the primal and dual variables can be trained locally and averaged periodically to significantly reduce the number of communications. We show that local SGDA can provably optimize distributed minimax problems in both homogeneous and heterogeneous data with reduced number of communications and establish convergence rates under strongly-convex-strongly-concave and nonconvex-strongly-concave settings. In addition, we propose a novel variant, dubbed as local SGDA+, to solve nonconvex-nonconcave problems. We also give corroborating empirical evidence on different distributed minimax problems.

Farzin Haddadpour · Mohammad Mahdi Kamani · Aryan Mokhtari · Mehrdad Mahdavi

In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are gradient compression and local computation with periodic communication. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a set of algorithms with periodical compressed (quantized or sparsified) communication and analyze their convergence properties in both homogeneous and heterogeneous local data distributions settings. For the homogeneous setting, our analysis improves existing bounds by providing tighter convergence rates for both strongly convex and non-convex objective functions. To mitigate data heterogeneity, we introduce a local gradient tracking scheme and obtain sharp convergence rates that match the best-known communication complexities without compression for convex, strongly convex, and nonconvex settings. We complement our theoretical results by demonstrating the effectiveness of our proposed methods on real-world datasets.

Xiaoyun Li · Ping Li

The commonly used Gaussian kernel has a tuning parameter $\gamma$. This makes the design of quantization schemes for random Fourier features (RFF) challenging, which is a popular technique to approximate the Gaussian kernel. Intuitively one would expect that a different quantizer is needed for a different $\gamma$ value (and we need to store a different set of quantized data for each $\gamma$). Fortunately, the recent work~\citep{Report:Li_2021_RFF} showed that only one Lloyd-Max (LM) quantizer is needed as the marginal distribution of RFF is free of the tuning parameter $\gamma$. On the other hand, \citet{Report:Li_2021_RFF} still required to store a different set of quantized data for each $\gamma$ value. In this paper, we adopt the ``one-sketch-for-all'' strategy for quantizing RFFs. Basically, we only store one set of quantized data after applying random projections on the original data. From the same set of quantized data, we can construct approximate RFFs to approximate Gaussian kernels for any tuning parameter $\gamma$. Compared with \citet{Report:Li_2021_RFF}, our proposed scheme would lose some accuracy as one would expect. Nevertheless, the proposed method still perform noticeably better than the quantization scheme based on random rounding. We provide statistical analysis on the properties of the proposed method and experiments are conducted to empirically illustrate its effectiveness.

Yuki Ohnishi · Jean Honorio

We introduce several novel change of measure inequalities for two families of divergences: $f$-divergences and $\alpha$-divergences. We show how the variational representation for $f$-divergences leads to novel change of measure inequalities. We also present a multiplicative change of measure inequality for $\alpha$-divergences and a generalized version of Hammersley-Chapman-Robbins inequality. Finally, we present several applications of our change of measure inequalities, including PAC-Bayesian bounds for various classes of losses and non-asymptotic intervals for Monte Carlo estimates.

We establish the consistency of K-medoids in the context of metric spaces. We start by proving that K-medoids is asymptotically equivalent to K-means restricted to the support of the underlying distribution under general conditions, including a wide selection of loss functions. This asymptotic equivalence, in turn, enables us to apply the work of Parna (1986) on the consistency of K-means. This general approach applies also to non-metric settings where only an ordering of the dissimilarities is available. We consider two types of ordinal information: one where all quadruple comparisons are available; and one where only triple comparisons are available. We provide some numerical experiments to illustrate our theory.

We study the problem of computing a good k-median clustering in a parallel computing environment. We design an efficient algorithm that gives a constant-factor approximation to the optimal solution for stable clustering instances. The notion of stability that we consider is resilience to perturbations of the distances between the points. Our computational experiments show that our algorithm works well in practice - we are able to find better clusterings than Lloyd’s algorithm and a centralized coreset construction using samples of the same size.

Elchanan Solomon · Alexander Wagner · Paul Bendich

Topological statistics, in the form of persistence diagrams, are a class of shape descriptors that capture global structural information in data. The mapping from data structures to persistence diagrams is almost everywhere differentiable, allowing for topological gradients to be backpropagated to ordinary gradients. However, as a method for optimizing a topological functional, this backpropagation method is expensive, unstable, and produces very fragile optima. Our contribution is to introduce a novel backpropagation scheme that is significantly faster, more stable, and produces more robust optima. Moreover, this scheme can also be used to produce a stable visualization of dots in a persistence diagram as a distribution over critical, and near-critical, simplices in the data structure.

Alican Bozkurt · Babak Esmaeili · Jean-Baptiste Tristan · Dana Brooks · Jennifer Dy · Jan-Willem van de Meent

Variational autoencoders (VAEs) optimize an objective that comprises a reconstruction loss (the distortion) and a KL term (the rate). The rate is an upper bound on the mutual information, which is often interpreted as a regularizer that controls the degree of compression. We here examine whether inclusion of the rate term also improves generalization. We perform rate-distortion analyses in which we control the strength of the rate term, the network capacity, and the difficulty of the generalization problem. Lowering the strength of the rate term paradoxically improves generalization in most settings, and reducing the mutual information typically leads to underfitting. Moreover, we show that generalization performance continues to improve even after the mutual information saturates, indicating that the gap on the bound (i.e. the KL divergence relative to the inference marginal) affects generalization. This suggests that the standard spherical Gaussian prior is not an inductive bias that typically improves generalization, prompting further work to understand what choices of priors improve generalization in VAEs.

In the Multiple Instance Learning (MIL) scenario, the training data consists of instances grouped into bags. Bag labels indicate whether each bag contains at least one positive instance, but instance labels are not observed. Recently, Haussmann et al (CVPR 2017) tackled the MIL instance label prediction task by introducing the Multiple Instance Learning Gaussian Process Logistic (MIL-GP-Logistic) model, an adaptation of the Gaussian Process Logistic Classification model that inherits its uncertainty quantification and flexibility. Notably, they provide a fast mean-field variational inference procedure. However, due to their choice of the logistic link, they do not maximize the ELBO objective directly, but rather a lower bound on it. This approximation, as we show, hurts predictive performance. In this work, we propose the Multiple Instance Learning Gaussian Process Probit (MIL-GP-Probit) model, an adaptation of the Gaussian Process Probit Classification model to solve the MIL instance label prediction problem. Leveraging the analytical tractability of the probit link, we give a variational inference procedure based on variable augmentation that maximizes the ELBO objective directly. Applying it, we show MIL-GP-Probit is significantly more calibrated than MIL-GP-Logistic on all 20 datasets of the benchmark 20 Newsgroups dataset collection, and achieves higher AUC than MIL-GP-Logistic on an additional 51 out of 59 datasets. Furthermore, we show how the probit formulation enables principled bag label predictions and a Gibbs sampling scheme. This is the first exact posterior inference procedure for any Bayesian model for the MIL scenario.

Maxime Vandegar · Michael Kagan · Antoine Wehenkel · Gilles Louppe

We revisit g-modeling empirical Bayes in the absence of a tractable likelihood function, as is typical in scientific domains relying on computer simulations. We investigate how the empirical Bayesian can make use of neural density estimators first to use all noise-corrupted observations to estimate a prior or source distribution over uncorrupted samples, and then to perform single-observation posterior inference using the fitted source distribution. We propose an approach based on the direct maximization of the log-marginal likelihood of the observations, examining both biased and de-biased estimators, and comparing to variational approaches. We find that, up to symmetries, a neural empirical Bayes approach recovers ground truth source distributions. With the learned source distribution in hand, we show the applicability to likelihood-free inference and examine the quality of the resulting posterior estimates. Finally, we demonstrate the applicability of Neural Empirical Bayes on an inverse problem from collider physics.

In this paper, we focus on answering queries, in a differentially private manner, on graph streams. We adopt the sliding window model of privacy, where we wish to perform analysis on the last $W$ updates and ensure that privacy is preserved for the entire stream. We show that in this model, the price of ensuring differential privacy is minimal. Furthermore, since differential privacy is preserved under post-processing, our results can be used as a subroutine in many tasks, including Lipschitz learning on graphs, cut functions, and spectral clustering.

Haoming Jiang · Zhehui Chen · Yuyang Shi · Bo Dai · Tuo Zhao

Adversarial training provides a principled approach for training robust neural networks. From an optimization perspective, adversarial training is essentially solving a bilevel optimization problem. The leader problem is trying to learn a robust classifier, while the follower maximization is trying to generate adversarial samples. Unfortunately, such a bilevel problem is difficult to solve due to its highly complicated structure. This work proposes a new adversarial training method based on a generic learning-to-learn (L2L) framework. Specifically, instead of applying existing hand-designed algorithms for the inner problem, we learn an optimizer, which is parametrized as a convolutional neural network. At the same time, a robust classifier is learned to defense the adversarial attack generated by the learned optimizer. Experiments over CIFAR-10 and CIFAR-100 datasets demonstrate that L2L outperforms existing adversarial training methods in both classification accuracy and computational efficiency. Moreover, our L2L framework can be extended to generative adversarial imitation learning and stabilize the training.

It is well-known that selling different goods in a single bundle can significantly increase revenue. However, bundling is no longer profitable if the goods have high production costs. To overcome this challenge, we introduce a new mechanism, Pure Bundling with Disposal for Cost (PBDC), where after buying the bundle, the customer is allowed to return any subset of goods for their costs.

We provide two types of guarantees on the profit of PBDC mechanisms relative to the optimum in the presence of production costs, under the assumption that customers have valuations which are additive over the items and drawn independently. We first provide a distribution-dependent guarantee which shows that PBDC earns at least 1-6c^{2/3} of the optimal profit, where c denotes the coefficient of variation of the welfare random variable. c approaches 0 if there are a large number of items whose individual valuations have bounded coefficients of variation, and our constants improve upon those from the classical result of Bakos and Brynjolfsson (1999) without costs.

We then provide a distribution-free guarantee which shows that either PBDC or individual sales earns at least 1/5.2 times the optimal profit, generalizing and improving the constant of 1/6 from the celebrated result of Babaioff et al. (2014). Conversely, we also provide the best-known upper bound on the performance of any partitioning mechanism (which captures both individual sales and pure bundling), of 1/1.19 times the optimal profit, improving on the previously-known upper bound of 1/1.08.

Finally, we conduct simulations under the same playing field as the extensive numerical study of Chu et al. (2011), which confirm that PBDC outperforms other simple pricing schemes overall.

Mansur Arief · Zhiyuan Huang · Guru Koushik Senthil Kumar · Yuanlu Bai · Shengyi He · Wenhao Ding · Henry Lam · Ding Zhao

Evaluating the reliability of intelligent physical systems against rare safety-critical events poses a huge testing burden for real-world applications. Simulation provides a useful platform to evaluate the extremal risks of these systems before their deployments. Importance Sampling (IS), while proven to be powerful for rare-event simulation, faces challenges in handling these learning-based systems due to their black-box nature that fundamentally undermines its efficiency guarantee, which can lead to under-estimation without diagnostically detected. We propose a framework called Deep Probabilistic Accelerated Evaluation (Deep-PrAE) to design statistically guaranteed IS, by converting black-box samplers that are versatile but could lack guarantees, into one with what we call a relaxed efficiency certificate that allows accurate estimation of bounds on the safety-critical event probability. We present the theory of Deep-PrAE that combines the dominating point concept with rare-event set learning via deep neural network classifiers, and demonstrate its effectiveness in numerical examples including the safety-testing of an intelligent driving algorithm.

Multiple proposal Markov chain Monte Carlo (MP-MCMC) as introduced in Calderhead (2014) allow for computationally efficient and parallelisable inference, whereby multiple states are proposed and computed simultaneously. In this paper, we improve the resulting integral estimators by sequentially using the multiple states within a Rao-Blackwellised estimator. We further propose a novel adaptive Rao-Blackwellised MP-MCMC algorithm, which generalises the adaptive MCMC algorithm introduced by Haario et al. (2001) to allow for multiple proposals. We prove its asymptotic unbiasedness, and demonstrate significant improvements in sampling efficiency through numerical studies.

Ruqi Zhang · Yingzhen Li · Christopher De Sa · Sam Devlin · Cheng Zhang

Variational inference (VI) plays an essential role in approximate Bayesian inference due to its computational efficiency and broad applicability. Crucial to the performance of VI is the selection of the associated divergence measure, as VI approximates the intractable distribution by minimizing this divergence. In this paper we propose a meta-learning algorithm to learn the divergence metric suited for the task of interest, automating the design of VI methods. In addition, we learn the initialization of the variational parameters without additional cost when our method is deployed in the few-shot learning scenarios. We demonstrate our approach outperforms standard VI on Gaussian mixture distribution approximation, Bayesian neural network regression, image generation with variational autoencoders and recommender systems with a partial variational autoencoder.

Steve Hanneke · Liu Yang

While the literature on the theory of pool-based active learning has seen much progress in the past 15 years, and is now fairly mature, much less is known about its cousin problem: online selective sampling. In the stochastic online learning setting, there is a stream of iid data, and the learner is required to predict a label for each instance, and we are interested in the rate of growth of the number of mistakes the learner makes. In the selective sampling variant of this problem, after each prediction, the learner can optionally request to observe the true classification of the point. This introduces a trade-off between the number of these queries and the number of mistakes as a function of the number T of samples in the sequence. This work explores various properties of the optimal trade-off curve, both abstractly (for general VC classes), and more-concretely for several constructed examples that expose important properties of the trade-off.

Gauthier Gidel · David Balduzzi · Wojciech Czarnecki · Marta Garnelo · Yoram Bachrach

Adversarial training, a special case of multi-objective optimization, is an increasingly prevalent machine learning technique: some of its most notable applications include GAN-based generative modeling and self-play techniques in reinforcement learning which have been applied to complex games such as Go or Poker. In practice, a \emph{single} pair of networks is typically trained in order to find an approximate equilibrium of a highly nonconcave-nonconvex adversarial problem. However, while a classic result in game theory states such an equilibrium exists in concave-convex games, there is no analogous guarantee if the payoff is nonconcave-nonconvex. Our main contribution is to provide an approximate minimax theorem for a large class of games where the players pick neural networks including WGAN, StarCraft II and Blotto Game. Our findings rely on the fact that despite being nonconcave-nonconvex with respect to the neural networks parameters, these games are concave-convex with respect to the actual models (e.g., functions or distributions) represented by these neural networks.

The increasing availability of data has generated unprecedented prospects for network analyses in many biological fields, such as neuroscience (e.g., brain networks), genomics (e.g., gene-gene interaction networks), and ecology (e.g., species interaction networks). A powerful statistical framework for estimating such networks is Gaussian graphical models, but standard estimators for the corresponding graphs are prone to large numbers of false discoveries. In this paper, we introduce a novel graph estimator based on knockoffs that imitate the partial correlation structures of unconnected nodes. We then show that this new estimator provides accurate control of the false discovery rate and yet large power.

Melrose Roderick · Vaishnavh Nagarajan · Zico Kolter

A key challenge in applying reinforcement learning to safety-critical domains is understanding how to balance exploration (needed to attain good performance on the task) with safety (needed to avoid catastrophic failure). Although a growing line of work in reinforcement learning has investigated this area of "safe exploration," most existing techniques either 1) do not guarantee safety during the actual exploration process; and/or 2) limit the problem to a priori known and/or deterministic transition dynamics with strong smoothness assumptions. Addressing this gap, we propose Analogous Safe-state Exploration (ASE), an algorithm for provably safe exploration in MDPs with unknown, stochastic dynamics. Our method exploits analogies between state-action pairs to safely learn a near-optimal policy in a PAC-MDP sense. Additionally, ASE also guides exploration towards the most task-relevant states, which empirically results in significant improvements in terms of sample efficiency, when compared to existing methods.

Haoxian Chen · Ziyi Huang · Henry Lam · Huajie Qian · Haofeng Zhang

We study the generation of prediction intervals in regression for uncertainty quantification. This task can be formalized as an empirical constrained optimization problem that minimizes the average interval width while maintaining the coverage accuracy across data. We strengthen the existing literature by studying two aspects of this empirical optimization. First is a general learning theory to characterize the optimality-feasibility tradeoff that encompasses Lipschitz continuity and VC-subgraph classes, which are exemplified in regression trees and neural networks. Second is a calibration machinery and the corresponding statistical theory to optimally select the regularization parameter that manages this tradeoff, which bypasses the overfitting issues in previous approaches in coverage attainment. We empirically demonstrate the strengths of our interval generation and calibration algorithms in terms of testing performances compared to existing benchmarks.

Zhuolin Yang · Zhaoxi Chen · Tiffany Cai · Xinyun Chen · Bo Li · Yuandong Tian

Adversarial examples have appeared as a ubiquitous property of machine learning models where bounded adversarial perturbation could mislead the models to make arbitrarily incorrect predictions. Such examples provide a way to assess the robustness of machine learning models as well as a proxy for understanding the model training process. Extensive studies try to explain the existence of adversarial examples and provide ways to improve model robustness (e.g. adversarial training). While they mostly focus on models trained on datasets with predefined labels, we leverage the teacher-student framework and assume a teacher model, or \emph{oracle}, to provide the labels for given instances. We extend~\citet{tian2019student} in the case of low-rank input data and show that \emph{student specialization} (trained student neuron is highly correlated with certain teacher neuron at the same layer) still happens within the input subspace, but the teacher and student nodes could \emph{differ wildly} out of the data subspace, which we conjecture leads to adversarial examples. Extensive experiments show that student specialization correlates strongly with model robustness in different scenarios, including student trained via standard training, adversarial training, confidence-calibrated adversarial training, and training with robust feature dataset. Our studies could shed light on the future exploration about adversarial examples, and enhancing model robustness via principled data augmentation.

Alan Kuhnle

We consider the problem of monotone, submodular maximization over a ground set of size $n$ subject to cardinality constraint $k$. For this problem, we introduce the first deterministic algorithms with linear time complexity; these algorithms are streaming algorithms. Our single-pass algorithm obtains a constant ratio in $\lceil n / c \rceil + c$ oracle queries, for any $c \ge 1$. In addition, we propose a deterministic, multi-pass streaming algorithm with a constant number of passes that achieves nearly the optimal ratio with linear query and time complexities. We prove a lower bound that implies no constant-factor approximation exists using $o(n)$ queries, even if queries to infeasible sets are allowed. An empirical analysis demonstrates that our algorithms require fewer queries (often substantially less than $n$) yet still achieve better objective value than the current state-of-the-art algorithms, including single-pass, multi-pass, and non-streaming algorithms.

Setareh Ariafar · Zelda Mariet · Dana Brooks · Jennifer Dy · Jasper Snoek

Many contemporary machine learning models require extensive tuning of hyperparameters to perform well. A variety of methods, such as Bayesian optimization, have been developed to automate and expedite this process. However, tuning remains extremely costly as it typically requires repeatedly fully training models. To address this issue, Bayesian optimization methods have been extended to use cheap, partially trained models to extrapolate to expensive complete models. While this approach enlarges the set of explored hyperparameters, including many low-fidelity observations adds to the intrinsic randomness of the procedure and makes extrapolation challenging. We propose to accelerate hyperparameter tuning for neural networks in a robust way by taking into account the relative amount of information contributed by each training example. To do so, we integrate importance sampling with Bayesian optimization, which significantly increases the quality of the black-box function evaluations and their runtime. To overcome the additional overhead cost of using importance sampling, we cast hyperparameter search as a multi-task Bayesian optimization problem over both hyperparameters and importance sampling design, which achieves the best of both worlds. Through learning a trade-off between training complexity and quality, our method improves upon validation error, in the average and worst-case. We show that this results in more reliable performance of our method in less wall-clock time across a variety of and datasets complex neural architectures.

Nikhil Mehta · Kevin Liang · Vinay Kumar Verma · Lawrence Carin Duke

Naively trained neural networks tend to experience catastrophic forgetting in sequential task settings, where data from previous tasks are unavailable. A number of methods, using various model expansion strategies, have been proposed recently as possible solutions. However, determining how much to expand the model is left to the practitioner, and often a constant schedule is chosen for simplicity, regardless of how complex the incoming task is. Instead, we propose a principled Bayesian nonparametric approach based on the Indian Buffet Process (IBP) prior, letting the data determine how much to expand the model complexity. We pair this with a factorization of the neural network's weight matrices. Such an approach allows us to scale the number of factors of each weight matrix to the complexity of the task, while the IBP prior encourages sparse weight factor selection and factor reuse, promoting positive knowledge transfer between tasks. We demonstrate the effectiveness of our method on a number of continual learning benchmarks and analyze how weight factors are allocated and reused throughout the training.