Track: Poster Session 3

A Cramér Distance perspective on Quantile Regression based Distributional Reinforcement Learning

Alix LHERITIER · Nicolas Bondoux

Distributional reinforcement learning (DRL) extends the value-based approach by approximating the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile Regression (QR)-based methods like QR-DQN project arbitrary distributions into a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance. However, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Non-crossing constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cramér distance yields a projection that coincides with the 1-Wasserstein one and that, under non-crossing constraints, the squared Cramér and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a low complexity algorithm to compute the Cramér distance.

Ada-BKB: Scalable Gaussian Process Optimization on Continuous Domains by Adaptive Discretization

Marco Rando · Luigi Carratino · Silvia Villa · Lorenzo Rosasco

Gaussian process optimization is a successful class of algorithms(e.g. GP-UCB) to optimize a black-box function through sequential evaluations. However, for functions with continuous domains, Gaussian process optimization has to rely on either a fixed discretization of the space, or the solution of a non-convex ptimization subproblem at each evaluation. The first approach can negatively affect performance, while the second approach requires a heavy computational burden. A third option, only recently theoretically studied, is to adaptively discretize the function domain. Even though this approach avoids the extra non-convex optimization costs, the overall computational complexity is still prohibitive. An algorithm such as GP-UCB has a runtime of $O(T^4)$ , where $T$ is the number of iterations. In this paper, we introduce Ada-BKB (Adaptive Budgeted Kernelized Bandit), a no-regret Gaussian process optimization algorithm for functions on continuous domains, that provably runs in $O(T^2 d_\text{eff}^2)$ , where $d_\text{eff}$ is the effective dimension of the explored space, and which is typically much smaller than $T$ . We corroborate our theoretical findings with experiments on synthetic non-convex functions and on the real-world problem of hyper-parameter optimization, confirming the good practical performances of the proposed approach.

Almost Optimal Universal Lower Bound for Learning Causal DAGs with Atomic Interventions

Vibhor Porwal · Piyush Srivastava · Gaurav Sinha

A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. We prove two main results. The first is a new universal lower bound on the number of atomic interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of atomic interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. The proof of our lower bound is based on the new notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better.

Are All Linear Regions Created Equal?

Matteo Gamba · Adrian Chmielewski-Anders · Josephine Sullivan · Hossein Azizpour · Marten Bjorkman

The number of linear regions has been studied as a proxy of complexity for ReLU networks. However, the empirical success of network compression techniques like pruning and knowledge distillation, suggest that in the overparameterized setting, linear regions density might fail to capture the effective nonlinearity. In this work, we propose an efficient algorithm for discovering linear regions and use it to investigate the effectiveness of density in capturing the nonlinearity of trained VGGs and ResNets on CIFAR-10 and CIFAR-100. We contrast the results with a more principled nonlinearity measure based on function variation, highlighting the shortcomings of linear regions density. Furthermore, interestingly, our measure of nonlinearity clearly correlates with model-wise deep double descent, connecting reduced test error with reduced nonlinearity, and increased local similarity of linear regions.

Asymptotically Optimal Locally Private Heavy Hitters via Parameterized Sketches

Hao Wu · Anthony Wirth

We study the frequency estimation problem under the local differential privacy model. Frequency estimation is a fundamental computational question, and differential privacy has become the de-facto standard, with the local version (LDP) affording even greater protection. On large input domains, sketching methods and hierarchical search methods are commonly and successfully, in practice, applied for reducing the size of the domain, and for identifying frequent elements. It is therefore of interest whether the current theoretical analysis of such algorithms is tight, or whether we can obtain algorithms in a similar vein that achieve optimal error guarantee. We introduce two algorithms for LDP frequency estimation. One solves the fundamental frequency oracle problem; the other solves the well-known heavy hitters identification problem. As a function of failure probability, β, the former achieves optimal worst-case estimation error for every β; the latter is optimal when β is at least inverse polynomial in n, the number of users.In each algorithm, server running time and memory usage are tilde{O}(n) and tilde{O}(sqrt{n}), respectively, while user running time and memory usage are both tilde{O}(1). Our frequency-oracle algorithm achieves lower estimation error than Bassily et al. (NeurIPS 2017). On the other hand, our heavy hitters identification method improves the worst-case error of TreeHist (ibid) by a factor of Omega(sqrt{log n}); it avoids invoking error-correcting codes, known to be theoretically powerful, but yet to be implemented.

Deep Layer-wise Networks Have Closed-Form Weights

Chieh Wu · Aria Masoomi · Arthur Gretton · Jennifer Dy

There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network one layer at a time with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, do they have a closed-form solution? Second, how do we know when to stop adding more layers? This work proves that the "Kernel Mean Embedding" is the closed-form solution that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the Neural Indicator Kernel.

Distributionally Robust Structure Learning for Discrete Pairwise Markov Networks

Yeshu Li · Zhan Shi · Xinhua Zhang · Brian Ziebart

We consider the problem of learning the underlying structure of a general discrete pairwise Markov network. Existing approaches that rely on empirical risk minimization may perform poorly in settings with noisy or scarce data. To overcome these limitations, we propose a computationally efficient and robust learning method for this problem with near-optimal sample complexities. Our approach builds upon distributionally robust optimization (DRO) and maximum conditional log-likelihood. The proposed DRO estimator minimizes the worst-case risk over an ambiguity set of adversarial distributions within bounded transport cost or f-divergence of the empirical data distribution. We show that the primal minimax learning problem can be efficiently solved by leveraging sufficient statistics and greedy maximization in the ostensibly intractable dual formulation. Based on DRO's approximation to Lipschitz and variance regularization, we derive near-optimal sample complexities matching existing results. Extensive empirical evidence with different corruption models corroborates the effectiveness of the proposed methods.

Diversity and Generalization in Neural Network Ensembles

Luis Antonio Ortega Andrés · Rafael Cabañas · Andres Masegosa

Ensembles are widely used in machine learning and, usually, provide state-of-the-art performance in many prediction tasks. From the very beginning, the diversity of an ensemble has been identified as a key factor for the superior performance of these models. But the exact role that diversity plays in ensemble models is poorly understood, specially in the context of neural networks. In this work, we combine and expand previously published results in a theoretically sound framework that describes the relationship between diversity and ensemble performance for a wide range of ensemble methods. More precisely, we provide sound answers to the following questions: how to measure diversity, how diversity relates to the generalization error of an ensemble, and how diversity is promoted by neural network ensemble algorithms. This analysis covers three widely used loss functions, namely, the squared loss, the cross-entropy loss, and the 0-1 loss; and two widely used model combination strategies, namely, model averaging and weighted majority vote. We empirically validate this theoretical analysis with neural network ensembles.

Efficient Algorithms for Extreme Bandits

Dorian Baudry · Yoan Russac · Emilie Kaufmann

In this paper, we contribute to the Extreme Bandits problem, a variant of Multi-Armed Bandits in which the learner seeks to collect the largest possible reward. We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions. This analysis motivates the introduction of Quantile of Maxima (QoMax). The properties of QoMax are sufficient to build an Explore-Then-Commit (ETC) strategy, QoMax-ETC, achieving strong asymptotic guarantees despite its simplicity. We then propose and analyze a more adaptive, anytime algorithm, QoMax-SDA, which combines QoMax with a subsampling method recently introduced by Baudry et al. (2021). Both algorithms are more efficient than existing approaches in two senses: (1) they lead to better empirical performance (2) they enjoy a significant reduction of the storage and computational cost.

Efficient and passive learning of networked dynamical systems driven by non-white exogenous inputs

Harish Doddi · Deepjyoti Deka · Saurav Talukdar · Murti Salapaka

We consider a networked linear dynamical system with p agents/nodes. We study the problem of learning the underlying graph of interactions/dependencies from observations of the nodal trajectories over a time-interval T. We present a regularized non-casual consistent estimator for this problem and analyze its sample complexity over two regimes: (a) where the interval T consists of n i.i.d. observation windows of length T/n (restart and record), and (b) where T is one continuous observation window (consecutive). Using the theory of M-estimators, we show that the estimator recovers the underlying interactions, in either regime, in a time-interval that is logarithmic in the system size p. To the best of our knowledge, this is the first work to analyze the sample complexity of learning linear dynamical systems driven by unobserved not-white wide-sense stationary (WSS) inputs.

Extragradient Method: O(1/K) Last-Iterate Convergence for Monotone Variational Inequalities and Connections With Cocoercivity

Eduard Gorbunov · Nicolas Loizou · Gauthier Gidel

Extragradient method (EG) (Korpelevich, 1976) is one of the most popular methods for solving saddle point and variational inequalities problems (VIP). Despite its long history and significant attention in the optimization community, there remain important open questions about convergence of EG. In this paper, we resolve one of such questions and derive the first last-iterate O(1/K) convergence rate for EG for monotone and Lipschitz VIP without any additional assumptions on the operator unlike the only known result of this type (Golowich et al., 2020) that relies on the Lipschitzness of the Jacobian of the operator. The rate is given in terms of reducing the squared norm of the operator. Moreover, we establish several results on the (non-)cocoercivity of the update operators of EG, Optimistic Gradient Method, and Hamiltonian Gradient Method, when the original operator is monotone and Lipschitz.

Improved analysis of randomized SVD for top-eigenvector approximation

Ruo-Chun Tzeng · Po-An Wang · Florian Adriaens · Aristides Gionis · Chi-Jen Lu

Computing the top eigenvectors of a matrix is a problem of fundamental interest to various fields.While the majority of the literature has focused on analyzing the reconstruction errorof low-rank matrices associated with the retrieved eigenvectors, in many applications one is interested in finding one vector with high Rayleigh quotient.In this paper we study the problem of approximating the top-eigenvector.Given a symmetric matrix $\mathbf{A}$ with largest eigenvalue $\lambda_1$ , our goal is to find a vector $\hat{\mathbf{u}}$ that approximates the leading eigenvector $\mathbf{u}_1$ with high accuracy, as measured by the ratio $R(\hat{\mathbf{u}})=\lambda_1^{-1}{\hat{\mathbf{u}}^T\mathbf{A}\hat{\mathbf{u}}}/{\hat{\mathbf{u}}^T\hat{\mathbf{u}}}$ .We present a novel analysis of the randomized SVD algorithm of \citet{halko2011finding} and derive tight bounds in many cases of interest.Notably, this is the first work that provides non-trivial bounds of $R(\hat{\mathbf{u}})$ for randomized SVD with any number of iterations.Our theoretical analysis is complemented with a thorough experimental study that confirms the efficiency and accuracy of the method.

k-Pareto Optimality-Based Sorting with Maximization of Choice

Jean Ruppert · Marharyta Aleksandrova · Thomas Engel

Topological sorting is an important technique in numerous practical applications, such as information retrieval, recommender systems, optimization, etc. In this paper, we introduce a problem of generalized topological sorting with maximization of choice, that is, of choosing a subset of items of a predefined size that contains the maximum number of equally preferable options (items) with respect to a dominance relation. We formulate this problem in a very abstract form and prove that sorting by k-Pareto optimality yields a valid solution. Next, we show that the proposed theory can be useful in practice. We apply it during the selection step of genetic optimization and demonstrate that the resulting algorithm outperforms existing state-of-the-art approaches such as NSGA-II and NSGA-III. We also demonstrate that the provided general formulation allows discovering interesting relationships and applying the developed theory to different applications.

Lagrangian manifold Monte Carlo on Monge patches

Marcelo Hartmann · Mark Girolami · Arto Klami

The efficiency of Markov Chain Monte Carlo (MCMC) depends on how the underlying geometry of the problem is taken into account. For distributions with strongly varying curvature, Riemannian metrics help in efficient exploration of the target distribution. Unfortunately, they have significant computational overhead due to e.g. repeated inversion of the metric tensor, and current geometric MCMC methods using the Fisher information matrix to induce the manifold are in practice slow. We propose a new alternative Riemannian metric for MCMC, by embedding the target distribution into a higher-dimensional Euclidean space as a Monge patch, thus using the induced metric determined by direct geometric reasoning. Our metric only requires first-order gradient information and has fast inverse and determinants, and allows reducing the computational complexity of individual iterations from cubic to quadratic in the problem dimensionality. We demonstrate how Lagrangian Monte Carlo in this metric efficiently explores the target distributions.

Lifted Primal-Dual Method for Bilinearly Coupled Smooth Minimax Optimization

Kiran Thekumparampil · Niao He · Sewoong Oh

We study the bilinearly coupled minimax problem: $\min_{x} \max_{y} f(x) + y^\top A x - h(y)$ , where $f$ and $h$ are both strongly convex smooth functions and admit first-order gradient oracles. Surprisingly, no known first-order algorithms have hitherto achieved the lower complexity bound of $\Omega((\sqrt{\frac{L_x}{\mu_x}} + \frac{\|A\|}{\sqrt{\mu_x \mu_y}} + \sqrt{\frac{L_y}{\mu_y}}) \log(\frac1{\varepsilon}))$ for solving this problem up to an $\varepsilon$ primal-dual gap in the general parameter regime, where $L_x, L_y,\mu_x,\mu_y$ are the corresponding smoothness and strongly convexity constants. We close this gap by devising the first optimal algorithm, the Lifted Primal-Dual (LPD) method. Our method lifts the objective into an extended form that allows both the smooth terms and the bilinear term to be handled optimally and seamlessly with the same primal-dual framework. Besides optimality, our method yields a desirably simple single-loop algorithm that uses only one gradient oracle call per iteration. Moreover, when $f$ is just convex, the same algorithm applied to a smoothed objective achieves the nearly optimal iteration complexity. We also provide a direct single-loop algorithm, using the LPD method, that achieves the iteration complexity of $O(\sqrt{\frac{L_x}{\varepsilon}} + \frac{\|A\|}{\sqrt{\mu_y \varepsilon}} + \sqrt{\frac{L_y}{\varepsilon}})$ . Numerical experiments on quadratic minimax problems and policy evaluation problems further demonstrate the fast convergence of our algorithm in practice.

Optimal Rates of (Locally) Differentially Private Heavy-tailed Multi-Armed Bandits

Youming Tao · Yulian Wu · Peng Zhao · Di Wang

In this paper we investigate the problem of stochastic multi-armed bandits (MAB) in the (local) differential privacy (DP/LDP) model. Unlike previous results that assume bounded/sub-Gaussian reward distributions, we focus on the setting where each arm's reward distribution only has $(1+v)$ -th moment with some $v\in (0, 1]$ . In the first part, we study the problem in the central $\epsilon$ -DP model. We first provide a near-optimal result by developing a private and robust Upper Confidence Bound (UCB) algorithm. Then, we improve the result via a private and robust version of the Successive Elimination (SE) algorithm. Finally, we establish the lower bound to show that the instance-dependent regret of our improved algorithm is optimal. In the second part, we study the problem in the $\epsilon$ -LDP model. We propose an algorithm that can be seen as locally private and robust version of SE algorithm, which provably achieves (near) optimal rates for both instance-dependent and instance-independent regret. Our results reveal differences between the problem of private MAB with bounded/sub-Gaussian rewards and heavy-tailed rewards. To achieve these (near) optimal rates, we develop several new hard instances and private robust estimators as byproducts, which might be used to other related problems. Finally, experiments also support our theoretical findings and show the effectiveness of our algorithms.

Particle-based Adversarial Local Distribution Regularization

Nguyen Thanh · Trung Le · He Zhao · Jianfei Cai · Dinh Phung

Adversarial training defense (ATD) and virtual adversarial training (VAT) are the two most effective methods to improve model robustness against attacks and model generalization. While ATD is usually applied in robust machine learning, VAT is used in semi-supervised learning and domain adaption. In this paper, we introduce a novel adversarial local distribution regularization. The adversarial local distribution is defined by a set of all adversarial examples within a ball constraint given a natural input. We illustrate this regularization is a general form of previous methods (e.g., PGD, TRADES, VAT and VADA). We conduct comprehensive experiments on MNIST, SVHN and CIFAR10 to illustrate that our method outperforms well-known methods such as PGD, TRADES and ADT in robust machine learning, VAT in semi-supervised learning and VADA in domain adaption. Our implementation is on Github: https://github.com/PotatoThanh/ALD-Regularization.

Sobolev Transport: A Scalable Metric for Probability Measures with Graph Metrics

Tam Le · Truyen Nguyen · Dinh Phung · Viet Anh Nguyen

Optimal transport (OT) is a popular measure to compare probability distributions. However, OT suffers a few drawbacks such as (i) a high complexity for computation, (ii) indefiniteness which limits its applicability to kernel machines. In this work, we consider probability measures supported on a graph metric space and propose a novel Sobolev transport metric. We show that the Sobolev transport metric yields a closed-form formula for fast computation and it is negative definite. We show that the space of probability measures endowed with this transport distance is isometric to a bounded convex set in a Euclidean space with a weighted l_p distance. We further exploit the negative definiteness of the Sobolev transport to design positive-definite kernels, and evaluate their performances against other baselines in document classification with word embeddings and in topological data analysis.

A Class of Geometric Structures in Transfer Learning: Minimax Bounds and Optimality

Xuhui Zhang · Jose Blanchet · Soumyadip Ghosh · Mark Squillante

We study the problem of transfer learning, observing that previous efforts to understand its information-theoretic limits do not fully exploit the geometric structure of the source and target domains. In contrast, our study first illustratesthe benefits of incorporating a natural geometric structure within a linear regression model, which corresponds to the generalized eigenvalue problem formed by the Gram matrices of both domains. We next establish a finite-sample minimax lower bound, propose a refined model interpolation estimator that enjoys a matching upper bound, and then extend our framework to multiple source domains and generalized linear models. Surprisingly, as long as information is available on the distance between the source and target parameters, negative-transfer does not occur. Simulation studies show that our proposed interpolation estimator outperforms state-of-the-art transfer learning methods in both moderate- and high-dimensional settings.

A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits

Pierre Laforgue · Giulia Clerici · Nicolò Cesa-Bianchi · Ran Gilad-Bachrach

Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.

Amortised Likelihood-free Inference for Expensive Time-series Simulators with Signatured Ratio Estimation

Joel Dyer · Patrick Cannon · Sebastian Schmon

Simulation models of complex dynamics in the natural and social sciences commonly lack a tractable likelihood function, rendering traditional likelihood-based statistical inference impossible. Recent advances in machine learning have introduced novel algorithms for estimating otherwise intractable likelihood functions using a likelihood ratio trick based on binary classifiers. Consequently, efficient likelihood approximations can be obtained whenever good probabilistic classifiers can be constructed. We propose a kernel classifier for sequential data using path signatures based on the recently introduced signature kernel. We demonstrate that the representative power of signatures yields a highly performant classifier, even in the crucially important case where sample numbers are low. In such scenarios, our approach can outperform sophisticated neural networks for common posterior inference tasks.

Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation

Anas Barakat · Pascal Bianchi · Julien Lehmann

Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we reduce this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.

A Non-asymptotic Approach to Best-Arm Identification for Gaussian Bandits

Antoine Barrier · Aurélien Garivier · Tomáš Kocák

We propose a new strategy for best-arm identification with fixed confidence of Gaussian variables with bounded means and unit variance. This strategy, called Exploration-Biased Sampling, is not only asymptotically optimal: it is to the best of our knowledge the first strategy with non-asymptotic bounds that asymptotically matches the sample complexity.But the main advantage over other algorithms like Track-and-Stop is an improved behavior regarding exploration: Exploration-Biased Sampling is biased towards exploration in a subtle but natural way that makes it more stable and interpretable. These improvements are allowed by a new analysis of the sample complexity optimization problem, which yields a faster numerical resolution scheme and several quantitative regularity results that we believe of high independent interest.

An Optimal Algorithm for Strongly Convex Minimization under Affine Constraints

Adil Salim · Laurent CONDAT · Dmitry Kovalev · Peter Richtarik

Optimization problems under affine constraints appear in various areas of machine learning. We consider the task of minimizing a smooth strongly convex function F(x) under the affine constraint Kx = b, with an oracle providing evaluations of the gradient of F and multiplications by K and its transpose. We provide lower bounds on the number of gradient computations and matrix multiplications to achieve a given accuracy. Then we propose an accelerated primal–dual algorithm achieving these lower bounds. Our algorithm is the first optimal algorithm for this class of problems.

A Witness Two-Sample Test

Jonas Kübler · Wittawat Jitkrittum · Bernhard Schölkopf · Krikamol Muandet

The Maximum Mean Discrepancy (MMD) has been the state-of-the-art nonparametric test for tackling the two-sample problem. Its statistic is given by the difference in expectations of the witness function, a real-valued function defined as a weighted sum of kernel evaluations on a set of basis points. Typically the kernel is optimized on a training set, and hypothesis testing is performed on a separate test set to avoid overfitting (i.e., control type-I error). That is, the test set is used to simultaneously estimate the expectations and define the basis points, while the training set only serves to select the kernel and is discarded. In this work, we propose to use the training data to also define the weights and the basis points for better data efficiency. We show that 1) the new test is consistent and has a well-controlled type-I error; 2) the optimal witness function is given by a precision-weighted mean in the reproducing kernel Hilbert space associated with the kernel; and 3) the test power of the proposed test is comparable or exceeds that of the MMD and other modern tests, as verified empirically on challenging synthetic and real problems (e.g., Higgs data).

Basis Matters: Better Communication-Efficient Second Order Methods for Federated Learning

Xun Qian · Rustem Islamov · Mher Safaryan · Peter Richtarik

Recent advances in distributed optimization have shown that Newton-type methods with proper communication compression mechanisms can guarantee fast local rates and low communication cost compared to first order methods. We discover that the communication cost of these methods can be further reduced, sometimes dramatically so, with a surprisingly simple trick: {\em Basis Learn (BL)}. The idea is to transform the usual representation of the local Hessians via a change of basis in the space of matrices and apply compression tools to the new representation. To demonstrate the potential of using custom bases, we design a new Newton-type method (BL1), which reduces communication cost via both {\em BL} technique and bidirectional compression mechanism. Furthermore, we present two alternative extensions (BL2 and BL3) to partial participation to accommodate federated learning applications. We prove local linear and superlinear rates independent of the condition number. Finally, we support our claims with numerical experiments by comparing several first and second order methods.

Bayesian Classifier Fusion with an Explicit Model of Correlation

Susanne Trick · Constantin Rothkopf

Combining the outputs of multiple classifiers or experts into a single probabilistic classification is a fundamental task in machine learning with broad applications from classifier fusion to expert opinion pooling. Here we present a hierarchical Bayesian model of probabilistic classifier fusion based on a new correlated Dirichlet distribution. This distribution explicitly models positive correlations between marginally Dirichlet-distributed random vectors thereby allowing explicit modeling of correlations between base classifiers or experts. The proposed model naturally accommodates the classic Independent Opinion Pool and other independent fusion algorithms as special cases. It is evaluated by uncertainty reduction and correctness of fusion on synthetic and real-world data sets. We show that a change in performance of the fused classifier due to uncertainty reduction can be Bayes optimal even for highly correlated base classifiers.

Bayesian Inference and Partial Identification in Multi-Treatment Causal Inference with Unobserved Confounding

Jiajing Zheng · Alexander D'Amour · Alexander Franks

In causal estimation problems, the parameter of interest is often only partially identified, implying that the parameter cannot be recovered exactly, even with infinite data. Here, we study Bayesian inference for partially identified treatment effects in multi-treatment causal inference problems with unobserved confounding. In principle, inferring the partially identified treatment effects is natural under the Bayesian paradigm, but the results can be highly sensitive to parameterization and prior specification, often in surprising ways. It is thus essential to understand which aspects of the conclusions about treatment effects are driven entirely by the prior specification. We use a so-called transparent parameterization to contextualize the effects of more interpretable scientifically motivated prior specifications on the multiple effects. We demonstrate our analysis in an example quantifying the effects of gene expression levels on mouse obesity.

Being a Bit Frequentist Improves Bayesian Neural Networks

Agustinus Kristiadi · Matthias Hein · Philipp Hennig

Despite their compelling theoretical properties, Bayesian neural networks (BNNs) tend to perform worse than frequentist methods in classification-based uncertainty quantification (UQ) tasks such as out-of-distribution (OOD) detection. In this paper, based on empirical findings in prior works, we hypothesize that this issue is because even recent Bayesian methods have never considered OOD data in their training processes, even though this OOD training'' technique is an integral part of state-of-the-art frequentist UQ methods. To validate this, we treat OOD data as a first-class citizen in BNN training by exploring four different ways of incorporating OOD data into Bayesian inference. We show in extensive experiments that OOD-trained BNNs are competitive to recent frequentist baselines. This work thus provides strong baselines for future work in Bayesian UQ.

Best Arm Identification with Safety Constraints

Zhenlin Wang · Andrew Wagenmaker · Kevin Jamieson

The best arm identification problem in the multi-armed bandit setting is an excellent model of many real-world decision-making problems, yet it fails to capture the fact that in the real-world, safety constraints often must be met while learning. In this work we study the question of best-arm identification in safety-critical settings, where the goal of the agent is to find the best safe option out of many, while exploring in a way that guarantees certain, initially unknown safety constraints are met. We first analyze this problem in the setting where the reward and safety constraint takes a linear structure, and show nearly matching upper and lower bounds. We then analyze a much more general version of the problem where we only assume the reward and safety constraint can be modeled by monotonic functions, and propose an algorithm in this setting which is guaranteed to learn safely. We conclude with experimental results demonstrating the effectiveness of our approaches in scenarios such as safely identifying the best drug out of many in order to treat an illness.

Beyond Data Samples: Aligning Differential Networks Estimation with Scientific Knowledge

Arshdeep Sekhon · Zhe Wang · Yanjun Qi

Learning the differential statistical dependency network between two contexts is essential for many real-life applications, mostly in the high dimensional low sample regime. In this paper, we propose a novel differential network estimator that allows integrating various sources of knowledge beyond data samples. The proposed estimator is scalable to a large number of variables and achieves a sharp asymptotic convergence rate. Empirical experiments on extensive simulated data and four real-world applications (one on neuroimaging and three from functional genomics) show that our approach achieves improved differential network estimation and provides better supports to downstream tasks like classification. Our results highlight significant benefits of integrating group, spatial and anatomic knowledge during differential genetic network identification and brain connectome change discovery.

Bias-Variance Decompositions for Margin Losses

Danny Wood · Tingting Mu · Gavin Brown

We introduce a novel bias-variance decomposition for a range of strictly convex margin losses, including the logistic loss (minimized by the classic LogitBoost algorithm) as well as the squared margin loss and canonical boosting loss.Furthermore we show that, for all strictly convex margin losses, the expected risk decomposes into the risk of a "central" model and a term quantifying variation in the functional margin with respect to variations in the training data. These decompositions provide a diagnostic tool for practitioners to understand model overfitting/underfitting, and have implications for additive ensemble models---for example, when our bias-variance decomposition holds, there is a corresponding "ambiguity" decomposition, which can be used to quantify model diversity.

Calibration Error for Heterogeneous Treatment Effects

Yizhe Xu · Steve Yadlowsky

Recently, many researchers have advanced data-driven methods for modeling heterogeneous treatment effects (HTEs). Even still, estimation of HTEs is a difficult task--these methods frequently over- or under-estimate the treatment effects, leading to poor calibration of the resulting models. However, while many methods exist for evaluating the calibration of prediction and classification models, formal approaches to assess the calibration of HTE models are limited to the calibration slope. In this paper, we define an analogue of the (L2) expected calibration error for HTEs, and propose a robust estimator. Our approach is motivated by doubly robust treatment effect estimators, making it unbiased, and resilient to confounding, overfitting, and high-dimensionality issues. Furthermore, our method is straightforward to adapt to many structures under which treatment effects can be identified, including randomized trials, observational studies, and survival analysis. We illustrate how to use our proposed metric to evaluate the calibration of learned HTE models through the application to the CRITEO-UPLIFT Trial.

Can Functional Transfer Methods Capture Simple Inductive Biases?

Arne Nix · Suhas Shrinivasan · Edgar Walker · Fabian Sinz

Transferring knowledge embedded in trained neural networks is a core problem in areas like model compression and continual learning. Among knowledge transfer approaches, functional transfer methods such as knowledge distillation and representational distance learning are particularly promising, since they allow for transferring knowledge across different architectures and tasks. Considering various characteristics of networks that are desirable to transfer, equivariance is a notable property that enables a network to capture valuable relationships in the data. We assess existing functional transfer methods on their ability to transfer equivariance and empirically show that they fail to even transfer shift equivariance, one of the simplest equivariances. Further theoretical analysis demonstrates that representational similarity methods, in fact, cannot guarantee the transfer of the intended equivariance. Motivated by these findings, we develop a novel transfer method that learns an equivariance model from a given teacher network and encourages the student network to acquire the same equivariance, via regularization. Experiments show that our method successfully transfers equivariance even in cases where highly restrictive methods, such as directly matching student and teacher representations, fail.

Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis

Jiaye Teng · Weiran Huang · Haowei He

Pretext-based self-supervised learning learns the semantic representation via a handcrafted pretext task over unlabeled data and then uses the learned representation for downstream tasks, which effectively reduces the sample complexity of downstream tasks under Conditional Independence (CI) condition. However, the downstream sample complexity gets much worse if the CI condition does not hold. One interesting question is whether we can make the CI condition hold by using downstream data to refine the unlabeled data to boost self-supervised learning. At first glance, one might think that seeing downstream data in advance would always boost the downstream performance. However, we show that it is not intuitively true and point out that in some cases, it hurts the final performance instead. In particular, we prove both model-free and model-dependent lower bounds of the number of downstream samples used for data refinement. Moreover, we conduct various experiments on both synthetic and real-world datasets to verify our theoretical results.

Causal Effect Identification with Context-specific Independence Relations of Control Variables

Ehsan Mokhtarian · Fateme Jamshidi · Jalal Etesami · Negar Kiyavash

We study the problem of causal effect identification from observational distribution given the causal graph and some context-specific independence (CSI) relations. It was recently shown that this problem is NP-hard, and while a sound algorithm to learn the causal effects is proposed in Tikka et al. (2019), no complete algorithm for the task exists. In this work, we propose a sound and complete algorithm for the setting when the CSI relations are limited to observed nodes with no parents in the causal graph. One limitation of the state of the art in terms of its applicability is that the CSI relations among all variables, even unobserved ones, must be given (as opposed to learned). Instead, We introduce a set of graphical constraints under which the CSI relations can be learned from mere observational distribution. This expands the set of identifiable causal effects beyond the state of the art.

Certifiably Robust Variational Autoencoders

Ben Barrett · Alexander Camuto · Matthew Willetts · Tom Rainforth

We introduce an approach for training variational autoencoders (VAEs) that are certifiably robust to adversarial attack. Specifically, we first derive actionable bounds on the minimal size of an input perturbation required to change a VAE's reconstruction by more than an allowed amount, with these bounds depending on certain key parameters such as the Lipschitz constants of the encoder and decoder. We then show how these parameters can be controlled, thereby providing a mechanism to ensure a priori that a VAE will attain a desired level of robustness. Moreover, we extend this to a complete practical approach for training such VAEs to ensure our criteria are met. Critically, our method allows one to specify a desired level of robustness upfront and then train a VAE that is guaranteed to achieve this robustness. We further demonstrate that these Lipschitz-constrained VAEs are more robust to attack than standard VAEs in practice.

CF-GNNExplainer: Counterfactual Explanations for Graph Neural Networks

Ana Lucic · Maartje ter Hoeve · Gabriele Tolomei · Maarten de Rijke · Fabrizio Silvestri

Given the increasing promise of graph neural networks (GNNs) in real-world applications, several methods have been developed for explaining their predictions. Existing methods for interpreting predictions from GNNs have primarily focused on generating subgraphs that are especially relevant for a particular prediction. However, such methods are not counterfactual (CF) in nature: given a prediction, we want to understand how the prediction can be changed in order to achieve an alternative outcome. In this work, we propose a method for generating CF explanations for GNNs: the minimal perturbation to the input (graph) data such that the prediction changes. Using only edge deletions, we find that our method, CF-GNNExplainer, can generate CF explanations for the majority of instances across three widely used datasets for GNN explanations, while removing less than 3 edges on average, with at least 94\% accuracy. This indicates that CF-GNNExplainer primarily removes edges that are crucial for the original predictions, resulting in minimal CF explanations.

Complex Momentum for Optimization in Games

Jonathan Lorraine · David Acuna · Paul Vicol · David Duvenaud

We generalize gradient descent with momentum for optimization in differentiable games to have complex-valued momentum. We give theoretical motivation for our method by proving convergence on bilinear zero-sum games for simultaneous and alternating updates. Our method gives real-valued parameter updates, making it a drop-in replacement for standard optimizers. We empirically demonstrate that complex-valued momentum can improve convergence in realistic adversarial games—like generative adversarial networks—by showing we can find better solutions with an almost identical computational cost. We also show a practical complex-valued Adam variant, which we use to train BigGAN to improve inception scores on CIFAR-10.

ContextGen: Targeted Data Generation for Low Resource Domain Specific Text Classification

Lukas Fromme · Jasmina Bogojeska · Jonas Kuhn

To address the challenging low-resource non-topical text classification problems in domain specific settings we introduce ContextGen -- a novel approach that uses targeted text generation with no fine tuning to augment the available small annotated dataset. It first adapts the powerful GPT-2 text generation model to generate samples relevant for the domain by using properly designed context text as input for generation. Then it assigns class labels to the newly generated samples after which they are added to the initial training set. We demonstrate the superior performance of a state-of-the-art text classifier trained with the augmented labelled dataset for four different non-topical tasks in the low resource setting, three of which are from specialized domains.

Co-Regularized Adversarial Learning for Multi-Domain Text Classification

Yuan Wu · Diana Inkpen · Ahmed El-Roby

Multi-domain text classification (MDTC) aims to leverage all available resources from multiple domains to learn a predictive model that can generalize well on these domains. Recently, many MDTC methods adopt adversarial learning, shared-private paradigm, and entropy minimization to yield state-of-the-art results. However, these approaches face three issues: (1) Minimizing domain divergence can not fully guarantee the success of domain alignment; (2) Aligning marginal feature distributions can not fully guarantee the discriminability of the learned features; (3) Standard entropy minimization may make the predictions on unlabeled data over-confident, deteriorating the discriminability of the learned features. In order to address the above issues, we propose a co-regularized adversarial learning (CRAL) mechanism for MDTC. This approach constructs two diverse shared latent spaces, performs domain alignment in each of them, and punishes the disagreements of these two alignments with respect to the predictions on unlabeled data. Moreover, virtual adversarial training (VAT) with entropy minimization is incorporated to impose consistency regularization to the CRAL method. Experiments show that our model outperforms state-of-the-art methods on two MDTC benchmarks.

Corruption-robust Offline Reinforcement Learning

Xuezhou Zhang · Yiding Chen · Xiaojin Zhu · Wen Sun

We study the adversarial robustness in offline reinforcement learning. Given a batch dataset consisting of tuples $(s, a, r, s')$ , an adversary is allowed to arbitrarily modify $\epsilon$ fraction of the tuples.From the corrupted dataset the learner aims to robustly identify a near-optimal policy. We first show that a worst-case $\Omega(d\epsilon)$ optimality gap is unavoidable in linear MDP of dimension $d$ , even if the adversary only corrupts the reward element in a tuple. This contrasts with dimension-free results in robust supervised learning and best-known lower-bound in the online RL setting with corruption. Next, we propose robust variants of the Least-Square Value Iteration (LSVI) algorithm utilizing robust supervised learning oracles, which achieve near-matching performances in cases both with and without full data coverage. The algorithm requires the knowledge of $\epsilon$ to design the pessimism bonus in the no-coverage case. Surprisingly, in this case, the knowledge of $\epsilon$ is necessary, as we show that being adaptive to unknown $\epsilon$ is impossible.This again contrasts with recent results on corruption-robust online RL and implies that robust offline RL is a strictly harder problem.

Data Appraisal Without Data Sharing

XINLEI XU · Awni Hannun · Laurens van der Maaten

One of the most effective approaches to improving the performance of amachine learning model is to procure additional training data.A model owner seeking relevant training data from a data owner needsto appraise the data before acquiring it.However, without a formal agreement, the data owner does not wantto share data.The resulting Catch-22 prevents efficient data marketsfrom forming.This paper proposes adding a data appraisal stage that requires nodata sharing between data owners and model owners. Specifically,we use multi-party computation to implement an appraisal functioncomputed on private data. The appraised value serves as a guide tofacilitate data selection and transaction. We propose an efficientdata appraisal method based on forward influence functions thatapproximates data value through its first-order lossreduction on the current model.The method requires no additional hyper-parameters or re-training.We show that in private, forward influence functions provide anappealing trade-off between high quality appraisal and required computation,in spite of label noise, class imbalance, and missing data.Our work seeks to inspire an open market that incentivizes efficient, equitable exchange of domain-specific training data.

Denoising and change point localisation in piecewise-constant high-dimensional regression coefficients

Fan Wang · Oscar Madrid · Yi Yu · Alessandro Rinaldo

We study the theoretical properties of the fused lasso procedure originally proposed by \cite{tibshirani2005sparsity} in the context of a linear regression model in which the regression coefficient are totally ordered and assumed to be sparse and piecewise constant. Despite its popularity, to the best of our knowledge, estimation error bounds in high-dimensional settings have only been obtained for the simple case in which the design matrix is the identity matrix. We formulate a novel restricted isometry condition on the design matrix that is tailored to the fused lasso estimator and derive estimation bounds for both the constrained version of the fused lasso assuming dense coefficients and for its penalised version. We observe that the estimation error can be dominated by either the lasso or the fused lasso rate, depending on whether the number of non-zero coefficient is larger than the number of piece-wise constant segments. Finally, we devise a post-processing procedure to recover the piecewise-constant pattern of the coefficients. Extensive numerical experiments support our theoretical findings.

Differential privacy for symmetric log-concave mechanisms

Staal Vinterbo

Adding random noise to database query results is an important tool for achieving privacy. A challenge is to minimize this noise while still meeting privacy requirements. Recently, a sufficient and necessary condition for $(\epsilon, \delta)$ -differential privacy for Gaussian noise was published. This condition allows the computation of the minimum privacy-preserving scale for this distribution. We extend this work and provide a sufficient and necessary condition for $(\epsilon, \delta)$ -differential privacy for all symmetric and log-concave noise densities. Our results allow fine-grained tailoring of the noise distribution to the dimensionality of the query result. We demonstrate that this can yield significantly lower mean squared errors than those incurred by the currently used Laplace and Gaussian mechanisms for the same $\epsilon$ and $\delta$ .

Dimensionality Reduction and Prioritized Exploration for Policy Search

Marius Memmel · Puze Liu · Davide Tateo · Jan Peters

Black-box policy optimization is a class of reinforcement learning algorithms that explores and updates the policies at the parameter level. This class of algorithms is widely applied in robotics with movement primitives or non-differentiable policies. Furthermore, these approaches are particularly relevant where exploration at the action level could cause actuator damage or other safety issues. However, Black-box optimization does not scale well with the increasing dimensionality of the policy, leading to high demand for samples, which are expensive to obtain in real-world systems. In many practical applications, policy parameters do not contribute equally to the return. Identifying the most relevant parameters allows to narrow down the exploration and speed up the learning. Furthermore, updating only the effective parameters requires fewer samples, improving the scalability of the method. We present a novel method to prioritize the exploration of effective parameters and cope with full covariance matrix updates. Our algorithm learns faster than recent approaches and requires fewer samples to achieve state-of-the-art results. To select the effective parameters, we consider both the Pearson correlation coefficient and the Mutual Information. We showcase the capabilities of our approach on the Relative Entropy Policy Search algorithm in several simulated environments, including robotics simulations. Code is available at https://git.ias.informatik.tu-darmstadt.de/ias_code/aistats2022/dr-creps.

Dropout as a Regularizer of Interaction Effects

Benjamin Lengerich · Eric Xing · Rich Caruana

We examine Dropout through the perspective of interactions. This view provides a symmetry to explain Dropout: given N variables, there are N choose k possible sets of k variables to form an interaction (i.e. O(N^k)); conversely, the probability an interaction of k variables survives Dropout at rate p is (1-p)^k (decaying with k). These rates effectively cancel, and so Dropout regularizes against higher-order interactions. We prove this perspective analytically and empirically. This perspective of Dropout as a regularizer against interaction effects has several practical implications: (1) higher Dropout rates should be used when we need stronger regularization against spurious high-order interactions, (2) caution should be exercised when interpreting Dropout-based explanations and uncertainty measures, and (3) networks trained with Input Dropout are biased estimators. We also compare Dropout to other regularizers and find that it is difficult to obtain the same selective pressure against high-order interactions with these methods.

Duel-based Deep Learning system for solving IQ tests

Paulina Tomaszewska · Adam Żychowski · Jacek Mańdziuk

One of the relevant aspects of Artificial General Intelligence is the ability of machines to demonstrate abstract reasoning skills, for instance, through solving (human) IQ tests. This work presents a new approach to machine IQ tests solving formulated as Raven’s Progressive Matrices (RPMs), called Duel-IQ. The proposed solution incorporates the concept of a tournament in which the best answer is chosen based on a set of duels between candidate RPM answers. The three relevant aspects are: (1) low computational and design complexity, (2) proposition of two schemes of pairing up candidate answers for the duels and (3) evaluation of the system on a dataset of shapes other than those used for training. Depending on a particular variant, the system reaches up to 82.8% accuracy on average in RPM tasks with 5 candidate answers and is on par with human performance and superior to other literature approaches of comparable complexity when training and test sets are from the same distribution.

Efficient computation of the the volume of a polytope in high-dimensions using Piecewise Deterministic Markov Processes

Augustin Chevallier · Frédéric Cazals · Paul Fearnhead

Computing the volume of a polytope in high dimensions iscomputationally challenging but has wide applications. Currentstate-of-the-art algorithms to compute such volumes rely on efficientsampling of a Gaussian distribution restricted to the polytope, usinge.g. Hamiltonian Monte Carlo. We present a new sampling strategy thatuses a Piecewise Deterministic Markov Process. Like Hamiltonian MonteCarlo, this new method involves simulating trajectories of anon-reversible process and inherits similar good mixingproperties. However, importantly, the process can be simulated moreeasily due to its piecewise linear trajectories — and this leads to areduction of the computational cost by a factor of the dimension ofthe space. Our experiments indicate that our method is numericallyrobust and is one order of magnitude faster (or better) than existingmethods using Hamiltonian Monte Carlo. On a single core processor, wereport computational time of a few minutes up to dimension 500.

Encrypted Linear Contextual Bandit

Evrard Garcelon · Matteo Pirotta · Vianney Perchet

Contextual bandit is a general framework for online learning in sequential decision-making problems that has found application in a wide range of domains, including recommendation systems, online advertising, and clinical trials. A critical aspect of bandit methods is that they require to observe the contexts --i.e., individual or group-level data-- and rewards in order to solve the sequential problem.The large deployment in industrial applications has increased interest in methods that preserve the users' privacy.In this paper, we introduce a privacy-preserving bandit framework based on homomorphic encryption which allows computations using encrypted data. The algorithm only observes encrypted information (contexts and rewards) and has no ability to decrypt it.Leveraging the properties of homomorphic encryption, we show that despite the complexity of the setting, it is possible to solve linear contextual bandits over encrypted data with a $\widetilde{O}(d\sqrt{T})$ regret bound in any linear contextual bandit problem, while keeping data encrypted.

Estimators of Entropy and Information via Inference in Probabilistic Models

Feras Saad · Marco Cusumano-Towner · Vikash Mansinghka

Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.

Exact Community Recovery over Signed Graphs

Xiaolu Wang · Peng Wang · Anthony Man-Cho So

Signed graphs encode similarity and dissimilarity relationships among different entities with positive and negative edges. In this paper, we study the problem of community recovery over signed graphs generated by the signed stochastic block model (SSBM) with two equal-sized communities. Our approach is based on the maximum likelihood estimation (MLE) of the SSBM. Unlike many existing approaches, our formulation reveals that the positive and negative edges of a signed graph should be treated unequally. We then propose a simple two-stage iterative algorithm for solving the regularized MLE. It is shown that in the logarithmic degree regime, the proposed algorithm can exactly recover the underlying communities in nearly-linear time at the information-theoretic limit. Numerical results on both synthetic and real data are reported to validate and complement our theoretical developments and demonstrate the efficacy of the proposed method.

Feature Collapsing for Gaussian Process Variable Ranking

Isaac Sebenius · Topi Paananen · Aki Vehtari

At present, there is no consensus on the most effective way to establish feature relevance for Gaussian process models. The most common heuristic, Automatic Relevance Determination, has several downsides; many alternate methods incur unacceptable computational costs. Existing methods based on sensitivity analysis of the posterior predictive distribution are promising, but are heavily biased and show room for improvement. This paper proposes Feature Collapsing as a novel method for performing GP feature relevance determination in an effective, unbiased, and computationally-inexpensive manner compared to existing algorithms.

Finite Sample Analysis of Mean-Volatility Actor-Critic for Risk-Averse Reinforcement Learning

Khaled Eldowa · Lorenzo Bisi · Marcello Restelli

The goal in the standard reinforcement learning problem is to find a policy that optimizes the expected return. However, such an objective is not adequate in a lot of real-life applications, like finance, where controlling the uncertainty of the outcome is imperative. The mean-volatility objective penalizes, through a tunable parameter, policies with high variance of the per-step reward. An interesting property of this objective is that it admits simple linear Bellman equations that resemble, up to a reward transformation, those of the risk-neutral case. However, the required reward transformation is policy-dependent, and requires the (usually unknown) expected return of the used policy. In this work, we propose two general methods for policy evaluation under the mean-volatility objective: the direct method and the factored method. We then extend recent results for finite sample analysis in the risk-neutral actor-critic setting to the mean-volatility case. Our analysis shows that the sample complexity to attain an $\epsilon$ -accurate stationary point is the same as that of the risk-neutral version, using either policy evaluation method for training the critic. Finally, we carry out experiments to test the proposed methods in a simple environment that exhibits some trade-off between optimality, in expectation, and uncertainty of outcome.

Forward Looking Best-Response Multiplicative Weights Update Methods for Bilinear Zero-sum Games

Michail Fasoulakis · Evangelos Markakis · Yannis Pantazis · Constantinos Varsos

Our work focuses on extra gradient learning algorithms for finding Nash equilibria in bilinear zero-sum games. The proposed method, which can be formally considered as a variant of Optimistic Mirror Descent \citep{DBLP:conf/iclr/MertikopoulosLZ19}, uses a large learning rate for the intermediate gradient step which essentially leads to computing (approximate) best response strategies against the profile of the previous iteration. Although counter-intuitive at first sight due to the irrationally large, for an iterative algorithm, intermediate learning step, we prove that the method guarantees last-iterate convergence to an equilibrium.Particularly, we show that the algorithm reaches first an $\eta^{1/\rho}$ -approximate Nash equilibrium, with $\rho > 1$ , by decreasing the Kullback-Leibler divergence of each iterate by at least $\Omega(\eta^{1+\frac{1}{\rho}})$ , for sufficiently small learning rate $\eta$ , until the method becomes a contracting map, and converges to the exact equilibrium.Furthermore, we perform experimental comparisons with the optimistic variant of the multiplicative weights update method, by \citep{Daskalakis2019LastIterateCZ} and show that our algorithm has significant practical potential since it offers substantial gains in terms of accelerated convergence.

Generalised GPLVM with Stochastic Variational Inference

Vidhi Lalchand · Aditya Ravuri · Neil Lawrence

Gaussian process latent variable models (GPLVM) are a flexible and non-linear approach to dimensionality reduction, extending classical Gaussian processes to an unsupervised learning context. The Bayesian incarnation of the GPLVM uses a variational framework, where the posterior over latent variables is approximated by a well-behaved variational family, a factorised Gaussian yielding a tractable lower bound. However, the non-factorisability of the lower bound prevents truly scalable inference. In this work, we study the doubly stochastic formulation of the Bayesian GPLVM model amenable with minibatch training. We show how this framework is compatible with different latent variable formulations and perform experiments to compare a suite of models. Further, we demonstrate how we can train in the presence of massively missing data and obtain high-fidelity reconstructions. We demonstrate the model's performance by benchmarking against the canonical sparse GPLVM for high dimensional data examples.

How to scale hyperparameters for quickshift image segmentation

Damien Garreau

Quickshift is a popular algorithm for image segmentation, used as a preprocessing step in many applications. Unfortunately, it is quite challenging to understand the hyperparameters' influence on the number and shape of superpixels produced by the method. In this paper, we study theoretically a slightly modified version of the quickshift algorithm, with a particular emphasis on homogeneous image patches with i.i.d. pixel noise and sharp boundaries between such patches. Leveraging this analysis, we derive a simple heuristic to scale quickshift hyperparameters with respect to the image size, which we check empirically.

Identity Testing of Reversible Markov Chains

Sela Fried · Geoffrey Wolfer

We consider the problem of identity testing of Markov chain transition matrices based on a single trajectory of observations under the distance notion introduced by Daskalakis et al. (2018a) and further analyzed by Cherapanamjeri and Bartlett (2019). Both works made the restrictive assumption that the Markov chains under consideration are symmetric. In this work we relax the symmetry assumption and show that it is possible to perform identity testing under the much weaker assumption of reversibility, provided that the stationary distributions of the reference and of the unknown Markov chains are close under a distance notion related to the separation distance. Additionally, we provide intuition on the distance notion of Daskalakis et al. (2018a) by showing how it behaves under several natural operations. In particular, we address some of their open questions.

Improving Attribution Methods by Learning Submodular Functions

Piyushi Manupriya · Tarun Menta · SakethaNath Jagarlapudi · Vineeth N Balasubramanian

This work explores the novel idea of learning a submodular scoring function to improve the specificity/selectivity of existing feature attribution methods. Submodular scores are natural for attribution as they are known to accurately model the principle of diminishing returns. A new formulation for learning a deep submodular set function that is consistent with the real-valued attribution maps obtained by existing attribution methods is proposed. The final attribution value of a feature is then defined as the marginal gain in the induced submodular score of the feature in the context of other highly attributed features, thus decreasing the attribution of redundant yet discriminatory features. Experiments on multiple datasets illustrate that the proposed attribution method achieves higher specificity along with good discriminative power. The implementation of our method is publicly available at https://github.com/Piyushi-0/SEA-NN.

Investigating the Role of Negatives in Contrastive Representation Learning

Jordan Ash · Surbhi Goel · Akshay Krishnamurthy · Dipendra Misra

Noise contrastive learning is a popular technique for unsupervised representation learning. In this approach, a representation is obtained via reduction to supervised learning, where given a notion of semantic similarity, the learner tries to distinguish a similar (positive) example from a collection of random (negative) examples. The success of modern contrastive learning pipelines relies on many design decisions, such as the choice of data augmentation, the number of negative examples, and the batch size; however, there is limited understanding as to how these parameters interact and affect downstream performance. We focus on disambiguating the role of one of these parameters: the number of negative examples. Theoretically, we show the existence of a collision-coverage trade-off suggesting that the optimal number of negative examples should scale with the number of underlying concepts in the data. Empirically, we scrutinize the role of the number of negatives in both NLP and vision tasks.

Jointly Efficient and Optimal Algorithms for Logistic Bandits

Louis Faury · Marc Abeille · Kwang-Sung Jun · Clement Calauzenes

Logistic Bandits have recently undergone careful scrutiny by virtue of their combined theoretical and practical relevance. This research effort delivered statistically efficient algorithms, improving the regret of previous strategies by exponentially large factors. Such algorithms are however strikingly costly as they require $\Omega(t)$ operations at each round. On the other hand, a different line of research focused on computational efficiency ( $\mathcal{O}(1)$ per-round cost), but at the cost of letting go of the aforementioned exponential improvements. Obtaining the best of both world is unfortunately not a matter of marrying both approaches. Instead we introduce a new learning procedure for Logistic Bandits. It yields confidence sets which sufficient statistics can be easily maintained online without sacrificing statistical tightness. Combined with efficient planning mechanisms we design fast algorithms which regret performance still match the problem-dependent lower-bound of Abeille et al (2021). To the best of our knowledge, those are the first Logistic Bandit algorithms that simultaneously enjoy statistical and computational efficiency.

Label differential privacy via clustering

Hossein Esfandiari · Vahab Mirrokni · Umar Syed · Sergei Vassilvtiskii

We present new mechanisms for \emph{label differential privacy}, a relaxation of differentially private machine learning that only protects the privacy of the labels in the training set. Our mechanisms cluster the examples in the training set using their (non-private) feature vectors, randomly re-sample each label from examples in the same cluster, and output a training set with noisy labels as well as a modified version of the true loss function. We prove that when the clusters are both large and high-quality, the model that minimizes the modified loss on the noisy training set converges to small excess risk at a rate that is comparable to the rate for non-private learning. We also describe a learning problem in which large clusters are necessary to achieve both strong privacy and either good precision or good recall. Our experiments show that randomizing the labels within each cluster significantly improves the privacy vs. accuracy trade-off compared to applying uniform randomized response to the labels, and also compared to learning a model via DP-SGD.

Learning Quantile Functions for Temporal Point Processes with Recurrent Neural Splines

Souhaib BEN TAIEB

We can build flexible predictive models for rich continuous-time event data by combining the framework of temporal point processes (TPP) with (recurrent) neural networks. We propose a new neural parametrization for TPPs based on the conditional quantile function. Specifically, we use a flexible monotonic rational-quadratic spline to learn a smooth continuous quantile function. Conditioning on historical events is achieved through a recurrent neural network. This novel parametrization provides a flexible yet tractable TPP model with multiple advantages, such as analytical sampling and closed-form expressions for quantiles and prediction intervals. While neural TPP models are often trained using maximum likelihood estimation, we consider the more robust continuous ranked probability score (CRPS). We additionally derive a closed-form expression for the CRPS of our model. Finally, we demonstrate that the proposed model achieves state-of-the-art performance in standard prediction tasks on both synthetic and real-world event data.

Learning Tensor Representations for Meta-Learning

Samuel Deng · Yilin Guo · Daniel Hsu · Debmalya Mandal

We introduce a tensor-based model of shared representation for meta-learning from a diverse set of tasks. Prior works on learning linear representations for meta-learning assume that there is a common shared representation across different tasks, and do not consider the additional task-specific observable side information. In this work, we model the meta-parameter through an order- $3$ tensor, which can adapt to the observed task features of the task. We propose two methods to estimate the underlying tensor. The first method solves a tensor regression problem and works under natural assumptions on the data generating process. The second method uses the method of moments under additional distributional assumptions and has an improved sample complexity in terms of the number of tasks.We also focus on the meta-test phase, and consider estimating task-specific parameters on a new task. Substituting the estimated tensor from the first step allows us estimating the task-specific parameters with very few samples of the new task, thereby showing the benefits of learning tensor representations for meta-learning. Finally, through simulation and several real-world datasets, we evaluate our methods and show that it improves over previous linear models of shared representations for meta-learning.

Learning to Plan Variable Length Sequences of Actions with a Cascading Bandit Click Model of User Feedback

Anirban Santara · Gaurav Aggarwal · Shuai Li · Claudio Gentile

Motivated by problems of ranking with partial information, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to standard cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms against a representative sample of cascading bandit baselines on a number of real-world datasets and show significantly improved empirical performance.

Lifted Division for Lifted Hugin Belief Propagation

Moritz Hoffmann · Tanya Braun · Ralf Möller

The lifted junction tree algorithm (LJT) is an inference algorithm that allows for tractable inference regarding domain sizes. To answer multiple queries efficiently, it decomposes a first-order input model into a first-order junction tree. During inference, degrees of belief are propagated through the tree. This propagation significantly contributes to the runtime complexity not just of LJT but of any tree-based inference algorithm. We present a lifted propagation scheme based on the so-called Hugin scheme whose runtime complexity is independent of the degree of the tree. Thereby, lifted Hugin can achieve asymptotic speed improvements over the existing lifted Shafer-Shenoy propagation. An empirical evaluation confirms these results.

LocoProp: Enhancing BackProp via Local Loss Optimization

Ehsan Amid · Rohan Anil · Manfred Warmuth

Second-order methods have shown state-of-the-art performance for optimizing deep neural networks. Nonetheless, their large memory requirement and high computational complexity, compared to first-order methods, hinder their versatility in a typical low-budget setup. This paper introduces a general framework of layerwise loss construction for multilayer neural networks that achieves a performance closer to second-order methods while utilizing first-order optimizers only. Our methodology lies upon a three-component loss, target, and regularizer combination, for which altering each component results in a new update rule. We provide examples using squared loss and layerwise Bregman divergences induced by the convex integral functions of various transfer functions. Our experiments on benchmark models and datasets validate the efficacy of our new approach, reducing the gap between first-order and second-order optimizers.

Maillard Sampling: Boltzmann Exploration Done Optimally

Jie Bian · Kwang-Sung Jun

The PhD thesis of Maillard (2013) presents a rather obscure algorithm for the $K$ -armed bandit problem. This less-known algorithm, which we call Maillard sampling (MS), computes the probability of choosing each arm in a \textit{closed form}, which is not true for Thompson sampling, a widely-adopted bandit algorithm in the industry. This means that the bandit-logged data from running MS can be readily used for counterfactual evaluation, unlike Thompson sampling. Motivated by such merit, we revisit MS and perform an improved analysis to show that it achieves both the asymptotical optimality and $\sqrt{KT\log{T}}$ minimax regret bound where $T$ is the time horizon, which matches the known bounds for asymptotically optimal UCB. %'s performance. We then propose a variant of MS called MS $^+$ that improves its minimax bound to $\sqrt{KT\log{K}}$ . MS $^+$ can also be tuned to be aggressive (i.e., less exploration) without losing the asymptotic optimality, a unique feature unavailable from existing bandit algorithms. Our numerical evaluation shows the effectiveness of MS $^+$ .

Mean Nyström Embeddings for Adaptive Compressive Learning

Antoine Chatalic · Luigi Carratino · Ernesto De Vito · Lorenzo Rosasco

Compressive learning is an approach to efficient large scale learning based on sketching an entire dataset to a single mean embedding (the sketch), i.e. a vector of generalized moments. The learning task is then approximately solved as an inverse problem using an adapted parametric model. Previous works in this context have focused on sketches obtained by averaging random features, that while universal can be poorly adapted to the problem at hand. In this paper, we propose and study the idea of performing sketching based on data-dependent Nyström approximation.From a theoretical perspective we prove that the excess risk can be controlled under a geometric assumption relating the parametric model used to learn from the sketch and the covariance operator associated to the task at hand. Empirically, we show for k-means clustering and Gaussian modeling that for a fixed sketch size, Nyström sketches indeed outperform those built with random features.

Metalearning Linear Bandits by Prior Update

Amit Peleg · Naama Pearl · Ron Meir

Fully Bayesian approaches to sequential decision-making assume that problem parameters are generated from a known prior. In practice, such information is often lacking. This problem is exacerbated in setups with partial information, where a misspecified prior may lead to poor exploration and performance. In this work we prove, in the context of stochastic linear bandits and Gaussian priors, that as long as the prior is sufficiently close to the true prior, the performance of the applied algorithm is close to that of the algorithm that uses the true prior. Furthermore, we address the task of learning the prior through metalearning, where a learner updates her estimate of the prior across multiple task instances in order to improve performance on future tasks. We provide an algorithm and regret bounds, demonstrate its effectiveness in comparison to an algorithm that knows the correct prior, and support our theoretical results empirically. Our theoretical results hold for a broad class of algorithms, including Thompson Sampling and Information Directed Sampling.

Minimal Expected Regret in Linear Quadratic Control

Yassir Jedra · Alexandre Proutiere

We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by $\widetilde{O}(d_x^2\log(T))$ if only $A$ is unknown, and (iii) by $\widetilde{O}(d_x(d_u+d_x)\log(T))$ if only $B$ is unknown and under some mild non-degeneracy condition ( $d_x$ and $d_u$ denote the dimensions of the state and of the control input, respectively). These regret scalings are minimal in $T$ , $d_x$ and $d_u$ as they match existing lower bounds in scenario (i) when $d_x\le d_u$ [SF20], and in scenario (ii) [Lai86]. We conjecture that our upper bounds are also optimal in scenario (iii) (there is no known lower bound in this setting).Existing online algorithms proceed in epochs of (typically exponentially) growing durations. The control policy is fixed within each epoch, which considerably simplifies the analysis of the estimation error on $A$ and $B$ and hence of the regret. Our algorithm departs from this design choice: it is a simple variant of certainty-equivalence regulators, where the estimates of $A$ and $B$ and the resulting control policy can be updated as frequently as we wish, possibly at every step. Quantifying the impact of such a constantly-varying control policy on the performance of these estimates and on the regret constitutes one of the technical challenges tackled in this paper.

Mode estimation on matrix manifolds: Convergence and robustness

Hiroaki Sasaki · Jun-ichiro Hirayama · Takafumi Kanamori

Data on matrix manifolds are ubiquitous on a wide range of research fields. The key issue is estimation of the modes (i.e., maxima) of the probability density function underlying the data. For instance, local modes (i.e., local maxima) can be used for clustering, while the global mode (i.e., the global maximum) is a robust alternative to the Fr{\'e}chet mean. Previously, to estimate the modes, an iterative method has been proposed based on a Riemannian gradient estimator and empirically showed the superior performance in clustering (Ashizawa et al., 2017). However, it has not been theoretically investigated if the iterative method is able to capture the modes based on the gradient estimator. In this paper, we propose simple iterative methods for mode estimation on matrix manifolds based on the Euclidean metric. A key contribution is to perform theoretical analysis and establish sufficient conditions for the monotonic ascending and convergence of the proposed iterative methods. In addition, for the previous method, we prove the monotonic ascending property towards a mode. Thus, our work can be also regarded as compensating for the lack of theoretical analysis in the previous method. Furthermore, the robustness of the iterative methods is theoretically investigated in terms of the breakdown point. Finally, the proposed methods are experimentally demonstrated to work well in clustering and robust mode estimation on matrix manifolds.

MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption

Alexander Bartler · Andre Bühler · Felix Wiewel · Mario Döbler · Bin Yang

An unresolved problem in Deep Learning is the ability of neural networks to cope with domain shifts during test-time, imposed by commonly fixing network parameters after training. Our proposed method Meta Test-Time Training (MT3), however, breaks this paradigm and enables adaption at test-time. We combine meta-learning, self-supervision and test-time training to learn to adapt to unseen test distributions. By minimizing the self-supervised loss, we learn task-specific model parameters for different tasks. A meta-model is optimized such that its adaption to the different task-specific models leads to higher performance on those tasks. During test-time a single unlabeled image is sufficient to adapt the meta-model parameters. This is achieved by minimizing only the self-supervised loss component resulting in a better prediction for that image. Our approach significantly improves the state-of-the-art results on the CIFAR-10-Corrupted image classification benchmark.

Nearly Tight Convergence Bounds for Semi-discrete Entropic Optimal Transport

Alex Delalande

We derive nearly tight and non-asymptotic convergence bounds for solutions of entropic semi-discrete optimal transport. These bounds quantify the stability of the dual solutions of the regularized problem (sometimes called Sinkhorn potentials) w.r.t. the regularization parameter, for which we ensure a better than Lipschitz dependence. Such facts may be a first step towards a mathematical justification of $\varepsilon$ -scaling heuristics for the numerical resolution of regularized semi-discrete optimal transport. Our results also entail a non-asymptotic and tight expansion of the difference between the entropic and the unregularized costs.

New Coresets for Projective Clustering and Applications

Murad Tukan · Xuan Wu · Samson Zhou · Vladimir Braverman · Dan Feldman

$(j,k)$ -projective clustering is the natural generalization of the family of $k$ -clustering and $j$ -subspace clustering problems. Given a set of points $P$ in $\mathbb{R}^d$ , the goal is to find $k$ flats of dimension $j$ , i.e., affine subspaces, that best fit $P$ under a given distance measure. In this paper, we propose the first algorithm that returns an $L_\infty$ coreset of size polynomial in $d$ . Moreover, we give the first strong coreset construction for general $M$ -estimator regression. Specifically, we show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_1-L_2$ , and Fair regression, as well as general concave and power-bounded loss functions. Finally, we provide experimental results based on real-world datasets, showing the efficacy of our approach.

Online Continual Adaptation with Active Self-Training

Shiji Zhou · Han Zhao · Shanghang Zhang · Lianzhe Wang · Heng Chang · Zhi Wang · Wenwu Zhu

Models trained with offline data often suffer from continual distribution shifts and expensive labeling in changing environments. This calls for a new online learning paradigm where the learner can continually adapt to changing environments with limited labels. In this paper, we propose a new online setting -- Online Active Continual Adaptation, where the learner aims to continually adapt to changing distributions using both unlabeled samples and active queries of limited labels. To this end, we propose Online Self-Adaptive Mirror Descent (OSAMD), which adopts an online teacher-student structure to enable online self-training from unlabeled data, and a margin-based criterion that decides whether to query the labels to track changing distributions. Theoretically, we show that, in the separable case, OSAMD has an $O({T}^{2/3})$ dynamic regret bound under mild assumptions, which is aligned with the $\Omega(T^{2/3})$ lower bound of online learning algorithms with full labels. In the general case, we show a regret bound of $O({T}^{2/3} + \alpha^* T)$ , where $\alpha^*$ denotes the separability of domains and is usually small. Our theoretical results show that OSAMD can fast adapt to changing environments with active queries. Empirically, we demonstrate that OSAMD achieves favorable regrets under changing environments with limited labels on both simulated and real-world data, which corroborates our theoretical findings.

On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity

Khang Le · Huy Nguyen · Khai Nguyen · Tung Pham · Nhat Ho

We study the multi-marginal partial optimal transport (POT) problem between $m$ discrete (unbalanced) measures with at most $n$ supports. We first prove that we can obtain two equivalent forms of the multimarginal POT problem in terms of the multimarginal optimal transport problem via novel extensions of cost tensors. The first equivalent form is derived under the assumptions that the total masses of each measure are sufficiently close while the second equivalent form does not require any conditions on these masses but at the price of more sophisticated extended cost tensor. Our proof techniques for obtaining these equivalent forms rely on novel procedures of moving mass in graph theory to push transportation plan into appropriate regions. Finally, based on the equivalent forms, we develop an optimization algorithm, named the ApproxMPOT algorithm, that builds upon the Sinkhorn algorithm for solving the entropic regularized multimarginal optimal transport. We demonstrate that the ApproxMPOT algorithm can approximate the optimal value of multimarginal POT problem with a computational complexity upper bound of the order $\bigOtil(m^3(n+1)^{m}/ \varepsilon^2)$ where $\varepsilon > 0$ stands for the desired tolerance.

PAC Learning of Quantum Measurement Classes : Sample Complexity Bounds and Universal Consistency

Arun Padakandla · Abram Magner

We formulate a quantum analogue of the fundamental classical PAC learning problem. As on a quantum computer, we model data to be encoded by modifying specific attributes - spin axis of an electron, plane of polarization of a photon - of sub-atomic particles. Any interaction, including reading off, extracting or learning from such data is via quantum measurements, thus leading us to a problem of PAC learning Quantum Measurement Classes. We propose and analyze the sample complexity of a new ERM algorithm that respects quantum non-commutativity. Our study entails that we define the VC dimension of Positive Operator Valued Measure(ments) (POVMs) concept classes. Our sample complexity bounds involve optimizing over partitions of jointly measurable classes. Finally, we identify universally consistent sequences of POVM classes. Technical components of this work include computations involving tensor products, trace and uniform convergence bounds.

Pairwise Fairness for Ordinal Regression

Matthäus Kleindessner · Samira Samadi · Muhammad Bilal Zafar · Krishnaram Kenthapadi · Chris Russell

We initiate the study of fairness for ordinal regression. We adapt two fairness notions previously considered in fair ranking and propose a strategy for training a predictor that is approximately fair according to either notion. Our predictor has the form of a threshold model, composed of a scoring function and a set of thresholds, and our strategy is based on a reduction to fair binary classification for learning the scoring function and local search for choosing the thresholds. We provide generalization guarantees on the error and fairness violation of our predictor, and we illustrate the effectiveness of our approach in extensive experiments.

Pick-and-Mix Information Operators for Probabilistic ODE Solvers

Nathanael Bosch · Filip Tronarp · Philipp Hennig

Probabilistic numerical solvers for ordinary differential equations compute posterior distributions over the solution of an initial value problem via Bayesian inference. In this paper, we leverage their probabilistic formulation to seamlessly include additional information as general likelihood terms. We show that second-order differential equations should be directly provided to the solver, instead of transforming the problem to first order. Additionally, by including higher-order information or physical conservation laws in the model, solutions become more accurate and more physically meaningful. Lastly, we demonstrate the utility of flexible information operators by solving differential-algebraic equations. In conclusion, the probabilistic formulation of numerical solvers offers a flexible way to incorporate various types of information, thus improving the resulting solutions.

Predicting the impact of treatments over time with uncertainty aware neural differential equations.

Edward De Brouwer · Javier Gonzalez · Stephanie Hyland

Predicting the impact of treatments from ob- servational data only still represents a major challenge despite recent significant advances in time series modeling. Treatment assignments are usually correlated with the predictors of the response, resulting in a lack of data support for counterfactual predictions and therefore in poor quality estimates. Developments in causal inference have lead to methods addressing this confounding by requiring a minimum level of overlap. However, overlap is difficult to assess and usually not satisfied in practice. In this work, we propose Counterfactual ODE (CF-ODE), a novel method to predict the impact of treatments continuously over time using Neural Ordinary Differential Equations equipped with uncertainty estimates. This allows to specifically assess which treatment outcomes can be reliably predicted. We demonstrate over several longitudinal datasets that CF-ODE provides more accurate predictions and more reliable uncertainty estimates than previously available methods.

Predictive variational Bayesian inference as risk-seeking optimization

Futoshi Futami · Tomoharu Iwata · Naonori Ueda · Issei Sato · Masashi Sugiyama

Since the Bayesian inference works poorly under model misspecification, various solutions have been explored to counteract the shortcomings. Recently proposed predictive Bayes (PB) that directly optimizes the Kullback Leibler divergence between the empirical distribution and the approximate predictive distribution shows excellent performances not only under model misspecification but also for over-parametrized models. However, its behavior and superiority are still unclear, which limits the applications of PB.Specifically, the superiority of PB has been shown only in terms of the predictive test log-likelihood and the performancein the sense of parameter estimation has not been investigated yet.Also, it is not clear why PB is superior with misspecified and over-parameterized models. In this paper, we clarify these ambiguities by studying PB in the framework of risk-seeking optimization. To achieve this, first, we provide a consistency theory for PB and then present intuition of robustness of PB to model misspecification using a response function theory. Thereafter, we theoretically and numerically show that PB has an implicit regularization effect that leads to flat local minima in over-parametrized models.

Privacy Amplification by Decentralization

Edwige Cyffers · Aurélien Bellet

Analyzing data owned by several parties while achieving a good trade-off between utility and privacy is a key challenge in federated learning and analytics. In this work, we introduce a novel relaxation of local differential privacy (LDP) that naturally arises in fully decentralized algorithms, i.e., when participants exchange information by communicating along the edges of a network graph without central coordinator. This relaxation, that we call network DP, captures the fact that users have only a local view of the system. To show the relevance of network DP, we study a decentralized model of computation where a token performs a walk on the network graph and is updated sequentially by the party who receives it. For tasks such as real summation, histogram computation and optimization with gradient descent, we propose simple algorithms on ring and complete topologies. We prove that the privacy-utility trade-offs of our algorithms under network DP significantly improve upon what is achievable under LDP, and often match the utility of the trusted curator model. Our results show for the first time that formal privacy gains can be obtained from full decentralization.We also provide experiments to illustrate the improved utility of ourapproach for decentralized training with stochastic gradient descent.

Quadric Hypersurface Intersection for Manifold Learning in Feature Space

Fedor Pavutnitskiy · Sergei O. Ivanov · Evgeniy Abramov · Viacheslav Borovitskiy · Artem Klochkov · Viktor Vialov · Anatolii Zaikovskii · Aleksandr Petiushko

The knowledge that data lies close to a particular submanifold of the ambient Euclidean space may be useful in a number of ways. For instance, one may want to automatically mark any point far away from the submanifold as an outlier or to use the geometry to come up with a better distance metric. Manifold learning problems are often posed in a very high dimension, e.g. for spaces of images or spaces of words. Today, with deep representation learning on the rise in areas such as computer vision and natural language processing, many problems of this kind may be transformed into problems of moderately high dimension, typically of the order of hundreds. Motivated by this, we propose a manifold learning technique suitable for moderately high dimension and large datasets. The manifold is learned from the training data in the form of an intersection of quadric hypersurfaces---simple but expressive objects. At test time, this manifold can be used to introduce a computationally efficient outlier score for arbitrary new data points and to improve a given similarity metric by incorporating the learned geometric structure into it.

Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks

QIN DING · Cho-Jui Hsieh · James Sharpnack

Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.

Sequential Multivariate Change Detection with Calibrated and Memoryless False Detection Rates

Oliver Cobb · Arnaud Van Looveren · Janis Klaise

Responding appropriately to the detections of a sequential change detector requires knowledge of the rate at which false positives occur in the absence of change. Setting detection thresholds to achieve a desired false positive rate is challenging. Existing works resort to setting time-invariant thresholds that focus on the expected runtime of the detector in the absence of change, either bounding it loosely from below or targeting it directly but with asymptotic arguments that we show cause significant miscalibration in practice. We present a simulation-based approach to setting time-varying thresholds that allows a desired expected runtime to be accurately targeted whilst additionally keeping the false positive rate constant across time steps. Whilst the approach to threshold setting is metric agnostic, we show how the cost of using the popular quadratic time MMD estimator can be reduced from $O(N^2B)$ to $O(N^2+NB)$ during configuration and from $O(N^2)$ to $O(N)$ during operation, where $N$ and $B$ are the numbers of reference and bootstrap samples respectively.

Spectral risk-based learning using unbounded losses

Matthew Holland · El Mehdi Haress

In this work, we consider the setting of learning problems under a wide class of spectral risk (or "L-risk") functions, where a Lipschitz-continuous spectral density is used to flexibly assign weight to extreme loss values. We obtain excess risk guarantees for a derivative-free learning procedure under unbounded heavy-tailed loss distributions, and propose a computationally efficient implementation which empirically outperforms traditional risk minimizers in terms of balancing spectral risk and misclassification error.

Stateful Offline Contextual Policy Evaluation and Learning

Nathan Kallus · Angela Zhou

We study off-policy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individual-level responses to agent actions. This model can be thought of as an offline generalization of contextual bandits with resource constraints. We formalize the relevant causal structure of problems such as dynamic personalized pricing and other operations management problems in the presence of potentially high-dimensional user types. The key insight is that an individual-level response is often not causally affected by the state variable and can therefore easily be generalized across timesteps and states. When this is true, we study implications for (doubly robust) off-policy evaluation and learning by instead leveraging single time-step evaluation, estimating the expectation over a single arrival via data from a population, for fitted-value iteration in a marginal MDP. We study sample complexity and analyze error amplification that leads to the persistence, rather than attenuation, of confounding error over time. In simulations of dynamic and capacitated pricing, we show improved out-of-sample policy performance in this class of relevant problems.

Tile Networks: Learning Optimal Geometric Layout for Whole-page Recommendation

Shuai Xiao · Zaifan Jiang · Shuang Yang

Finding optimal configurations in a geometric space is a key challenge in many technological disciplines. Current approaches either rely heavily on human domain expertise and are difficult to scale. In this paper we show it is possible to solve configuration optimization problems for whole-page recommendation using reinforcement learning. The proposed \textit{Tile Networks} is a neural architecture that optimizes 2D geometric configurations by arranging items on proper positions. Empirical results on real dataset demonstrate its superior performance compared to traditional learning to rank approaches and recent deep models.

Top K Ranking for Multi-Armed Bandit with Noisy Evaluations

Evrard Garcelon · Vashist Avadhanula · Alessandro Lazaric · Matteo Pirotta

We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, evaluations of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.

Transfer Learning with Gaussian Processes for Bayesian Optimization

Petru Tighineanu · Kathrin Skubch · Paul Baireuther · Attila Reiss · Felix Berkenkamp · Julia Vinogradska

Bayesian optimization is a powerful paradigm to optimize black-box functions based on scarce and noisy data. Its data efficiency can be further improved by transfer learning from related tasks. While recent transfer models meta-learn a prior based on large amount of data, in the low-data regime methods that exploit the closed-form posterior of Gaussian processes (GPs) have an advantage. In this setting, several analytically tractable transfer-model posteriors have been proposed, but the relative advantages of these methods are not well understood. In this paper, we provide a unified view on hierarchical GP models for transfer learning, which allows us to analyze the relationship between methods. As part of the analysis, we develop a novel closed-form boosted GP transfer model that fits between existing approaches in terms of complexity. We evaluate the performance of the different approaches in large-scale experiments and highlight strengths and weaknesses of the different transfer-learning methods.

Warping Layer: Representation Learning for Label Structures in Weakly Supervised Learning

Yingyi Ma · Xinhua Zhang

Many learning tasks only receive weak supervision, such as semi-supervised learning and few-shot learning. With limited labeled data, prior structures become especially important, and prominent examples include hierarchies and mutual exclusions in the class space. However, most existing approaches only learn the representations \emph{separately} in the feature space and the label space, and do not explicitly enforce the logical relationships. In this paper, we propose a novel warping layer that \emph{jointly} learns representations in \emph{both} spaces, and thanks to the modularity and differentiability, it can be directly embedded into generative models to leverage the prior hierarchical structure and unlabeled data. The effectiveness of the warping layer is demonstrated on both few-shot and semi-supervised learning, outperforming the state of the art in practice.