**Differentiable Bayesian inference of SDE parameters using a pathwise series expansion of Brownian motion**

Sanmitra Ghosh · Paul J. Birrell · Daniela De Angelis

By invoking a pathwise series expansion of Brownian motion, we propose to approximate a stochastic differential equation (SDE) with an ordinary differential equation (ODE). This allows us to reformulate Bayesian inference for a SDE as the parameter estimation task for an ODE. Unlike a nonlinear SDE, the likelihood for an ODE model is tractable and its gradient can be obtained using adjoint sensitivity analysis. This reformulation allows us to use an efficient sampler, such as NUTS, that rely on the gradient of the log posterior. Applying the reparameterisation trick, variational inference can also be used for the same estimation task. We illustrate the proposed method on a variety of SDE models. We obtain similar parameter estimates when compared to data augmentation techniques.

**Federated Reinforcement Learning with Environment Heterogeneity**

Hao Jin · Yang Peng · Wenhao Yang · Shusen Wang · Zhihua Zhang

We study Federated Reinforcement Learning (FedRL) problem in which $n$ agents collaboratively learn a single policy without sharing the trajectories they collected during agent-environment interaction. In this paper, we stress the constraint of environment heterogeneity, which means $n$ environments corresponding to these $n$ agents have different state-transitions. To obtain a value function or a policy function which optimizes the overall performance in all environments, we propose two algorithms, we propose two federated RL algorithms, \texttt{QAvg} and \texttt{PAvg}. We theoretically prove that these algorithms converge to suboptimal solutions, while such suboptimality depends on how heterogeneous these $n$ environments are. Moreover, we propose a heuristic that achieves personalization by embedding the $n$ environments into $n$ vectors. The personalization heuristic not only improves the training but also allows for better generalization to new environments.

The vanishing ideal of a set of points X is the set of polynomials that evaluate to 0 over all points x in X and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the Conditional Gradients Approximately Vanishing Ideal algorithm (CGAVI) for the construction of the set of generators of the approximately vanishing ideal. The constructed set of generators captures polynomial structures in data and gives rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In CGAVI, we construct the set of generators by solving specific instances of (constrained) convex optimization problems with the Pairwise Frank-Wolfe algorithm (PFW). Among other things, the constructed generators inherit the LASSO generalization bound and not only vanish on the training but also on out-sample data. Moreover, CGAVI admits a compact representation of the approximately vanishing ideal by constructing few generators with sparse coefficient vectors.

We consider the $k$-clustering problem with $\ell_p$-norm cost, which includes $k$-median, $k$-means and $k$-center, under an individual notion of fairness proposed by Jung et al. [2020]: given a set of points $P$ of size $n$, a set of $k$ centers induces a fair clustering if every point in $P$ has a center among its $n/k$ closest neighbors. Mahabadi and Vakilian [2020] presented a $( p^{O(p)},7)$-bicriteria approximation for fair clustering with $\ell_p$-norm cost: every point finds a center within distance at most $7$ times its distance to its $(n/k)$-th closest neighbor and the $\ell_p$-norm cost of the solution is at most $p^{O(p)}$ times the cost of an optimal fair solution.In this work, for any $\eps>0$, we present an improved $(16^p +\eps,3)$-bicriteria for this problem. Moreover, for $p=1$ ($k$-median) and $p=\infty$ ($k$-center), we present improved cost-approximation factors $7.081+\eps$ and $3+\eps$ respectively. To achieve our guarantees, we extend the framework of [Charikar et al.,2002, Swamy, 2016] and devise a $16^p$-approximation algorithm for the facility location with $\ell_p$-norm cost under matroid constraint which might be of an independent interest. Besides, our approach suggests a reduction from our individually fair clustering to a clustering with a group fairness requirement proposed by [Kleindessner et al. 2019], which is essentially the median matroid problem.

Continual Learning addresses the challenge of learning a number of different tasks sequentially. The goal of maintaining knowledge of earlier tasks without re-accessing them starkly conflicts with standard SGD training for artificial neural networks. An influential method to tackle this problem without storing old data are so-called regularisation approaches. They measure the importance of each parameter for solving a given task and subsequently protect important parameters from large changes. In the literature, three ways to measure parameter importance have been put forward and they have inspired a large body of follow-up work. Here, we present strong theoretical and empirical evidence that these three methods, Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI) and Memory Aware Synapses (MAS), are surprisingly similar and are all linked to the same theoretical quantity. Concretely, we show that, despite stemming from very different motivations, both SI and MAS approximate the square root of the Fisher Information, with the Fisher being the theoretically justified basis of EWC. Moreover, we show that for SI the relation to the Fisher -- and in fact its performance -- is due to a previously unknown bias. On top of uncovering unknown similarities and unifying regularisation approaches, we also demonstrate that our insights enable practical performance improvements for large batch training.

**Counterfactual Explanation Trees: Transparent and Consistent Actionable Recourse with Decision Trees**

Kentaro Kanamori · Takuya Takagi · Ken Kobayashi · Yuichi Ike

Counterfactual Explanation (CE) is a post-hoc explanation method that provides a perturbation for altering the prediction result of a classifier. An individual can interpret the perturbation as an "action" to obtain the desired decision results. Existing CE methods focus on providing an action, which is optimized for a given single instance. However, these CE methods do not address the case where we have to assign actions to multiple instances simultaneously. In such a case, we need a framework of CE that assigns actions to multiple instances in a transparent and consistent way. In this study, we propose Counterfactual Explanation Tree (CET) that assigns effective actions with decision trees. Due to the properties of decision trees, our CET has two advantages: (1) Transparency: the reasons for assigning actions are summarized in an interpretable structure, and (2) Consistency: these reasons do not conflict with each other. We learn a CET in two steps: (i) compute one effective action for multiple instances and (ii) partition the instances to balance the effectiveness and interpretability. Numerical experiments and user studies demonstrated the efficacy of our CET in comparison with existing methods.

**Derivative-Based Neural Modelling of Cumulative Distribution Functions for Survival Analysis**

Dominic Danks · Christopher Yau

Survival models --- particularly those able to account for patient comorbidities via competing risks analysis --- offer valuable prognostic information to clinicians making critical decisions and represent a growing area of application for machine learning approaches. However, current methods typically involve restrictive parameterisations, discretisation of time or the modelling of only one event cause. In this paper, we highlight how general cumulative distribution functions can be naturally expressed via neural network-based ordinary differential equations and how this observation can be utilised in survival analysis. In particular, we present DeSurv, a neural derivative-based approach capable of avoiding aforementioned restrictions and flexibly modelling competing-risk survival data in continuous time. We apply DeSurv to both single-risk and competing-risk synthetic and real-world datasets and obtain results which compare favourably with current state-of-the-art models.

**Optimal transport with $f$-divergence regularization and generalized Sinkhorn algorithm**

Dávid Terjék · Diego González-Sánchez

Entropic regularization provides a generalization of the original optimal transport problem. It introduces a penalty term defined by the Kullback-Leibler divergence, making the problem more tractable via the celebrated Sinkhorn algorithm. Replacing the Kullback-Leibler divergence with a general $f$-divergence leads to a natural generalization. The case of divergences defined by superlinear functions was recently studied by Di Marino and Gerolin. Using convex analysis, we extend the theory developed so far to include all $f$-divergences defined by functions of Legendre type, and prove that under some mild conditions, strong duality holds, optimums in both the primal and dual problems are attained, the generalization of the $c$-transform is well-defined, and we give sufficient conditions for the generalized Sinkhorn algorithm to converge to an optimal solution. We propose a practical algorithm for computing an approximate solution of the optimal transport problem with $f$-divergence regularization via the generalized Sinkhorn algorithm. Finally, we present experimental results on synthetic 2-dimensional data, demonstrating the effects of using different $f$-divergences for regularization, which influences convergence speed, numerical stability and sparsity of the optimal coupling.

**On Some Fast And Robust Classifiers For High Dimension, Low Sample Size Data**

Sarbojit Roy · Jyotishka Ray Choudhury · Subhajit Dutta

In high dimension, low sample size (HDLSS) settings, distance concentration phenomena affects the performance of several popular classifiers which are based on Euclidean distances. The behaviour of these classifiers in high dimensions is completely governed by the first and second order moments of the underlying class distributions. Moreover, the classifiers become useless for such HDLSS data when the first two moments of the competing distributions are equal, or when the moments do not exist. In this work, we propose robust, computationally efficient and tuning-free classifiers applicable in the HDLSS scenario. As the data dimension increases, these classifiers yield perfect classification if the one-dimensional marginals of the underlying distributions are different. We establish strong theoretical properties for the proposed classifiers in ultrahigh-dimensional settings. Numerical experiments with a wide variety of simulated examples and analysis of real data sets exhibit clear and convincing advantages over existing methods.

**Faster Rates, Adaptive Algorithms, and Finite-Time Bounds for Linear Composition Optimization and Gradient TD Learning**

Anant Raj · Pooria Joulani · Andras Gyorgy · Csaba Szepesvari

Gradient temporal difference (GTD) algorithms are provably convergent policy evaluation methods for off-policy reinforcement learning. Despite much progress, proper tuning of the stochastic approximation methods used to solve the resulting saddle point optimization problem requires the knowledge of several (unknown) problem-dependent parameters. In this paper we apply adaptive step-size tuning strategies to greatly reduce this dependence on prior knowledge, and provide algorithms with adaptive convergence guarantees. In addition, we use the underlying refined analysis technique to obtain new O(1/T) rates that do not depend on the strong-convexity parameter of the problem, and also apply to the Markov noise setting, as well as the unbounded i.i.d. noise setting.

**Exploiting Correlation to Achieve Faster Learning Rates in Low-Rank Preference Bandits**

Aadirupa Saha · Suprovat Ghoshal

We introduce the Correlated Preference Bandits problem with random utility-based choice models (RUMs), where the goal is to identify the best item from a given pool of $n$ items through online subsetwise preference feedback. We investigate whether models with a simple correlation structure, e.g. low rank, can result in faster learning rates. While we show that the problem can be impossible to solve for the general `low rank' choice models, faster learning rates can be attained assuming more structured item correlations. In particular, we introduce a new class of Block-Rank based RUM model, where the best item is shown to be $(\epsilon,\delta)$-PAC learnable with only $O(r \epsilon^{-2} \log(n/\delta))$ samples. This improves on the standard sample complexity bound of $\tilde{O}(n\epsilon^{-2} \log(1/\delta))$ known for the usual learning algorithms which might not exploit the item-correlations ($r \ll n$). We complement the above sample complexity with a matching lower bound (up to logarithmic factors), justifying the tightness of our analysis. Further, we extend the results to a more general noisy Block-Rank model, which ensures robustness of our techniques. Overall, our results justify the advantage of playing subsetwise queries over pairwise preferences $(k=2)$, we show the latter provably fails to exploit correlation.

**Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games**

Yulai Zhao · Yuandong Tian · Jason Lee · Simon Du

Policy-based methods with function approximation are widely used for solving two-player zero-sum games with large state and/or action spaces.However, it remains elusive how to obtain optimization and statistical guarantees for such algorithms.We present a new policy optimization algorithm with function approximation and prove that under standard regularity conditions on the Markov game and the function approximation class, our algorithm finds a near-optimal policy within a polynomial number of samples and iterations.To our knowledge, this is the first provably efficient policy optimization algorithm with function approximation that solves two-player zero-sum Markov games.

**Differentially Private Federated Learning on Heterogeneous Data**

Maxence Noble-Bourillot · Aurélien Bellet · Aymeric Dieuleveut

Federated Learning (FL) is a paradigm for large-scale distributed learning which faces two key challenges: (i) training efficiently from highly heterogeneous user data, and (ii) protecting the privacy of participating users. In this work, we propose a novel FL approach (DP-SCAFFOLD) to tackle these two challenges together by incorporating Differential Privacy (DP) constraints into the popular SCAFFOLD algorithm. We focus on the challenging setting where users communicate with a ``honest-but-curious'' server without any trusted intermediary, which requires to ensure privacy not only towards a third party observing the final model but also towards the server itself. Using advanced results from DP theory, we establish the convergence of our algorithm for convex and non-convex objectives. Our paper clearly highlights the trade-off between utility and privacy and demonstrates the superiority of DP-SCAFFOLD over the state-of-the-art algorithm DP-FedAvg when the number of local updates and the level of heterogeneity grows. Our numerical results confirm our analysis and show that DP-SCAFFOLD provides significant gains in practice.

Units in online A/B tests are often involved in social networks. Thus, their outcomes may depend on the treatment of their neighbors. Many of such networks exhibit certain cluster structures allowing the use of these features in the design to reduce the bias from network interference. When the average treatment effect (ATE) is considered from the individual perspective, conditions for the valid estimation restrict the use of these features in the design. We show that such restrictions can be alleviated if the ATE from the cluster perspective is considered. Using an illustrative example, we further show that the weights employed by the Horvitz-Thompson estimator may not appropriately accommodate the network structure, and purely relying on graph-cluster randomization may generate very unbalanced cluster-treated structures across the treatment arms. The measures of such structures for one cluster may depend on the treatment of other clusters and pose a great challenge for the design of A/B tests. To address these issues, we propose a rerandomized-adaptive randomization to balance the clusters and a cluster-adjusted estimator to alleviate the problem of the weights. Numerical studies are conducted to demonstrate the usage of the proposed procedure.

**Fixed Support Tree-Sliced Wasserstein Barycenter**

Yuki Takezawa · Ryoma Sato · Zornitsa Kozareva · Sujith Ravi · Makoto Yamada

The Wasserstein barycenter has been widely studied in various fields, including natural language processing, and computer vision. However, it requires a high computational cost to solve the Wasserstein barycenter problem because the computation of the Wasserstein distance requires a quadratic time with respect to the number of supports. By contrast, the Wasserstein distance on a tree, called the tree-Wasserstein distance, can be computed in linear time and allows for the fast comparison of a large number of distributions. In this study, we propose a barycenter under the tree-Wasserstein distance, called the fixed support tree-Wasserstein barycenter (FS-TWB) and its extension, called the fixed support tree-sliced Wasserstein barycenter (FS-TSWB). More specifically, we first show that the FS-TWB and FS-TSWB problems are convex optimization problems and can be solved by using the projected subgradient descent. Moreover, we propose a more efficient algorithm to compute the subgradient and objective function value by using the properties of tree-Wasserstein barycenter problems. Through real-world experiments, we show that, by using the proposed algorithm, the FS-TWB and FS-TSWB can be solved two orders of magnitude faster than the original Wasserstein barycenter.

We propose a fast non-gradient-based method of rank-1 non-negative matrix factorization (NMF) for missing data, called A1GM, that minimizes the KL divergence from an input matrix to the reconstructed rank-1 matrix. Our method is based on our new finding of an analytical closed-formula of the best rank-1 non-negative multiple matrix factorization (NMMF), a variety of NMF. NMMF is known to exactly solve NMF for missing data if positions of missing values satisfy a certain condition, and A1GM transforms a given matrix so that the analytical solution to NMMF can be applied. We empirically show that A1GM is more efficient than a gradient method with competitive reconstruction errors.

**Acceleration in Distributed Optimization under Similarity**

Ye Tian · Gesualdo Scutari · Tianyu Cao · Alexander Gasnikov

We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes. The loss functions of the agents are assumed to be similar, due to statistical data similarity or otherwise. In order to reduce the number of communications to reach a solution accuracy, we proposed a preconditioned, accelerated distributed method. An $\varepsilon$-solution is achieved in $\tilde{\mathcal{O}}\big(\sqrt{\frac{\beta/\mu}{1-\rho}}\log1/\varepsilon\big)$ number of communications steps, where $\beta/\mu$ is the relative condition number between the global and local loss functions, and $\rho$ characterizes the connectivity of the network. This rate matches (up to poly-log factors) lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest. Numerical results show significant communication savings with respect to existing accelerated distributed schemes, especially when solving ill-conditioned problems.

**Sampling from Arbitrary Functions via PSD Models**

Ulysse Marteau-Ferey · Francis Bach · Alessandro Rudi

In many areas of applied statistics and machine learning, generating an arbitrary number of inde- pendent and identically distributed (i.i.d.) samples from a given distribution is a key task. When the distribution is known only through evaluations of the density, current methods either scale badly with the dimension or require very involved implemen- tations. Instead, we take a two-step approach by first modeling the probability distribution and then sampling from that model. We use the recently introduced class of positive semi-definite (PSD) models which have been shown to be ecient for approximating probability densities. We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to eectively sample from these models. We also present preliminary empirical results to illustrate our assertions.

In this paper, we consider the problem of black-box optimization using Gaussian Process (GP) bandit optimization with a small number of batches. Assuming the unknown function has a low norm in the Reproducing Kernel Hilbert Space (RKHS), we introduce a batch algorithm inspired by batched finite-arm bandit algorithms, and show that it achieves the cumulative regret upper bound $O^\ast(\sqrt{T\gamma_T})$ using $O(\log\log T)$ batches within time horizon $T$, where the $O^\ast(\cdot)$ notation hides dimension-independent logarithmic factors and $\gamma_T$ is the maximum information gain associated with the kernel. This bound is near-optimal for several kernels of interest and improves on the typical $O^\ast(\sqrt{T}\gamma_T)$ bound, and our approach is arguably the simplest among algorithms attaining this improvement. In addition, in the case of a constant number of batches (not depending on $T$), we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches, focusing on the squared exponential and Mat\'ern kernels. The algorithmic upper bounds are shown to be nearly minimax optimal via analogous algorithm-independent lower bounds.

**Increasing the accuracy and resolution of precipitation forecasts using deep generative models**

Ilan Price · Stephan Rasp

Accurately forecasting extreme rainfall is notoriously difficult, but is also ever more crucial for society as climate change increases the frequency of such extremes. Global numerical weather prediction models often fail to capture extremes, and are produced at too low a resolution to be actionable, while regional, high-resolution models are hugely expensive both in computation and labour. In this paper we explore the use of deep generative models to simultaneously correct and downscale (super-resolve) global ensemble forecasts over the Continental US. Specifically, using fine-grained radar observations as our ground truth, we train a conditional Generative Adversarial Network---coined CorrectorGAN---via a custom training procedure and augmented loss function, to produce ensembles of high-resolution, bias-corrected forecasts based on coarse, global precipitation forecasts in addition to other relevant meteorological fields. Our model outperforms an interpolation baseline, as well as super-resolution-only and CNN-based univariate methods, and approaches the performance of an operational regional high-resolution model across an array of established probabilistic metrics. Crucially, CorrectorGAN, once trained, produces predictions in seconds on a single machine. These results raise exciting questions about the necessity of regional models, and whether data-driven downscaling and correction methods can be transferred to data-poor regions that so far have had no access to high-resolution forecasts.

**Node Feature Kernels Increase Graph Convolutional Network Robustness**

Mohamed El Amine Seddik · Changmin Wu · Johannes Lutzeyer · Michalis Vazirgiannis

The robustness of the much used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at: https://github.com/ChangminWu/RobustGCN.

**Adversarially Robust Kernel Smoothing**

Jia-Jie Zhu · Christina Kouridi · Yassine Nemmour · Bernhard Schölkopf

We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Exploiting a connection with optimal transport, we prove theoretical guarantees for certified robustness under distribution shift. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate competitive performance with the state-of-the-art certifiable robust learning algorithms based on the Wasserstein distance.

**Identifiable Energy-based Representations: An Application to Estimating Heterogeneous Causal Effects**

Yao Zhang · Jeroen Berrevoets · Mihaela van der Schaar

Conditional average treatment effects (CATEs) allow us to understand the effect heterogeneity across a large population of individuals. However, typical CATE learners assume all confounding variables are measured in order for the CATE to be identifiable. This requirement can be satisfied by collecting many variables, at the expense of increased sample complexity for estimating CATEs. To combat this, we propose an energy-based model (EBM) that learns a low-dimensional representation of the variables by employing a noise contrastive loss function. With our EBM we introduce a preprocessing step that alleviates the dimensionality curse for any existing learner developed for estimating CATEs. We prove that our EBM keeps the representations partially identifiable up to some universal constant, as well as having universal approximation capability. These properties enable the representations to converge and keep the CATE estimates consistent. Experiments demonstrate the convergence of the representations, as well as show that estimating CATEs on our representations performs better than on the variables or the representations obtained through other dimensionality reduction methods.

Learning representations through deep generative modeling is a powerful approach for dynamical modeling to discover the most simplified and compressed underlying description of the data, to then use it for other tasks such as prediction. Most learning tasks have intrinsic symmetries, i.e., the input transformations leave the output unchanged, or the output undergoes a similar transformation. The learning process is, however, usually uninformed of these symmetries. Therefore, the learned representations for individually transformed inputs may not be meaningfully related. In this paper, we propose an SO(3) equivariant deep dynamical model (EqDDM) for motion prediction that learns a structured representation of the input space in the sense that the embedding varies with symmetry transformations. EqDDM is equipped with equivariant networks to parameterize the state-space emission and transition models. We demonstrate the superior predictive performance of the proposed model on various motion data.

**Nonstochastic Bandits and Experts with Arm-Dependent Delays**

Dirk van der Hoeven · Nicolò Cesa-Bianchi

We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms. While the setting in which delays only depend on time has been extensively studied, the arm-dependent delay setting better captures real-world applications at the cost of introducing new technical challenges.In the full information (experts) setting, we design an algorithm with a first-order regret bound that reveals an interesting trade-off between delays and losses. We prove a similar first-order regret bound also for the bandit setting, when the learner is allowed to observe how many losses are missing.Our bounds are the first in the delayed setting that only depend on the losses and delays of the best arm.In the bandit setting, when no information other than the losses is observed, we still manage to prove a regret bound for bandits through a modification to the algorithm of \citet{zimmert2020optimal}.Our analyses hinge on a novel bound on the drift, measuring how much better an algorithm can perform when given a look-ahead of one round.

**Adaptation of the Independent Metropolis-Hastings Sampler with Normalizing Flow Proposals**

James Brofos · Marylou Gabrie · Marcus Brubaker · Roy Lederman

Markov Chain Monte Carlo (MCMC) methods are a powerful tool for computation with complex probability distributions. However the performance of such methods is critically dependent on properly tuned parameters, most of which are difficult if not impossible to know a priori for a given target distribution. Adaptive MCMC methods aim to address this by allowing the parameters to be updated during sampling based on previous samples from the chain at the expense of requiring a new theoretical analysis to ensure convergence. In this work we extend the convergence theory of adaptive MCMC methods to a new class of methods built on a powerful class of parametric density estimators known as normalizing flows. In particular, we consider an independent Metropolis-Hastings sampler where the proposal distribution is represented by a normalizing flow whose parameters are updated using stochastic gradient descent. We explore the practical performance of this procedure on both synthetic settings and in the analysis of a physical field system, and compare it against both adaptive and non-adaptive MCMC methods.

**On Facility Location Problem in the Local Differential Privacy Model**

Vincent Cohen-Addad · Yunus Esencayi · Chenglin Fan · Marco Gaboradi · Shi Li · Di Wang

We study the facility location problem under the constraints imposed by local differential privacy (LDP). Recently, Gupta et al. (2010) and Esencayi et al. (2019) proposed lower and upper bounds for the problem on the central differential privacy (DP) model where a trusted curator first collects all data and processes it. In this paper, we focus on the LDP model, where we protect a client's participation in the facility location instance. Under the HST metric, we show that there is a non-interactive $\epsilon$-LDP algorithm achieving $O(n^{1/4}/\epsilon^2)$-approximation ratio, where $n$ is the size of the metric. On the negative side, we show a lower bound of $\Omega(n^{1/4}/\sqrt{\epsilon})$ on the approximation ratio for any non-interactive $\epsilon$-LDP algorithm. Thus, our results are tight up to a polynomial factor of $\epsilon$. Moreover, unlike previous results, our results generalize to non-uniform facility costs.

Link prediction aims to reveal missing edges in a graph. We introduce a deep graph convolutional Gaussian process model for this task, which addresses recent challenges in graph machine learning with oversmoothing and overfitting. Using simplified graph convolutions, we transform a Gaussian process to leverage the topological information of the graph domain.To scale the Gaussian process model to larger graphs, we introduce a variational inducing point method that places pseudo-inputs on a graph-structured domain. Multiple Gaussian processes are assembled into a hierarchy whose structure allows skipping convolutions and thus counteracting oversmoothing.The proposed model represents the first Gaussian process for link prediction that makes use of both node features and topological information.We evaluate our model on multiple graph data sets with up to thousands of nodes and report consistent improvements over competitive link prediction approaches.

**On the complexity of the optimal transport problem with graph-structured cost**

Jiaojiao Fan · Isabel Haasler · Johan Karlsson · Yongxin Chen

Multi-marginal optimal transport (MOT) is a generalization of optimal transport to multiple marginals. Optimal transport has evolved into an important tool in many machine learning applications, and its multi-marginal extension opens up for addressing new challenges in the field of machine learning. However, the usage of MOT has been largely impeded by its computational complexity which scales exponentially in the number of marginals. Fortunately, in many applications, such as barycenter or interpolation problems, the cost function adheres to structures, which has recently been exploited for developing efficient computational methods. In this work we derive computational bounds for these methods. In particular, with $m$ marginal distributions supported on $n$ points, we provide a $ \mathcal{\tilde O}(d(\mathcal{T})m n^{w(G)+1}\epsilon^{-2})$ bound for a $\epsilon$-accuracy when the problem is associated with a graph that can be factored as a junction tree with diameter $d(\mathcal{T})$ and tree-width $w(G)$. For the special case of the Wasserstein barycenter problem, which corresponds to a star-shaped tree, our bound is in alignment with the existing complexity bound for it.

**GraphAdaMix: Enhancing Node Representations with Graph Adaptive Mixtures**

Da Sun Handason Tam · Siyue Xie · Wing Cheong Lau

Graph Neural Networks (GNNs) are the current state-of-the-art models in learning node representations for many predictive tasks on graphs. Typically, GNNs reuses the same set of model parameters across all nodes in the graph to improve the training efficiency and exploit the translationally-invariant properties in many datasets. However, the parameter sharing scheme prevents GNNs from distinguishing two nodes having the same local structure and that the translation invariance property may not exhibit in real-world graphs. In this paper, we present Graph Adaptive Mixtures (GraphAdaMix), a novel approach for learning node representations in a graph by introducing multiple independent GNN models and a trainable mixture distribution for each node. GraphAdaMix can adapt to tasks with different settings. Specifically, for semi-supervised tasks, we optimize GraphAdaMix using the Expectation-Maximization (EM) algorithm, while in unsupervised settings, GraphAdaMix is trained following the paradigm of contrastive learning. We evaluate GraphAdaMix on ten benchmark datasets with extensive experiments. GraphAdaMix is demonstrated to consistently boost state-of-the-art GNN variants in semi-supervised and unsupervised node classification tasks. The code of GraphAdaMix is available online.

**FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning**

Elnur Gasanov · Ahmed Khaled · Samuel Horváth · Peter Richtarik

Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.

It has been recently shown in the literature (Nie et al, 2018; Shin et al, 2019a,b) that the sample averages from online learning experiments are biased when used to estimate the mean reward. To correct the bias, off-policy evaluation methods, including importance sampling and doubly robust estimators, typically calculate the conditional propensity score, which is ill-defined for non-randomized policies such as UCB. This paper provides a procedure to debias the samples using bootstrap, which doesn't require the knowledge of the reward distribution and can be applied to any adaptive policies. Numerical experiments demonstrate the effective bias reduction for samples generated by popular multi-armed bandit algorithms such as Explore-Then-Commit (ETC), UCB, Thompson sampling (TS) and $\epsilon$-greedy (EG). We analyze and provide theoretical justifications for the procedure under the ETC algorithm, including the asymptotic convergence of the bias decay rate in the real and bootstrap worlds.

**Adaptive Gaussian Processes on Graphs via Spectral Graph Wavelets**

Felix Opolka · Yin-Cong Zhi · Pietro Lió · Xiaowen Dong

Graph-based models require aggregating information in the graph from neighbourhoods of different sizes. In particular, when the data exhibit varying levels of smoothness on the graph, a multi-scale approach is required to capture the relevant information. In this work, we propose a Gaussian process model using spectral graph wavelets, which can naturally aggregate neighbourhood information at different scales. Through maximum likelihood optimisation of the model hyperparameters, the wavelets automatically adapt to the different frequencies in the data, and as a result our model goes beyond capturing low frequency information. We achieve scalability to larger graphs by using a spectrum-adaptive polynomial approximation of the filter function, which is designed to yield a low approximation error in dense areas of the graph spectrum.Synthetic and real-world experiments demonstrate the ability of our model to infer scales accurately and produce competitive performances against state-of-the-art models in graph-based learning tasks.

**Data-splitting improves statistical performance in overparameterized regimes**

Nicole Mücke · Enrico Reiss · Jonas Rungenhagen · Markus Klein

While large training datasets generally offer improvement in model performance, thetraining process becomes computationally expensive and time consuming. Distributedlearning is a common strategy to reduce the overall training time by exploiting multiplecomputing devices. Recently, it has been observed in the single machine setting thatoverparameterization is essential for benign overfitting in ridgeless regression in Hilbert spaces. We show that in this regime, data splitting has a regularizing effect, hence improving statistical performance and computational complexity at the same time. We further provide a unified framework that allows to analyze both the finite and infinite dimensional setting. We numerically demonstrate the effect of different model parameters.

**Marginalized Operators for Off-policy Reinforcement Learning**

Yunhao Tang · Mark Rowland · Remi Munos · Michal Valko

In this work, we propose marginalized operators, a new class of off-policy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multi-step operators, such as Retrace, as special cases. Marginalizedoperators also suggest a form of sample-based estimates with potential variance reduction, compared to sample-based estimates of the original multi-step operators. We show that the estimates for marginalized operators can be computed ina scalable way, which also generalizes prior results on marginalized importance sampling as special cases. Finally, we empirically demonstrate that marginalized operators provide performance gains to off-policy evaluation problems and downstream policy optimization algorithms.

**Coresets for Data Discretization and Sine Wave Fitting**

Alaa Maalouf · Murad Tukan · Eric Price · Daniel Kane · Dan Feldman

In the \emph{monitoring} problem, the input is an unbounded stream $P={p_1,p_2\cdots}$ of integers in $[N]:=\{1,\cdots,N\}$, that are obtained from a sensor (such as GPS or heart beats of a human). The goal (e.g., for anomaly detection) is to approximate the $n$ points received so far in $P$ by a single frequency $\sin$, e.g. $\min_{c\in C}cost(P,c)+\lambda(c)$, where $cost(P,c)=\sum_{i=1}^n \sin^2(\frac{2\pi}{N} p_ic)$, $C\subseteq [N]$ is a feasible set of solutions, and $\lambda$ is a given regularization function. For any approximation error $\varepsilon>0$, we prove that \emph{every} set $P$ of $n$ integers has a weighted subset $S\subseteq P$ (sometimes called core-set) of cardinality $|S|\in O(\log(N)^{O(1)})$ that approximates $cost(P,c)$ (for every $c\in [N]$) up to a multiplicative factor of $1\pm\varepsilon$. Using known coreset techniques, this implies streaming algorithms using only $O((\log(N)\log(n))^{O(1)})$ memory. Our results hold for a large family of functions. Experimental results and open source code are provided.

**Convergent Working Set Algorithm for Lasso with Non-Convex Sparse Regularizers**

Alain Rakotomamonjy · Rémi Flamary · Joseph Salmon · Gilles Gasso

Non-convex sparse regularizers are common tools for learning with high-dimensional data. For accelerating convergence for Lasso problem involving thoseregularizers, a working set strategy addresses the optimization problem through an iterative algorithm by gradually incrementing the number of variables to optimize until the identification of the solution support. We propose in this paper the first Lasso working set algorithm for non-convex sparse regularizers with convergence guarantees. The algorithm, named FireWorks, is based on a non-convex reformulation of a recent duality-based approach and leverages on the geometry of the residuals. We provide theoretical guarantees showing that convergence is preserved even when the inner solver is inexact, under sufficient decay of the error across iterations. Experimental results demonstrate strong computational gain when using our working set strategy compared to full problem solvers for both block-coordinate descent or a proximal gradient solver.

**Learning Personalized Item-to-Item Recommendation Metric via Implicit Feedback**

Nghia Hoang · Anoop Deoras · Tong Zhao · Jin Li · George Karypis

This paper studies the item-to-item recommendation problem in recommender systems from a new perspective of metric learning via implicit feedback. We develop and investigate a personalizable deep metric model that captures both the internal contents of items and how they were interacted with by users. There are two key challenges in learning such model. First, there is no explicit similarity annotation, which deviates from the assumption of most metric learning methods. Second, these approaches do not account for the fact that items are often represented by multiple sources of meta data and different users use different combinations of these sources to form their own notion of similarity. To address these challenges, we develop a new metric representation embedded as kernel parameters of a probabilistic model. This helps express the correlation between items that a user has interacted with, which can be used to predict user interaction with new items. Our approach hinges on the intuition that similar items induce similar interactions from the same user, thus fitting a metric-parameterized model to predict an implicit feedback signal could indirectly guide it towards finding the most suitable metric for each user. To this end, we also analyze how and when the proposed method is effective from a theoretical lens. Its empirical effectiveness is also demonstrated on several real-world datasets.

**A Globally Convergent Evolutionary Strategy for Stochastic Constrained Optimization with Applications to Reinforcement Learning**

Youssef Diouane · Aurelien Lucchi · Vihang Patil

Evolutionary strategies have recently been shown to achieve competing levels of performance for complex optimization problems in reinforcement learning. In such problems, one often needs to optimize an objective function subject to a set of constraints, including for instance constraints on the entropy of a policy or to restrict the possible set of actions or states accessible to an agent. Convergence guarantees for evolutionary strategies to optimize \emph{stochastic} constrained problems are however lacking in the literature. In this work, we address this problem by designing a novel optimization algorithm with a sufficient decrease mechanism that ensures convergence and that is based only on estimates of the functions. We demonstrate the applicability of this algorithm on two types of experiments: i) a control task for maximizing rewards and ii) maximizing rewards subject to a non-relaxable set of constraints.

**Efficient Online Bayesian Inference for Neural Bandits**

Gerardo Duran-Martin · Aleyna Kara · Kevin Murphy

In this paper we present a new algorithm for online (sequential) inferencein Bayesian neural networks, and show its suitability for tackling contextual bandit problems. The key idea is to combine the extended Kalman filter (which locally linearizes the likelihood function at each time step) with a (learned or random) low-dimensional affine subspace for the parameters; the use of a subspace enables us to scale our algorithm to models with $\sim 1M$ parameters. While most other neural bandit methods need to store the entire past dataset in order to avoid the problem of ``catastrophic forgetting'', our approach uses constant memory. This is possible because we represent uncertainty about all the parameters in the model, not just the final linear layer. We show good results on the ``Deep Bayesian Bandit Showdown'' benchmark, as well as MNIST and a recommender system.

**Neural score matching for high-dimensional causal inference**

Oscar Clivio · Fabian Falck · Brieuc Lehmann · George Deligiannidis · Chris Holmes

Traditional methods for matching in causal inference are impractical for high-dimensional datasets. They suffer from the curse of dimensionality: exact matching and coarsened exact matching find exponentially fewer matches as the input dimension grows, and propensity score matching may match highly unrelated units together. To overcome this problem, we develop theoretical results which motivate the use of neural networks to obtain non-trivial, multivariate balancing scores of a chosen level of coarseness, in contrast to the classical, scalar propensity score. We leverage these balancing scores to perform matching for high-dimensional causal inference and call this procedure neural score matching. We show that our method is competitive against other matching approaches on semi-synthetic high-dimensional datasets, both in terms of treatment effect estimation and reducing imbalance.

Several works have recently focused on nonparametric active learning, especially in the binary classification setting under H\"older smoothness assumptions on the regression function. These works have highlighted the benefit of active learning by providing better rates of convergence compared to the passive counterpart. In this paper, we extend these results to multiclass classification under a more general smoothness assumption, which takes into account a broader class of underlying distributions. We present a new algorithm called \texttt{MKAL} for multiclass K-nearest neighbors active learning, and prove its theoretical benefits. Additionally, we empirically study \texttt{MKAL} on several datasets and discuss its merits and potential improvements.

Solving Partially Observable Markov Decision Processes (POMDPs) is hard. Learning optimal controllers for POMDPs when the model is unknown is harder. Online learning of optimal controllers for unknown POMDPs, which requires efficient learning using regret-minimizing algorithms that effectively tradeoff exploration and exploitation, is even harder, and no solution exists currently. In this paper, we consider infinite-horizon average-cost POMDPs with unknown transition model, though a known observation model. We propose a natural posterior sampling-based reinforcement learning algorithm (PSRL-POMDP) and show that it achieves a regret bound of $O(\log T)$, where $T$ is the time horizon, when the parameter set is finite. In the general case (continuous parameter set), we show that the algorithm achieves $O(T^{2/3})$ regret under two technical assumptions. To the best of our knowledge, this is the first online RL algorithm for POMDPs and has sub-linear regret.

**On Global-view Based Defense via Adversarial Attack and Defense Risk Guaranteed Bounds**

Trung Le · Anh Bui · LE Minh Tri Tue · He Zhao · Paul Montague · Quan Tran · Dinh Phung

It is well-known that deep neural networks (DNNs) are susceptible to adversarial attacks, which presents the most severe fragility of the deep learning system. Despite achieving impressive performance, most of the current state-of-the-art classifiers remain highly vulnerable to carefully crafted imperceptible, adversarial perturbations. Recent research attempts to understand neural network attack and defense have become increasingly urgent and important. While rapid progress has been made on this front, there is still an important theoretical gap in achieving guaranteed bounds on attack/defense models, leaving uncertainty in the quality and certified guarantees of these models. To this end, we systematically address this problem in this paper. More specifically, we formulate attack and defense in a generic setting where there exists a family of adversaries (i.e., attackers) for attacking a family of classifiers (i.e., defenders). We develop a novel class of f-divergences suitable for measuring divergence among multiple distributions. This equips us to study the interactions between attackers and defenders in a countervailing game where we formulate a joint risk on attack and defense schemes. This is followed by our key results on guaranteed upper and lower bounds on this risk that can provide a better understanding of the behaviors of those parties from the attack and defense perspectives, thereby having important implications to both attack and defense sides. Finally, benefited from our theory, we propose an empirical approach that bases on a global view to defend against adversarial attacks. The experimental results conducted on benchmark datasets show that the global view for attack/defense if exploited appropriately can help to improve adversarial robustness.

**SparseFed: Mitigating Model Poisoning Attacks in Federated Learning with Sparsification **

Ashwinee Panda · Saeed Mahloujifar · Arjun Nitin Bhagoji · Supriyo Chakraborty · Prateek Mittal

Federated learning is inherently vulnerable to model poisoning attacks because its decentralized nature allows attackers to participate with compromised devices.In model poisoning attacks, the attacker reduces the model's performance on targeted sub-tasks (e.g. classifying planes as birds) by uploading "poisoned" updates.In this paper we introduce SparseFed, a novel defense that uses global top-k update sparsification and device-level gradient clipping to mitigate model poisoning attacks.We propose a theoretical framework for analyzing the robustness of defenses against poisoning attacks, and provide robustness and convergence analysis of our algorithm.To validate its empirical efficacy we conduct an open-source evaluation at scale across multiple benchmark datasets for computer vision and federated learning.

**Optimal channel selection with discrete QCQP**

Yeonwoo Jeong · Deokjae Lee · Gaon An · Changyong Son · Hyun Oh Song

Reducing the high computational cost of large convolutional neural networks is crucial when deploying the networks to resource-constrained environments. We first show the greedy approach of recent channel pruning methods ignores the inherent quadratic coupling between channels in the neighboring layers and cannot safely remove inactive weights during the pruning procedure. Furthermore, due to these inactive weights, the greedy methods cannot guarantee to satisfy the given resource constraints and deviate with the true objective. In this regard, we propose a novel channel selection method that optimally selects channels via discrete QCQP, which provably prevents any inactive weights and guarantees to meet the resource constraints tightly in terms of FLOPs, memory usage, and network size. We also propose a quadratic model that accurately estimates the actual inference time of the pruned network, which allows us to adopt inference time as a resource constraint option. Furthermore, we generalize our method to extend the selection granularity beyond channels and handle non-sequential connections. Our experiments on CIFAR-10 and ImageNet show our proposed pruning method outperforms other fixed-importance channel pruning methods on various network architectures.

**On the Generalization of Representations in Reinforcement Learning**

Charline Le Lan · Stephen Tu · Adam Oberman · Rishabh Agarwal · Marc G. Bellemare

In reinforcement learning, state representations are used to tractably deal with large problem spaces. State representations serve both to approximate the value function with few parameters, but also to generalize to newly encountered states. Their features may be learned implicitly (as part of a neural network) or explicitly (for example, the successor representation of Dayan(1993). While the approximation properties of representations are reasonably well-understood, a precise characterization of how and when these representations generalize is lacking. In this work, we address this gap and provide an informative bound on the generalization error arising from a specific state representation. This bound is based on the notion of effective dimension which measures the degree to which knowing the value at one state informs the value at other states.Our bound applies to any state representation and quantifies the natural tension between representations that generalize well and those that approximate well. We complement our theoretical results with an empirical survey of classic representation learning methods from the literature and results on the Arcade Learning Environment, and find that the generalization behaviour of learned representations is well-explained by their effective dimension.

Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions, as well as convex and non-convex loss functions. An implication of our results is that for each DRO problem there exists a data distribution such that learning this distribution is equivalent to solving the DRO problem. Yet, important problems that DRO seeks to address (for instance, adversarial robustness and fighting bias) cannot be reduced to finding the one 'unbiased' dataset. Our discussion section addresses this important discrepancy.

**Effective Nonlinear Feature Selection Method based on HSIC Lasso and with Variational Inference**

Kazuki Koyama · Keisuke Kiritoshi · Tomomi Okawachi · Tomonori Izumitani

HSIC Lasso is one of the most effective sparse nonlinear feature selection methods based on the Hilbert-Schmidt independence criterion. We propose an adaptive nonlinear feature selection method, which is based on the HSIC Lasso, that uses a stochastic model with a family of super-Gaussian prior distributions for sparsity enhancement. The method includes easily implementable closed-form update equations that are derived approximately from variational inference and can handle high-dimensional and large datasets. We applied the method to several synthetic datasets and real-world datasets and verified its effectiveness regarding redundancy, computational complexity, and classification and prediction accuracy using the selected features. The results indicate that the method can more effectively remove irrelevant features, leaving only relevant features. In certain problem settings, the method assigned non-zero importance only to the actually relevant features. This is an important characteristic for practical use.

We study the problem of learning a single neuron $\mathbf{x}\mapsto \sigma(\mathbf{w}^T\mathbf{x})$ with gradient descent (GD). All the existing positive results are limited to the case where $\sigma$ is monotonic. However, it is recently observed that non-monotonic activation functions outperform the traditional monotonic ones in many applications. To fill this gap, we establish learnability without assuming monotonicity. Specifically, when the input distribution is the standard Gaussian, we show that mild conditions on $\sigma$ (e.g., $\sigma$ has a dominating linear part) are sufficient to guarantee the learnability in polynomial time and polynomial samples. Moreover, with a stronger assumption on the activation function, the condition of input distribution can be relaxed to a non-degeneracy of the marginal distribution. We remark that our conditions on $\sigma$ are satisfied by practical non-monotonic activation functions, such as SiLU/Swish and GELU. We also discuss how our positive results are related to existing negative results on training two-layer neural networks.

**On PAC-Bayesian reconstruction guarantees for VAEs**

Badr-Eddine Chérief-Abdellatif · Yuyang Shi · Arnaud Doucet · Benjamin Guedj

Despite its wide use and empirical successes, the theoretical understanding and study of the behaviour and performance of the variational autoencoder (VAE) have only emerged in the past few years. We contribute to this recent line of work by analysing the VAE's reconstruction ability for unseen test data, leveraging arguments from the PAC-Bayes theory. We provide generalisation bounds on the theoretical reconstruction error, and provide insights on the regularisation effect of VAE objectives.We illustrate our theoretical results with supporting experiments on classical benchmark datasets.

**Variational Continual Proxy-Anchor for Deep Metric Learning**

Minyoung Kim · Ricardo Guerrero · Hai Pham · Vladimir Pavlovic

The recent proxy-anchor method achieved outstanding performance in deep metric learning, which can be acknowledged to its data efficient loss based on hard example mining, as well as far lower sampling complexity than pair-based approaches. In this paper we extend the proxy-anchor method by posing it within the continual learning framework, motivated from its batch-expected loss form (instead of instance-expected, typical in deep learning), which can potentially incur the catastrophic forgetting of historic batches. By regarding each batch as a task in continual learning, we adopt the Bayesian variational continual learning approach to derive a novel loss function. Interestingly the resulting loss has two key modifications to the original proxy-anchor loss: i) we inject noise to the proxies when optimizing the proxy-anchor loss, and ii) we encourage momentum update to avoid abrupt model changes. As a result, the learned model achieves higher test accuracy than proxy-anchor due to the robustness to noise in data (through model perturbation during training), and the reduced batch forgetting effect. We demonstrate the improved results on several benchmark datasets.

Deep kernel learning is a promising combination of deep neural networks and nonparametric function estimation. However, as a data driven approach, the performance of deep kernel learning can still be restricted by scarce or insufficient data, especially in extrapolation tasks. To address these limitations, we propose Physics Informed Deep Kernel Learning (PI-DKL) that exploits physics knowledge represented by differential equations with latent sources. Specifically, we use the posterior function sample of the Gaussian process as the surrogate for the solution of the differential equation, and construct a generative component to integrate the equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables in the joint probability and derive a collapsed model evidence lower bound (ELBO), based on which we develop a stochastic model estimation algorithm. Our ELBO can be viewed as a nice, interpretable posterior regularization objective. On synthetic datasets and real-world applications, we show the advantage of our approach in both prediction accuracy and uncertainty quantification. The code is available at https://github.com/GregDobby/PIDKL.

**Approximate Function Evaluation via Multi-Armed Bandits**

Tavor Baharav · Gary Cheng · Mert Pilanci · David Tse

We study the problem of estimating the value of a known smooth function f at an unknown point $\mu \in \mathbb{R}^n$, where each component $\mu_i$ can be sampled via a noisy oracle. Sampling more frequently components of $\mu$ corresponding to directions of the function with larger directional derivatives is more sample-efficient. However, as $\mu$ is unknown, the optimal sampling frequencies are also unknown. We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-\delta$ returns an $\epsilon$ accurate estimate of $f(\mu)$. We generalize our algorithm to adapt to heteroskedastic noise, and prove asymptotic optimality when f is linear. We corroborate our theoretical results with numerical experiments, showing the dramatic gains afforded by adaptivity.

The paper considers the problem of multi-objective decision support when outcomes are uncertain. We extend the concept of Pareto-efficient decisions to take into account the uncertainty of decision outcomes across varying contexts. This enables quantifying trade-offs between decisions in terms of tail outcomes that are relevant in safety-critical applications. We propose a method for learning efficient decisions with statistical confidence, building on results from the conformal prediction literature. The method adapts to weak or nonexistent context covariate overlap and its statistical guarantees are evaluated using both synthetic and real data.

**Near-Optimal Task Selection for Meta-Learning with Mutual Information and Online Variational Bayesian Unlearning**

Yizhou Chen · Shizhuo Zhang · Bryan Kian Hsiang Low

This paper addresses the problem of active task selection which involves selecting the most informative tasks for meta-learning. We propose a novel active task selection criterion based on the mutual information between latent task vectors.Unfortunately, such a criterion scales poorly in the number of candidate tasks when optimized. To resolve this issue, we exploit the submodularity property of our new criterion for devising the first active task selection algorithm for meta-learning with a near-optimal performance guarantee. To further improve our efficiency, we propose an online variant of the Stein variational gradient descent to perform fast belief updates of the meta-parameters via maintaining a set of forward (and backward) particles when learning (or unlearning) from each selected task. We empirically demonstrate the performance of our proposed algorithm on real-world datasets.

**Factorization Approach for Low-complexity Matrix Completion Problems: Exponential Number of Spurious Solutions and Failure of Gradient Methods**

Baturalp Yalçın · Haixiang Zhang · Javad Lavaei · Somayeh Sojoudi

Burer-Monteiro (B-M) factorization approach can efficiently solve low-rank matrix optimization problems under the Restricted Isometry Property (RIP) condition. It is natural to ask whether B-M factorization-based methods can succeed on any low-rank matrix optimization problems with low information-theoretic complexity, i.e., polynomial-time solvable problems that have a unique solution. We provide negative answer to this question. We investigate the landscape of B-M factorized polynomial-time solvable matrix completion (MC) problems, which are the most popular subclass of low-rank matrix optimization problems without the RIP condition. We construct an instance of polynomial-time solvable MC problems with exponentially many spurious local minima, which leads to the failure of most gradient-based methods. We define a new complexity metric that measures the solvability of low-rank matrix optimization problems based on B-M factorization approach. In addition, we show that more measurements can deteriorate the landscape, which further reveals the unfavorable behavior of B-M factorization.

Given a graph, the densest subgraph problem asks for a set of vertices such that the average degree among these vertices is maximized. Densest subgraph has numerous applications in learning, e.g., community detection in social networks, link spam detection, correlation mining, bioinformatics, and so on. Although there are efficient algorithms that output either exact or approximate solutions to the densest subgraph problem, existing algorithms may violate the privacy of the individuals in the network, e.g., leaking the existence/non-existence of edges.In this paper, we study the densest subgraph problem in the framework of the differential privacy, and we derive the upper and lower bounds for this problem. We show that there exists a linear-time \epsilon-differentially private algorithm that finds a 2-approximation of the densest subgraph with an extra poly-logarithmic additive error. Our algorithm not only reports the approximate density of the densest subgraph, but also reports the vertices that form the densesubgraph.Our upper bound almost matches the famous 2-approximation by Charikar both in performance and in approximation ratio, but we additionally achieve differential privacy. In comparison with Charikar’s algorithm, our algorithm has an extra poly logarithmic additive error. We partly justify the additive error with a new lower bound, showing that for any differentially private algorithm that provides a constant-factor approximation, a sub-logarithmic additive erroris inherent.We also practically study our differentially private algorithm on real-world graphs, and we show that in practice the algorithm finds a solution which is very close to the optimal.

**Embedded Ensembles: infinite width limit and operating regimes**

Maksim Velikanov · Roman Kail · Ivan Anokhin · Roman Vashurin · Maxim Panov · Alexey Zaytsev · Dmitry Yarotsky

A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.

**Disentangling Whether from When in a Neural Mixture Cure Model for Failure Time Data**

Matthew Engelhard · Ricardo Henao

The mixture cure model allows failure probability to be estimated separately from failure timing in settings wherein failure never occurs in a subset of the population. In this paper, we draw on insights from representation learning and causal inference to develop a neural network based mixture cure model that is free of distributional assumptions, yielding improved prediction of failure timing, yet still effectively disentangles information about failure timing from information about failure probability. Our approach also mitigates effects of selection biases in the observation of failure and censoring times on estimation of the failure density and censoring density, respectively. Results suggest this approach could be applied to distinguish factors predicting failure occurrence versus timing and mitigate biases in real-world observational datasets.

**Harmless interpolation in regression and classification with structured features**

Andrew McRae · Santhosh Karnik · Mark Davenport · Vidya Muthukumar

Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data. Inspired by this empirical observation, recent work has sought to understand this phenomenon of benign overfitting or harmless interpolation in the much simpler linear model. Previous theoretical work critically assumes that either the data features are statistically independent or the input data is high-dimensional; this precludes general nonparametric settings with structured feature maps. In this paper, we present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space. A key contribution is that our framework describes precise sufficient conditions on the data Gram matrix under which harmless interpolation occurs. Our results recover prior independent-features results (with a much simpler analysis), but they furthermore show that harmless interpolation can occur in more general settings such as features that are a bounded orthonormal system. Furthermore, our results show an asymptotic separation between classification and regression performance in a manner that was previously only shown for Gaussian features.

**On Structured Filtering-Clustering: Global Error Bound and Optimal First-Order Algorithms**

Nhat Ho · Tianyi Lin · Michael Jordan

The filtering-clustering models, including trend filtering and convex clustering, have become an important source of ideas and modeling tools in machine learning and related fields. The statistical guarantee of optimal solutions in these models has been extensively studied yet the investigations on the computational aspect have remained limited. In particular, practitioners often employ the first-order algorithms in real-world applications and are impressed by their superior performance regardless of ill-conditioned structures of difference operator matrices, thus leaving open the problem of understanding the convergence property of first-order algorithms. This paper settles this open problem and contributes to the broad interplay between statistics and optimization by identifying a \textit{global error bound} condition, which is satisfied by a large class of dual filtering-clustering problems, and designing a class of \textit{generalized dual gradient ascent} algorithm, which is \textit{optimal} first-order algorithms in deterministic, finite-sum and online settings. Our results are new and help explain why the filtering-clustering models can be efficiently solved by first-order algorithms. We also provide the detailed convergence rate analysis for the proposed algorithms in different settings, shedding light on their potential to solve the filtering-clustering models efficiently. We also conduct experiments on real datasets and the numerical results demonstrate the effectiveness of our algorithms.

**Feature screening with kernel knockoffs**

Benjamin Poignard · Peter Naylor · Héctor Climente-González · Makoto Yamada

This article analyses three feature screening procedures: Kendall’s Tau and Spearman Rho (TR), Hilbert-Schmidt Independence Criterion (HSIC) and conditional Maximum Mean Discrepancy (cMMD), where the latter is a modified version of the standard MMD for categorical classification. These association measures are not based on any specific underlying model, such as the linear regression. We provide the conditions for which the sure independence screening (SIS) property is satisfied under a lower bound assumption on the minimum signal strength of the association measure. The SIS property for the HSIC and cMMD is established for given bounded and symmetric kernels. Within the high-dimensional setting, we propose a two-step approach to control the false discovery rate (FDR) using the knockoff filtering. The performances of the association measures are assessed through simulated and real data experiments and compared with existing competing screening methods.

Deep networks usually require a massive amount of labeled data for their training. Yet, such data may include some mistakes in the labels. Interestingly, networks have been shown to be robust to such errors. This work uses spectral analysis of their learned mapping to provide an explanation for their robustness. In particular, we relate the smoothness regularization that usually exists in conventional training to the attenuation of high frequencies, which mainly characterize noise. By using a connection between the smoothness and the spectral norm of the network weights, we suggest that one may further improve robustness via spectral normalization. Empirical experiments validate our claims and show the advantage of this normalization for classification with label noise.

**Learning Proposals for Practical Energy-Based Regression**

Fredrik Gustafsson · Martin Danelljan · Thomas Schön

Energy-based models (EBMs) have experienced a resurgence within machine learning in recent years, including as a promising alternative for probabilistic regression. However, energy-based regression requires a proposal distribution to be manually designed for training, and an initial estimate has to be provided at test-time. We address both of these issues by introducing a conceptually simple method to automatically learn an effective proposal distribution, which is parameterized by a separate network head. To this end, we derive a surprising result, leading to a unified training objective that jointly minimizes the KL divergence from the proposal to the EBM, and the negative log-likelihood of the EBM. At test-time, we can then employ importance sampling with the trained proposal to efficiently evaluate the learned EBM and produce stand-alone predictions. Furthermore, we utilize our derived training objective to learn mixture density networks (MDNs) with a jointly trained energy-based teacher, consistently outperforming conventional MDN training on four real-world regression tasks within computer vision. Code is available at https://github.com/fregu856/ebms_proposals.

**Parallel MCMC Without Embarrassing Failures**

Daniel Augusto de Souza · Diego Mesquita · Samuel Kaski · Luigi Acerbi

Embarrassingly parallel Markov Chain Monte Carlo (MCMC) exploits parallel computing to scale Bayesian inference to large datasets by using a two-step approach. First, MCMC is run in parallel on (sub)posteriors defined on data partitions. Then, a server combines local results. While efficient, this framework is very sensitive to the quality of subposterior sampling. Common sampling problems such as missing modes or misrepresentation of low-density regions are amplified -- instead of being corrected -- in the combination phase, leading to catastrophic failures.In this work, we propose a novel combination strategy to mitigate this issue. Our strategy, Parallel Active Inference (PAI), leverages Gaussian Process (GP) surrogate modeling and active learning. After fitting GPs to subposteriors, PAI (i) shares information between GP surrogates to cover missing modes; and (ii) uses active sampling to individually refine subposterior approximations. We validate PAI in challenging benchmarks, including heavy-tailed and multi-modal posteriors and a real-world application to computational neuroscience. Empirical results show that PAI succeeds where previous methods catastrophically fail, with a small communication overhead.

**Identification in Tree-shaped Linear Structural Causal Models**

Benito van der Zander · Marcel Wienöbst · Markus Bläser · Maciej Liskiewicz

Linear structural equation models represent direct causal effects as directed edges and confounding factors as bidirected edges. An open problem is to identify the causal parameters from correlations between the nodes. We investigate models, whose directed component forms a tree, and show that there, besides classical instrumental variables, missing cycles of bidirected edges can be used to identify the model. They can yield systems of quadratic equations that we explicitly solve to obtain one or two solutions for the causal parameters of adjacent directed edges. We show how multiple missing cycles can be combined to obtain a unique solution. This results in an algorithm that can identify instances that previously required approaches based on Gröbner bases, which have doubly-exponential time complexity in the number of structural parameters.

**A Single-Timescale Method for Stochastic Bilevel Optimization**

Tianyi Chen · Yuejiao Sun · Quan Xiao · Wotao Yin

Stochastic bilevel optimization generalizes the classic stochastic optimization from the minimization of a single objective to the minimization of an objective function that depends on the solution of another optimization problem. Recently, bilevel optimization is regaining popularity in emerging machine learning applications such as hyper-parameter optimization and model-agnostic meta learning. To solve this class of optimization problems, existing methods require either double-loop or two-timescale updates, which are sometimes less efficient. This paper develops a new optimization method for a class of stochastic bilevel problems that we term Single-Timescale stochAstic BiLevEl optimization (\textbf{STABLE}) method. STABLE runs in a single loop fashion, and uses a single-timescale update with a fixed batch size. To achieve an $\epsilon$-stationary point of the bilevel problem, STABLE requires ${\cal O}(\epsilon^{-2})$ samples in total; and to achieve an $\epsilon$-optimal solution in the strongly convex case, STABLE requires ${\cal O}(\epsilon^{-1})$ samples. To the best of our knowledge, when STABLE was proposed, it is the \emph{first} bilevel optimization algorithm achieving the same order of sample complexity as SGD for single-level stochastic optimization.

Pufferfish privacy achieves $\epsilon$-indistinguishability over a set of secret pairs in the disclosed data. This paper studies how to attain $\epsilon$-pufferfish privacy by exponential mechanism, an additive noise scheme that generalizes the Laplace noise. It is shown that the disclosed data is $\epsilon$-pufferfish private if the noise is calibrated to the sensitivity of the Kantorovich optimal transport plan. Such a plan can be obtained directly from the data statistics conditioned on the secret, the prior knowledge of the system. The sufficient condition is further relaxed to reduce the noise power. It is also proved that the Gaussian mechanism based on the Kantorovich approach attains the $\delta$-approximation of $\epsilon$-pufferfish privacy.

**Adaptive Importance Sampling meets Mirror Descent : a Bias-variance Tradeoff**

Anna Korba · François Portier

Adaptive importance sampling is a widely spread Monte Carlo technique that uses a re-weighting strategy to iteratively estimate the so-called target distribution. A major drawback of adaptive importance sampling is the large variance of the weights which is known to badly impact the accuracy of the estimates. This paper investigates a regularization strategy whose basic principle is to raise the importance weights at a certain power. This regularization parameter, that might evolve between zero and one during the algorithm, is shown (i) to balance between the bias and the variance and (ii) to be connected to the mirror descent framework. Using a kernel density estimate to build the sampling policy, the uniform convergence is established under mild conditions. Finally, several practical ways to choose the regularization parameter are discussed and the benefits of the proposed approach are illustrated empirically.

**Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization**

Guodong Zhang · Yuanhao Wang · Laurent Lessard · Roger Grosse

Smooth minimax games often proceed by simultaneous or alternating gradient updates. Although algorithms with alternating updates are commonly used in practice, the majority of existing theoretical analyses focus on simultaneous algorithms for convenience of analysis. In this paper, we study alternating gradient descent-ascent~(Alt-GDA) in minimax games and show that Alt-GDA is superior to its simultaneous counterpart~(Sim-GDA) in many settings. We prove that Alt-GDA achieves a near-optimal local convergence rate for strongly convex-strongly concave (SCSC) problems while Sim-GDA converges at a much slower rate.To our knowledge, this is the \emph{first} result of any setting showing that Alt-GDA converges faster than Sim-GDA by more than a constant.We further adapt the theory of integral quadratic constraints (IQC) and show that Alt-GDA attains the same rate \emph{globally} for a subclass of SCSC minimax problems. Empirically, we demonstrate that alternating updates speed up GAN training significantly and the use of optimism only helps for simultaneous algorithms.

**Dual-Level Adaptive Information Filtering for Interactive Image Segmentation**

Ervine Zheng · Qi Yu · Rui Li · Pengcheng Shi · Anne Haake

Image segmentation can be performed interactively by accepting user annotations to refine the segmentation. It seeks frequent feedback from humans, and the model is updated with a smaller batch of data in each iteration of the feedback loop. Such a training paradigm requires effective information filtering to guide the model so that it can encode vital information and avoid overfitting due to limited data and inherent heterogeneity and noises thereof. We propose an adaptive interactive segmentation framework to support user interaction while introducing dual-level information filtering to train a robust model. The framework integrates an encoder-decoder architecture with a style-aware augmentation module that applies augmentation to feature maps and customizes the segmentation prediction for different latent styles. It also applies a systematic label softening strategy to generate uncertainty-aware soft labels for model updates. Experiments on both medical and natural image segmentation tasks demonstrate the effectiveness of the proposed framework.

**Adaptive Multi-Goal Exploration**

Jean Tarbouriech · Omar Darwiche Domingues · Pierre Ménard · Matteo Pirotta · Michal Valko · Alessandro Lazaric

We introduce a generic strategy for provably efficient multi-goal exploration. It relies on AdaGoal, a novel goal selection scheme that leverages a measure of uncertainty in reaching states to adaptively target goals that are neither too difficult nor too easy. We show how AdaGoal can be used to tackle the objective of learning an $\epsilon$-optimal goal-conditioned policy for the (initially unknown) set of goal states that are reachable within $L$ steps in expectation from a reference state $s_0$ in a reward-free Markov decision process. In the tabular case with $S$ states and $A$ actions, our algorithm requires $\tilde{O}(L^3 S A \epsilon^{-2})$ exploration steps, which is nearly minimax optimal. We also readily instantiate AdaGoal in linear mixture Markov decision processes, yielding the first goal-oriented PAC guarantee with linear function approximation. Beyond its strong theoretical guarantees, we anchor AdaGoal in goal-conditioned deep reinforcement learning, both conceptually and empirically, by connecting its idea of selecting "uncertain" goals to maximizing value ensemble disagreement.

**Implicitly Regularized RL with Implicit Q-values**

Nino Vieillard · Marcin Andrychowicz · Anton Raichuk · Olivier Pietquin · Matthieu Geist

The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to $Q$. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax over actions cannot be computed exactly otherwise. More specifically, the usage of function approximation to deal with continuous action spaces in modern actor-critic architectures intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the $Q$-function \emph{implicitly}, as the sum of a log-policy and a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.

**Accurate Shapley Values for explaining tree-based models**

Salim I. Amoukou · Tangi Salaün · Nicolas Brunel

Although Shapley Values (SV) are widely used in explainable AI, they can be poorly understood and estimated, implying that their analysis may lead to spurious inferences and explanations. As a starting point, we remind an invariance principle for SV and derive the correct approach for computing the SV of categorical variables that are particularly sensitive to the encoding used. In the case of tree-based models, we introduce two estimators of Shapley Values that exploit the tree structure efficiently and are more accurate than state-of-the-art methods. Simulations and comparisons are performed with state-of-the-art algorithms and show the practical gain of our approach. Finally, we discuss the ability of SV to provide reliable local explanations. We also provide a Python package that compute our estimators at https://github.com/salimamoukou/acv00.

We give a general recipe for derandomising PAC-Bayesian bounds using margins, with the critical ingredient being that our randomised predictions concentrate around some value. The tools we develop straightforwardly lead to margin bounds for various classifiers, including linear prediction---a class that includes boosting and the support vector machine---single-hidden-layer neural networks with an unusual erf activation function, and deep ReLU networks. Further we extend to partially-derandomised predictors where only some of the randomness of our estimators is removed, letting us extend bounds to cases where the concentration properties of our estimators are otherwise poor.

**Differentially Private Regression with Unbounded Covariates**

Jason Milionis · Alkis Kalavasis · Dimitris Fotakis · Stratis Ioannidis

We provide computationally efficient, differentially private algorithms for the classical regression settings of Least Squares Fitting, Binary Regression and Linear Regression with unbounded covariates. Prior to our work, privacy constraints in such regression settings were studied under strong a priori bounds on covariates. We consider the case of Gaussian marginals and extend recent differentially private techniques on mean and covariance estimation (Kamath et al., 2019; Karwa and Vadhan, 2018) to the sub-gaussian regime. We provide a novel technical analysis yielding differentially private algorithms for the above classical regression settings. Through the case of Binary Regression, we capture the fundamental and widely-studied models of logistic regression and linearly-separable SVMs, learning an unbiased estimate of the true regression vector, up to a scaling factor.

**A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds**

Yan Shuo Tan · Abhineet Agarwal · Bin Yu

Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We take a different approach, and advocate studying the generalization performance of decision trees with respect to different generative regression models. This allows us to elicit their inductive bias, that is, the assumptions the algorithms make (or do not make) to generalize to new data, thereby guiding practitioners on when and how to apply these methods. In this paper, we focus on sparse additive generative models, which have both low statistical complexity and some nonparametric flexibility. We prove a sharp squared error generalization lower bound for a large class of decision tree algorithms fitted to sparse additive models with $C^1$ component functions. This bound is surprisingly much worse than the minimax rate for estimating such sparse additive models. The inefficiency is due not to greediness, but to the loss in power for detecting global structure when we average responses solely over each leaf, an observation that suggests opportunities to improve tree-based algorithms, for example, by hierarchical shrinkage. To prove these bounds, we develop new technical machinery, establishing a novel connection between decision tree estimation and rate-distortion theory, a sub-field of information theory.

**Leveraging Time Irreversibility with Order-Contrastive Pre-training**

Monica Agrawal · Hunter Lang · Michael Offin · Lior Gazit · David Sontag

Label-scarce, high-dimensional domains such as healthcare present a challenge for modern machine learning techniques. To overcome the difficulties posed by a lack of labeled data, we explore an "order-contrastive" method for self-supervised pre-training on longitudinal data. We sample pairs of time segments, switch the order for half of them, and train a model to predict whether a given pair is in the correct order. Intuitively, the ordering task allows the model to attend to the least time-reversible features (for example, features that indicate progression of a chronic disease). The same features are often useful for downstream tasks of interest. To quantify this, we study a simple theoretical setting where we prove a finite-sample guarantee for the downstream error of a representation learned with order-contrastive pre-training. Empirically, in synthetic and longitudinal healthcare settings, we demonstrate the effectiveness of order-contrastive pre-training in the small-data regime over supervised learning and other self-supervised pre-training baselines. Our results indicate that pre-training methods designed for particular classes of distributions and downstream tasks can improve the performance of self-supervised learning.

A recent line of work has focused on training machine learning (ML) models in the performative setting, i.e. when the data distribution reacts to the deployed model. The goal in this setting is to learn a model which both induces a favorable data distribution and performs well on the induced distribution, thereby minimizing the test loss. Previous work on finding an optimal model assumes that the data distribution immediately adapts to the deployed model. In practice, however, this may not be the case, as the population may take time to adapt to the model. In many applications, the data distribution depends on both the currently deployed ML model and on the ``state'' that the population was in before the model was deployed.In this work, we propose a new algorithm, Stateful Performative Gradient Descent (Stateful PerfGD), for minimizing the performative loss even in the presence of these effects. We provide theoretical guarantees for the convergence of Stateful PerfGD. Our experiments confirm that Stateful PerfGD substantially outperforms previous state-of-the-art methods.

**Super-Acceleration with Cyclical Step-sizes**

Baptiste Goujaud · Damien Scieur · Aymeric Dieuleveut · Adrien Taylor · Fabian Pedregosa

We develop a convergence-rate analysis of momentum with cyclical step-sizes. We show that under some assumption on the spectral gap of Hessians in machine learning, cyclical step-sizes are provably faster than constant step-sizes. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal first order methods for quadratic minimization with a given spectral structure. Finally, we provide a local convergence rate analysis beyond quadratic minimization for the proposed methods and illustrate our findings through benchmarks on least squares and logistic regression problems.

This paper proposes a method to clarify image regions that are not well encoded by an invertible neural network (INN), i.e., image regions that significantly decrease the likelihood of the input image. The proposed method can diagnose the limitation of the representation capacity of an INN. Given an input image, our method extracts image regions, which are not well encoded, by maximizing the likelihood of the image. We explicitly model the distribution of not-well-encoded regions. A metric is proposed to evaluate the extraction of the not-well-encoded regions. Finally, we use the proposed method to analyze several state-of-the-art INNs trained on various benchmark datasets.

**Diversified Sampling for Batched Bayesian Optimization with Determinantal Point Processes**

Elvis Nava · Mojmir Mutny · Andreas Krause

In Bayesian Optimization (BO) we study black-box function optimization with noisy point evaluations and Bayesian priors. Convergence of BO can be greatly sped up by batching, where multiple evaluations of the black-box function are performed in a single round. The main difficulty in this setting is to propose at the same time diverse and informative batches of evaluation points. In this work, we introduce DPP-Batch Bayesian Optimization (DPP-BBO), a universal framework for inducing batch diversity in sampling based BO by leveraging the repulsive properties of Determinantal Point Processes (DPP) to naturally diversify the batch sampling procedure. We illustrate this framework by formulating DPP-Thompson Sampling (DPP-TS) as a variant of the popular Thompson Sampling (TS) algorithm and introducing a Markov Chain Monte Carlo procedure to sample from it. We then prove novel Bayesian simple regret bounds for both classical batched TS as well as our counterpart DPP-TS; with the latter bound being tighter. Our real-world, as well as synthetic, experiments demonstrate improved performance of DPP-BBO over classical batching methods with Gaussian process and Cox process models.

In this paper, we estimate free energy, average mutual information, and minimum mean square error (MMSE) of a linear model under the assumption that the source is generated by a Markov chain. Our estimates are based on the replica method in statistical physics. We show that under the MMSE estimator, the linear model with Markov sources or hidden Markov sources is decoupled into single input AWGN channels with state information available at both encoder and decoderwhere the state distribution follows the stationary distribution of the stochastic matrix of Markov chains. Numerical results show that the free energies and MSEs obtained via the replica method are closely approximate to their counterparts via MCMC simulations.

**k-experts - Online Policies and Fundamental Limits**

Samrat Mukhopadhyay · Sourav Sahoo · Abhishek Sinha

We introduce the k-experts problem - a generalization of the classic Prediction with Expert's Advice framework. Unlike the classic version, where the learner selects exactly one expert from a pool of N experts at each round, in this problem, the learner selects a subset of k experts at each round (1<= k <= N). The reward obtained by the learner at each round is assumed to be a function of the k selected experts. The primary objective is to design an online learning policy with a small regret. In this pursuit, we propose SAGE (Sampled Hedge) - a framework for designing efficient online learning policies by leveraging statistical sampling techniques. For a wide class of reward functions, we show that SAGE either achieves the first sublinear regret guarantee or improves upon the existing ones. Furthermore, going beyond the notion of regret, we fully characterize the mistake bounds achievable by online learning policies for stable loss functions. We conclude the paper by establishing a tight regret lower bound for a variant of the k-experts problem and carrying out experiments with standard datasets.

**Structured variational inference in Bayesian state-space models**

Honggang Wang · Anirban Bhattacharya · Debdeep Pati · Yun Yang

Variational inference is routinely deployed in Bayesian state-space models as an efficient computational technique. Motivated by the inconsistency issue observed by Wang and Titterington (2004) for the mean-field approximation in linear state-space models, we consider a more expressive variational family for approximating the joint posterior of the latent variables to retain their dependence, while maintaining the mean-field (i.e. independence) structure between latent variables and parameters. In state-space models, such a latent structure adapted mean-field approximation can be efficiently computed using the belief propagation algorithm. Theoretically, we show that this adapted mean-field approximation achieves consistency of the variational estimates. Furthermore, we derive a non-asymptotic risk bound for an averaged alpha-divergence from the true data generating model, suggesting that the posterior mean of the best variational approximation for the static parameters shows optimal concentration. From a broader perspective, we add to the growing literature on statistical accuracy of variational approximations by allowing dependence between the latent variables, and the techniques developed here should be useful in related contexts.

We give a complete characterisation of families of probability distributions that are invariant under the action of ReLU neural network layers (in the same way that the family of Gaussian distributions is invariant to affine linear transformations). The need for such families arises during the training of Bayesian networks or the analysis of trained neural networks, e.g., in the context of uncertainty quantification (UQ) or explainable artificial intelligence (XAI).We prove that no invariant parametrised family of distributions can exist unless at least one of the following three restrictions holds: First, the network layers have a width of one, which is unreasonable for practical neural networks. Second, the probability measures in the family have finite support, which basically amounts to sampling distributions. Third, the parametrisation of the family is not locally Lipschitz continuous, which excludes all computationally feasible families.Finally, we show that these restrictions are individually necessary. For each of the three cases we can construct an invariant family exploiting exactly one of the restrictions but not the other two.

**Marginalising over Stationary Kernels with Bayesian Quadrature**

Saad Hamid · Sebastian Schulze · Michael A. Osborne · Stephen Roberts

Marginalising over families of Gaussian Process kernels produces flexible model classes with well-calibrated uncertainty estimates. Existing approaches require likelihood evaluations of many kernels, rendering them prohibitively expensive for larger datasets. We propose a Bayesian Quadrature scheme to make this marginalisation more efficient and thereby more practical. Through use of maximum mean discrepancies between distributions, we define a kernel over kernels that captures invariances between Spectral Mixture (SM) Kernels. Kernel samples are selected by generalising an information-theoretic acquisition function for warped Bayesian Quadrature. We show that our framework achieves more accurate predictions with better calibrated uncertainty than state-of-the-art baselines, especially when given limited (wall-clock) time budgets.

In the problem of classical group testing one aims to identify a small subset (of size $d$) diseased individuals/defective items in a large population (of size $n$) via a minimal number of suitably-designed group tests on subsets of items, where the test outcome is positive iff the given test contains at least one defective item. Motivated by physical considerations, we consider a generalized setting that includes as special cases multiple other group-testing-like models in the literature. In our setting, which subsumes as special cases a variety of noiseless and noisy group-testing models in the literature, the test outcome is positive with probability $f(x)$, where $x$ is the number of defectives tested in a pool, and $f(\cdot)$ is an arbitrary {\it monotonically increasing} (stochastic) test function. Our main contributions are as follows.1. We present a non-adaptive scheme that with probability $1-\varepsilon$ identifies all defective items. Our scheme requires at most ${\cal O}( H(f) d\log(n/\varepsilon))$ tests, where $H(f)$ is a suitably defined ``sensitivity parameter" of $f(\cdot)$, and is never larger than ${\cal O}(d^{1+o(1)})$, but may be substantially smaller for many $f(\cdot)$.2. We argue that any non-adaptive group testing scheme needs at least $\Omega (h(f) d\log(n/d))$ tests to ensure high reliability recovery. Here $h(f)$ is a suitably defined ``concentration parameter" of $f(\cdot)$, and $h(f) \in \Omega{(1)}$. 3. We prove that our sample-complexity bounds for generalized group testing are information-theoretically near-optimal for a variety of sparse-recovery group-testing models in the literature. That is, for {\it any} ``noisy" test function $f(\cdot)$ (i.e. $0< f(0) < f(d) <1$), and for a variety of ``(one-sided) noiseless" test functions $f(\cdot)$ (i.e., either $f(0)=0$, or $f(d)=1$, or both) studied in the literature we show that $H(f)/h(f) \in \Theta(1)$. As a by-product we tightly characterize the heretofore open information-theoretic sample-complexity for the well-studied model of threshold group-testing. For general (near)-noiseless test functions $f(\cdot)$ we show that $H(f)/h(f) \in {\cal O}(d^{1+o(1)})$. We also demonstrate a ``natural" test-function $f(\cdot)$ whose sample complexity scales ``extremally" as $\Theta ( d^2\log(n))$, rather than $\Theta ( d\log(n))$ as in the case of classical group-testing.Some of our techniques may be of independent interest -- in particular our achievability requires a delicate saddle-point approximation, and our impossibility proof relies on a novel bound relating the mutual information of pair of random variables with the mean and variance of a specific function, and we derive novel structural results about monotone functions.

**A general sample complexity analysis of vanilla policy gradient**

Rui YUAN · Robert Gower · Alessandro Lazaric

We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A \geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity of PG. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and places no restriction on the batch size $m$, including the single trajectory case (i.e., $m=1$). We then instantiate our theorem in different settings, where we both recover existing results and obtained improved sample complexity, e.g., for convergence to the global optimum for Fisher-non-degenerated parameterized policies.

**Chernoff Sampling for Active Testing and Extension to Active Regression**

Subhojyoti Mukherjee · Ardhendu Tripathy · Robert Nowak

Active learning can reduce the number of samples needed to perform a hypothesis test and to estimate the parameters of a model. In this paper, we revisit the work of Chernoff that described an asymptotically optimal algorithm for performing a hypothesis test. We obtain a novel sample complexity bound for Chernoff’s algorithm, with a non-asymptotic term that characterizes its performance at a fixed confidence level. We also develop an extension of Chernoff sampling that can be used to estimate the parameters of a wide variety of models and we obtain a non-asymptotic bound on the estimation error. We apply our extension of Chernoff sampling to actively learn neural network models and to estimate parameters in real-data linear and non-linear regression problems, where our approach performs favorably to state-of-the-art methods.

Dynamic scheduling is an important problem in applications from queuing to wireless networks. It addresses how to choose an item among multiple scheduling items in each timestep to achieve a long-term goal. Most of the conventional approaches for dynamic scheduling find the optimal policy for a given specific system so that the policy from these approaches is usable only for the corresponding system characteristics. Hence, it is hard to use such approaches for a practical system in which system characteristics dynamically change. This paper proposes a novel policy structure for MDP-based dynamic scheduling, a descriptive policy, which has a system-agnostic capability to adapt to unseen system characteristics for an identical task (dynamic scheduling). To this end, the descriptive policy learns a system-agnostic scheduling principle--in a nutshell, ``which condition of items should have a higher priority in scheduling''. The scheduling principle can be applied to any system so that the descriptive policy learned in one system can be used for another system. Experiments with simple explanatory and realistic application scenarios demonstrate that it enables system-agnostic meta-learning with very little performance degradation.

**Masked Training of Neural Networks with Partial Gradients**

Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich

State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD). Recently, many variations have been explored: perturbing parameters for better accuracy (such as in Extragradient), limiting SGD updates to a subset of parameters for increased efficiency (such as meProp) or a combination of both (such as Dropout). However, the convergence of these methods is often not studied in theory. We propose a unified theoretical framework to study such SGD variants---encompassing the aforementioned algorithms and additionally a broad variety of methods used for communication efficient training or model compression. Our insights can be used as a guide to improve the efficiency of such methods and facilitate generalization to new applications. As an example, we tackle the task of jointly training networks, a version of which (limited to sub-networks) is used to create Slimmable Networks. By training a low-rank Transformer jointly with a standard one we obtain superior performance than when it is trained separately.

**Model-agnostic out-of-distribution detection using combined statistical tests**

Federico Bergamin · Pierre-Alexandre Mattei · Jakob Drachmann Havtorn · Hugo Sénétaire · Hugo Schmutz · Lars Maaløe · Soren Hauberg · Jes Frellsen

We present simple methods for out-of-distribution detection using a trained generative model. These techniques, based on classical statistical tests, are model-agnostic in the sense that they can be applied to any differentiable generative model. The idea is to combine a classical parametric test (Rao's score test) with the recently introduced typicality test. These two test statistics are both theoretically well-founded and exploit different sources of information based on the likelihood for the typicality test and its gradient for the score test. We show that combining them using Fisher's method overall leads to a more accurate out-of-distribution test. We also discuss the benefits of casting out-of-distribution detection as a statistical testing problem, noting in particular that false positive rate control can be valuable for practical out-of-distribution detection. Despite their simplicity and generality, these methods can be competitive with model-specific out-of-distribution detection algorithms without any assumptions on the out-distribution.