**On the Interplay between Information Loss and Operation Loss in Representations for Classification**

Jorge Silva · Felipe Tobar

Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. Inspired by this, we look at the relationship between i) a weak form of information loss in the Shannon sense and ii) operational loss in the minimum probability of error (MPE) sense when considering a family of lossy continuous representations of an observation. Our first result offers a lower bound on a weak form of information loss as a function of its respective operation loss when adopting a discrete lossy representation (quantization) instead of the original raw observation. From this, our main result shows that a specific form of vanishing information loss (a weak notion of asymptotic informational sufficiency) implies a vanishing MPE loss (or asymptotic operational sufficiency) when considering a family of lossy continuous representations. Our theoretical findings support the observation that the selection of feature representations that attempt to capture informational sufficiency is appropriate for learning, but this design principle is a rather conservative if the intended goal is achieving MPE in classification. On this last point, we discuss about studying weak forms of informational sufficiencies to achieve operational sufficiency in learning settings.

Fast Johnson-Lindenstrauss Transform (FJLT) and Sparse Johnson-Lindenstrauss Transform (SJLT) are two important oblivious subspace embeddings. So far, the developments of these two methods are almost orthogonal. In this work, we propose an iterative algorithm for oblivious subspace embedding which makes a connection between these two methods. The proposed method is built upon an iterative implementation of FJLT and is equipped with several theoretically motivated modifications. One important strategy we adopt is the early stopping strategy. On the one hand, the early stopping strategy makes our algorithm fast. On the other hand, it results in a sparse embedding matrix. As a result, the proposed algorithm is not only faster than the FJLT, but also faster than the SJLT with the same degree of sparsity. We present a general theoretical framework to analyze the embedding property of sparse embedding methods, which is used to prove the embedding property of the proposed method. This framework is also of independent interest. Lastly, we conduct numerical experiments to verify the good performance of the proposed algorithm.

**Regret Bounds for Expected Improvement Algorithms in Gaussian Process Bandit Optimization**

Hung Tran · Sunil Gupta · Santu Rana · Svetha Venkatesh

The expected improvement (EI) algorithm is one of the most popular strategies for optimization under uncertainty due to its simplicity and efficiency. Despite its popularity, the theoretical aspects of this algorithm have not been properly analyzed. In particular, whether in the noisy setting, the EI strategy with a standard incumbent converges is still an open question of the Gaussian process bandit optimization problem. We aim to answer this question by proposing a variant of EI with a standard incumbent defined via the GP predictive mean. We prove that our algorithm converges, and achieves a cumulative regret bound of $\mathcal O(\gamma_T\sqrt{T})$, where $\gamma_T$ is the maximum information gain between $T$ observations and the Gaussian process model. Based on this variant of EI, we further propose an algorithm called Improved GP-EI that converges faster than previous counterparts. In particular, our proposed variants of EI do not require the knowledge of the RKHS norm and the noise's sub-Gaussianity parameter as in previous works. Empirical validation in our paper demonstrates the effectiveness of our algorithms compared to several baselines.

This paper considers convex shape-restricted nonparametric regression over subgaussian domain and noise with the squared loss. It introduces a tractable convex piecewise-linear estimator which precomputes a partition of the training data by an adaptive version of farthest-point clustering, approximately fits hyperplanes over the partition cells by minimizing the regularized empirical risk, and projects the result into the max-affine class. The analysis provides an upper bound on the generalization error of this estimator matching the rate of Lipschitz nonparametric regression and proves its adaptivity to the intrinsic dimension of the data mitigating the effect of the curse of dimensionality. The experiments conclude with competitive performance, improved overfitting robustness, and significant computational savings compared to existing convex regression methods.

**Nonparametric Relational Models with Superrectangulation**

Masahiro Nakano · Ryo Nishikimi · Yasuhiro Fujiwara · Akisato Kimura · Takeshi Yamada · Naonori Ueda

This paper addresses the question, ''What is the smallest object that contains all rectangular partitions with n or fewer blocks?'' and shows its application to relational data analysis using a new strategy we call super Bayes as an alternative to Bayesian nonparametric (BNP) methods. Conventionally, standard BNP methods have combined the Aldous-Hoover-Kallenberg representation with parsimonious stochastic processes on rectangular partitioning to construct BNP relational models. As a result, conventional methods face the great difficulty of searching for a parsimonious random rectangular partition that fits the observed data well in Bayesian inference. As a way to essentially avoid such a problem, we propose a strategy to combine an extremely redundant rectangular partition as a deterministic (non-probabilistic) object. Specifically, we introduce a special kind of rectangular partitioning, which we call superrectangulation, that contains all possible rectangular partitions. Delightfully, this strategy completely eliminates the difficult task of searching around for random rectangular partitions, since the superrectangulation is deterministically fixed in inference. Experiments on predictive performance in relational data analysis show that the super Bayesian model provides a more stable analysis than the existing BNP models, which are less likely to be trapped in bad local optima.

**An Alternate Policy Gradient Estimator for Softmax Policies**

Shivam Garg · Samuele Tosatto · Yangchen Pan · Martha White · Rupam Mahmood

Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions. Sub-optimal policy saturation may arise from bad policy initialization or sudden changes in the environment that occur after the policy has already converged. Current softmax PG estimators require a large number of updates to overcome policy saturation, which causes low sample efficiency and poor adaptability to new situations. To mitigate this problem, we propose a novel PG estimator for softmax policies that utilizes the bias in the critic estimate and the noise present in the reward signal to escape the saturated regions of the policy parameter space. Our theoretical analysis and experiments, conducted on bandits and various reinforcement learning environments, show that this new estimator is significantly more robust to policy saturation.

**Can we Generalize and Distribute Private Representation Learning?**

Sheikh Shams Azam · Taejin Kim · Seyyedali Hosseinalipour · Carlee Joe-Wong · Saurabh Bagchi · Christopher Brinton

We study the problem of learning representations that are private yet informative i.e., provide information about intended "ally" targets while hiding sensitive "adversary" attributes. We propose Exclusion-Inclusion Generative Adversarial Network (EIGAN), a generalized private representation learning (PRL) architecture that accounts for multiple ally and adversary attributes unlike existing PRL solutions. While centrally-aggregated dataset is a prerequisite for most PRL techniques, data in real-world is often siloed across multiple distributed nodes unwilling to share the raw data because of privacy concerns. We address this practical constraint by developing D-EIGAN, the first distributed PRL method that learns representations at each node without transmitting the source data. We theoretically analyze the behavior of adversaries under the optimal EIGAN and D-EIGAN encoders and the impact of dependencies among ally and adversary tasks on the optimization objective. Our experiments on various datasets demonstrate the advantages of EIGAN in terms of performance, robustness, and scalability. In particular, EIGAN outperforms the previous state-of-the-art by a significant accuracy margin (47% improvement), and D-EIGAN's performance is consistently on par with EIGAN under different network settings.

**Aligned Multi-Task Gaussian Process**

Olga Mikheeva · Ieva Kazlauskaite · Adam Hartshorne · Hedvig Kjellström · Carl Henrik Ek · Neill Campbell

Multi-task learning requires accurate identification of the correlations between tasks. In real-world time-series, tasks are rarely perfectly temporally aligned; traditional multi-task models do not account for this and subsequent errors in correlation estimation will result in poor predictive performance and uncertainty quantification. We introduce a method that automatically accounts for temporal misalignment in a unified generative model that improves predictive performance. Our method uses Gaussian processes (GPs) to model the correlations both within and between the tasks. Building on the previous work by Kazlauskaite et al. (2019), we include a separate monotonic warp of the input data to model temporal misalignment. In contrast to previous work, we formulate a lower bound that accounts for uncertainty in both the estimates of the warping process and the underlying functions. Also, our new take on a monotonic stochastic process, with efficient path-wise sampling for the warp functions, allows us to perform full Bayesian inference in the model rather than MAP estimates. Missing data experiments, on synthetic and real time-series, demonstrate the advantages of accounting for misalignments (vs standard unaligned method) as well as modelling the uncertainty in the warping process (vs baseline MAP alignment approach).

**Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations**

Winnie Xu · Ricky T. Q. Chen · Xuechen Li · David Duvenaud

We perform scalable approximate inference in continuous-depth Bayesian neural networks. In this model class, uncertainty about separate weights in each layer gives hidden units that follow a stochastic differential equation. We demonstrate gradient-based stochastic variational inference in this infinite-parameter setting, producing arbitrarily-flexible approximate posteriors. We also derive a novel gradient estimator that approaches zero variance as the approximate posterior over weights approaches the true posterior. This approach brings continuous-depth Bayesian neural nets to a competitive comparison against discrete-depth alternatives, while inheriting the memory-efficient training and tunable precision of Neural ODEs.

We study modeling joint densities over sets of random variables (next-step movements of multiple agents) which are conditioned on aligned observations (past trajectories). For this setting, we propose an autoregressive approach to model intra-timestep dependencies, where distributions over joint movements are represented by autoregressive factorizations. In our approach, factors are randomly ordered and estimated with a graph neural network to account for permutation equivariance, while a recurrent neural network encodes past trajectories. We further propose a conditional two-stream attention mechanism, to allow for efficient training of random factorizations. We experiment on trajectory data from professional soccer matches and find that we model low frequency trajectories better than variational approaches.

Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper proposes a novel training algorithm that optimises the PAC-Bayesian bound, without relying on any surrogate loss. Empirical results show that this approach outperforms currently available PAC-Bayesian training methods.

**Non-stationary Online Learning with Memory and Non-stochastic Control**

Peng Zhao · Yu-Xiang Wang · Zhi-Hua Zhou

We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions and thus captures temporal effects of learning problems. In this paper, we introduce \emph{dynamic policy regret} as the performance measure to design algorithms robust to non-stationary environments, which competes algorithms' decisions with a sequence of changing comparators. We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret. The key technical challenge is how to control the \emph{switching cost}, the cumulative movements of player's decisions, which is neatly addressed by a novel decomposition of dynamic policy regret and a careful design of meta-learner and base-learner that explicitly regularizes the switching cost. The results are further applied to tackle non-stationarity in \emph{online non-stochastic control} [Agarwal et al., 2019], i.e., controlling a linear dynamical system with adversarial disturbance and convex cost functions. We derive a novel gradient-based controller with dynamic policy regret guarantees, which is the first controller provably competitive to a sequence of changing policies for online non-stochastic control.

We consider a graph-structured change point problem in which we observe a random vector with piece-wise constant but otherwise unknown mean and whose independent, sub-Gaussian coordinates correspond to the $n$ nodes of a fixed graph. We are interested in the localisation task of recovering the partition of the nodes associated to the constancy regions of the mean vector or, equivalently, of estimating the cut separating the sub-graphs over which the mean remains constant. Although graph-valued signals of this type have been previously studied in the literature for the different tasks of testing for the presence of an anomalous cluster and of estimating the mean vector, no localisation results are known outside the classical case of chain graphs. When the partition $\mathcal{S}$ consists of only two elements, we characterise the difficulty of the localisation problem in terms of four key parameters: the maximal noise variance $\sigma^2$, the size $\Delta$ of the smaller element of the partition, the magnitude $\kappa$ of the difference in the signal values across contiguous elements of the partition and the sum of the effective resistance edge weights $|\partial_r(\mathcal{S})|$ of the corresponding cut -- a graph theoretic quantity quantifying the size of the partition boundary. In particular, we demonstrate an information theoretical lower bound implying that, in the low signal-to-noise ratio regime $\kappa^2 \Delta \sigma^{-2} |\partial_r(\mathcal{S})|^{-1} \lesssim 1$, no consistent estimator of the true partition exists. On the other hand, when $\kappa^2 \Delta \sigma^{-2} |\partial_r(\mathcal{S})|^{-1} \gtrsim \zeta_n \log\{r(|E|)\}$, with $r(|E|)$ being the sum of effective resistance weighted edges and $\zeta_n$ being any diverging sequence in $n$, we show that a polynomial-time, approximate $\ell_0$-penalised least squared estimator delivers a localisation error -- measured by the symmetric difference between the true and estimated partition -- of order $ \kappa^{-2} \sigma^2 |\partial_r(\mathcal{S})| \log\{r(|E|)\}$. Aside from the $\log\{r(|E|)\}$ term, this rate is minimax optimal. Finally, we provide discussions on the localisation error for more general partitions of unknown sizes.

**The Curse Revisited: When are Distances Informative for the Ground Truth in Noisy High-Dimensional Data?**

Robin Vandaele · Bo Kang · Tijl De Bie · Yvan Saeys

Distances between data points are widely used in machine learning applications. Yet, when corrupted by noise, these distances---and thus the models based upon them---may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we exactly characterize such effects in noisy high-dimensional data using an asymptotic probabilistic expression. Previously, it has been argued that neighborhood queries become meaningless and unstable when distance concentration occurs, which means that there is a poor relative discrimination between the furthest and closest neighbors in the data. However, we conclude that this is not necessarily the case when we decompose the data in a ground truth---which we aim to recover---and noise component. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be truthful even when distance concentration occurs. We also include thorough empirical verification of our results, as well as interesting experiments in which our derived `phase shift' where neighbors become random or not turns out to be identical to the phase shift where common dimensionality reduction methods perform poorly or well for recovering low-dimensional reconstructions of high-dimensional data with dense noise.

Labelling of data for supervised learning can be costly and time-consuming and the risk of incorporating label noise in large data sets is imminent. When training a flexible discriminative model using a strictly proper loss, such noise will inevitably shift the solution towards the conditional distribution over noisy labels. Nevertheless, while deep neural networks have proven capable of fitting random labels, regularisation and the use of robust loss functions empirically mitigate the effects of label noise. However, such observations concern robustness in accuracy, which is insufficient if reliable uncertainty quantification is critical. We demonstrate this by analysing the properties of the conditional distribution over noisy labels for an input-dependent noise model. In addition, we evaluate the set of robust loss functions characterised by noise-insensitive, asymptotic risk minimisers. We find that strictly proper and robust loss functions both offer asymptotic robustness in accuracy, but neither guarantee that the final model is calibrated. Moreover, even with robust loss functions, overfitting is an issue in practice. With these results, we aim to explain observed robustness of common training practices, such as early stopping, to label noise. In addition, we aim to encourage the development of new noise-robust algorithms that not only preserve accuracy but that also ensure reliability.

Recent work on permutation equivariant neural networkshas mostly focused on the first order case (sets) and second order case (graphs). We describe the machinery for generalizing permutation equivariance to arbitrary $k$-ary interactions between entities for any value of $k$.We demonstrate the effectiveness of higher orderpermutation equivariant models on several real world applications and find that our resultscompare favorably to existing permutation invariant/equivariant baselines.

The foundational concept of Max-Margin in machine learning is ill-posed for output spaces with more than two labels such as in structured prediction. In this paper, we show that the Max-Margin loss can only be consistent to the classification task under highly restrictive assumptions on the discrete loss measuring the error between outputs. These conditions are satisfied by distances defined in tree graphs, for which we prove consistency, thus being the first losses shown to be consistent for Max-Margin beyond the binary setting. We finally address these limitations by correcting the concept of Max-Margin and introducing the Restricted-Max-Margin, where the maximization of the loss-augmented scores is maintained, but performed over a subset of the original domain. The resulting loss is also a generalization of the binary support vector machine and it is consistent under milder conditions on the discrete loss.

Predicting stochastic spreading processes on complex networks is critical in epidemic control, opinion propagation, and viral marketing. We focus on the problem of inferring the time-dependent marginal probabilities of states for each node which collectively quantifies the spreading results. Dynamic Message Passing (DMP) has been developed as an efficient inference algorithm for several spreading models, and it is asymptotically exact on locally tree-like networks. However, DMP can struggle in diffusion networks with lots of local loops. We address this limitation by using Graph Neural Networks (GNN) to learn the dependency amongst messages implicitly. Specifically, we propose a hybrid model in which the GNN module runs jointly with DMP equations. The GNN module refines the aggregated messages in DMP iterations by learning from simulation data. We demonstrate numerically that after training, our model's inference accuracy substantially outperforms DMP in conditions of various network structure and dynamics parameters. Moreover, compared to pure data-driven models, the proposed hybrid model has a better generalization ability for out-of-training cases, profiting from the explicitly utilized dynamics priors in the hybrid model. A PyTorch implementation of our model is at https://github.com/FeiGSSS/NEDMP.

Probabilistic time series forecasting has played critical role in decision-making processes due to its capability to quantify uncertainties. Deep forecasting models, however, could be prone to input perturbations, and the notion of such perturbations, together with that of robustness, has not even been completely established in the regime of probabilistic forecasting. In this work, we propose a framework for robust probabilistic time series forecasting. First, we generalize the concept of adversarial input perturbations, based on which we formulate the concept of robustness in terms of bounded Wasserstein deviation. Then we extend the randomized smoothing technique to attain robust probabilistic forecasters with theoretical robustness certificates against certain classes of adversarial perturbations. Lastly, extensive experiments demonstrate that our methods are empirically effective in enhancing the forecast quality under additive adversarial attacks and forecast consistency under supplement of noisy observations. The code for our experiments is available at https://github.com/tetrzim/robust-probabilistic-forecasting.

**Triple-Q: A Model-Free Algorithm for Constrained Reinforcement Learning with Sublinear Regret and Zero Constraint Violation**

Honghao Wei · Xin Liu · Lei Ying

This paper presents the first {\em model-free}, {\em simulator-free} reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named {\em Triple-Q} because it includes three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three ``Q'' values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4 S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}} \right)$ regret\footnote{{\bf Notation:} $f(n) = \tilde{\mathcal O}(g(n))$ denotes $f(n) = {\mathcal O}(g(n){\log}^k n)$ with $k>0.$ The same applies to $\tilde{\Omega}.$ $\mathbb R^+$ denotes non-negative real numbers. $[H]$ denotes the set $\{1,2,\cdots, H\}.$ }, where $K$ is the total number of episodes, $H$ is the number of steps in each episode, $S$ is the number of states, $A$ is the number of actions, and $\delta$ is Slater's constant. Furthermore, {Triple-Q} guarantees zero constraint violation, both on expectation and with a high probability, when $K$ is sufficiently large. Finally, the computational complexity of {Triple-Q} is similar to SARSA for unconstrained MDPs, and is computationally efficient.

Post-deployment monitoring of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML DEployment MONitoring. MLDemon integrates both unlabeled data and a small amount of on-demand labels to produce a real-time estimate of the ML model’s current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, expert supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon outperforms existing approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal for a broad class of distribution drifts.

We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs). For this purpose, some variant of Stochastic Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. An important detail is the ability to use inexact values of functional constraints and compute the value of dual variables. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method. Using this algorithm, we get the first parallel algorithm for mixing average-reward MDPs with a generative model without reduction to discounted MDP. One of the main features of the presented method is low communication costs in a distributed centralized setting, even with very large networks.

**SAN: Stochastic Average Newton Algorithm for Minimizing Finite Sums**

Jiabin CHEN · Rui YUAN · Guillaume Garrigos · Robert Gower

We present a principled approach for designing stochastic Newton methods for solving finite sum optimization problems. Our approach has two steps. First, we re-write the stationarity conditions as a system of nonlinear equations that associates each data point to a new row. Second, we apply a Subsampled Newton Raphson method to solve this system of nonlinear equations. Using our approach, we develop a new Stochastic Average Newton (SAN) method, which is incremental by design, in that it requires only a single data point per iteration. It is also cheap to implement when solving regularized generalized linear models, with a cost per iteration of the order of the number of the parameters. We show through extensive numerical experiments that SAN requires no knowledge about the problem, neither parameter tuning, while remaining competitive as compared to classical variance reduced gradient methods (e.g. SAG and SVRG), incremental Newton and quasi-Newton methods (e.g. SNM, IQN).

**DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search**

Matti Karppa · Martin Aumüller · Rasmus Pagh

Kernel Density Estimation (KDE) is a nonparametric method for estimatig the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for KDE evaluation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance.

**Tight bounds for minimum $\ell_1$-norm interpolation of noisy data**

Guillaume Wang · Konstantin Donhauser · Fanny Yang

We provide matching upper and lower bounds of order $\sigma^2/\log(d/n)$ for the prediction error of the minimum $\ell_1$-norm interpolator, a.k.a. basis pursuit. Our result is tight up to negligible terms when $d \gg n$, and is the first to imply asymptotic consistency of noisy minimum-norm interpolation for isotropic features and sparse ground truths. Our work complements the literature on "benign overfitting" for minimum $\ell_2$-norm interpolation, where asymptotic consistency can be achieved only when the features are effectively low-dimensional.

**Crowdsourcing Regression: A Spectral Approach**

Yaniv Tenzer · Omer Dror · Boaz Nadler · Erhan Bilal · Yuval Kluger

Merging the predictions of multiple experts is a frequent task. When ground-truth response values are available, this merging is often based on the estimated accuracies of the experts. In various applications, however, the only available information are the experts' predictions on unlabeled test data, which do not allow to directly estimate their accuracies. Moreover, simple merging schemes such as majority voting in classification or the ensemble mean or median in regression, are clearly sub-optimal when some experts are more accurate than others.Focusing on regression tasks, in this work we propose U-PCR, a framework for {\em unsupervised ensemble regression}. Specifically, we develop spectral-based methods that under mild assumptions and in the absence of ground truth data, are able to estimate the mean squared error of the different experts and combine their predictions to a more accurate meta-learner. We provide theoretical support for U-PCR as well as empirical evidence for the validity of its underlying assumptions. On a variety of regression problems, we illustrate the improved accuracy of U-PCR over various unsupervised merging strategies. Finally, we also illustrate its applicability to unsupervised multi-class ensemble learning.

**Efficient interventional distribution learning in the PAC framework**

Arnab Bhattacharyya · Sutanu Gayen · Saravanan Kandasamy · Vedant Raval · Vinodchandran N. Variyam

We consider the problem of efficiently inferring interventional distributions in a causal Bayesian network from a finite number of observations. Let P be a causal model on a set V of observable variables on a given causal graph G. For sets X,Y \subseteq V, and setting x to X, P*x(Y) denotes the interventional distribution on Y with respect to an intervention x to variables X. Shpitser and Pearl (AAAI 2006), building on the work of Tian and Pearl (AAAI 2001), proved that the {\em ID algorithm} is sound and complete for recovering P*x(Y) from observations.We give the first provably efficient version of the ID algorithm. In particular, under natural assumptions, we give a polynomial-time algorithm that on input a causal graph G on observable variables V, a setting x of a set X \subseteq V of bounded size, outputs succinct descriptions of both an evaluator and a generator for a distribution \hat{P} that is epsilon-close (in total variation distance) to P*x(Y) where Y = V \ X, if P*x(Y) is identifiable. We also show that when Y is an arbitrary subset of V \ X, there is no efficient algorithm that outputs an evaluator of a distribution that is epsilon-close to P_x(Y) unless all problems that have statistical zero-knowledge proofs, including the Graph Isomorphism problem, have efficient randomized algorithms.

**Improved Algorithms for Misspecified Linear Markov Decision Processes**

Daniel Vial · Advait Parulekar · Sanjay Shakkottai · R Srikant

For the misspecified linear Markov decision process (MLMDP) model of Jin et al. [2020], we propose an algorithm with three desirable properties. (P1) Its regret after K episodes scales as Kmax{εmis,εtol}, where εmis is the degree of misspecification and εtol is a user-specified error tolerance. (P2) Its space and per-episode time complexities remain bounded as K→∞. (P3) It does not require εmis as input. To our knowledge, this is the first algorithm satisfying all three properties. For concrete choices of εtol, we also improve existing regret bounds (up to log factors) while achieving either (P2) or (P3) (existing algorithms satisfy neither). At a high level, our algorithm generalizes (to MLMDPs) and refines the Sup-Lin-UCB algorithm, which Takemura et al. [2021] recently showed satisfies (P3) in the contextual bandit setting. We also provide an intuitive interpretation of their result, which informs the design of our algorithm.

**p-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets**

Alexander Munteanu · Simon Omlor · Christian Peters

We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses. It extends the standard probit model by replacing its link function, the standard normal cdf, by a $p$-generalized normal distribution for $p\in[1, \infty)$. The $p$-generalized normal distributions (Subbotin, 1923) are of special interest in statistical modeling because they fit much more flexibly to data. Their tail behavior can be controlled by choice of the parameter $p$, which influences the model's sensitivity to outliers. Special cases include the Laplace, the Gaussian, and the uniform distributions. We further show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+\varepsilon)$ on large data by combining sketching techniques with importance subsampling to obtain a small data summary called coreset.

An important problem in machine learning is the ability to learn tasks in a sequential manner. If trained with standard first-order methods most models forget previously learned tasks when trained on a new task, which is often referred to as catastrophic forgetting.A popular approach to overcome forgetting is to regularize the loss function by penalizing models that perform poorly on previous tasks. For example, elastic weight consolidation (EWC) regularizes with a quadratic form involving a diagonal matrix build based on past data. While EWC works very well for some setups, we show that, even under otherwise ideal conditions, it can provably suffer catastrophic forgetting if the diagonal matrix is a poor approximation of the Hessian matrix of previous tasks. We propose a simple approach to overcome this: Regularizing training of a new task with sketches of the Jacobian matrix of past data. This provably enables overcoming catastrophic forgetting for linear models and for wide neural networks, at the cost of memory. The overarching goal of this paper is to provided insights on when regularization-based continual learning algorithms work and under what memory costs.

**Discovering Inductive Bias with Gibbs Priors: A Diagnostic Tool for Approximate Bayesian Inference**

Luca Rendsburg · Agustinus Kristiadi · Philipp Hennig · Ulrike von Luxburg

Full Bayesian posteriors are rarely analytically tractable, which is why real-world Bayesian inference heavily relies on approximate techniques.Approximations generally differ from the true posterior and require diagnostic tools to assess whether the inference can still be trusted.We investigate a new approach to diagnosing approximate inference: the approximation mismatch is attributed to a change in the inductive bias by treating the approximations as exact and reverse-engineering the corresponding prior.We show that the problem is more complicated than it appears to be at first glance, because the solution generally depends on the observation.By reframing the problem in terms of incompatible conditional distributions we arrive at a natural solution: the Gibbs prior.The resulting diagnostic is based on pseudo-Gibbs sampling, which is widely applicable and easy to implement.We illustrate how the Gibbs prior can be used to discover the inductive bias in a controlled Gaussian setting and for a variety of Bayesian models and approximations.

**Obtaining Causal Information by Merging Datasets with MAXENT**

Sergio Garrido Mejia · Elke Kirschbaum · Dominik Janzing

The investigation of the question “which treatment has a causal effect on a target variable?” is of particular relevance in a large number of scientific disciplines. This challenging task becomes even more difficult if not all treatment variables were or even can not be observed jointly with the target variable. In this paper, we discuss how causal knowledge can be obtained without having observed all variables jointly, but by merging the statistical information from different datasets. We show how the maximum entropy principle can be used to identify edges among random variables when assuming causal sufficiency and an extended version of faithfulness, and when only subsets of the variables have been observed jointly.

**SHAFF: Fast and consistent SHApley eFfect estimates via random Forests**

Clément Bénard · Gérard Biau · Sébastien Da Veiga · Erwan Scornet

Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating Shapley effects is a challenging task, because of the computational complexity and the conditional expectation estimates. Accordingly, existing Shapley algorithms have flaws: a costly running time, or a bias when input variables are dependent. Therefore, we introduce SHAFF, SHApley eFfects via random Forests, a fast and accurate Shapley effect estimate, even when input variables are dependent. We show SHAFF efficiency through both a theoretical analysis of its consistency, and the practical performance improvements over competitors with extensive experiments. An implementation of SHAFF in C++ and R is available online.

**Probing GNN Explainers: A Rigorous Theoretical and Empirical Analysis of GNN Explanation Methods**

Chirag Agarwal · Marinka Zitnik · Himabindu Lakkaraju

As Graph Neural Networks (GNNs) are increasingly being employed in critical real-world applications, several methods have been proposed in recent literature to explain the predictions of these models. However, there has been little to no work on systematically analyzing the reliability of these methods. Here, we introduce the first-ever theoretical analysis of the reliability of state-of-the-art GNN explanation methods. More specifically, we theoretically analyze the behavior of various state-of-the-art GNN explanation methods with respect to several desirable properties (e.g., faithfulness, stability, and fairness preservation) and establish upper bounds on the violation of these properties. We also empirically validate our theoretical results using extensive experimentation with nine real-world graph datasets. Our empirical results further shed light on several interesting insights about the behavior of state-of-the-art GNN explanation methods.

Motivated by the tremendous success of boosting methods in the standard centralized model of learning, we initiate the theory of boosting in the Federated Learning setting. The primary challenges in the Federated Learning setting are heterogeneity in client data and the requirement that no client data can be transmitted to the server. We develop federated functional gradient boosting (FFGB) an algorithm that is designed to handle these challenges. Under appropriate assumptions on the weak learning oracle, the FFGB algorithm is proved to efficiently converge to certain neighborhoods of the global optimum. The radii of these neighborhoods depend upon the level of heterogeneity measured via the total variation distance and the much tighter Wasserstein-1 distance, and diminish to zero as the setting becomes more homogeneous. In practice, as suggested by our theoretical findings, we propose using FFGB to warm-start existing Federated Learning solvers and observe significant performance boost in highly heterogeneous settings. The code can be found here.

**Sampling in Dirichlet Process Mixture Models for Clustering Streaming Data**

Or Dinari · Oren Freifeld

Practical tools for clustering streaming data must be fast enough to handle the arrival rate of the observations. Typically, they also must adapt on the fly to possible lack of stationarity; i.e., the data statistics may be time-dependent due to various forms of drifts, changes in the number of clusters, etc. The Dirichlet Process Mixture Model (DPMM), whose Bayesian nonparametric nature allows it to adapt its complexity to the data, seems a natural choice for the streaming-data case. In its classical formulation, however, the DPMM cannot capture common types of drifts in the data statistics. Moreover, and regardless of that limitation, existing methods for online DPMM inference are too slow to handle rapid data streams. In this work we propose adapting both the DPMM and a known DPMM sampling-based non-streaming inference method for streaming-data clustering. We demonstrate the utility of the proposed method on several challenging settings, where it obtains state-of-the-art results while being on par with other methods in terms of speed.

**Robust Deep Learning from Crowds with Belief Propagation**

Hoyoung Kim · Seunghyuk Cho · Dongwoo Kim · Jungseul Ok

Crowdsourcing systems enable us to collect large-scale dataset, but inherently suffer from noisy labels of low-paid workers. We address the inference and learning problems using such a crowdsourced dataset with noise. Due to the nature of sparsity in crowdsourcing, it is critical to exploit both probabilistic model to capture worker prior and neural network to extract task feature despite risks from wrong prior and overfitted feature in practice. We hence establish a neural-powered Bayesian framework, from which we devise deepMF and deepBP with different choice of variational approximation methods, mean field (MF) and belief propagation (BP), respectively. This provides a unified view of existing methods, which are special cases of deepMF with different priors. In addition, our empirical study suggests that deepBP is a new approach, which is more robust against wrong prior, feature overfitting and extreme workers thanks to the more sophisticated BP than MF.

**Practical Schemes for Finding Near-Stationary Points of Convex Finite-Sums**

Kaiwen Zhou · Lai Tian · Anthony Man-Cho So · James Cheng

In convex optimization, the problem of finding near-stationary points has not been adequately studied yet, unlike other optimality measures such as the function value. Even in the deterministic case, the optimal method (OGM-G, due to Kim and Fessler (2021)) has just been discovered recently. In this work, we conduct a systematic study of algorithmic techniques for finding near-stationary points of convex finite-sums. Our main contributions are several algorithmic discoveries: (1) we discover a memory-saving variant of OGM-G based on the performance estimation problem approach (Drori and Teboulle, 2014); (2) we design a new accelerated SVRG variant that can simultaneously achieve fast rates for minimizing both the gradient norm and function value; (3) we propose an adaptively regularized accelerated SVRG variant, which does not require the knowledge of some unknown initial constants and achieves near-optimal complexities. We put an emphasis on the simplicity and practicality of the new schemes, which could facilitate future work.

**Robust Bayesian Inference for Simulator-based Models via the MMD Posterior Bootstrap**

Charita Dellaporta · Jeremias Knoblauch · Theodoros Damoulas · Francois-Xavier Briol

Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. They are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in those cases. In this paper, we propose a novel algorithm based on the posterior bootstrap and maximum mean discrepancy estimators. This leads to a highly-parallelisable Bayesian inference algorithm with strong robustness properties. This is demonstrated through an in-depth theoretical study which includes generalisation bounds and proofs of frequentist consistency and robustness of our posterior. The approach is then assessed on a range of examples including a g-and-k distribution and a toggle-switch model.

**Statistical Depth Functions for Ranking Distributions: Definitions, Statistical Learning and Applications**

Morgane Goibert · Stephan Clemencon · Ekhine Irurozki · Pavlo Mozharovskyi

The concept of \textit{median/consensus} has been widely investigated in order to provide a statistical summary of ranking data, \textit{i.e.} realizations of a random permutation $\Sigma$ of a finite set, $\{1,\; \ldots,\; n\}$ with $n\geq 1$ say. As it sheds light onto only one aspect of $\Sigma$'s distribution $P$, it may neglect other informative features. It is the purpose of this paper to define analogues of quantiles, ranks and statistical procedures based on such quantities for the analysis of ranking data by means of a metric-based notion of \textit{depth function} on the symmetric group. Overcoming the absence of vector space structure on $\mathfrak{S}_n$, the latter defines a center-outward ordering of the permutations in the support of $P$ and extends the classic metric-based formulation of \textit{consensus ranking} (\textit{medians} corresponding then to the \textit{deepest} permutations). The axiomatic properties that \textit{ranking depths} should ideally possess are listed, while computational and generalization issues are studied at length. Beyond the theoretical analysis carried out, the relevance of the novel concepts and methods introduced for a wide variety of statistical tasks are also supported by numerous numerical experiments.

**Last Layer Marginal Likelihood for Invariance Learning**

Pola Schwöbel · Martin Jørgensen · Sebastian Ober · Mark van der Wilk

Data augmentation is often used to incorporate inductive biases into models. Traditionally, these are hand-crafted and tuned with cross validation. The Bayesian paradigm for model selection provides a path towards end-to-end learning of invariances using only the training data, by optimising the marginal likelihood.Computing the marginal likelihood is hard for neural networks, but success with tractable approaches that compute the marginal likelihood for the last layer only raises the question of whether this convenient approach might be employed for learning invariances. We show partial success on standard benchmarks, in the low-data regime and on a medical imaging dataset by designing a custom optimisation routine. Introducing a new lower bound to the marginal likelihood allows us to perform inference for a larger class of likelihood functions than before. On the other hand, we demonstrate failure modes on the CIFAR10 dataset, where the last layer approximation is not sufficient due to the increased complexity of our neural network. Our results indicate that once more sophisticated approximations become available the marginal likelihood is a promising approach for invariance learning in neural networks.

**ExactBoost: Directly Boosting the Margin in Combinatorial and Non-decomposable Metrics**

Daniel Csillag · Carolina Piazza · Thiago Ramos · João Vitor Romano · Roberto Oliveira · Paulo Orenstein

Many classification algorithms require the use of surrogate losses when the intended loss function is combinatorial or non-decomposable. This paper introduces a fast and exact stagewise optimization algorithm, dubbed ExactBoost, that boosts stumps to the actual loss function. By developing a novel extension of margin theory to the non-decomposable setting, it is possible to provably bound the generalization error of ExactBoost for many important metrics with different levels of non-decomposability. Through extensive examples, it is shown that such theoretical guarantees translate to competitive empirical performance. In particular, when used as an ensembler, ExactBoost is able to significantly outperform other surrogate-based and exact algorithms available.

**AdaBlock: SGD with Practical Block Diagonal Matrix Adaptation for Deep Learning**

Jihun Yun · Aurelie Lozano · Eunho Yang

We introduce AdaBlock, a class of adaptive gradient methods that extends popular approaches such as Adam by adopting the simple and natural idea of using block-diagonal matrix adaption to effectively utilize structural characteristics of deep learning architectures. Unlike other quadratic or block-diagonal approaches, AdaBlock has complete freedom to select block-diagonal groups, providing a wider trade-off applicable even to extremely high-dimensional problems. We provide convergence and generalization error bounds for AdaBlock, and study both theoretically and empirically the impact of the block size on the bounds and advantages over usual diagonal approaches. In addition, we propose a randomized layer-wise variant of Adablock to further reduce computations and memory footprint, and devise an efficient spectrum-clipping scheme for AdaBlock to benefit from Sgd's superior generalization performance. Extensive experiments on several deep learning tasks demonstrate the benefits of block diagonal adaptation compared to adaptive diagonal methods, vanilla Sgd, as well as modified versions of full-matrix adaptation.

We consider a multi-armed bandit problem in which a set of arms is registered by each agent, and the agent receives reward when its arm is selected. An agent might strategically submit more arms with replications, which can bring more reward by abusing the bandit algorithm's exploration-exploitation balance. Our analysis reveals that a standard algorithm indeed fails at preventing replication and suffers from linear regret in time $T$. We aim to design a bandit algorithm which demotivates replications and also achieves a small cumulative regret. We devise Hierarchical UCB (H-UCB) of replication-proof, which has $O(\ln T)$-regret under any equilibrium. We further propose Robust Hierarchical UCB (RH-UCB) which has a sublinear regret even in a realistic scenario with irrational agents replicating careless. We verify our theoretical findings through numerical experiments.

**Point Cloud Generation with Continuous Conditioning**

Larissa Triess · Andre Bühler · David Peter · Fabian Flohr · Marius Zöllner

Generative models can be used to synthesize 3D objects of high quality and diversity. However, there is typically no control over the properties of the generated object.This paper proposes a novel generative adversarial network (GAN) setup that generates 3D point cloud shapes conditioned on a continuous parameter. In an exemplary application, we use this to guide the generative process to create a 3D object with a custom-fit shape. We formulate this generation process in a multi-task setting by using the concept of auxiliary classifier GANs. Further, we propose to sample the generator label input for training from a kernel density estimation (KDE) of the dataset. Our ablations show that this leads to significant performance increase in regions with few samples. Extensive quantitative and qualitative experiments show that we gain explicit control over the object dimensions while maintaining good generation quality and diversity.

**PAC Top-$k$ Identification under SST in Limited Rounds**

Arpit Agarwal · Sanjeev Khanna · Prathamesh Patil

We consider the problem of finding top-$k$ items from a set of $n$ items using actively chosen pairwise comparisons. This problem has been widely studied in machine learning and has widespread applications in recommendation systems, sports, social choice etc. Motivated by applications where there can be a substantial delay between requesting comparisons and receiving feedback, we consider an active/adaptive learning setting where the algorithm uses limited rounds of \emph{parallel} interaction with the feedback generating oracle.We study this problem under the \emph{strong stochastic transitivity} (SST) noise model which is a widely studied ranking model and captures many applications. A special case of this model is the noisy comparison model for which it was recently shown that $O(n \log k)$ comparisons and $\log^* n$ rounds of adaptivity are sufficient to find the set of top-$k$ items (Cohen-Addad et al., 2020; Braverman et al., 2019). Under the more general SST model, it is known that $O(n)$ comparisons and $O(n)$ rounds are sufficient to find a PAC top-1 item (Falahatgar et al., 2017a,b), however, not much seems to be known for general $k$, even given unbounded rounds of adaptivity. We first show that $\Omega (nk)$ comparisons are necessary for PAC top-$k$ identification under SST even with unbounded adaptivity, establishing that this problem is strictly harder under SST than it is for the noisy comparison model. Our main contribution is to show that the 2-round query complexity for this problem is $\widetilde{\Theta} (n^{4/3} + nk)$, and to show that just 3 rounds are sufficient to obtain a nearly optimal query complexity of $\widetilde{\Theta}(nk)$. We further show that our 3-round result can be improved by a $\log (n)$ factor using $2 \log^* n + 4$ rounds.

**CATVI: Conditional and Adaptively Truncated Variational Inference for Hierarchical Bayesian Nonparametric Models**

Yirui Liu · Xinghao Qiao · Jessica Lam

Current variational inference methods for hierarchical Bayesian nonparametric models can neither characterize the correlation structure among latent variables due to the mean-field setting, nor infer the true posterior dimension because of the universal truncation. To overcome these limitations, we propose the conditional and adaptively truncated variational inference method (CATVI) by maximizing the nonparametric evidence lower bound and integrating Monte Carlo into the variational inference framework. CATVI enjoys several advantages over traditional methods, including a smaller divergence between variational and true posteriors, reduced risk of underfitting or overfitting, and improved prediction accuracy. Empirical studies on three large datasets reveal that CATVI applied in Bayesian nonparametric topic models substantially outperforms competing models, providing lower perplexity and clearer topic-words clustering.

**Rejection sampling from shape-constrained distributions in sublinear time**

Sinho Chewi · Patrik Gerber · Chen Lu · Thibaut Le Gouic · Philippe Rigollet

We consider the task of generating exact samples from a target distribution, known up to normalization, over a finite alphabet. The classical algorithm for this task is rejection sampling, and although it has been used in practice for decades, there is surprisingly little study of its fundamental limitations. In this work, we study the query complexity of rejection sampling in a minimax framework for various classes of discrete distributions. Our results provide new algorithms for sampling whose complexity scales sublinearly with the alphabet size. When applied to adversarial bandits, we show that a slight modification of the EXP3 algorithm reduces the per-iteration complexity from O(K) to O(log(K) log(K/δ)) with probability 1-δ, where K is the number of arms.

**Fast and accurate optimization on the orthogonal manifold without retraction**

Pierre Ablin · Gabriel Peyré

We consider the problem of minimizing a function over the manifold of orthogonal matrices. The majority of algorithms for this problem compute a direction in the tangent space, and then use a retraction to move in that direction while staying on the manifold. Unfortunately, the numerical computation of retractions on the orthogonal manifold always involves some expensive linear algebra operation, such as matrix inversion, exponential or square-root. These operations quickly become expensive as the dimension of the matrices grows. To bypass this limitation, we propose the landing algorithm which does not use retractions. The algorithm is not constrained to stay on the manifold but its evolution is driven by a potential energy which progressively attracts it towards the manifold. One iteration of the landing algorithm only involves matrix multiplications, which makes it cheap compared to its retraction counterparts. We provide an analysis of the convergence of the algorithm, and demonstrate its promises on large-scale and deep learning problems, where it is faster and less prone to numerical errors than retraction-based methods.

**Testing Granger Non-Causality in Panels with Cross-Sectional Dependencies**

Lenon Minorics · Caner Turkmen · David Kernert · Patrick Bloebaum · Laurent Callot · Dominik Janzing

This paper proposes a new approach for testing Granger non-causality on panel data. Instead of aggregating panel member statistics, we aggregate their corresponding p-values and show that the resulting p-value approximately bounds the type I error by the chosen significance level even if the panel members are dependent. We compare our approach against the most widely used Granger causality algorithm on panel data and show that our approach yields lower FDR at the same power for large sample sizes and panels with cross sectional dependencies. Finally, we examine COVID-19 data about confirmed cases and deaths measured in countries/regions worldwide and show that our approach is able to discover the true causal relation between confirmed cases and deaths while state-of-the-art approaches fail.

Non-parametric goodness-of-fit testing procedures based on kernel Stein discrepancies (KSD) are promising approaches to validate general unnormalised distributions in various scenarios. Existing works focused on studying kernel choices to boost test performances. However, the choices of (non-unique) Stein operators also have considerable effect on the test performances. Inspired by the standardisation technique that was originally developed to better derive approximation properties for normal distributions, we present a unifying framework, called standardisation-function kernel Stein discrepancy (Sf-KSD), to study different Stein operators in KSD-based tests for goodness-of-fit. We derive explicitly how the proposed framework relates to existing KSD-based tests and show that Sf-KSD can be used as a guide to develop novel kernel-based non-parametric tests on complex data scenarios, e.g. truncated distributions or compositional data. Experimental results demonstrate that the proposed tests control type-I error well and achieve higher test power than existing approaches.

**On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction**

Jiawei Huang · Nan Jiang

In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem. We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees. In our first strategy, we propose an algorithm called P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal. Besides, in our second strategy, we design a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.

**Momentum Accelerates the Convergence of Stochastic AUPRC Maximization**

Guanghui Wang · MING YANG · Lijun Zhang · Tianbao Yang

In this paper, we study stochastic optimization of areas under precision-recall curves (AUPRC), which is widely used for combating imbalanced classification tasks. Although a few methods have been proposed for maximizing AUPRC, stochastic optimization of AUPRC with convergence guarantee remains an undeveloped territory. A state-of-the-art complexity is $O(1/\epsilon^5)$ for finding an $\epsilon$-stationary solution. In this paper, we further improve the stochastic optimization of AURPC by (i) developing novel stochastic momentum methods with a better iteration complexity of $O(1/\epsilon^4)$ for finding an $\epsilon$-stationary solution; and (ii) designing a novel family of stochastic adaptive methods with the same iteration complexity, which enjoy faster convergence in practice. To this end, we propose two innovative techniques that are critical for improving the convergence: (i) the biased estimators for tracking individual ranking scores are updated in a randomized coordinate-wise manner; and (ii) a momentum update is used on top of the stochastic gradient estimator for tracking the gradient of the objective. The novel analysis of Adam-style updates is also one main contribution. Extensive experiments on various data sets demonstrate the effectiveness of the proposed algorithms. Of independent interest, the proposed stochastic momentum and adaptive algorithms are also applicable to a class of two-level stochastic dependent compositional optimization problems.

**Safe Active Learning for Multi-Output Gaussian Processes**

Cen-You Li · Barbara Rakitsch · Christoph Zimmer

Multi-output regression problems are commonly encountered in science and engineering. In particular, multi-output Gaussian processes have been emerged as a promising tool for modeling these complex systems since they can exploit the inherent correlations and provide reliable uncertainty estimates. In many applications, however, acquiring the data is expensive and safety concerns might arise (e.g. robotics, engineering). We propose a safe active learning approach for multi-output Gaussian process regression. This approach queries the most informative data or output taking the relatedness between the regressors and safety constraints into account. We prove the effectiveness of our approach by providing theoretical analysis and by demonstrating empirical results on simulated datasets and on a real-world engineering dataset. On all datasets, our approach shows improved convergence compared to its competitors.

An increasing number of machine learning problems, such as robust or adversarial variants of existing algorithms, require minimizing a loss function that is itself defined as a maximum. Carrying a loop of stochastic gradient ascent (SGA) steps on the (inner) maximization problem, followed by an SGD step on the (outer) minimization, is known as Epoch Stochastic Gradient Descent Ascent(ESGDA). While successful in practice, the theoretical analysis of ESGDA remains challenging, with no clear guidance on choices for the inner loop size nor on the interplay between inner/outer step sizes. We propose RSGDA (Randomized SGDA), a variant of ESGDA with stochastic loop size with a simpler theoretical analysis. RSGDA comes with the first (among SGDA algorithms) almost sure convergence rates when used on nonconvex min/strongly-concave max settings. RSGDA can be parameterized using optimal loop sizes that guarantee the best convergence rates known to hold for SGDA. We test RSGDA on toy and larger scale problems, using distributionally robust optimization and single-cell data matching using optimal transport as a testbed.

**Semi-Implicit Hybrid Gradient Methods with Application to Adversarial Robustness**

Beomsu Kim · Junghoon Seo

Adversarial examples, crafted by adding imperceptible perturbations to natural inputs, can easily fool deep neural networks (DNNs). One of the most successful methods for training adversarially robust DNNs is solving a nonconvex-nonconcave minimax problem with an adversarial training (AT) algorithm. However, among the many AT algorithms, only Dynamic AT (DAT) and You Only Propagate Once (YOPO) is guaranteed to converge to a stationary point with rate O(1/K^{1/2}). In this work, we generalize the stochastic primal-dual hybrid gradient algorithm to develop semi-implicit hybrid gradient methods (SI-HGs) for finding stationary points of nonconvex-nonconcave minimax problems. SI-HGs have the convergence rate O(1/K), which improves upon the rate O(1/K^{1/2}) of DAT and YOPO. We devise a practical variant of SI-HGs, and show that it outperforms other AT algorithms in terms of convergence speed and robustness.

**Probabilistic Numerical Method of Lines for Time-Dependent Partial Differential Equations**

Nicholas Krämer · Jonathan Schmidt · Philipp Hennig

This work develops a class of probabilistic algorithms for the numerical solution of nonlinear, time-dependent partial differential equations (PDEs). Current state-of-the-art PDE solvers treat the space- and time-dimensions separately, serially, and with black-box algorithms, which obscures the interactions between spatial and temporal approximation errors and misguides the quantification of the overall error. To fix this issue, we introduce a probabilistic version of a technique called method of lines. The proposed algorithm begins with a Gaussian process interpretation of finite difference methods, which then interacts naturally with filtering-based probabilistic ordinary differential equation (ODE) solvers because they share a common language: Bayesian inference. Joint quantification of space- and time-uncertainty becomes possible without losing the performance benefits of well-tuned ODE solvers. Thereby, we extend the toolbox of probabilistic programs for differential equation simulation to PDEs.

**REPID: Regional Effect Plots with implicit Interaction Detection**

Julia Herbinger · Bernd Bischl · Giuseppe Casalicchio

Machine learning models can automatically learn complex relationships, such as non-linear and interaction effects.Interpretable machine learning methods such as partial dependence plots visualize marginal feature effects but may lead to misleading interpretations when feature interactions are present. Hence, employing additional methods that can detect and measure the strength of interactions is paramount to better understand the inner workings of machine learning models. We demonstrate several drawbacks of existing global interaction detection approaches, characterize them theoretically, and evaluate them empirically.Furthermore, we introduce regional effect plots with implicit interaction detection, a novel framework to detect interactions between a feature of interest and other features. The framework also quantifies the strength of interactions and provides interpretable and distinct regions in which feature effects can be interpreted more reliably, as they are less confounded by interactions.We prove the theoretical eligibility of our method and show its applicability on various simulation and real-world examples.

Bayesian optimization is a class of global optimization techniques. In Bayesian optimization, the underlying objective function is modeled as a realization of a Gaussian process. Although the Gaussian process assumption implies a random distribution of the Bayesian optimization outputs, quantification of this uncertainty is rarely studied in the literature. In this work, we propose a novel approach to assess the output uncertainty of Bayesian optimization algorithms, which proceeds by constructing confidence regions of the maximum point (or value) of the objective function. These regions can be computed efficiently, and their confidence levels are guaranteed by the uniform error bounds for sequential Gaussian process regression newly developed in the present work. Our theory provides a unified uncertainty quantification framework for all existing sequential sampling policies and stopping criteria.

**Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably**

Tianyi Liu · Yan Li · Enlu Zhou · Tuo Zhao

We investigate the role of noise in optimization algorithms for learning over-parameterized models. Specifically, we consider the recovery of a rank one matrix $Y^*\in R^{d\times d}$ from a noisy observation $Y$ using an over-parameterization model. Specifically, we parameterize the rank one matrix $Y^*$ by $XX^\top$, where $X\in R^{d\times d}$. We then show that under mild conditions, the estimator, obtained by the randomly perturbed gradient descent algorithm using the square loss function, attains a mean square error of $O(\sigma^2/d)$, where $\sigma^2$ is the variance of the observational noise. In contrast, the estimator obtained by gradient descent without random perturbation only attains a mean square error of $O(\sigma^2)$. Our result partially justifies the implicit regularization effect of noise when learning over-parameterized models, and provides new understanding of training over-parameterized neural networks.

We study the optimal sample complexity of learning a Gaussian directed acyclic graph (DAG) from observational data. Our main results establish the minimax optimal sample complexity for learning the structure of a linear Gaussian DAG model in two settings of interest: 1) Under equal variances without knowledge of the true ordering, and 2) For general linear models given knowledge of the ordering. In both cases the sample complexity is $n\asymp q\log(d/q)$, where $q$ is the maximum number of parents and $d$ is the number of nodes. We further make comparisons with the classical problem of learning (undirected) Gaussian graphical models, showing that under the equal variance assumption, these two problems share the same optimal sample complexity. In other words, at least for Gaussian models with equal error variances, learning a directed graphical model is statistically no more difficult than learning an undirected graphical model. Our results also extend to more general identification assumptions as well as subgaussian errors.

**Weighted Gaussian Process Bandits for Non-stationary Environments**

Yuntian Deng · Xingyu Zhou · Baekjin Kim · Ambuj Tewari · Abhishek Gupta · Ness Shroff

In this paper, we consider the Gaussian process (GP) bandit optimization problem in a non-stationary environment. To capture external changes, the black-box function is allowed to be time-varying within a reproducing kernel Hilbert space (RKHS). To this end, we develop WGP-UCB, a novel UCB-type algorithm based on weighted Gaussian process regression. A key challenge is how to cope with infinite-dimensional feature maps. To that end, we leverage kernel approximation techniques to prove a sublinear regret bound, which is the first (frequentist) sublinear regret guarantee on weighted time-varying bandits with general nonlinear rewards. This result generalizes both non-stationary linear bandits and standard GP-UCB algorithms. Further, a novel concentration inequality is achieved for weighted Gaussian process regression with general weights. We also provide universal upper bounds and weight-dependent upper bounds for weighted maximum information gains. These results are of independent interest for applications such as news ranking and adaptive pricing, where weights can be adopted to capture the importance or quality of data. Finally, we conduct experiments to highlight the favorable gains of the proposed algorithm in many cases when compared to existing methods.

**Non-separable Spatio-temporal Graph Kernels via SPDEs**

Alexander Nikitin · ST John · Arno Solin · Samuel Kaski

Gaussian processes (GPs) provide a principled and direct approach for inference and learning on graphs. However, the lack of justified graph kernels for spatio-temporal modelling has held back their use in graph problems. We leverage an explicit link between stochastic partial differential equations (SPDEs) and GPs on graphs, introduce a framework for deriving graph kernels via SPDEs, and derive non-separable spatio-temporal graph kernels that capture interaction across space and time. We formulate the graph kernels for the stochastic heat equation and wave equation. We show that by providing novel tools for spatio-temporal GP modelling on graphs, we outperform pre-existing graph kernels in real-world applications that feature diffusion, oscillation, and other complicated interactions.

**Modelling Non-Smooth Signals with Complex Spectral Structure**

Wessel Bruinsma · Martin Tegnér · Richard Turner

The Gaussian Process Convolution Model (GPCM; Tobar et al., 2015a) is a model for signals with complex spectral structure. A significant limitation of the GPCM is that it assumes a rapidly decaying spectrum: it can only model smooth signals. Moreover, inference in the GPCM currently requires (1) a mean-field assumption, resulting in poorly calibrated uncertainties, and (2) a tedious variational optimisation of large covariance matrices. We redesign the GPCM model to induce a richer distribution over the spectrum with relaxed assumptions about smoothness: the Causal Gaussian Process Convolution Model (CGPCM) introduces a causality assumption into the GPCM, and the Rough Gaussian Process Convolution Model (RGPCM) can be interpreted as a Bayesian nonparametric generalisation of the fractional Ornstein—Uhlenbeck process. We also propose a more effective variational inference scheme, going beyond the mean-field assumption: we design a Gibbs sampler which directly samples from the optimal variational solution, circumventing any variational optimisation entirely. The proposed variations of the GPCM are validated in experiments on synthetic and real-world data, showing promising results.

LIME (Locally Interpretable Model-Agnostic Explanations) has become a popular way of generating explanations for tabular, image and natural language models, providing insight into why an instance was given a particular classification. In this paper we adapt LIME to time series classification, an under-explored area with existing approaches failing to account for the structure of this kind of data. We frame the non-trivial challenge of adapting LIME to time series classification as the following open questions: `What is a meaningful interpretable representation of a time series?'',`

How does one realistically perturb a time series?'' and ``What is a local neighbourhood around a time series?''. We propose solutions to all three questions and combine them into a novel time series explanation framework called LIMESegment, which outperforms existing adaptations of LIME to time series on a variety of classification tasks.

**Two-way Sparse Network Inference for Count Data**

Sijia Li · Martín López-García · Neil Lawrence · Luisa Cutillo

Classically, statistical datasets have a larger number of data points than features (n > p). The standard model of classical statistics caters for the case where data points are considered conditionally independent given the parameters. However, for n ≈ p or p > n such models are poorly determined. Kalaitzis et al. (2013) introduced the Bigraphical Lasso, an estimator for sparse precision matrices based on the Cartesian product of graphs. Unfortunately, the original Bigraphical Lasso algorithm is not applicable in case of large p and n due to memory requirements. We exploit eigenvalue decomposition of the Cartesian product graph to present a more efficient version of the algorithm which reduces memory requirements from O(n^2p^2) to O(n^2 +p^2). Many datasets in different application fields, such as biology, medicine and social science, come with count data, for which Gaussian based models are not applicable. Our multiway network inference approach can be used for discrete data.Our methodology accounts for the dependencies across both instances and features, reduces the computational complexity for high dimensional data and enables to deal with both discrete and continuous data. Numerical studies on both synthetic and real datasets are presented to showcase the performance of our method.

As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.

Making an informed decision---for example, when choosing a career or housing---requires knowledge about the available options. Such knowledge is generally acquired through costly trial and error, but this learning process can be disrupted by competition. In this work, we study how competition affects the long-term outcomes of individuals as they learn. We build on a line of work that models this setting as a two-sided matching market with bandit learners. A recent result in this area states that it is impossible to simultaneously guarantee two natural desiderata: stability and low optimal regret for all agents. Resource-allocating platforms can point to this result as a justification for assigning good long-term outcomes to some agents and poor ones to others. We show that this impossibility need not hold true. In particular, by modeling two additional components of competition---namely, costs and transfers---we prove that it is possible to simultaneously guarantee four desiderata: stability, low optimal regret, fairness in the distribution of regret, and high social welfare.

This paper studies regret minimization in a multi-armed bandit. It is well known that side information, such as the prior distribution of arm means in Thompson sampling, can improve the statistical efficiency of the bandit algorithm. While the prior is a blessing when correctly specified, it is a curse when misspecified. To address this issue, we introduce the assumption of a random-effect model to bandits. In this model, the mean arm rewards are drawn independently from an unknown distribution, which we estimate. We derive a random-effect estimator of the arm means, analyze its uncertainty, and design a UCB algorithm ReUCB that uses it. We analyze ReUCB and derive an upper bound on its n-round Bayes regret, which improves upon not using the random-effect structure. Our experiments show that ReUCB can outperform Thompson sampling, without knowing the prior distribution of arm means.

**Grassmann Stein Variational Gradient Descent**

Xing Liu · Harrison Zhu · Jean-Francois Ton · George Wynne · Andrew Duncan

Stein variational gradient descent (SVGD) is a deterministic particle inference algorithm that provides an efficient alternative to Markov chain Monte Carlo. However, SVGD has been found to suffer from variance underestimation when the dimensionality of the target distribution is high. Recent developments have advocated projecting both the score function and the data onto real lines to sidestep this issue, although this can severely overestimate the epistemic (model) uncertainty. In this work, we propose Grassmann Stein variational gradient descent (GSVGD) as an alternative approach, which permits projections onto arbitrary dimensional subspaces. Compared with other variants of SVGD that rely on dimensionality reduction, GSVGD updates the projectors simultaneously for the score function and the data, and the optimal projectors are determined through a coupled Grassmann-valued diffusion process which explores favourable subspaces. Both our theoretical and experimental results suggest that GSVGD enjoys efficient state-space exploration in high-dimensional problems that have an intrinsic low-dimensional structure.

**Statistical and computational thresholds for the planted k-densest sub-hypergraph problem**

Luca Corinzia · Paolo Penna · Wojciech Szpankowski · Joachim Buhmann

In this work, we consider the problem of recovery a planted k-densest sub-hypergraph on d-uniform hypergraphs. This fundamental problem appears in different contexts, e.g., community detection, average-case complexity, and neuroscience applications as a structural variant of tensor-PCA problem. We provide tight information-theoretic upper and lower bounds for the exact recovery threshold by the maximum-likelihood estimator, as well as algorithmic bounds based on approximate message passing algorithms. The problem exhibits a typical statistical-to-computational gap observed in analogous sparse settings that widen with increasing sparsity of the problem. The bounds show that the signal structure impacts the location of the statistical and computational phase transition that the known existing bounds for the tensor-PCA model do not capture. This effect is due to the generic planted signal prior that this latter model addresses.

**Double Control Variates for Gradient Estimation in Discrete Latent Variable Models**

Michalis Titsias · Jiaxin Shi

Stochastic gradient-based optimisation for discrete latent variable models is challenging due to the high variance of gradients. We introduce a variance reduction technique for score function estimators that makes use of double control variates. These control variates act on top of a main control variate, and try to further reduce the variance of the overall estimator. We develop a double control variate for the REINFORCE leave-one-out estimator using Taylor expansions. For training discrete latent variable models, such as variational autoencoders with binary latent variables, our approach adds no extra computational cost compared to standard training with the REINFORCE leave-one-out estimator. We apply our method to challenging high-dimensional toy examples and for training variational autoencoders with binary latent variables. We show that our estimator can have lower variance compared to other state-of-the-art estimators.

**Resampling Base Distributions of Normalizing Flows**

Vincent Stimper · Bernhard Schölkopf · Jose Miguel Hernandez-Lobato

Normalizing flows are a popular class of models for approximating probability distributions. However, their invertible nature limits their ability to model target distributions whose support have a complex topological structure, such as Boltzmann distributions. Several procedures have been proposed to solve this problem but many of them sacrifice invertibility and, thereby, tractability of the log-likelihood as well as other desirable properties. To address these limitations, we introduce a base distribution for normalizing flows based on learned rejection sampling, allowing the resulting normalizing flow to model complicated distributions without giving up bijectivity. Furthermore, we develop suitable learning algorithms using both maximizing the log-likelihood and the optimization of the Kullback-Leibler divergence, and apply them to various sample problems, i.e. approximating 2D densities, density estimation of tabular data, image generation, and modeling Boltzmann distributions. In these experiments our method is competitive with or outperforms the baselines.

**Information-Theoretic Analysis of Epistemic Uncertainty in Bayesian Meta-learning **

Sharu Theresa Jose · Sangwoo Park · Osvaldo Simeone

The overall predictive uncertainty of a trained predictor can be decomposed into separate contributions due to epistemic and aleatoric uncertainty. Under a Bayesian formulation, assuming a well-specified model, the two contributions can be exactly expressed (for the log-loss) or bounded (for more general losses) in terms of information-theoretic quantities (Xu and Raginsky [2020]). This paper addresses the study of epistemic uncertainty within an information-theoretic framework in the broader setting of Bayesian meta-learning. A general hierarchical Bayesian model is assumed in which hyperparameters determine the per-task priors of the model parameters. Exact characterizations (for the log-loss) and bounds (for more general losses) are derived for the epistemic uncertainty – quantified by the minimum excess meta-risk (MEMR) – of optimal meta-learning rules. This characterization is leveraged to bring insights into the dependence of the epistemic uncertainty on the number of tasks and on the amount of per-task training data. Experiments are presented that use the proposed information-theoretic bounds, evaluated via neural mutual information estimators, to compare the performance of conventional learning and meta-learning as the number of meta-learning tasks increases.

**An Unsupervised Hunt for Gravitational Lenses**

Stephen Sheng · Keerthi Vasan G C · Chi Po Choi · James Sharpnack · Tucker Jones

Strong gravitational lenses allow us to peer into the farthest reaches of space by bending the light from a background object around a massive object in the foreground. Unfortunately, these lenses are extremely rare, and manually finding them in astronomy surveys is difficult and time-consuming. We are thus tasked with finding them in an automated fashion with few, if any, known lenses to form positive samples. To assist us with training, we can simulate realistic lenses within our survey images to form positive samples. Naively training a ResNet model with these simulated lenses results in a poor precision for the desired high recall, because the simulations contain artifacts that are learned by the model. In this work, we develop a lens detection method that combines simulation, data augmentation, semi-supervised learning, and GANs to improve this performance by an order of magnitude. We perform ablation studies and examine how performance scales with the number of non-lenses and simulated lenses. These findings allow researchers to go into a survey mostly "blind" and still be able to classify strong gravitational lenses with high precision and recall.

**Sinkformers: Transformers with Doubly Stochastic Attention**

Michael Sander · Pierre Ablin · Mathieu Blondel · Gabriel Peyré

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior. On the theoretical side, we show that, unlike the SoftMax operation, this normalization makes it possible to understand the iterations of self-attention modules as a discretized gradient-flow for the Wasserstein metric. We also show in the infinite number of samples limit that, when rescaling both attention matrices and depth, Sinkformers operate a heat diffusion. On the experimental side, we show that Sinkformers enhance model accuracy in vision and natural language processing tasks. In particular, on 3D shapes classification, Sinkformers lead to a significant improvement.

**Proximal Optimal Transport Modeling of Population Dynamics**

Charlotte Bunne · Laetitia Papaxanthos · Andreas Krause · Marco Cuturi

We propose a new approach to model the collective dynamics of a population of particles evolving with time. As is often the case in challenging scientific applications, notably single-cell genomics, measuring features for these particles requires destroying them. As a result, the population can only be monitored with periodic snapshots, obtained by sampling a few particles that are sacrificed in exchange for measurements. Given only access to these snapshots, can we reconstruct likely individual trajectories for all other particles? We propose to model these trajectories as collective realizations of a causal Jordan-Kinderlehrer-Otto (JKO) flow of measures: The JKO scheme posits that the new configuration taken by a population at time t+1 is one that trades off an improvement, in the sense that it decreases an energy, while remaining close (in Wasserstein distance) to the previous configuration observed at t. In order to learn such an energy using only snapshots, we propose JKOnet, a neural architecture that computes (in end-to-end differentiable fashion) the JKO flow given a parametric energy and initial configuration of points. We demonstrate the good performance and robustness of the JKOnet fitting procedure, compared to a more direct forward method.

**Using time-series privileged information for provably efficient learning of prediction models**

Rickard Karlsson · Martin Willbo · Zeshan Hussain · Rahul Krishnan · David Sontag · Fredrik Johansson

We study prediction of future outcomes with supervised models that use privileged information during learning. The privileged information comprises samples of time series observed between the baseline time of prediction and the future outcome; this information is only available at training time which differs from the traditional supervised learning. Our question is when using this privileged data leads to more sample-efficient learning of models that use only baseline data for predictions at test time. We give an algorithm for this setting and prove that when the time series are drawn from a non-stationary Gaussian-linear dynamical system of fixed horizon, learning with privileged information is more efficient than learning without it. On synthetic data, we test the limits of our algorithm and theory, both when our assumptions hold and when they are violated. On three diverse real-world datasets, we show that our approach is generally preferable to classical learning, particularly when data is scarce. Finally, we relate our estimator to a distillation approach both theoretically and empirically.

**Off-Policy Risk Assessment for Markov Decision Processes**

Audrey Huang · Liu Leqi · Zachary Lipton · Kamyar Azizzadenesheli

Addressing such diverse ends as mitigating safety risks, aligning agent behavior with human preferences, and improving the efficiency of learning, an emerging line of reinforcement learning research addresses the entire distribution of returns and various risk functionals that depend upon it. In the contextual bandit setting, recently work on off-policy risk assessment estimates the target policy's CDF of returns, providing finite sample guarantees that extend to (and hold simultaneously over) plugin estimates of an arbitrarily large set of risk functionals. In this paper, we lift OPRA to Markov decision processes (MDPs), where importance sampling (IS) CDF estimators suffer high variance on longer trajectories due to vanishing (and exploding) importance weights. To mitigate these problems, we incorporate model-based estimation to develop the first doubly robust (DR) estimator for the CDF of returns in MDPs. The DR estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound. Moreover, for many risk functionals, the downstream estimates enjoy both lower bias and lower variance. Additionally, we derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant. Finally, we demonstrate the efficacy of our DR CDF estimates experimentally on several different environments.

**Multivariate Quantile Function Forecaster**

Kelvin Kan · François-Xavier Aubet · Tim Januschowski · Youngsuk Park · Konstantinos Benidis · Lars Ruthotto · Jan Gasthaus

We propose Multivariate Quantile Function Forecaster (MQF2), a global probabilistic forecasting method constructed using a multivariate quantile function and investigate its application to multi-horizon forecasting. Prior approaches are either autoregressive, implicitly capturing the dependency structure across time but exhibiting error accumulation with increasing forecast horizons, or multi-horizon sequence-to-sequence models, which do not exhibit error accumulation, but also do typically not model the dependency structure across time steps. MQF2 combines the benefits of both approaches, by directly making predictions in the form of a multivariate quantile function, defined as the gradient of a convex function which we parametrize using input-convex neural networks. By design, the quantile function is monotone with respect to the input quantile levels and hence avoids quantile crossing. We provide two options to train MQF2: with energy score or with maximum likelihood. Experimental results on real-world and synthetic datasets show that our model has comparable performance with state-of-the-art methods in terms of single time step metrics while capturing the time dependency structure.

**Common Failure Modes of Subcluster-based Sampling in Dirichlet Process Gaussian Mixture Models - and a Deep-learning Solution**

Vlad Winter · Or Dinari · Oren Freifeld

The Dirichlet Process Gaussian Mixture Model (DPGMM) is often used to cluster data when the number of clusters is unknown. One main DPGMM inference paradigm relies on sampling. Here we consider a known state-of-art sampler (proposed by Chang and Fisher III (2013) and improved by~Dinari \etal (2019)), analyze its failure modes, and show how to improve it, often drastically. Concretely, in that sampler, whenever a new cluster is formed it is augmented with two subclusters whose labels are initialized at random. Upon their evolution, the subclusters serve to propose a split of the parent cluster. We show that the random initialization is often problematic and hurts the otherwise-effective sampler. Specifically, we demonstrate that this initialization tends to lead to poor split proposals and/or too many iterations before a desired split is accepted. This slows convergence and can damage the clustering. As a remedy, we propose two drop-in-replacement options for the subcluster-initialization subroutine. The first is an intuitive heuristic while the second is based on deep learning. We show that the proposed approach yields better splits, which in turn translate to substantial improvements in performance, results, and stability. Our code is publicly available.

Mixture models are widely used to fit complex and multimodal datasets. In this paper we study mixtures with high dimensional sparse latent parameter vectors and consider the problem of support recovery of those vectors. While parameter learning in mixture models is well-studied, the sparsity constraint remains relatively unexplored. Sparsity of parameter vectors is a natural constraint in variety of settings, and support recovery is a major step towards parameter estimation.We provide efficient algorithms for support recovery that have a logarithmic sample complexity dependence on the dimensionality of the latent space. Our algorithms are quite general, namely they are applicable to 1) mixtures of many different canonical distributions including Uniform, Poisson, Laplace, Gaussians, etc. 2) Mixtures of linear regressions and linear classifiers with Gaussian covariates under different assumptions on the unknown parameters. In most of these settings, our results are the first guarantees on this problem while in the rest, we provide significant improvements on existing results in certain regimes.

**Pairwise Supervision Can Provably Elicit a Decision Boundary**

Han Bao · Takuya Shimada · Liyuan Xu · Issei Sato · Masashi Sugiyama

Similarity learning is a general problem to elicit useful representations by predicting the relationship between a pair of patterns. This problem is related to various important preprocessing tasks such as metric learning, kernel learning, and contrastive learning. A classifier built upon the representations is expected to perform well in downstream classification; however, little theory has been given in literature so far and thereby the relationship between similarity and classification has remained elusive. Therefore, we tackle a fundamental question: can similarity information provably leads a model to perform well in downstream classification? In this paper, we reveal that a product-type formulation of similarity learning is strongly related to an objective of binary classification. We further show that these two different problems are explicitly connected by an excess risk bound. Consequently, our results elucidate that similarity learning is capable of solving binary classification by directly eliciting a decision boundary.

**Reward-Free Policy Space Compression for Reinforcement Learning**

Mirco Mutti · Stefano Del Col · Marcello Restelli

In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies, called policy space, typically represented by a family of parametric functions. Dealing with such a policy space is a hefty challenge, which often causes sample and computational inefficiencies. However, we argue that a limited number of policies are actually relevant when we also account for the structure of the environment and of the policy parameterization, as many of them would induce very similar interactions, i.e., state-action distributions. In this paper, we seek for a reward-free compression of the policy space into a finite set of representative policies, such that, given any policy $\pi$, the minimum Rényi divergence between the state-action distributions of the representative policies and the state-action distribution of $\pi$ is bounded. We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard. Nonetheless, we propose a game-theoretic reformulation for which a locally optimal solution can be efficiently found by iteratively stretching the compressed space to cover the most challenging policy. Finally, we provide an empirical evaluation to illustrate the compression procedure in simple domains, and its ripple effects in reinforcement learning.

**On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization**

Jungtaek Kim · Seungjin Choi

Sequential model-based optimization sequentially selects a candidate point by constructing a surrogate model with the history of evaluations, to solve a black-box optimization problem. Gaussian process (GP) regression is a popular choice as a surrogate model, because of its capability of calculating prediction uncertainty analytically. On the other hand, an ensemble of randomized trees is another option and has practical merits over GPs due to its scalability and easiness of handling continuous/discrete mixed variables. In this paper we revisit various ensembles of randomized trees to investigate their behavior in the perspective of prediction uncertainty estimation. Then, we propose a new way of constructing an ensemble of randomized trees, referred to as BwO forest, where bagging with oversampling is employed to construct bootstrapped samples that are used to build randomized trees with random splitting. Experimental results demonstrate the validity and good performance of BwO forest over existing tree-based models in various circumstances.

**Survival regression with proper scoring rules and monotonic neural networks**

David Rindt · Robert Hu · David Steinsaltz · Dino Sejdinovic

We consider frequently used scoring rules for right-censored survival regression models such as time-dependent concordance, survival-CRPS, integrated Brier score and integrated binomial log-likelihood, and prove that neither of them is a proper scoring rule. This means that the true survival distribution may be scored worse than incorrect distributions, leading to inaccurate estimation. We prove, in contrast to these scores, that the right-censored log-likelihood is a proper scoring rule, i.e. the highest expected score is achieved by the true distribution. Despite this, modern feed-forward neural-network-based survival regression models are unable to train and validate directly on right-censored log-likelihood, due to its intractability, and resort to the aforementioned alternatives, i.e. non-proper scoring rules. We therefore propose a simple novel survival regression method capable of directly optimizing log-likelihood using a monotonic restriction on the time-dependent weights, coined SurvivalMonotonic-net (SuMo-net). SuMo-net achieves state-of-the-art log-likelihood scores across several datasets with 20--100x computational speedup on inference over existing state-of-the-art neural methods and is readily applicable to datasets with several million observations.

This paper studies the performative prediction problem which optimizes a stochastic loss function with data distribution that depends on the decision variable. We consider a setting where the agent(s) provides samples adapted to both the learner's and agent's previous states. The samples are then used by the learner to update his/her state to optimize a loss function. Such closed loop update dynamics is studied as a state dependent stochastic approximation (SA) algorithm, which is shown to find a fixed point known as the performative stable solution. Our setting captures the unforgetful nature and reliance on past experiences of agents. Our contributions are three-fold. First, we present a framework for state dependent performative prediction with biased stochastic gradients driven by a controlled Markov chain whose transition probability depends on the learner's state. Second, we present a new finite-time performance analysis of the SA algorithm. We show that the expected squared distance to the performative stable solution decreases as O(1/k), where k is the iteration number. Third, numerical experiments verify our findings.

**Multiple Importance Sampling ELBO and Deep Ensembles of Variational Approximations**

Oskar Kviman · Harald Melin · Hazal Koptagel · Victor Elvira · Jens Lagergren

In variational inference (VI), the marginal log-likelihood is estimated using the standard evidence lower bound (ELBO), or improved versions as the importance weighted ELBO (IWELBO). We propose the multiple importance sampling ELBO (MISELBO), a \textit{versatile} yet \textit{simple} framework. MISELBO is applicable in both amortized and classical VI, and it uses ensembles, e.g., deep ensembles, of independently inferred variational approximations. As far as we are aware, the concept of deep ensembles in amortized VI has not previously been established.We prove that MISELBO provides a tighter bound than the average of standard ELBOs, and demonstrate empirically that it gives tighter bounds than the average of IWELBOs. MISELBO is evaluated in density-estimation experiments that include MNIST and several real-data phylogenetic tree inference problems. First, on the MNIST dataset, MISELBO boosts the density-estimation performances of a state-of-the-art model, nouveau VAE. Second, in the phylogenetic tree inference setting, our framework enhances a state-of-the-art VI algorithm that uses normalizing flows. On top of the technical benefits of MISELBO, it allows to unveil connections between VI and recent advances in the importance sampling literature, paving the way for further methodological advances. We provide our code at \url{https://github.com/Lagergren-Lab/MISELBO}.

**Spectral Pruning for Recurrent Neural Networks**

Takashi Furuya · Kazuma Suetake · Koichi Taniguchi · Hiroyuki Kusumoto · Ryuji Saiin · Tomohiro Daimon

Recurrent neural networks (RNNs) are a class of neural networks used in sequential tasks. However, in general, RNNs have a large number of parameters and involve enormous computational costs by repeating the recurrent structures in many time steps. As a method to overcome this difficulty, RNN pruning has attracted increasing attention in recent years, and it brings us benefits in terms of the reduction of computational cost as the time step progresses. However, most existing methods of RNN pruning are heuristic. The purpose of this paper is to study the theoretical scheme for RNN pruning method. We propose an appropriate pruning algorithm for RNNs inspired by “spectral pruning”, and provide the generalization error bounds for compressed RNNs. We also provide numerical experiments to demonstrate our theoretical results and show the effectiveness of our pruning method compared with the existing methods.

**An Information-theoretical Approach to Semi-supervised Learning under Covariate-shift**

Gholamali Aminian · Mahed Abroshan · Mohammad Mahdi Khalili · Laura Toni · Miguel Rodrigues

A common assumption in semi-supervised learning is that the labeled, unlabeled, and test data are drawn from the same distribution. However, this assumption is not satisfied in many applications. In many scenarios, the data is collected sequentially (e.g., healthcare) and the distribution of the data may change over time often exhibiting so-called covariate shifts. In this paper, we propose an approach for semi-supervised learning algorithms that is capable of addressing this issue. Our framework also recovers some popular methods, including entropy minimization and pseudo-labeling. We provide new information-theoretical based generalization error upper bounds inspired by our novel framework. Our bounds are applicable to both general semi-supervised learning and the covariate-shift scenario. Finally, we show numerically that our method outperforms previous approaches proposed for semi-supervised learning under the covariate shift.

Generative models are typically trained on grid-like data such as images. As a result, the size of these models usually scales directly with the underlying grid resolution. In this paper, we abandon discretized grids and instead parameterize individual data points by continuous functions. We then build generative models by learning distributions over such functions. By treating data points as functions, we can abstract away from the specific type of data we train on and construct models that are agnostic to discretization. To train our model, we use an adversarial approach with a discriminator that acts on continuous signals. Through experiments on a wide variety of data modalities including images, 3D shapes and climate data, we demonstrate that our model can learn rich distributions of functions independently of data type and resolution.

**Learning Sparse Fixed-Structure Gaussian Bayesian Networks**

Arnab Bhattacharyya · Davin Choo · Rishikesh Gajjala · Sutanu Gayen · Yuhao WANG

Gaussian Bayesian networks are widely used to model causal interactions among continuous variables. In this work, we study the problem of learning a fixed-structure Gaussian Bayesian network up to a bounded error in total variation distance. We analyze the commonly used node-wise least squares regression LeastSquares and prove that it has the near-optimal sample complexity. We also study a couple of new algorithms for the problem:BatchAvgLeastSquares takes the average of several batches of least squares solutions at each node, so that one can interpolate between the batch size and the number of batches. We show that BatchAvgLeastSquares also has near-optimal sample complexity. CauchyEst takes the median of solutions to several batches of linear systems at each node. We show that the algorithm specialized to polytrees, CauchyEstTree, has near-optimal sample complexity. Experimentally, we show that for uncontaminated, realizable data, the LeastSquares algorithm performs best, but in the presence of contamination or DAG misspecification, CauchyEst/CauchyEstTree and BatchAvgLeastSquares respectively perform better.

**VFDS: Variational Foresight Dynamic Selection in Bayesian Neural Networks for Efficient Human Activity Recognition**

Randy Ardywibowo · Shahin Boluki · Zhangyang Wang · Bobak Mortazavi · Shuai Huang · Xiaoning Qian

In many machine learning tasks, input features with varying degrees of predictive capability are acquired at varying costs. In order to optimize the performance-cost trade-off, one would select features to observe a priori. However, given the changing context with previous observations, the subset of predictive features to select may change dynamically. Therefore, we face the challenging new problem of foresight dynamic selection (FDS): finding a dynamic and light-weight policy to decide which features to observe next, before actually observing them, for overall performance-cost trade-offs. To tackle FDS, this paper proposes a Bayesian learning framework of Variational Foresight Dynamic Selection (VFDS). VFDS learns a policy that selects the next feature subset to observe, by optimizing a variational Bayesian objective that characterizes the trade-off between model performance and feature cost. At its core is an implicit variational distribution on binary gates that are dependent on previous observations, which will select the next subset of features to observe. We apply VFDS on the Human Activity Recognition (HAR) task where the performance-cost trade-off is critical in its practice. Extensive results demonstrate that VFDS selects different features under changing contexts, notably saving sensory costs while maintaining or improving the HAR accuracy. Moreover, the features that VFDS dynamically select are shown to be interpretable and associated with the different activity types. We will release the code.

**Reconstructing Test Labels from Noisy Loss Functions**

Abhinav Aggarwal · Shiva Kasiviswanathan · Zekun Xu · Oluwaseyi Feyisetan · Nathanael Teissier

Machine learning classifiers rely on loss functions for performance evaluation, often on a private (hidden) dataset. In a recent line of research, label inference was introduced as the problem of reconstructing the ground truth labels of this private dataset from just the (possibly perturbed) cross-entropy loss function values evaluated at chosen prediction vectors (without any other access to the hidden dataset). In this paper, we formally study the necessary and sufficient conditions under which label inference is possible from \emph{any} (noisy) loss function value. Using tools from analytical number theory, we show that a broad class of commonly used loss functions, including general Bregman divergence-based losses and multiclass cross-entropy with common activation functions like sigmoid and softmax, it is possible to design label inference attacks that succeed even for arbitrary noise levels and using only a single query from the adversary. We formally study the computational complexity of label inference and show that while in general, designing adversarial prediction vectors for these attacks is co-NP-hard, once we have these vectors, the attacks can also be carried out through a lightweight augmentation to any neural network model, making them look benign and hard to detect. The observations in this paper provide a deeper understanding of the vulnerabilities inherent in modern machine learning and could be used for designing future trustworthy ML.