Skip to yearly menu bar Skip to main content


Timezone: CET

Registration Desk: Registration + Assistance Wed 26 Apr 07:00 a.m.  


Invited Talk: Shakir Mohamed

Elevating our Evaluations: Technical and Sociotechnical Standards of Assessment in Machine Learning

Evaluation in Machine Learning does not always get the attention it deserves. I hope to focus our attention for the time of this talk on the questions of systematic evaluation in machine learning and the changes that we should continue to make as we elevate the standard of evaluation across our field. The breadth of application areas we collaborate on in machine learning requires a variety of approaches for evaluation, and we’ll explore this variety by considering applications in generative models, social good, healthcare, and environmental science. Grounded in these applications, we will expand the conceptual aperture through which we think about machine learning evaluations, starting from purely technical evaluations (thinking about likelihoods), moving to mixed methods (with proper scoring rules and expert assessments), and then to sociotechnical assessments (considering fairness, impacts, and participation). My core message is that broad and expansive evaluation remains fundamental and an area into which I hope we will drive even greater investments as a community, together.

Shakir Mohamed

 

Shakir Mohamed works on technical and sociotechnical questions in machine learning research, working on problems in machine learning principles, applied problems in healthcare and environment, and ethics and diversity. Shakir is a Director for Research at DeepMind in London, an Associate Fellow at the Leverhulme Centre for the Future of Intelligence, and a Honorary Professor of University College London. Shakir is also a founder and trustee of the Deep Learning Indaba, a grassroots charity whose work is to build pan-African capacity and leadership in AI. Amongst other roles, Shakir served as the senior programme chair for ICLR2021, and as the General Chair for NeurIPS 2022. Shakir also serves on the Board of Directors for some of the leading conferences in the field of machine learning and AI (ICML, ICLR, NeurIPS), is a member of the Royal Society diversity and inclusion committee, and on the international scientific advisory committee for the pan-Canadian AI strategy. Shakir is from South Africa, completed a postdoc at the University of British Columbia, received his PhD from the University of Cambridge, and received his masters and undergraduate degrees in Electrical and Information engineering from the University of the Witwatersrand, Johannesburg.



Oral: Probabilistic Methods 1 Wed 26 Apr 10:30 a.m.  

Oral
Mrinank Sharma · Sebastian Farquhar · Eric Nalisnick · Tom Rainforth

[ Auditorium 1 ]

We investigate the benefit of treating all the parameters in a Bayesian neural network stochastically and find compelling theoretical and empirical evidence that this standard construction may be unnecessary. To this end, we prove that expressive predictive distributions require only small amounts of stochasticity. In particular, partially stochastic networks with only n stochastic biases are universal probabilistic predictors for n-dimensional predictive problems. In empirical investigations, we find no systematic benefit of full stochasticity across four different inference modalities and eight datasets; partially stochastic networks can match and sometimes even outperform fully stochastic networks, despite their reduced memory costs.

Oral
Quanhan Xi · Benjamin Bloem-Reddy

[ Auditorium 1 ]

Most modern probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Different tasks tolerate different indeterminacies, however recent applications have indicated the need for strongly identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, and recent work excludes many---but not all---indeterminacies. In this work, we motivate model-identifiability in terms of task-identifiability, then construct a theoretical framework for analyzing the indeterminacies of latent variable models, which enables their precise characterization in terms of the generator function and prior distribution spaces. We reveal that strong identifiability is possible even with highly flexible nonlinear generators, and give two such examples. One is a straightforward modification of iVAE (Khemakhem et al., 2020); the other uses triangular monotonic maps, leading to novel connections between optimal transport and identifiability.

Oral
Rick Presman · Jason Xu

[ Auditorium 1 ]

Constrained learning is prevalent in countless statistical tasks. Recent work proposes distance-to-set penalties to derive estimators under general constraints that can be specified as sets, but focuses on obtaining point estimates that do not come with corresponding measures of uncertainty. To remedy this, we approach distance-to-set regularization from a Bayesian lens. We consider a class of smooth distance-to-set priors, showing that they yield well-defined posteriors toward quantifying uncertainty for constrained learning problems. We discuss relationships and advantages over prior work on Bayesian constraint relaxation. Moreover, we prove that our approach is optimal in an information geometric-sense for finite penalty parameters ρ, and enjoys favorable statistical properties when ρ → ∞. The method is designed to perform effectively within gradient-based MCMC samplers, as illustrated on a suite of simulated and real data applications.

Oral
Juan Kuntz Nussio · Jen Ning Lim · Adam Johansen

[ Auditorium 1 ]

Neal and Hinton (1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional F, and the EM algorithm as coordinate descent applied to F. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with F and show that their limits coincide with F's stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.


Oral: Probabilistic Methods 2 Wed 26 Apr 11:30 a.m.  

Oral
Alexander Hägele · Jonas Rothfuss · Lars Lorch · Vignesh Ram Somnath · Bernhard Schölkopf · Andreas Krause

Inferring causal structures from experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, the targets of the interventions are often uncertain or unknown and the number of observations limited. As a result, standard causal discovery methods can no longer be reliably used.To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering and reasoning about the causal structure that underlies data generated under various unknown experimental or interventional conditions.BaCaDI is fully differentiable, which allows us to infer the complex joint posteriorover the intervention targets and the causal structure via efficient gradient-based variational inference.In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets.

Oral
Kaiyu Li · Daniel Giles · Toni Karvonen · Serge Guillas · Francois-Xavier Briol

[ Auditorium 1 ]

Multilevel Monte Carlo is a key tool for approximating integrals involving expensive scientific models. The idea is to use approximations of the integrand to construct an estimator with improved accuracy over classical Monte Carlo. We propose to further enhance multilevel Monte Carlo through Bayesian surrogate models of the integrand, focusing on Gaussian process models and the associated Bayesian quadrature estimators. We show, using both theory and numerical experiments, that our approach can lead to significant improvements in accuracy when the integrand is expensive and smooth, and when the dimensionality is small or moderate. We conclude the paper with a case study illustrating the potential impact of our method in landslide-generated tsunami modelling, where the cost of each integrand evaluation is typically too large for operational settings.

Oral
Natalie Maus · Kaiwen Wu · David Eriksson · Jacob Gardner

[ Auditorium 1 ]

Bayesian optimization (BO) is a popular approach to sample-efficient optimization of black-box objective functions. While BO has been successfully applied to a wide range of scientific applications, traditional approaches to single-objective BO only seek to find a single best solution. This can be a significant limitation in situations where solutions may later turn out to be intractable, e.g., a designed molecule may turn out to later violate constraints that can only be evaluated after the optimization process has concluded. To combat this issue, we propose Rank-Ordered Bayesian Optimization with Trust-regions (ROBOT) which aims to find a portfolio of high-performing solutions that are diverse according to a user-specified diversity metric. We evaluate ROBOT on several real-world applications and show that it can discover large sets of high-performing diverse solutions while requiring few additional function evaluations compared to finding a single best solution.

Oral
Henry Moss · Sebastian Ober · Victor Picheny

[ Auditorium 1 ]

Sparse Gaussian processes are a key component of high-throughput Bayesian optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of determinantal point processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable for a broad range of downstream sequential decision making tasks.


Test Of Time: Deep Gaussian Processes Wed 26 Apr 02:00 p.m.  

Neil Lawrence · Andreas Damianou

In this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief net- work based on Gaussian process mappings. The data is modeled as the output of a multivariate GP. The inputs to that Gaussian process are then governed by another GP. A single layer model is equivalent to a standard GP or the GP latent vari- able model (GP-LVM). We perform inference in the model by approximate variational marginal- ization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically ap- plied to relatively large data sets using stochas- tic gradient descent for optimization. Our fully Bayesian treatment allows for the application of deep models even when data is scarce. Model se- lection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples.


Oral: Statistical Methods 1 Wed 26 Apr 03:30 p.m.  

Oral
Hongjian Wang · Aaditya Ramdas

[ Auditorium 1 ]

Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) ε fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.

Oral
Junwen Yao · Benjamin Erichson · Miles Lopes

[ Auditorium 1 ]

Random Fourier Features (RFF) is among the most popular and broadly applicable approaches for scaling up kernel methods. In essence, RFF allows the user to avoid costly computations with a large kernel matrix via a fast randomized approximation. However, a pervasive difficulty in applying RFF is that the user does not know the actual error of the approximation, or how this error will propagate into downstream learning tasks. Up to now, the RFF literature has primarily dealt with these uncertainties using theoretical error bounds, but from a user's standpoint, such results are typically impractical---either because they are highly conservative or involve unknown quantities. To tackle these general issues in a data-driven way, this paper develops a bootstrap approach to numerically estimate the errors of RFF approximations. Three key advantages of this approach are: (1) The error estimates are specific to the problem at hand, avoiding the pessimism of worst-case bounds. (2) The approach is flexible with respect to different uses of RFF, and can even estimate errors in downstream learning tasks. (3) The approach enables adaptive computation, in the sense that the user can quickly inspect the error of a rough initial kernel approximation and then predict how much extra …

Oral
Alberto Maria Metelli · Mirco Mutti · Marcello Restelli

[ Auditorium 1 ]

The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.

Oral
Rosanne Turner · Peter Grunwald

[ Auditorium 1 ]

Sequential decision making significantly speeds up research and is more cost-effective compared to fixed-n methods. We present a method for sequential decision making for stratified count data that retains Type-I error guarantee or false discovery rate under optional stopping, using e-variables. We invert the method to construct stratified anytime-valid confidence sequences, where cross-talk between subpopulations in the data can be allowed during data collection to improve power. Finally, we combine information collected in separate subpopulations through pseudo-Bayesian averaging and switching to create effective estimates for the minimal, mean and maximal treatment effects in the subpopulations.


Poster Session 2 Wed 26 Apr 04:30 p.m.  

Poster
Natraj Raman · Daniele Magazzeni · Sameena Shah

[ Auditorium 1 Foyer ]

Counterfactual explanations utilize feature perturbations to analyze the outcome of an original decision and recommend an actionable recourse. We argue that it is beneficial to provide several alternative explanations rather than a single point solution and propose a probabilistic paradigm to estimate a diverse set of counterfactuals. Specifically, we treat the perturbations as random variables endowed with prior distribution functions. This allows sampling multiple counterfactuals from the posterior density, with the added benefit of incorporating inductive biases, preserving domain specific constraints and quantifying uncertainty in estimates. More importantly, we leverage Bayesian hierarchical modeling to share information across different subgroups of a population, which can both improve robustness and measure fairness. A gradient based sampler with superior convergence characteristics efficiently computes the posterior samples. Experiments across several datasets demonstrate that the counterfactuals estimated using our approach are valid, sparse, diverse and feasible.

Poster
Kyungsu Lee · Haeyun Lee · Jae Youn Hwang

[ Auditorium 1 Foyer ]

Deep learning (DL) techniques for the precise semantic segmentation of aerial imagery have remained a challenge because of the vague boundaries of target objects caused by the low resolution of images. Despite the improved segmentation performance using up/downsampling operations in early DL models, conventional operators cannot fully preserve spatial information and thus generate vague boundaries of target objects. Therefore, for the precise segmentation of target objects in aerial images, this paper presents two novel operators: (1) upsampling interpolation method (USIM), an operator that upsamples input feature maps and combines feature maps into one while preserving the spatial information of both inputs, and (2) USIM gate (UG), an advanced USIM operator with boundary-attention mechanisms. The experimental results demonstrate that using the USIM and UG with state-of-the-art DL models can improve the segmentation performance with clear boundaries of target objects (Intersection over Union: +6.9%; BJ: +10.1%). Furthermore, mathematical proofs verify that the USIM and UG contribute to the handling of spatial information.

Poster
Adam Elmachtoub · Vishal Gupta · YUNFAN ZHAO

[ Auditorium 1 Foyer ]

We consider a personalized pricing problem in which we have data consisting of feature information, historical pricing decisions, and binary realized demand. The goal is to perform off-policy evaluation for a new personalized pricing policy that maps features to prices. Methods based on inverse propensity weighting (including doubly robust methods) for off-policy evaluation may perform poorly when the logging policy has little exploration or is deterministic, which is common in pricing applications. Building on the balanced policy evaluation framework of Kallus (2018), we propose a new approach tailored to pricing applications. The key idea is to compute an estimate that minimizes the worst-case mean squared error or maximizes a worst-case lower bound on policy performance, where in both cases the worst-case is taken with respect to a set of possible revenue functions. We establish theoretical convergence guarantees and empirically demonstrate the advantage of our approach using a real-world pricing dataset.

Poster
Ervine Zheng · Qi Yu · Rui Li · Pengcheng Shi · Anne Haake

[ Auditorium 1 Foyer ]

Image captioning offers a computational process to understand the semantics of images and convey them using descriptive language. However, automated captioning models may not always generate satisfactory captions due to the complex nature of the images and the quality/size of the training data. We propose an interactive captioning framework to improve machine-generated captions by keeping humans in the loop and performing an online-offline knowledge acquisition (KA) process. In particular, online KA accepts a list of keywords specified by human users and fuses them with the image features to generate a readable sentence that captures the semantics of the image. It leverages a multimodal conditioned caption completion mechanism to ensure the appearance of all user-input keywords in the generated caption. Offline KA further learns from the user inputs to update the model and benefits caption generation for unseen images in the future. It is built upon a Bayesian transformer architecture that dynamically allocates neural resources and supports uncertainty-aware model updates to mitigate overfitting. Our theoretical analysis also proves that Offline KA automatically selects the best model capacity to accommodate the newly acquired knowledge. Experiments on real-world data demonstrate the effectiveness of the proposed framework.

Poster
Jaromír Plhák · Ondřej Sotolář · Michaela Lebedíková · David Šmahel

[ Auditorium 1 Foyer ]

Previous research on detecting risky online behavior has been rather scattered, typically identifying single risks in online samples. To our knowledge, the presented research is the first that presents a process of building models that can efficiently detect the following four online risky behavior: (1) aggression, harassment, hate; (2) mental health; (3) use of alcohol, and drugs; and (4) sexting. Furthermore, the corpora in this research are unique because of the usage of private instant messaging conversations in the Czech language provided by adolescents. The combination of publicly unavailable and unique data with high-quality annotations of specific psychological phenomena allowed us for precise detection using transformer machine learning models that can handle sequential data and involve the context of utterances. The impact of the context length and text augmentation on model efficiency is discussed in detail. The final model provides promising results with an acceptable F1 score. Therefore, we believe that the model could be used in various applications, e.g., parental applications, chatbots, or services provided by Internet providers. Future research could investigate the usage of the model in other languages.

Poster
Dongping Qi · David Bindel · Alexander Vladimirsky

[ Auditorium 1 Foyer ]

We consider a task of surveillance-evading path-planning in a continuous setting. An Evader strives to escape from a 2D domain while minimizing the risk of detection (and immediate capture). The probability of detection is path-dependent and determined by the spatially inhomogeneous surveillance intensity, which is fixed but a priori unknown and gradually learned in the multi-episodic setting. We introduce a Bayesian reinforcement learning algorithm that relies on a Gaussian Process regression (to model the surveillance intensity function based on the information from prior episodes), numerical methods for Hamilton-Jacobi PDEs (to plan the best continuous trajectories based on the current model), and Confidence Bounds (to balance the exploration vs exploitation). We use numerical experiments and regret metrics to highlight the significant advantages of our approach compared to traditional graph-based algorithms of reinforcement learning.

Poster
Di He · Shanda Li · Wenlei Shi · Xiaotian Gao · Jia Zhang · Jiang Bian · Liwei Wang · Tie-Yan Liu

[ Auditorium 1 Foyer ]

Physics-Informed Neural Network (PINN) has become a commonly used machine learning approach to solve partial differential equations (PDE). But, facing high-dimensional secondorder PDE problems, PINN will suffer from severe scalability issues since its loss includes second-order derivatives, the computational cost of which will grow along with the dimension during stacked back-propagation. In this work, we develop a novel approach that can significantly accelerate the training of Physics-Informed Neural Networks. In particular, we parameterize the PDE solution by the Gaussian smoothed model and show that, derived from Stein's Identity, the second-order derivatives can be efficiently calculated without back-propagation. We further discuss the model capacity and provide variance reduction methods to address key limitations in the derivative estimation. Experimental results show that our proposed method can achieve competitive error compared to standard PINN training but is significantly faster. Our code is released at https://github.com/LithiumDA/PINN-without-Stacked-BP.

Poster
Xun Zhu · Yutong Xiong · Ming Wu · Gaozhen Nie · Bin Zhang · Ziheng Yang

[ Auditorium 1 Foyer ]

Weather forecasting is one of the cornerstones of meteorological work. In this paper, we present a new benchmark dataset named Weather2K, which aims to make up for the deficiencies of existing weather forecasting datasets in terms of real-time, reliability, and diversity, as well as the key bottleneck of data quality. To be specific, our Weather2K is featured from the following aspects: 1) Reliable and real-time data. The data is hourly collected from 2,130 ground weather stations covering an area of 6 million square kilometers. 2) Multivariate meteorological variables. 20 meteorological factors and 3 constants for position information are provided with a length of 40,896 time steps. 3) Applicable to diverse tasks. We conduct a set of baseline tests on time series forecasting and spatio-temporal forecasting. To the best of our knowledge, our Weather2K is the first attempt to tackle weather forecasting task by taking full advantage of the strengths of observation data from ground weather stations. Based on Weather2K, we further propose Meteorological Factors based Multi-Graph Convolution Network (MFMGCN), which can effectively construct the intrinsic correlation among geographic locations based on meteorological factors. Sufficient experiments show that MFMGCN improves both the forecasting performance and temporal robustness. We hope our Weather2K …

Poster
Toshihiro Ota · Ikuro Sato · Rei Kawakami · Masayuki Tanaka · Nakamasa Inoue

[ Auditorium 1 Foyer ]

It has been known by neuroscience studies that partial and transient forgetting of memory often plays an important role in the brain to improve performance for certain intellectual activities. In machine learning, associative memory models such as classical and modern Hopfield networks have been proposed to express memories as attractors in the feature space of a closed recurrent network. In this work, we propose learning with partial forgetting (LwPF), where a partial forgetting functionality is designed by element-wise non-bijective projections, for memory neurons in modern Hopfield networks to improve model performance. We incorporate LwPF into the attention mechanism also, whose process has been shown to be identical to the update rule of a certain modern Hopfield network, by modifying the corresponding Lagrangian. We evaluated the effectiveness of LwPF on three diverse tasks such as bit-pattern classification, immune repertoire classification for computational biology, and image classification for computer vision, and confirmed that LwPF consistently improves the performance of existing neural networks including DeepRC and vision transformers.

Poster
Pascal Esser · Satyaki Mukherjee · Mahalakshmi Sabanayagam · Debarghya Ghoshdastidar

[ Auditorium 1 Foyer ]

The central question in representation learning is what constitutes a good or meaningful representation. In this work we argue that if we consider data with inherent cluster structures, where clusters can be characterized through different means and covariances, those data structures should be represented in the embedding as well. While Autoencoders (AE) are widely used in practice for unsupervised representation learning, they do not fulfil the above condition on the embedding as they obtain a single representation of the data. To overcome this we propose a meta-algorithm that can be used to extend an arbitrary AE architecture to a tensorized version (TAE) that allows for learning cluster-specific embeddings while simultaneously learning the cluster assignment. For the linear setting we prove that TAE can recover the principle components of the different clusters in contrast to principle component of the entire data recovered by a standard AE. We validate this on planted models and for general, non-linear and convolutional AEs we empirically illustrate that tensorizing the AE is beneficial in clustering and de-noising tasks.

Poster
Gavin Kerrigan · Justin Ley · Padhraic Smyth

[ Auditorium 1 Foyer ]

Diffusion generative models have recently been applied to domains where the available data can be seen as a discretization of an underlying function, such as audio signals or time series. However, these models operate directly on the discretized data, and there are no semantics in the modeling process that relate the observed data to the underlying functional forms. We generalize diffusion models to operate directly in function space by developing the foundational theory for such models in terms of Gaussian measures on Hilbert spaces. A significant benefit of our function space point of view is that it allows us to explicitly specify the space of functions we are working in, leading us to develop methods for diffusion generative modeling in Sobolev spaces. Our approach allows us to perform both unconditional and conditional generation of function-valued data. We demonstrate our methods on several synthetic and real-world benchmarks.

Poster
Fasil Cheema · Ruth Urner

[ Auditorium 1 Foyer ]

Generative modelling has seen enormous practical advances over the past few years. Evaluating the quality of a generative system however is often still based on subjective human inspection. To overcome this, very recently the research community has turned to exploring formal evaluation metrics and methods. In this work, we propose a novel evaluation paradigm based on a two way nearest neighbor neighborhood test. We define a novel measure of mutual coverage for two continuous probability distributions. From this, we derive an empirical analogue and show analytically that it exhibits favorable theoretical properties while it is also straightforward to compute. We show that, while algorithmically simple, our derived method is also statistically sound. In contrast to previously employed distance measures, our measure naturally stems from a notion of local discrepancy, which can be accessed separately. This provides more detailed information to practitioners on the diagnosis of where their generative models will perform well, or conversely where their models fail. We complement our analysis with a systematic experimental evaluation and comparison to other recently proposed measures. Using a wide array of experiments we demonstrate our algorithms strengths over other existing methods and confirm our results from the theoretical analysis.

Poster
Qingzhong Ai · Pengyun Wang · Lirong He · liangjian Wen · Lujia Pan · Zenglin Xu

[ Auditorium 1 Foyer ]

Learning with imbalanced data is a challenging problem in deep learning. Over-sampling is a widely used technique to re-balance the sampling distribution of training data. However, most existing over-sampling methods only use intra-class information of minority classes to augment the data but ignore the inter-class relationships with the majority ones, which is prone to overfitting, especially when the imbalance ratio is large. To address this issue, we propose a novel over-sampling model, called Majority-Guided VAE(MGVAE), which generates new minority samples under the guidance of a majority-based prior. In this way, the newly generated minority samples can inherit the diversity and richness of the majority ones, thus mitigating overfitting in downstream tasks. Furthermore, to prevent model collapse under limited data, we first pre-train MGVAE on sufficient majority samples and then fine-tune based on minority samples with Elastic Weight Consolidation(EWC) regularization. Experimental results on benchmark image datasets and real-world tabular data show that MGVAE achieves competitive improvements over other over-sampling methods in downstream classification tasks, demonstrating the effectiveness of our method.

Poster
Joel Oskarsson · Per Sidén · Fredrik Lindsten

[ Auditorium 1 Foyer ]

This paper proposes a temporal graph neural network model for forecasting of graph-structured irregularly observed time series. Our TGNN4I model is designed to handle both irregular time steps and partial observations of the graph. This is achieved by introducing a time-continuous latent state in each node, following a linear Ordinary Differential Equation (ODE) defined by the output of a Gated Recurrent Unit (GRU). The ODE has an explicit solution as a combination of exponential decay and periodic dynamics. Observations in the graph neighborhood are taken into account by integrating graph neural network layers in both the GRU state update and predictive model. The time-continuous dynamics additionally enable the model to make predictions at arbitrary time steps. We propose a loss function that leverages this and allows for training the model for forecasting over different time horizons. Experiments on simulated data and real-world data from traffic and climate modeling validate the usefulness of both the graph structure and time-continuous dynamics in settings with irregular observations.

Poster
Giannis Nikolentzos · Michail Chatzianastasis · Michalis Vazirgiannis

[ Auditorium 1 Foyer ]

In recent years, graph neural networks (GNNs) have emerged as a promising tool for solving machine learning problems on graphs. Most GNNs are members of the family of message passing neural networks (MPNNs). There is a close connection between these models and the Weisfeiler-Leman (WL) test of isomorphism, an algorithm that can successfully test isomorphism for a broad class of graphs. Recently, much research has focused on measuring the expressive power of GNNs. For instance, it has been shown that standard MPNNs are at most as powerful as WL in terms of distinguishing non-isomorphic graphs. However, these studies have largely ignored the distances between the representations of the nodes/graphs which are of paramount importance for learning tasks. In this paper, we define a distance function between nodes which is based on the hierarchy produced by the WL algorithm, and propose a model that learns representations which preserve those distances between nodes. Since the emerging hierarchy corresponds to a tree, to learn these representations, we capitalize on recent advances in the field of hyperbolic neural networks. We empirically evaluate the proposed model on standard graph and node classification datasets where it achieves competitive performance with state-of-the-art models.

Poster
Abdullah Alchihabi · Yuhong Guo

[ Auditorium 1 Foyer ]

Graph Neural Networks (GNNs) require a relatively large number of labeled nodes and a reliable/uncorrupted graph connectivity structure to obtain good performance on the semi-supervised node classification task. The performance of GNNs can degrade significantly as the number of labeled nodes decreases or the graph connectivity structure is corrupted by adversarial attacks or noise in data measurement/collection. Therefore, it is important to develop GNN models that are able to achieve good performance when there is limited supervision knowledge—a few labeled nodes and a noisy graph structure. In this paper, we propose a novel Dual GNN learning framework to address this challenging task. The proposed framework has two GNN based node prediction modules. The primary module uses the input graph structure to induce typical node embeddings and predictions with a regular GNN baseline, while the auxiliary module constructs a new graph structure through fine-grained spectral clustering and learns new node embeddings and predictions. By integrating the two modules in a dual GNN learning framework, we perform joint learning in an end-to-end fashion. This general framework can be applied on many GNN baseline models. The experimental results show that the proposed dual GNN framework can greatly outperform the GNN baseline methods and …

Poster
Qi CHEN · Mario Marchand

[ Auditorium 1 Foyer ]

We use information-theoretic tools to derive a novel analysis of Multi-source Domain Adaptation (MDA) from the representation learning perspective. Concretely, we study joint distribution alignment for supervised MDA with few target labels and unsupervised MDA with pseudo labels, where the latter is relatively hard and less commonly studied. We further provide algorithm-dependent generalization bounds for these two settings, where the generalization is characterized by the mutual information between the parameters and the data. Then we propose a novel deep MDA algorithm, implicitly addressing the target shift through joint alignment. Finally, the mutual information bounds are extended to this algorithm providing a non-vacuous gradient-norm estimation. The proposed algorithm has comparable performance to the state-of-the-art on target-shifted MDA benchmark with improved memory efficiency.

Poster
Matthias Rath · Alexandru Paul Condurache

[ Auditorium 1 Foyer ]

We address the problem of improving the performance and in particular the sample complexity of deep neural networks by enforcing and guaranteeing invariances to symmetry transformations rather than learning them from data. Group-equivariant convolutions are a popular approach to obtain equivariant representations. The desired corresponding invariance is then imposed using pooling operations. For rotations, it has been shown that using invariant integration instead of pooling further improves the sample complexity. In this contribution, we first expand invariant integration beyond rotations to flips and scale transformations. We then address the problem of incorporating multiple desired invariances into a single network. For this purpose, we propose a multi-stream architecture, where each stream is invariant to a different transformation such that the network can simultaneously benefit from multiple invariances. We demonstrate our approach with successful experiments on Scaled-MNIST, SVHN, CIFAR-10 and STL-10.

Poster
Mehrnaz Mofakhami · Ioannis Mitliagkas · Gauthier Gidel

[ Auditorium 1 Foyer ]

Performative prediction is a framework for learning models that influence the data they intend to predict. We focus on finding classifiers that are performatively stable, i.e. optimal for the data distribution they induce. Standard convergence results for finding a performatively stable classifier with the method of repeated risk minimization assume that the data distribution is Lipschitz continuous to the model's parameters. Under this assumption, the loss must be strongly convex and smooth in these parameters; otherwise, the method will diverge for some problems. In this work, we instead assume that the data distribution is Lipschitz continuous with respect to the model's predictions, a more natural assumption for performative systems. As a result, we are able to significantly relax the assumptions on the loss function. In particular, we do not need to assume convexity with respect to the model's parameters. As an illustration, we introduce a resampling procedure that models realistic distribution shifts and show that it satisfies our assumptions. We support our theory by showing that one can learn performatively stable classifiers with neural networks making predictions about real data that shift according to our proposed procedure.

Poster
Andrew Stirn · Harm Wessels · Megan Schertzer · Laura Pereira · Neville Sanjana · David Knowles

[ Auditorium 1 Foyer ]

Heteroscedastic regression models a Gaussian variable's mean and variance as a function of covariates. Parametric methods that employ neural networks for these parameter maps can capture complex relationships in the data. Yet, optimizing network parameters via log likelihood gradients can yield suboptimal mean and uncalibrated variance estimates. Current solutions side-step this optimization problem with surrogate objectives or Bayesian treatments. Instead, we make two simple modifications to optimization. Notably, their combination produces a heteroscedastic model with mean estimates that are provably as accurate as those from its homoscedastic counterpart (i.e.~fitting the mean under squared error loss). For a wide variety of network and task complexities, we find that mean estimates from existing heteroscedastic solutions can be significantly less accurate than those from an equivalently expressive mean-only model. Our approach provably retains the accuracy of an equally flexible mean-only model while also offering best-in-class variance calibration. Lastly, we show how to leverage our method to recover the underlying heteroscedastic noise variance.

Poster
Anirban Samaddar · Sandeep Madireddy · Prasanna Balaprakash · Tapabrata Maiti · Gustavo de los Campos · Ian Fischer

[ Auditorium 1 Foyer ]

The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity as a mechanism to provide the flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism for learning a joint distribution of the latent variable and the sparsity, and hence it can account for the complete uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as other sparsity induction mechanisms for latent variable models proposed in the literature.

Poster
Tina Behnia · Ganesh Ramachandra Kini · Vala Vakilian · Christos Thrampoulidis

[ Auditorium 1 Foyer ]

Various logit-adjusted parameterizations of the cross-entropy (CE) loss have been proposed as alternatives to weighted CE for training large models on label-imbalanced data far beyond the zero train error regime. The driving force behind those designs has been the theory of implicit bias, which for linear(ized) models, explains why they successfully induce bias on the optimization path towards solutions that favor minorities. Aiming to extend this theory to non-linear models, we investigate the implicit geometry of classifiers and embeddings that are learned by different CE parameterizations. Our main result characterizes the global minimizers of a non-convex cost-sensitive SVM classifier for the unconstrained features model, which serves as an abstraction of deep-nets. We derive closed-form formulas for the angles and norms of classifiers and embeddings as a function of the number of classes, the imbalance and the minority ratios, and the loss hyperparameters. Using these, we show that logit-adjusted parameterizations can be appropriately tuned to learn symmetric geometries irrespective of the imbalance ratio. We complement our analysis with experiments and an empirical study of convergence accuracy in deep-nets.

Poster
Zhichao Wang · Yizhe Zhu

[ Auditorium 1 Foyer ]

We investigate the properties of random feature ridge regression (RFRR) given by a two-layer neural network with random Gaussian initialization. We study the non-asymptotic behaviors of the RFRR with nearly orthogonal deterministic unit-length input data vectors in the overparameterized regime, where the width of the first layer is much larger than the sample size. Our analysis shows high-probability non-asymptotic concentration results for the training errors, cross-validations, and generalization errors of RFRR centered around their respective values for a kernel ridge regression (KRR). This KRR is derived from an expected kernel generated by a nonlinear random feature map. We then approximate the performance of the KRR by a polynomial kernel matrix obtained from the Hermite polynomial expansion of the activation function, whose degree only depends on the orthogonality among different data points. This polynomial kernel determines the asymptotic behavior of the RFRR and the KRR. Our results hold for a wide variety of activation functions and input data sets that exhibit nearly orthogonal properties. Based on these approximations, we obtain a lower bound for the generalization error of the RFRR for a nonlinear student-teacher model.

Poster
Haotian Ju · Dongyue Li · Aneesh Sharma · Hongyang Zhang

[ Auditorium 1 Foyer ]

Graph neural networks are widely used tools for graph prediction tasks. Motivated by their empirical performance, prior works have developed generalization bounds for graph neural networks, which scale with graph structures in terms of the maximum degree. In this paper, we present generalization bounds that instead scale with the largest singular value of the graph neural network's feature diffusion matrix. These bounds are numerically much smaller than prior bounds for real-world graphs. We also construct a lower bound of the generalization gap that matches our upper bound asymptotically. To achieve these results, we analyze a unified model that includes prior works' settings (i.e., convolutional and message-passing networks) and new settings (i.e., graph isomorphism networks). Our key idea is to measure the stability of graph neural networks against noise perturbations using Hessians. Empirically, we find that Hessian-based measurements correlate with observed generalization gaps of graph neural networks accurately; Optimizing noise stability properties for fine-tuning pretrained graph neural networks also improves the test performance on several graph-level classification tasks.

Poster
Zenan Ling · Xingyu Xie · Qiuhao Wang · Zongpeng Zhang · Zhouchen Lin

[ Auditorium 1 Foyer ]

A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection. Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation. In this paper, the training dynamics of over-parameterized DEQs are investigated, and we propose a novel probabilistic framework to overcome the challenge arising from the weight-sharing and the infinite depth. By supposing a condition on the initial equilibrium point, we prove that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. We further perform a fine-grained non-asymptotic analysis about random DEQs and the corresponding weight-untied models, and show that the required initial condition is satisfied via mild over-parameterization. Moreover, we show that the unique equilibrium point always exists during the training.

Poster
Xinyi Xu · Zhaoxuan Wu · Arun Verma · Chuan Sheng Foo · Bryan Kian Hsiang Low

[ Auditorium 1 Foyer ]

Scientific discovery aims to find new patterns and test specific hypotheses by analysing large-scale experimental data. However, various practical limitations (e.g., high experimental costs or the inability to perform some experiments) make it challenging for researchers to collect sufficient experimental data for successful scientific discovery. To this end, we propose a collaborative active learning (CAL) framework that enables researchers to share their experimental data for mutual benefit. Specifically, our proposed coordinated acquisition function sets out to achieve individual rationality and fairness so that everyone can equitably benefit from collaboration. We empirically demonstrate that our method outperforms existing batch active learning ones (adapted to the CAL setting) in terms of both learning performance and fairness on various real-world scientific discovery datasets (biochemistry, material science, and physics).

Poster
Freddie Bickford Smith · Andreas Kirsch · Sebastian Farquhar · Yarin Gal · Adam Foster · Tom Rainforth

[ Auditorium 1 Foyer ]

Information-theoretic approaches to active learning, such as BALD, typically focus on maximising the information gathered about the model parameters. We highlight that this can be suboptimal from the perspective of predictive performance. In particular, BALD fails to account for the input distribution and thus is prone to prioritise data that is of low relevance to prediction. Addressing this shortfall, we propose the expected predictive information gain (EPIG), an acquisition function that measures information gain in the space of predictions rather than parameters. We find that using EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models, and thus provides an appealing drop-in replacement.

Poster
Aarshvi Gajjar · Christopher Musco · Chinmay Hegde

[ Auditorium 1 Foyer ]

We consider the problem of active learning for single neuron models, also sometimes called “ridge functions”, in the agnostic setting (under adversarial label noise). Such models have been shown to be broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for partial differential equations.Surprisingly, we show that for a single neuron model with any Lipschitz non-linearity (such as the ReLU, sigmoid, absolute value, low-degree polynomial, among others), strong provable approximation guarantees can be obtained using a well-known active learning strategy for fitting \emph{linear functions} in the agnostic setting. Namely, we can collect samples via statistical leverage score sampling, which has been shown to be near-optimal in other active learning scenarios. We support our theoretical results with empirical simulations showing that our proposed active learning strategy based on leverage score sampling outperforms (ordinary) uniform sampling when fitting single neuron models.

Poster
Ben London · Levi(Chenhui) Lu · Ted Sandler · Thorsten Joachims

[ Auditorium 1 Foyer ]

We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a ``weak" learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.

Poster
Koki Okajima · Xiangming Meng · Takashi Takahashi · Yoshiyuki Kabashima

[ Auditorium 1 Foyer ]

We analyze the performance of the least absolute shrinkage and selection operator (Lasso) for the linear model when the number of regressors $N$ grows larger keeping the true support size $d$ finite, i.e., the ultra-sparse case. The result is based on a novel treatment of the non-rigorous replica method in statistical physics, which has been applied only to problem settings where $N$, $d$ and the number of observations $M$ tend to infinity at the same rate. Our analysis makes it possible to assess the average performance of Lasso with Gaussian sensing matrices without assumptions on the scaling of $N$ and $M$, the noise distribution, and the profile of the true signal. Under mild conditions on the noise distribution, the analysis also offers a lower bound on the sample complexity necessary for partial and perfect support recovery when $M$ diverges as $M = O(\log N)$. The obtained bound for perfect support recovery is a generalization of that given in previous literature, which only considers the case of Gaussian noise and diverging $d$. Extensive numerical experiments strongly support our analysis.
Poster
Itay Evron · Ophir Onn · Tamar Weiss · Hai Azeroual · Daniel Soudry

[ Auditorium 1 Foyer ]

Error-correcting codes (ECC) are used to reduce multiclass classification tasks to multiple binary classification subproblems. In ECC, classes are represented by the rows of a binary matrix, corresponding to codewords in a codebook. Codebooks are commonly either predefined or problem dependent. Given predefined codebooks, codeword-to-class assignments are traditionally overlooked, and codewords are implicitly assigned to classes arbitrarily.Our paper shows that these assignments play a major role in the performance of ECC. Specifically, we examine similarity-preserving assignments, where similar codewords are assigned to similar classes. Addressing a controversy in existing literature, our extensive experiments confirm that similarity-preserving assignments induce easier subproblems and are superior to other assignment policies in terms of their generalization performance. We find that similarity-preserving assignments make predefined codebooks become problem-dependent, without altering other favorable codebook properties. Finally, we show that our findings can improve predefined codebooks dedicated to extreme classification.

Poster
Xiangqian SUN · Cheuk Hang Leung · Yijun Li · Qi Wu

[ Auditorium 1 Foyer ]

Subset selection, i.e., finding a bunch of items from a collection to achieve specific goals, has wide applications in information retrieval, statistics, and machine learning. To implement an end-to-end learning framework, different relaxed differentiable operators of subset selection are proposed. Most existing work relies on either the regularization method or the perturbation method. In this work, we provide a probabilistic interpretation for regularization relaxation and unify two schemes. Besides, we build some concrete examples to show the generic connection between these two relaxations. Finally, we evaluate the perturbed selector as well as the regularized selector on two tasks: the maximum entropy sampling problem and the feature selection problem. The experimental results show that these two methods can achieve competitive performance against other benchmarks.

Poster
Mukund Sudarshan · Aahlad Puli · Wesley Tansey · Rajesh Ranganath

[ Auditorium 1 Foyer ]

Conditional randomization tests (CRTs) assess whether a variable x is predictive of another variable y, having observed covariates z. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: F(x∣z) and F(y∣z) where F(⋅∣z) is a conditional cumulative distribution function (CDF). These variables are termed "information residuals." We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.

Poster
Yuka Hashimoto · Masahiro Ikeda · Hachem Kadri

[ Auditorium 1 Foyer ]

Supervised learning in reproducing kernel Hilbert space (RKHS) and vector-valued RKHS (vvRKHS) has been investigated for more than 30 years. In this paper, we provide a new twist to this rich literature by generalizing supervised learning in RKHS and vvRKHS to reproducing kernel Hilbert C-module (RKHM), and show how to construct effective positive-definite kernels by considering the perspective of C-algebra. Unlike the cases of RKHS and vvRKHS, we can use C*-algebras to enlarge representation spaces. This enables us to construct RKHMs whose representation power goes beyond RKHSs, vvRKHSs, and existing methods such as convolutional neural networks. Our framework is suitable, for example, for effectively analyzing image data by allowing the interaction of Fourier components.

Poster
Frederiek Wesel · Kim Batselier

[ Auditorium 1 Foyer ]

Kernel machines are one of the most studied family of methods in machine learning. In the exact setting, training requires to instantiate the kernel matrix, thereby prohibiting their application to large-sampled data. One popular kernel approximation strategy which allows to tackle large-sampled data consists in interpolating product kernels on a set of grid-structured inducing points. However, since the number of model parameters increases exponentially with the dimensionality of the data, these methods are limited to small-dimensional datasets. In this work we lift this limitation entirely by placing inducing points on a grid and constraining the primal weights to be a low-rank Canonical Polyadic Decomposition. We derive a block coordinate descent algorithm that efficiently exploits grid-structured inducing points. The computational complexity of the algorithm scales linearly both in the number of samples and in the dimensionality of the data for any product kernel. We demonstrate the performance of our algorithm on large-scale and high-dimensional data, achieving state-of-the art results on a laptop computer. Our results show that grid-structured approaches can work in higher-dimensional problems.

Poster
Junwen Yao · Benjamin Erichson · Miles Lopes

[ Auditorium 1 Foyer ]

Random Fourier Features (RFF) is among the most popular and broadly applicable approaches for scaling up kernel methods. In essence, RFF allows the user to avoid costly computations with a large kernel matrix via a fast randomized approximation. However, a pervasive difficulty in applying RFF is that the user does not know the actual error of the approximation, or how this error will propagate into downstream learning tasks. Up to now, the RFF literature has primarily dealt with these uncertainties using theoretical error bounds, but from a user's standpoint, such results are typically impractical---either because they are highly conservative or involve unknown quantities. To tackle these general issues in a data-driven way, this paper develops a bootstrap approach to numerically estimate the errors of RFF approximations. Three key advantages of this approach are: (1) The error estimates are specific to the problem at hand, avoiding the pessimism of worst-case bounds. (2) The approach is flexible with respect to different uses of RFF, and can even estimate errors in downstream learning tasks. (3) The approach enables adaptive computation, in the sense that the user can quickly inspect the error of a rough initial kernel approximation and then predict how much extra …

Poster
Othmane Marfoq · Giovanni Neglia · Laetitia Kameni · Richard Vidal

[ Auditorium 1 Foyer ]

Federated learning (FL) is an effective solution to train machine learning models on the increasing amount of data generated by IoT devices and smartphones while keeping such data localized. Most previous work on federated learning assumes that clients operate on static datasets collected before training starts. This approach may be inefficient because 1) it ignores new samples clients collect during training, and 2) it may require a potentially long preparatory phase for clients to collect enough data. Moreover, learning on static datasets may be simply impossible in scenarios with small aggregate storage across devices. It is, therefore, necessary to design federated algorithms able to learn from data streams. In this work, we formulate and study the problem of federated learning for data streams. We propose a general FL algorithm to learn from data streams through an opportune weighted empirical risk minimization. Our theoretical analysis provides insights to configure such an algorithm, and we evaluate its performance on a wide range of machine learning tasks.

Poster
Andrew Lowy · Ali Ghafelebashi · Meisam Razaviyayn

[ Auditorium 1 Foyer ]

We study federated learning (FL)–especially cross-silo FL–with non-convex loss functions and data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) must protect the privacy of each person’s data (e.g. patient’s medical record), even if the server or other silos act as adversarial eavesdroppers. To that end, we consider inter-silo record-level (ISRL) differential privacy (DP), which requires silo $i$’s communications to satisfy record/item-level DP. We give novel ISRL-DP algorithms for FL with heterogeneous (non-i.i.d.) silo data and two classes of Lipschitz continuous loss functions: First, we consider losses satisfying the Proximal Polyak-Łojasiewicz (PL) inequality, which is an extension of the classical PL condition to the constrained setting. Prior works only considered unconstrained private optimization with Lipschitz PL loss, which rules out most interesting PL losses such as strongly convex problems and linear/logistic regression. However, by analyzing the proximal PL scenario, we permit these losses and others (e.g. LASSO, some neural nets) which are Lipschitz on a restricted parameter domain. Our algorithms nearly attain the optimal strongly convex, homogeneous (i.i.d.) rate for ISRL-DP FL without assuming convexity or i.i.d. data. Second, we give the first private algorithms for non-convex non-smooth loss functions. …
Poster
Nikolaos Nakis · Abdulkadir Celikkanat · Louis Boucherie · Christian Djurhuus · Felix Burmester · Daniel Mathias Holmelund · Monika Frolcová · Morten Mørup

[ Auditorium 1 Foyer ]

Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of ''us-versus-them'' that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We first propose the Signed Latent Distance Model (SLDM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks while extending the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes, forming the Signed Latent relational dIstance Model (SLIM). On four real social signed networks of polarization, we demonstrate that the models extract low-dimensional characterizations that well predict friendships and animosity while SLIM provides interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.

Poster
Hiromi Narimatsu · Mayuko Ozawa · Shiro Kumano

[ Auditorium 1 Foyer ]

Two important aspects of machine learning, uncertainty and calibration, have previously been studied separately. The first aspect involves knowing whether inaccuracy is due to the epistemic uncertainty of the model, which is theoretically reducible, or to the aleatoric uncertainty in the data per se, which thus becomes the upper bound of model performance. As for the second aspect, numerous calibration methods have been proposed to correct predictive probabilities to better reflect the true probabilities of being correct. In this paper, we aim to obtain the squared error of predictive distribution from the true distribution as epistemic uncertainty. Our formulation, based on second-order R\'{e}nyi entropy, integrates the two problems into a unified framework and obtains the epistemic (un)certainty as the difference between the aleatoric and predictive (un)certainties. As an auxiliary loss to ordinary losses, such as cross-entropy loss, the proposed collision probability matching (CPM) loss matches the cross-collision probability between the true and predictive distributions to the collision probability of the predictive distribution, where these probabilities correspond to accuracy and confidence, respectively. Unlike previous Shannon-entropy-based uncertainty methods, the proposed method makes the aleatoric uncertainty directly measurable as test-retest reliability, which is a summary statistic of the true distribution frequently used in …

Poster
Christian Fabian · Kai Cui · Heinz Koeppl

[ Auditorium 1 Foyer ]

Although the field of multi-agent reinforcement learning (MARL) has made considerable progress in the last years, solving systems with a large number of agents remains a hard challenge. Graphon mean field games (GMFGs) enable the scalable analysis of MARL problems that are otherwise intractable. By the mathematical structure of graphons, this approach is limited to dense graphs which are insufficient to describe many real-world networks such as power law graphs. Our paper introduces a novel formulation of GMFGs, called LPGMFGs, which leverages the graph theoretical concept of $L^p$ graphons and provides a machine learning tool to efficiently and accurately approximate solutions for sparse network problems. This especially includes power law networks which are empirically observed in various application areas and cannot be captured by standard graphons. We derive theoretical existence and convergence guarantees and give empirical examples that demonstrate the accuracy of our learning approach for systems with many agents. Furthermore, we extend the Online Mirror Descent (OMD) learning algorithm to our setup to accelerate learning speed, empirically show its capabilities, and conduct a theoretical analysis using the novel concept of smoothed step graphons. In general, we provide a scalable, mathematically well-founded machine learning approach to a large class of …
Poster
Mengxiao Zhang · Shi Chen · Haipeng Luo · Yingfei Wang

[ Auditorium 1 Foyer ]

Supply chain management (SCM) has been recognized as an important discipline with applications to many industries, where the two-echelon stochastic inventory model, involving one downstream retailer and one upstream supplier, plays a fundamental role for developing firms' SCM strategies. In this work, we aim at designing online learning algorithms for this problem with an unknown demand distribution, which brings distinct features as compared to classic online convex optimization problems. Specifically, we consider the two-echelon supply chain model introduced in [Cachon and Zipkin, 1999] under two different settings: the centralized setting, where a planner decides both agents' strategy simultaneously, and the decentralized setting, where two agents decide their strategy independently and selfishly. We design algorithms that achieve favorable guarantees for both regret and convergence to the optimal inventory decision in both settings, and additionally for individual regret in the decentralized setting. Our algorithms are based on Online Gradient Descent and Online Newton Step, together with several new ingredients specifically designed for our problem. We also implement our algorithms and show their empirical effectiveness.

Poster
Dheeraj Baby · Yu-Xiang Wang

[ Auditorium 1 Foyer ]

We consider the problem of \emph{universal} dynamic regret minimization under exp-concave and smooth losses. We show that appropriately designed Strongly Adaptive algorithms achieve a dynamic regret of $\tilde O(d^2 n^{1/5} [\mathcal{TV}_1(w_{1:n})]^{2/5} \vee d^2)$, where $n$ is the time horizon and $\mathcal{TV}_1(w_{1:n})$ a path variational based on \emph{second order differences} of the comparator sequence. Such a path variational naturally encodes comparator sequences that are piece-wise linear -- a powerful family that tracks a variety of non-stationarity patterns in practice (Kim et al, 2009). The aforementioned dynamic regret is shown to be optimal modulo dimension dependencies and poly-logarithmic factors of $n$. To the best of our knowledge, this path variational has not been studied in the non-stochastic online learning literature before. Our proof techniques rely on analysing the KKT conditions of the offline oracle and requires several non-trivial generalizations of the ideas in Baby & Wang 2021 where the latter work only implies an $\tilde{O}(n^{1/3})$ regret for the current problem.
Poster
Devansh Jalota · Karthik Gopalakrishnan · Navid Azizan · Ramesh Johari · Marco Pavone

[ Auditorium 1 Foyer ]

In transportation networks, road tolling schemes are a method to cope with the efficiency losses due to selfish user routing, wherein users choose routes to minimize individual travel costs. However, the efficacy of tolling schemes often relies on access to complete information on users' trip attributes, such as their origin-destination (O-D) travel information and their values of time, which may not be available in practice. Motivated by this practical consideration, we propose an online learning approach to set tolls in a traffic network to drive heterogeneous users with different values of time toward a system-efficient traffic pattern. In particular, we develop a simple yet effective algorithm that adjusts tolls at each time period solely based on the observed aggregate flows on the roads of the network without relying on any additional trip attributes of users, thereby preserving user privacy. In the setting where the O-D pairs and values of time of users are drawn i.i.d. at each period, we show that our approach obtains an expected regret and road capacity violation of $O(\sqrt{T})$, where $T$ is the number of periods over which tolls are updated. Our regret guarantee is relative to an offline oracle with complete information on users' trip …
Poster
Tom Norman · Kfir Levy · Nir Weinberger · Kfir Yehuda Levy · Nir Weinberger

[ Auditorium 1 Foyer ]

We investigate robust linear regression where data may be contaminated by an oblivious adversary, i.e., an adversary that knows the data distribution but is otherwise oblivious to the realization of the data samples. This model has been previously analyzed under strong assumptions. Concretely, (i) all previous works assume that the covariance matrix of the features is positive definite; (ii) most of them assume that the features are centered. Additionally, all previous works make additional restrictive assumptions, e.g., assuming Gaussianity of the features or symmetric distribution of the corruptions. In this work, we investigate robust regression under a more general set of assumptions: (i) the covariance matrix may be either positive definite or positive semi definite, (ii) features may not be centered, (iii) no assumptions beyond boundedness (or sub-Gaussianity) of the features and the measurement noise. Under these assumptions we analyze a sequential algorithm, namely, a natural SGD variant for this problem, and show that it enjoys a fast convergence rate when the covariance matrix is positive definite. In the positive semi definite case we show that there are two regimes: if the features are centered, we can obtain a standard convergence rate; Otherwise, the adversary can cause any learner to …

Poster
Jieyu Zhang · Linxin Song · Alex Ratner

[ Auditorium 1 Foyer ]

Programmatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently. The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources abstracted as labeling functions (LFs). Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features when modeling the underlying generative process. In this paper, we attempt to incorporate the instance features into a statistical label model via the proposed FABLE. In particular, it is built on a mixture of Bayesian label models, each corresponding to a global pattern of correlation, and the coefficients of the mixture components are predicted by a Gaussian Process classifier based on instance features. We adopt an auxiliary variable-based variational inference algorithm to tackle the non-conjugate issue between the Gaussian Process and Bayesian label models. Extensive empirical comparison on eleven benchmark datasets sees FABLE achieving the highest averaged performance across nine baselines. Our implementation of FABLE can be found in https://github.com/JieyuZ2/wrench/blob/main/wrench/labelmodel/fable.py.

Poster
Omer Gottesman · Kavosh Asadi · Cameron Allen · Samuel Lobel · George Konidaris · Michael Littman

[ Auditorium 1 Foyer ]

Principled decision-making in continuous state--action spaces is impossible without some assumptions. A common approach is to assume Lipschitz continuity of the Q-function. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows us to compute significantly tighter bounds on Q-functions, leading to improved learning. We provide a theoretical analysis of our new smoothness definition, and discuss its implications and impact on control and exploration in continuous domains.

Poster
Mengdi Xu · Peide Huang · Yaru Niu · Visak Kumar · Jielin Qiu · Chao Fang · Kuan-Hui Lee · Xuewei Qi · Henry Lam · Bo Li · DING ZHAO

[ Auditorium 1 Foyer ]

One key challenge for multi-task Reinforcement learning (RL) in practice is the absence of task specifications. Robust RL has been applied to deal with task ambiguity but may result in over-conservative policies. To balance the worst-case (robustness) and average performance, we propose Group Distributionally Robust Markov Decision Process (GDR-MDP), a flexible hierarchical MDP formulation that encodes task groups via a latent mixture model. GDR-MDP identifies the optimal policy that maximizes the expected return under the worst-possible qualified belief over task groups within an ambiguity set. We rigorously show that GDR-MDP's hierarchical structure improves distributional robustness by adding regularization to the worst possible outcomes. We then develop deep RL algorithms for GDR-MDP for both value-based and policy-based RL methods. Extensive experiments on Box2D control tasks, MuJoCo benchmarks, and Google football platforms show that our algorithms outperform classic robust training algorithms across diverse environments in terms of robustness under belief uncertainties.

Poster
Egor Gladin · Maksim Lavrik-Karmazin · Karina Zainullina · Varvara Rudenko · Alexander Gasnikov · Martin Takac

[ Auditorium 1 Foyer ]

The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy-regularized policy optimizer and Vaidya’s dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.

Poster
Yiding Chen · Xuezhou Zhang · Kaiqing Zhang · Mengdi Wang · Xiaojin Zhu

[ Auditorium 1 Foyer ]

We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is \textsc{COW}, a novel algorithm for the \textit{robust mean estimation from batches} problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
Poster
Eric Xia · Martin Wainwright

[ Auditorium 1 Foyer ]

We present and analyze the Krylov--Bellman Boosting algorithm forpolicy evaluation in general state spaces. It alternates betweenfitting the Bellman residual using non-parametric regression (as inboosting), and estimating the value function via the least-squarestemporal difference (LSTD) procedure applied with a feature set thatgrows adaptively over time. By exploiting the connection to Krylovmethods, we equip this method with two attractive guarantees. First,we provide a general convergence bound that allows for separateestimation errors in residual fitting and LSTD computation.Consistent with our numerical experiments, this bound shows thatconvergence rates depend on the restricted spectral structure, and aretypically super-linear. Second, by combining this meta-result withsample-size dependent guarantees for residual fitting and LTSDcomputation, we obtain concrete statistical guarantees that depend onthe sample size along with the complexity of the function class usedto fit the residuals. We illustrate the behavior of the KBB algorithmfor various types of policy evaluation problems, and typically findlarge reductions in sample complexity relative to the standardapproach of fitted value iteration.

Poster
Andrew Bennett · Dipendra Misra · Nathan Kallus

[ Auditorium 1 Foyer ]

Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies concepts from active learning to reinforcement learning to provably control the number of queries to the safety oracle. SABRE works by iteratively exploring the state space to find regions where the agent is currently uncertain about safety. Our main theoretical results shows that, under appropriate technical assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be applied to various settings studied in both …

Poster
Riashat Islam · Hongyu Zang · Manan Tomar · Aniket Didolkar · Md Mofijul Islam · Samin Yeasar Arnob · Tariq Iqbal · Xin Li · Anirudh Goyal · Nicolas Heess · Alex Lamb

[ Auditorium 1 Foyer ]

Several representation learning methods using self-supervised objectives have been proposed for rich observation space reinforcement learning (RL). For real world applications of RL, recovering underlying latent states is of importance, especially when sensory inputs can contain irrelevant and exogenous information. A central object of this work is to understand effectiveness of compressed representations, through use of information bottlenecks, to learn robust representations in presence of task irrelevant information. We propose an algorithm that utilizes variational and discrete information bottleneck, coined as RepDIB, to learn structured factorized representations. By exploiting the expressiveness in representation space due to factorized representations, we introduce a simple, yet effective, representation bottleneck that can be integrated with any existing self supervised objective for RL. We demonstrate this across several online and offline RL benchmarks, along with a real robot arm task, where we find that compressed representations with RepDIB can lead to effective performance improvements, since the learnt bottlenecks can help with predicting only the relevant state, while ignoring irrelevant information.

Poster
Samuel Alexander · David Quarel · Len Du · Marcus Hutter

[ Auditorium 1 Foyer ]

Inspired by recent progress in multi-agent Reinforcement Learning (RL), in this work we examine the collective intelligent behaviour of theoretical universal agents by introducing a weighted mixture operation. Given a weighted set of agents, their weighted mixture is a new agent whose expected total reward in any environment is the corresponding weighted average of the original agents' expected total rewards in that environment. Thus, if RL agent intelligence is quantified in terms of performance across environments, the weighted mixture's intelligence is the weighted average of the original agents' intelligence. This operation enables various interesting new theorems that shed light on the geometry of RL agent intelligence, namely: results about symmetries, convex agent-sets, and local extrema. We also show that any RL agent intelligence measure based on average performance across environments, subject to certain weak technical conditions, is identical (up to a constant factor) to performance within a single environment dependent on said intelligence measure.

Poster
Shengbo Wang · Nian Si · Jose Blanchet · Zhengyuan Zhou

[ Auditorium 1 Foyer ]

We consider a reinforcement learning setting in which the deployment environment is different from the training environment. Applying a robust Markov decision processes formulation, we extend the distributionally robust Q-learning framework studied in [Liu et. al. 2022]. Further, we improve the design and analysis of their multi-level Monte Carlo estimator. Assuming access to a simulator, we prove that the worst-case expected sample complexity of our algorithm to learn the optimal robust Q-function within an $\epsilon$ error in the sup norm is upper bounded by $\tilde O(|S||A|(1-\gamma)^{-5}\epsilon^{-2}p_{\wedge}^{-6}\delta^{-4})$, where $\gamma$ is the discount rate, $p_{\wedge}$ is the non-zero minimal support probability of the transition kernels and $\delta$ is the uncertainty size. This is the first sample complexity result for the model-free robust RL problem. Simulation studies further validate our theoretical results.
Poster
Pedro Cisneros · Boxiang Lyu · Sanmi Koyejo · Mladen Kolar

[ Auditorium 1 Foyer ]

While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.

Poster
Wanqiao Xu · Yecheng Ma · Kan Xu · Hamsa Bastani · Osbert Bastani

[ Auditorium 1 Foyer ]

A key challenge to deploying reinforcement learning in practice is avoiding excessive (harmful) exploration in individual episodes. We propose a natural constraint on exploration---uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget.We design a novel algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to satisfy our exploration constraint with high probability. Importantly, to ensure unbiased exploration across the state space, our algorithm adaptively determines when to explore. We prove that our approach remains conservative while minimizing regret in the tabular setting.We experimentally validate our results on a sepsis treatment task and an HIV treatment task, demonstrating that our algorithm can learn while ensuring good performance compared to the baseline policy for every patient; the latter task also demonstrates that our approach extends to continuous state spaces via deep reinforcement learning.

Poster
Yassir Jedra · Junghyun Lee · Alexandre Proutiere · Se-Young Yun

[ Auditorium 1 Foyer ]

We consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP.We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of possible contexts.
Poster
Alberto Maria Metelli · Mirco Mutti · Marcello Restelli

[ Auditorium 1 Foyer ]

The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.

Poster
Yuqing Hu · Stephane Pateux · Vincent Gripon

[ Auditorium 1 Foyer ]

Transductive Few-Shot learning has gained increased attention nowadays considering the cost of data annotations along with the increased accuracy provided by unlabelled samples in the domain of few shot. Especially in Few-Shot Classification (FSC), recent works explore the feature distributions aiming at maximizing likelihoods or posteriors with respect to the unknown parameters. Following this vein, and considering the parallel between FSC and clustering, we seek for better taking into account the uncertainty in estimation due to lack of data, as well as better statistical properties of the clusters associated with each class. Therefore in this paper we propose a new clustering method based on Variational Bayesian inference, further improved by Adaptive Dimension Reduction based on Probabilistic Linear Discriminant Analysis. Our proposed method significantly improves accuracy in the realistic unbalanced transductive setting on various Few-Shot benchmarks when applied to features used in previous studies, with a gain of up to $6\%$ in accuracy. In addition, when applied to balanced setting, we obtain very competitive results without making use of the class-balance artefact which is disputable for practical use cases.
Poster
Josh Givens · Henry W. J. Reeve · Song Liu

[ Auditorium 1 Foyer ]

Density Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with non-uniformly missing data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this missing data setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance on a range of both synthetic data and real world data with simulated missingness.

Poster
Samuel Stocksieker · Denys Pommeret · Arthur Charpentier

[ Auditorium 1 Foyer ]

In this work, we consider the problem of imbalanced data in a regression framework when the imbalanced phenomenon concerns continuous or discrete covariates. Such a situation can lead to biases in the estimates. In this case, we propose a data augmentation algorithm that combines a weighted resampling (WR) and a data augmentation (DA) procedure. In a first step, the DA procedure permits exploring a wider support than the initial one. In a second step, the WR method drives the exogenous distribution to a target one. We discuss the choice of the DA procedure through a numerical study that illustrates the advantages of this approach. Finally, an actuarial application is studied.

Poster
Nicholas Monath · Manzil Zaheer · Kelsey Allen · Andrew McCallum

[ Auditorium 1 Foyer ]

Dual encoder models are ubiquitous in modern classification and retrieval. Crucial for training such dual encoders is an accurate estimation of gradients from the partition function of the softmax over the large output space; this requires finding negative targets that contribute most significantly (`hard negatives). Since dual encoder model parameters change during training, the use of traditional static nearest neighbor indexes can be sub-optimal. These static indexes (1) periodically require expensive re-building of the index, which in turn requires (2) expensive re-encoding of all targets using updated model parameters. This paper addresses both of these challenges. First, we introduce an algorithm that uses a tree structure to approximate the softmax with provable bounds and that dynamically maintains the tree. Second, we approximate the effect of a gradient update on target encodings with an efficient Nystrom low-rank approximation. In our empirical study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining. Furthermore, our method surpasses prior state-of-the-art while using 150x less accelerator memory.

Poster
Helmuth Naumer · Farzad Kamalabadi

[ Auditorium 1 Foyer ]

This work introduces a method to select linear functional measurements of a vector-valued time series optimized for forecasting distant time-horizons. By formulating and solving the problem of sequential linear measurement design as an infinite-horizon problem with the time-averaged trace of the Cramér-Rao lower bound (CRLB) for forecasting as the cost, the most informative data can be collected irrespective of the eventual forecasting algorithm. By introducing theoretical results regarding measurements under additive noise from natural exponential families, we construct an equivalent problem from which a local dimensionality reduction can be derived. This alternative formulation is based on the future collapse of dimensionality inherent in the limiting behavior of many differential equations and can be directly observed in the low-rank structure of the CRLB for forecasting. Implementations of both an approximate dynamic programming formulation and the proposed alternative are illustrated using an extended Kalman filter for state estimation, with results on simulated systems with limit cycles and chaotic behavior demonstrating a linear improvement in the CRLB as a function of the number of collapsing dimensions of the system.

Poster
Aristeidis Panos · Ioannis Kosmidis · Petros Dellaportas

[ Auditorium 1 Foyer ]

We adopt the interpretability offered by a parametric, Hawkes-process-inspired conditional probability mass function for the marks and apply variational inference techniques to derive a general and scalable inferential framework for marked point processes. The framework can handle both exchangeable and non-exchangeable event sequences with minimal tuning and without any pre-training. This contrasts with many parametric and non-parametric state-of-the-art methods that typically require pre-training and/or careful tuning, and can only handle exchangeable event sequences. The framework’s competitive computational and predictive performance against other state-of-the-art methods are illustrated through real data experiments. Its attractiveness for large-scale applications is demonstrated through a case study involving all events occurring in an English Premier League season.

Poster
Minyoung Kim

[ Auditorium 1 Foyer ]

We tackle the cross-modal retrieval problem, where the training is only supervised by the relevant multi-modal pairs in the data. The contrastive learning is the most popular approach for this task. However, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.

Poster
Tom Hess · Ron Visbord · Sivan Sabato

[ Auditorium 1 Foyer ]

We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means|| algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means||, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means||. Code for running the algorithm and experiments is available at https://github.com/selotape/distributedkmeans.

Poster
Christopher Harker · Aditya Bhaskara

[ Auditorium 1 Foyer ]

Nonlinear node embedding techniques such as DeepWalk and Node2Vec are used extensively in practice to uncover structure in graphs. Despite theoretical guarantees in special regimes (such as the case of high embedding dimension), the structure of the optimal low dimensional embeddings has not been formally understood even for graphs obtained from simple generative models. We consider the stochastic block model and show that under appropriate separation conditions, the optimal embeddings can be analytically characterized. Akin to known results on eigenvector based (spectral) embeddings, we prove theoretically that solution vectors are well-clustered, up to a sublinear error.

Poster
Jeroen Berrevoets · Fergus Imrie · Trent Kyono · James Jordon · Mihaela van der Schaar

[ Auditorium 1 Foyer ]

Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the input (e.g. an individual) and the label (e.g. an outcome). The treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we introduce mixed confounded missingness (MCM), a new missingness mechanism where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment introduces bias in covariates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions …

Poster
Milan Kuzmanovic · Tobias Hatt · Stefan Feuerriegel

[ Auditorium 1 Foyer ]

Estimating conditional average treatment effects (CATE) is challenging, especially when treatment information is missing. Although this is a widespread problem in practice, CATE estimation with missing treatments has received little attention. In this paper, we analyze CATE estimation in the setting with missing treatments where unique challenges arise in the form of covariate shifts. We identify two covariate shifts in our setting: (i) a covariate shift between the treated and control population; and (ii) a covariate shift between the observed and missing treatment population. We first theoretically show the effect of these covariate shifts by deriving a generalization bound for estimating CATE in our setting with missing treatments. Then, motivated by our bound, we develop the missing treatment representation network (MTRNet), a novel CATE estimation algorithm that learns a balanced representation of covariates using domain adaptation. By using balanced representations, MTRNet provides more reliable CATE estimates in the covariate domains where the data are not fully observed. In various experiments with semi-synthetic and real-world data, we show that our algorithm improves over the state-of-the-art by a substantial margin.

Poster
Osman Mian · David Kaltenpoth · Michael Kamp · Jilles Vreeken

[ Auditorium 1 Foyer ]

In critical applications, causal models are the prime choice for their trustworthiness and explainability. If data is inherently distributed and privacy-sensitive, federated learning allows for collaboratively training a joint model. Existing approaches for federated causal discovery share locally discovered causal model in every iteration, therewith not only revealing local structure but also leading to very high communication costs. Instead, we propose an approach for privacy-preserving federated causal discovery by distributed min-max regret optimization. We prove that max-regret is a consistent scoring criterion that can be used within the well-known Greedy Equivalence Search to discover causal networks in a federated setting and is provably privacy-preserving at the same time. Through extensive experiments, we show that our approach reliably discovers causal networks without ever looking at local data and beats the state of the art both in terms of the quality of discovered causal networks as well as communication efficiency.

Poster
Ang Li · Judea Pearl

[ Auditorium 1 Foyer ]

Probabilities of causation play a crucial role in modern decision-making. Pearl defined three binary probabilities of causation, the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). These probabilities were then bounded by Tian and Pearl using a combination of experimental and observational data. However, observational data are not always available in practice; in such a case, Tian and Pearl's Theorem provided valid but less effective bounds using pure experimental data. In this paper, we discuss the conditions that observational data are worth considering to improve the quality of the bounds. More specifically, we defined the expected improvement of the bounds by assuming the observational distributions are uniformly distributed on their feasible interval. We further applied the proposed theorems to the unit selection problem defined by Li and Pearl.

Poster
Zeshan Hussain · Ming-Chieh Shih · Michael Oberst · Ilker Demirel · David Sontag

[ Auditorium 1 Foyer ]

Randomized Controlled Trials (RCT)s are relied upon to assess new treatments, but suffer from limited power to guide personalized treatment decisions. On the other hand, observational (i.e., non-experimental) studies have large and diverse populations, but are prone to various biases (e.g. residual confounding). To safely leverage the strengths of observational studies, we focus on the problem of falsification, whereby RCTs are used to validate causal effect estimates learned from observational data. In particular, we show that, given data from both an RCT and an observational study, assumptions on internal and external validity have an observable, testable implication in the form of a set of Conditional Moment Restrictions (CMRs). Further, we show that expressing these CMRs with respect to the causal effect, or causal contrast, as opposed to individual counterfactual means, provides a more reliable falsification test. In addition to giving guarantees on the asymptotic properties of our test, we demonstrate superior power and type I error of our approach on semi-synthetic and real world datasets. Our approach is interpretable, allowing a practitioner to visualize which subgroups in the population lead to falsification of an observational study.

Poster
Davin Choo · Kirankumar Shiragur

[ Auditorium 1 Foyer ]

Learning causal relationships between variables is a fundamental task in causal inference and directed acyclic graphs (DAGs) are a popular choice to represent the causal relationships. As one can recover a causal graph only up to its Markov equivalence class from observations, interventions are often used for the recovery task. Interventions are costly in general and it is important to design algorithms that minimize the number of interventions performed. In this work, we study the problem of identifying the smallest set of interventions required to learn the causal relationships between a subset of edges (target edges). Under the assumptions of faithfulness, causal sufficiency, and ideal interventions, we study this problem in two settings: when the underlying ground truth causal graph is known (subset verification) and when it is unknown (subset search). For the subset verification problem, we provide an efficient algorithm to compute a minimum sized interventional set; we further extend these results to bounded size non-atomic interventions and node-dependent interventional costs. For the subset search problem, in the worst case, we show that no algorithm (even with adaptivity or randomization) can achieve an approximation ratio that is asymptotically better than the vertex cover of the target edges when compared …

Poster
Alexander Hägele · Jonas Rothfuss · Lars Lorch · Vignesh Ram Somnath · Bernhard Schölkopf · Andreas Krause

[ Auditorium 1 Foyer ]

Inferring causal structures from experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, the targets of the interventions are often uncertain or unknown and the number of observations limited. As a result, standard causal discovery methods can no longer be reliably used.To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering and reasoning about the causal structure that underlies data generated under various unknown experimental or interventional conditions.BaCaDI is fully differentiable, which allows us to infer the complex joint posteriorover the intervention targets and the causal structure via efficient gradient-based variational inference.In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets.

Poster
Mohsen Heidari · Wojciech Szpankowski

[ Auditorium 1 Foyer ]

Many conventional learning algorithms rely on loss functions other than the natural 0-1 loss for computational efficiency and theoretical tractability. Among them are approaches based on absolute loss (L1 regression) and square loss (L2 regression). The first is proved to be an \textit{agnostic} PAC learner for various important concept classes such as \textit{juntas}, and \textit{half-spaces}. On the other hand, the second is preferable because of its computational efficiency which is linear in the sample size. However, PAC learnability is still unknown as guarantees have been proved only under distributional restrictions. The question of whether L2 regression is an agnostic PAC learner for 0-1 loss has been open since 1993 and yet has to be answered. This paper resolves this problem for the junta class on the Boolean cube --- proving agnostic PAC learning of k-juntas using L2 polynomial regression. Moreover, we present a new PAC learning algorithm based on the Boolean Fourier expansion with lower computational complexity. Fourier-based algorithms, such as Linial et al. (1993), have been used under distributional restrictions, such as uniform distribution. We show that with an appropriate change one can apply those algorithms in agnostic settings without any distributional assumption. We prove our results by connecting …

Poster
Aldo Carranza · Sanath Kumar Krishnamurthy · Susan Athey

[ Auditorium 1 Foyer ]

Contextual bandit algorithms often estimate reward models to inform decision-making. However, true rewards can contain action-independent redundancies that are not relevant for decision-making. We show it is more data-efficient to estimate any function that explains the reward differences between actions, that is, the treatment effects. Motivated by this observation, building on recent work on oracle-based bandit algorithms, we provide the first reduction of contextual bandits to general-purpose heterogeneous treatment effect estimation, and we design a simple and computationally efficient algorithm based on this reduction. Our theoretical and experimental results demonstrate that heterogeneous treatment effect estimation in contextual bandits offers practical advantages over reward estimation including more efficient model estimation and greater flexibility to model misspecification.

Poster
Thomas Kleine Buening · Aadirupa Saha

[ Auditorium 1 Foyer ]

We study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem. The only two existing attempts in this line of work fall short across multiple dimensions, including pessimistic measures of non-stationary complexity and non-adaptive parameter tuning that requires knowledge of the number of preference changes. We develop an elimination-based rescheduling algorithm to overcome these shortcomings and show a near-optimal $\tO(\sqrt{S^{\texttt{CW}} T})$ dynamic regret bound, where $\Sw$ is the number of times the Condorcet winner changes in $T$ rounds. This yields the first near-optimal dynamic regret bound for unknown $S^{\texttt{CW}}$. We further study other related notions of non-stationarity for which we also prove near-optimal dynamic regret guarantees under additional assumptions on the preference model.
Poster
Debangshu Banerjee · Avishek Ghosh · Sayak Ray Chowdhury · Aditya Gopalan

[ Auditorium 1 Foyer ]

We present a non-asymptotic lower bound on the spectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound on the least eigenvalue, rather than a logarithmic lower bound as shown by \citet{lattimore2017end}for discrete (i.e., well-separated) action spaces. Furthermore, while the latter holds only in the asymptotic regime ($n \to \infty$), our result for these ``locally rich" action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios -- \emph{model selection} and \emph{clustering} in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent …
Poster
Wonyoung Kim · Myunghee Cho Paik · Min-hwan Oh

[ Auditorium 1 Foyer ]

We propose a novel algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts.We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into \textit{additive} dimension-dependent terms instead of multiplicative terms.We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting.Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors.The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
Poster
Yu-Zhen Janice Chen · Lin Yang · Xuchuang Wang · Xutong Liu · Mohammad Hajiesmaili · John C. S. Lui · Don Towsley

[ Auditorium 1 Foyer ]

This paper studies a cooperative multi-agent multi-armed stochastic bandit problem where agents operate asynchronously -- agent pull times and rates are unknown, irregular, and heterogeneous -- and face the same instance of a K-armed bandit problem. Agents can share reward information to speed up the learning process at additional communication costs. We propose ODC, an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times. ODC is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents. ODC is a generic protocol that can be integrated into most cooperative bandit algorithms. We then incorporate ODC into the natural extensions of UCB and AAE algorithms and propose two communication-efficient cooperative algorithms. Our analysis shows that both algorithms are near-optimal in regret.

Poster
Tom Huix · Shunshi Zhang · Alain Durmus

[ Auditorium 1 Foyer ]

In this paper, we consider high dimensional contextual bandit problems. To address these issues, Thompson Sampling and its variants have been proposed and have been successfully applied to multiple machine learning problems. Existing theory on Thompson Sampling shows that it has suboptimal dimension dependency in contrast to upper confidence bound (UCB) algorithms. To circumvent this issue and obtain optimal regret bounds, Zhang et al (2021) recently proposed to modify Thompson Sampling by enforcing more exploration and hence is able to attain optimal regret bounds. Nonetheless, this analysis does not permit tractable implementation in high dimensions.The main challenge therein is the simulation of the posterior samples at each step given the available observations.To overcome this, we propose and analyze the use of Markov Chain Monte Carlo methods. As a corollary, we show that for contextual linear bandits, using Langevin Monte Carlo (LMC) or Metropolis Adjusted Langevin Algorithm (MALA), our algorithm attains optimal regret bounds of $\tilde{\mc{O}}(d\sqrt{T})$. Furthermore, we show that this is obtained with $\tilde{\mc{O}}(dT^4)$, $\tilde{\mc{O}}(dT^2)$ data evaluations respectively for LMC and MALA. Finally, we validate our findings through numerical simulations and show that we outperform vanilla Thompson sampling in high dimensions.
Poster
Kihyuk Hong · Yuhang Li · Ambuj Tewari

[ Auditorium 1 Foyer ]

We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on non- stationary kernel bandits. Moreover, when applied to the non-stationary linear bandits by us- ing a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.

Poster
Benjamin Howson · Ciara Pike-Burke · Sarah Filippi

[ Auditorium 1 Foyer ]

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.
Poster
Imad Aouali · Branislav Kveton · Sumeet Katariya

[ Auditorium 1 Foyer ]

A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. To explore efficiently using this structure, we propose Mixed-Effect Thompson Sampling (meTS) and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform well empirically and show the generality of the proposed framework.

Poster
Yuxuan Han · JIALIN ZENG · Yang Wang · Yang Xiang · Jiheng Zhang

[ Auditorium 1 Foyer ]

We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes.
Poster
Kurtland Chua · Qi Lei · Jason Lee

[ Auditorium 1 Foyer ]

Hierarchical reinforcement learning (HRL) has seen widespread interest as an approach to tractable learning of complex modular behaviors. However, existing works either assume access to expert-constructed hierarchies, or use hierarchy-learning heuristics with no provable guarantees. To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task. We consider a tabular setting where natural hierarchical structure is embedded in the transition dynamics. Analogous to supervised meta-learning theory, we provide “diversity conditions” which, together with a tractable optimism-based algorithm, guarantee sample-efficient recovery of this natural hierarchy. Furthermore, we provide regret bounds on a learner using the recovered hierarchy to solve a meta-test task. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.

Poster
Rong Tang · Yun Yang

[ Auditorium 1 Foyer ]

In this paper, we consider the problem of two-sample hypothesis testing that aims at detecting the difference between two probability densities based on finite samples. The proposed test statistic is constructed by first truncating a sample version of a negative Besov norm and then normalizing it. Here, the negative Besov norm is the norm associated with a Besov space with negative exponent, and is shown to be closely related to a class of commonly used adversarial losses (or integral probability metrics) with smooth discriminators. Theoretically, we characterize the optimal detection boundary of two-sample testing in terms of the dimensionalities and smoothness levels of the underlying densities and the discriminator class defining the adversarial loss. We also show that the proposed approach can simultaneously attain the optimal detection boundary under many common adversarial losses, including those induced by the $\ell_1$, $\ell_2$ distances and Wasserstein distances. Our numerical experiments show that the proposed test procedure tends to exhibit higher power and robustness in difference detection than existing state-of-the-art competitors.
Poster
Kenshi Abe · Kaito Ariu · Mitsuki Sakamoto · Kentaro Toyoshima · Atsushi Iwasaki

[ Auditorium 1 Foyer ]

This paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings. In the former, players observe their exact gradient vectors of the utility functions. In the latter, they only observe the noisy gradient vectors. Even the celebrated Multiplicative Weights Update (MWU) and Optimistic MWU (OMWU) algorithms may not converge to a Nash equilibrium with noisy feedback. On the contrary, M2WU exhibits the last-iterate convergence to a stationary point near a Nash equilibrium in both feedback settings. We then prove that it converges to an exact Nash equilibrium by iteratively adapting the mutation term. We empirically confirm that M2WU outperforms MWU and OMWU in exploitability and convergence rates.

Poster
Emmanouil Zampetakis · Fred Zhang

[ Auditorium 1 Foyer ]

A seminal work by Moulin (1980) shows that the median voting scheme fully characterizes (deterministic) strategy-proof facility location mechanism for single-peaked preferences. In this simple setting, median also achieves the optimal social cost. In $d$ dimensions, strategy-proof mechanism is characterized by coordinate-wise median, which is known to have a large $\sqrt{d}$ approximation ratio of the social cost in the Euclidean space, whereas the socially optimal mechanism fails at being strategy-proof. In light of the negative results in the classic, worst-case setting, we initiate the study of Bayesian mechanism design for strategy-proof facility location for multi-dimensional Euclidean preferences, where the agents' preferences are drawn from a distribution. We approach the problem via connections to algorithmic high-dimensional robust statistics. Specially, our contributions are the following:* We provide a general reduction from any robust estimation scheme to Bayesian approximately strategy-proof mechanism. This leads to new strategy-proof mechanisms for Gaussian and bounded moment distributions, by leveraging recent advances in robust statistics. * We show that the Lugosi-Mendelson median that arises from heavy-tailed statistics can be used to obtain Bayesian approximately strategy-proof single-facility mechanism with asymptotically optimal social cost, under mild distributional assumptions. * We provide Bayesian approximately strategy-proof multi-facility mechanisms for Gaussian mixture distributions …
Poster
Nelvin Tan · Ramji Venkataramanan

[ Auditorium 1 Foyer ]

In mixed linear regression, each observation comes from one of L regression vectors (signals), but we do not know which one. The goal is to estimate the signals from the unlabeled observations. We propose a novel approximate message passing (AMP) algorithm for estimation and rigorously characterize its performance in the high-dimensional limit. This characterization is in terms of a state evolution recursion, which allows us to precisely compute performance measures such as the asymptotic mean-squared error. This can be used to tailor the AMP algorithm to take advantage of any known structural information about the signals. Using state evolution, we derive an optimal choice of AMP `denoising' functions that minimizes the estimation error in each iteration. Numerical simulations are provided to validate the theoretical results, and show that AMP significantly outperforms other estimators including spectral methods, expectation maximization, and alternating minimization. Though our numerical results focus on mixed linear regression, the proposed AMP algorithm can be applied to a broader class of models including mixtures of generalized linear models and max-affine regression.

Poster
Minh-Toan Nguyen · Romain Couillet

[ Auditorium 1 Foyer ]

The article considers semi-supervised multitask learning on a Gaussian mixture model (GMM). Using methods from statistical physics, we compute the asymptotic Bayes risk of each task in the regime of large datasets in high dimension, from which we analyze the role of task similarity in learning and evaluate the performance gain when tasks are learned together rather than separately. In the supervised case, we derive a simple algorithm that attains the Bayes optimal performance.

Poster
Gary Cheng · Karan Chadha · John Duchi

[ Auditorium 1 Foyer ]

We develop an asymptotic framework to compare the test performance of (personalized) federated learning algorithms whose purpose is to move beyond algorithmic convergence arguments. To that end, we study a high-dimensional linear regression model to elucidate the statistical properties (per client test error) of loss minimizers. Our techniques and model allow precise predictions about the benefits of personalization and information sharing in federated scenarios, including that Federated Averaging with simple client fine-tuning achieves identical asymptotic risk to more intricate meta-learning approaches and outperforms naive Federated Averaging. We evaluate and corroborate these theoretical predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets.

Poster
Haiyun He · Gholamali Aminian · Yuheng Bu · Miguel Rodrigues · Vincent Tan

[ Auditorium 1 Foyer ]

We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. Distribution-free upper and lower bounds on the gen-error can also be obtained. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information shared between the labeled and pseudo-labeled data samples. This serves as a guideline to choose an appropriate pseudo-labeling method from a given family of methods. To deepen our understanding, we further explore two examples---mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data \lambda affects the gen-error under both scenarios. As \lambda increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified exactly with our analysis, and is dependent on the cross-covariance between the labeled and pseudo-labeled data samples. For logistic regression, the gen-error and …

Poster
Emilio Dorigatti · Benjamin Schubert · Bernd Bischl · David Rügamer

[ Auditorium 1 Foyer ]

Semi-structured regression (SSR) models jointly learn the effect of structured (tabular) and unstructured (non-tabular) data through additive predictors and deep neural networks (DNNs), respectively. Inference in SSR models aims at deriving confidence intervals for the structured predictor, although current approaches ignore the variance of the DNN estimation of the unstructured effects. This results in an underestimation of the variance of the structured coefficients and, thus, an increase of Type-I error rates. To address this shortcoming, we present here a theoretical framework for structured inference in SSR models that incorporates the variance of the DNN estimate into confidence intervals for the structured predictor. By treating this estimate as a random offset with known variance, our formulation is agnostic to the specific deep uncertainty quantification method employed. Through numerical experiments and a practical application on a medical dataset, we show that our approach results in increased coverage of the true structured coefficients and thus a reduction in Type-I error rate compared to ignoring the variance of the neural network, naive ensembling of SSR models, and a variational inference baseline.

Poster
Solenne Gaucher · Nicolas Schreuder · Evgenii Chzhen

[ Auditorium 1 Foyer ]

This work provides several fundamental characterizations of the optimal classification function under the demographic parity constraint. In the awareness framework, akin to the classical unconstrained classification case, we show that maximizing accuracy under this fairness constraint is equivalent to solving a fair regression problem followed by thresholding at level $1/2$. We extend this result to linear-fractional classification measures (e.g., ${\rm F}$-score, AM measure, balanced accuracy, etc.), highlighting the fundamental role played by regression in this framework.Our results leverage recently developed connection between the demographic parity constraint and the multi-marginal optimal transport formulation. Informally, our result shows that the transition between the unconstrained problem and the fair one is achieved by replacing the conditional expectation of the label by the solution of the fair regression problem.Finally, leveraging our analysis, we demonstrate an equivalence between the awareness and the unawareness setups for two sensitive groups.
Poster
Masahiro Kato · Masaaki Imaizumi · Kentaro Minami

[ Auditorium 1 Foyer ]

This paper provides a unified perspective for the Kullback-Leibler (KL)-divergence and the integral probability metrics (IPMs) from the perspective of maximum likelihood density-ratio estimation (DRE). Both the KL-divergence and the IPMs are widely used in various fields in applications such as generative modeling. However, a unified understanding of these concepts has still been unexplored. In this paper, we show that the KL-divergence and the IPMs can be represented as maximal likelihoods differing only by sampling schemes, and use this result to derive a unified form of the IPMs and a relaxed estimation method. To develop the estimation problem, we construct an unconstrained maximum likelihood estimator to perform DRE with a stratified sampling scheme. We further propose a novel class of probability divergences, called the Density Ratio Metrics (DRMs), that interpolates the KL-divergence and the IPMs. In addition to these findings, we also introduce some applications of the DRMs, such as DRE and generative adversarial networks. In experiments, we validate the effectiveness of our proposed methods.

Poster
Hongjian Wang · Aaditya Ramdas

[ Auditorium 1 Foyer ]

Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) ε fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.

Poster
Nikita Puchkin · Valeriia Shcherbakova

[ Auditorium 1 Foyer ]

We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.

Poster
Kevin Elgui · Alex Nowak · Geneviève Robin

[ Auditorium 1 Foyer ]

The introduction of machine learning (ML) techniques to the field of survival analysis has increased the flexibility of modeling approaches, and ML based models have become state-of-the-art. These models optimize their own cost functions, and their performance is often evaluated using the concordance index (C-index). From a statistical learning perspective, it is therefore an important problem to analyze the relationship between the optimizers of the C-index and those of the ML cost functions. We address this issue by providing C-index Fisher-consistency results and excess risk bounds for several of the commonly used cost functions in survival analysis. We identify conditions under which they are consistent, under the form of three nested families of survival models. We also study the general case where no model assumption is made and present a new, off-the-shelf method that is shown to be consistent with the C-index, although computationally expensive at inference. Finally, we perform limited numerical experiments with simulated data to illustrate our theoretical findings.

Poster
Shuo Yang · Yijun Dong · Rachel Ward · Inderjit Dhillon · Sujay Sanghavi · Qi Lei

[ Auditorium 1 Foyer ]

Data augmentation is popular in the training of large neural networks; however, currently, theoretical understanding of the discrepancy between different algorithmic choices of leveraging augmented data remains limited.In this paper, we take a step in this direction -- we first present a simple and novel analysis for linear regression with label invariant augmentations, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). The analysis is then extended to misspecified augmentations (i.e., augmentations that change the labels), which again demonstrates the merit of DAC over DA-ERM. Further, we extend our analysis to non-linear models (e.g., neural networks) and present generalization bounds. Finally, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between DAC and DA-ERM using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.

Poster
Yikai Zhang · Jiahe Lin · FENGPEI LI · · Kashif Rasul · Anderson Schneider · Yuriy Nevmyvaka

[ Auditorium 1 Foyer ]

Quantifying aleatoric uncertainty is a challenging task in machine learning. It is important for decision making associated with data-dependent uncertainty in model outcomes. Recently, many empirical studies in modeling aleatoric uncertainty under regression settings primarily rely on either a Gaussian likelihood or moment matching. However, the performance of these methods varies for different datasets whereas discussions on their theoretical guarantees are lacking. In this work, we investigate theoretical aspects of these approaches and establish risk bounds for their estimates. We provide conditions that are sufficient to guarantee the PAC-learnablility of the aleatoric uncertainty. The study suggests that the likelihood- and moment matching-based methods enjoy different types of guarantee in their risk bounds, i.e., they calibrate different aspects of the uncertainty and thus exhibit distinct properties in different regimes of the parameter space. Finally, we conduct empirical study which shows promising results and supports our theorems.

Poster
Anass Aghbalou · Anne Sabourin · François Portier

[ Auditorium 1 Foyer ]

This paper investigates the efficiency of the K-fold cross-validation (CV) procedure and a debiased version thereof as a means of estimating the generalization risk of a learning algorithm. We work under the general assumption of uniform algorithmic stability. We show that the K-fold risk estimate may not be consistent under such general stability assumptions, by constructing non vanishing lower bounds on the error in realistic contexts such as regularized empirical risk minimisation and stochastic gradient descent. We thus advocate the use of a debiased version of the K-fold and prove an error bound with exponential tail decay regarding this version. Our result is applicable to the large class of uniformly stable algorithms, contrarily to earlier works focusing on specific tasks such as density estimation. We illustrate the relevance of the debiased K-fold CV on a simple model selection problem and demonstrate empirically the usefulness of the promoted approach on real world classification and regression datasets.

Poster
Pranjal Awasthi · Anqi Mao · Mehryar Mohri · Yutao Zhong

[ Auditorium 1 Foyer ]

Adversarial robustness is a critical property of classifiers in applications as they are increasingly deployed in complex real-world systems. Yet, achieving accurate adversarial robustness in machine learning remains a persistent challenge and the choice of the surrogate loss function used for training a key factor. We present a family of new loss functions for adversarial robustness, smooth adversarial losses, which we show can be derived in a general way from broad families of loss functions used in multi-class classification. We prove strong H-consistency theoretical guarantees for these loss functions, including multi-class H-consistency bounds for sum losses in the adversarial setting. We design new regularized algorithms based on the minimization of these principled smooth adversarial losses (PSAL). We further show through a series of extensive experiments with the CIFAR-10, CIFAR-100 and SVHN datasets that our PSAL algorithm consistently outperforms the current state-of-the-art technique, TRADES, for both robust accuracy against l-infinity-norm bounded perturbations and, even more significantly, for clean accuracy. Finally, we prove that, unlike PSAL, the TRADES loss in general does not admit an H-consistency property.

Poster
Hedda Cohen Indelman · Tamir Hazan

[ Auditorium 1 Foyer ]

Multi-graph matching is a prominent structured prediction task, in which the predicted label is constrained to the space of cycle-consistent matchings. While direct loss minimization is an effective method for learning predictors over structured label spaces, it cannot be applied efficiently to the problem at hand, since executing a specialized solver across sets of matching predictions is computationally prohibitive. Moreover, there’s no supervision on the ground-truth matchings over cycle-consistent prediction sets. Our key insight is to strictly enforce the matching constraints in pairwise matching predictions and softly enforce the cycle-consistency constraints by casting them as weighted loss terms, such that the severity of inconsistency with global predictions is tuned by a penalty parameter. Inspired by the classic penaltymethod, we prove that our method theoretically recovers the optimal multi-graph matching constrained solution. Our method’s advantages are brought to light in experimental results on the popular keypoint matching task on the Pascal VOC and the Willow ObjectClass datasets.

Poster
Flavio Chierichetti · Mirko Giacchini · Ravi Kumar · Alessandro Panconesi · Andrew Tomkins

[ Auditorium 1 Foyer ]

In this work we consider the problem of fitting Random Utility Models (RUMs) to user choices. Given the winner distributions of the subsets of size $k$ of a universe, we obtain a polynomial-time algorithm that finds the RUM that best approximates the given distribution on average. Our algorithm is based on a linear program that we solve using the ellipsoid method. Given that its separation oracle problem is NP-hard, we devise an approximate separation oracle that can be viewed as a generalization of the weighted Feedback Arc Set problem to hypergraphs. Our theoretical result can also be made practical: we obtain a heuristic that scales to real-world datasets.
Poster
Hossein Taheri · Christos Thrampoulidis

[ Auditorium 1 Foyer ]

Decentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finite-time generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds approximately match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rate-of-consensus of DGD for a class of self-bounded losses. Finally, on the algorithmic front, we design improved gradient-based routines for decentralized learning with separable data and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.

Poster
Ronak Mehta · Vincent Roulet · Krishna Pillutla · Lang Liu · Zaid Harchaoui

[ Auditorium 1 Foyer ]

Spectral risk objectives -- also called L-risks -- allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop LSVRG, a stochastic algorithm to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothness of the objective. We show theoretically and experimentally that out-of-the-box approaches such as stochastic subgradient and dual averaging can be hindered by bias, whereas our approach exhibits linear convergence.

Poster
Théo Lacombe

[ Auditorium 1 Foyer ]

This work studies how the introduction of the entropic regularization term in unbalanced Optimal Transport (OT) models may alter their homogeneity with respect to the input measures. We observe that in common settings (including balanced OT and unbalanced OT with Kullback-Leibler divergence to the marginals), although the optimal transport cost itself is not homogeneous, optimal transport plans and the so-called Sinkhorn divergences are indeed homogeneous. However, homogeneity does not hold in more general Unbalanced Regularized Optimal Transport (UROT) models, for instance those using the Total Variation as divergence to the marginals. We propose to modify the entropic regularization term to retrieve an UROT model that is homogeneous while preserving most properties of the standard UROT model. We showcase the importance of using our Homogeneous UROT (HUROT) model when it comes to regularize Optimal Transport with Boundary, a transportation model involving a spatially varying divergence to the marginals for which the standard (inhomogeneous) UROT model would yield inappropriate behavior.

Poster
Dan Garber · Tsur Livney · Shoham Sabach

[ Auditorium 1 Foyer ]

This paper considers a convex composite optimization problem with affine constraints, which includes problems that take the form of minimizing a smooth convex objective function over the intersection of (simple) convex sets, or regularized with multiple (simple) functions. Motivated by high-dimensional applications in which exact projection/proximal computations are not tractable, we propose a \textit{projection-free} augmented Lagrangian-based method, in which primal updates are carried out using a \textit{weak proximal oracle} (WPO). In an earlier work, WPO was shown to be more powerful than the standard \textit{linear minimization oracle} (LMO) that underlies conditional gradient-based methods (aka Frank-Wolfe methods). Moreover, WPO is computationally tractable for many high-dimensional problems of interest, including those motivated by recovery of low-rank matrices and tensors, and optimization over polytopes which admit efficient LMOs. The main result of this paper shows that under a certain curvature assumption (which is weaker than strong convexity), our WPO-based algorithm achieves an ergodic rate of convergence of $O(1/T)$ for both the objective residual and feasibility gap. This result, to the best of our knowledge, improves upon the $O(1/\sqrt{T})$ rate for existing LMO-based projection-free methods for this class of problems. Empirical experiments on a low-rank and sparse covariance matrix estimation task and the Max …
Poster
Hyunglip Bae · Jinkyu Lee · Woo Chang Kim · Yongjae Lee

[ Auditorium 1 Foyer ]

A neural networks-based stagewise decomposition algorithm called Deep Value Function Networks (DVFN) is proposed for large-scale multistage stochastic programming (MSP) problems. Traditional approaches such as nested Benders decomposition and its stochastic variant, stochastic dual dynamic programming (SDDP) approximates value functions as piecewise linear convex functions by gradually accumulating subgradient cuts from dual solutions of stagewise subproblems. Although they have been proven effective for linear problems, nonlinear problems may suffer from the increasing number of subgradient cuts as they proceed. A recently developed algorithm called Value Function Gradient Learning (VFGL) replaced the piecewise linear approximation with parametric function approximation, but its performance heavily depends upon the choice of parametric forms like most of traditional parametric machine learning algorithms did. On the other hand, DVFN approximates value functions using neural networks, which are known to have huge capacity in terms of their functional representations. The art of choosing appropriate parametric form becomes a simple labor of hyperparameter search for neural networks. However, neural networks are non-convex in general, and it would make the learning process unstable. We resolve this issue by using input convex neural networks that guarantee convexity with respect to inputs. We compare DVFN with SDDP and VFGL for solving …

Poster
Marina Drygala · Sai Ganesh Nagarajan · Ola Svensson

[ Auditorium 1 Foyer ]

In recent years there has been a significant research effort on incorporating predictions into online algorithms. However, work in this area often makes the underlying assumption that predictions come for free (e.g., without any computational or monetary costs). In this paper, we consider a cost associated with making predictions. We show that interesting algorithmic subtleties arise for even the most basic online problems, such as ski rental and its generalization, the Bahncard problem. In particular, we show that with costly predictions, care needs to be taken in (i) asking for the prediction at the right time, (ii) deciding if it is worth asking for the prediction, and (iii) how many predictions we ask for, in settings where it is natural to consider making multiple predictions. Specifically, (i) in the basic ski-rental setting, we compute the optimal delay before asking the predictor, (ii) in the same setting, given apriori information about the true number of ski-days through its mean and variance, we provide a simple algorithm that is near-optimal, under some natural parameter settings, in deciding if it is worth asking for the predictor and (iii) in the setting of the Bahncard problem, we provide a $(1+\varepsilon)$-approximation algorithm and quantify lower …
Poster
Ryuichiro Hataya · Makoto Yamada

[ Auditorium 1 Foyer ]

The essential difficulty of gradient-based bilevel optimization is to estimate the inverse Hessian vector product of neural networks.This paper proposes to tackle this problem by the Nystrom method and the Woodbury matrix identity, exploiting the low-rankness of the Hessian.Compared to existing methods using iterative approximation, such as conjugate gradient and the Neumann series approximation, the proposed method can avoid numerical instability and be efficiently computed in matrix operations.As a result, the proposed method works stably in various tasks and is two times faster than iterative approximations.Throughout experiments including large-scale hyperparameter optimization and meta learning, we demonstrate that the Nystrom method consistently achieves comparable or even superior performance to other approaches.

Poster
Yiling Luo · Yiling Xie · Xiaoming Huo

[ Auditorium 1 Foyer ]

This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. The resulting rate for approximating the optimal transport (OT) has been improved from $\widetilde{\mathcal{O}}({n^{2.5}}/{\epsilon})$ to $\widetilde{\mathcal{O}}({n^2}/{\epsilon})$, where $n$ is the problem size and $\epsilon$ is the accuracy level. In particular, we propose an accelerated primal-dual stochastic mirror descent algorithm with variance reduction. Such special designs help us improve the rate compared to other accelerated primal-dual algorithms. We further propose a batch version of our stochastic algorithm, which improves the computational performance through parallel computing.To compare, we prove that the computational complexity of the Stochastic Sinkhorn algorithm is $\widetilde{\mathcal{O}}({n^2}/{\epsilon^2})$, which is slower than our accelerated primal-dual stochastic mirror algorithm. Experiments are done using synthetic and real data, and the results match our theoretical rates.Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $\widetilde{\mathcal{O}}({n^2}/{\epsilon})$ for solving OT.
Poster
Jane Lee · Saeid Haghighatshoar · Amin Karbasi

[ Auditorium 1 Foyer ]

Spiking neural networks (SNN) have recently emerged as an alternative to traditional neural networks, owing to its energy efficiency benefits and capacity to capture biological neuronal mechanisms. However, the classic backpropagation algorithm for training traditional networks has been notoriously difficult to apply to SNN due to the hard-thresholding and discontinuities at spike times. Therefore, a large majority of prior work believes exact gradients for SNN w.r.t. their weights do not exist and has focused on approximation methods to produce surrogate gradients. In this paper, (1) by applying the implicit function theorem to SNN at the discrete spike times, we prove that, albeit being non-differentiable in time, SNNs have well-defined gradients w.r.t. their weights, and (2) we propose a novel training algorithm, called \emph{forward propagation} (FP), that computes exact gradients for SNN. FP exploits the causality structure between the spikes and allows us to parallelize computation forward in time. It can be used with other algorithms that simulate the forward pass, and it also provides insights on why other related algorithms such as Hebbian learning and also recently-proposed surrogate gradient methods may perform well.

Poster
Oisín Faust · Hamza Fawzi · James Saunderson

[ Auditorium 1 Foyer ]

The difference of convex (DC) algorithm is a conceptually simple method for the minimization of (non)convex functions that are expressed as the difference of two convex functions. An attractive feature of the algorithm is that it maintains a global overestimator on the function and does not require a choice of step size at each iteration.By adopting a Bregman divergence point of view, we simplify and strengthen many existing non-asymptotic convergence guarantees for the DC algorithm. We further present several sufficient conditions that ensure a linear convergence rate, namely a new DC Polyak-Łojasiewicz condition, as well as a relative strong convexity assumption. Importantly, our conditions do not require smoothness of the objective function.We illustrate our results on a family of minimization problems involving the quantum relative entropy, with applications in quantum information theory.

Poster
Loay Mualem · Moran Feldman

[ Auditorium 1 Foyer ]

In recent years, maximization of DR-submodular continuous functions became an important research field, with many real-worlds applications in the domains of machine learning, communication systems, operation research and economics. Most of the works in this field study maximization subject to down-closed convex set constraints due to an inapproximability result by Vondrak (2013). However, Durr et al. (2021) showed that one can bypass this inapproximability by proving approximation ratios that are functions of m, the minimum ell-infinity norm of any feasible vector. Given this observation, it is possible to get results for maximizing a DR-submodular function subject to general convex set constraints, which has lead to multiple works on this problem. The most recent of which is a polynomial time 1/4(1 - m)-approximation offline algorithm due to Du (2022). However, only a sub-exponential time (1 - m)/(3^1.5)-approximation algorithm is known for the corresponding online problem. In this work, we present a polynomial time online algorithm matching the 1/4(1 - m)-approximation of the state-of-the-art offline algorithm. We also present an inapproximability result showing that our online algorithm and Du's (2022) offline algorithm are both optimal in a strong sense. Finally, we study the empirical performance of our algorithm and the algorithm Du …

Poster
Feihu Huang · Xidong Wu · Zhengmian Hu

[ Auditorium 1 Foyer ]

In the paper, we propose a class of faster adaptive Gradient Descent Ascent (GDA) methods for solving the nonconvex-strongly-concave minimax problems by using the unified adaptive matrices, which include almost all existing coordinate-wise and global adaptive learning rates. In particular, we provide an effective convergence analysis framework for our adaptive GDA methods. Specifically, we propose a fast Adaptive Gradient Descent Ascent (AdaGDA) method based on the basic momentum technique, which reaches a lower gradient complexity of $\tilde{O}(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the existing results of the adaptive GDA methods by a factor of $O(\sqrt{\kappa})$. Moreover, we propose an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves a lower gradient complexity of $\tilde{O}(\kappa^{4.5}\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches, which improves the existing results of the adaptive GDA methods by a factor of $O(\epsilon^{-1})$. Moreover, we prove that our VR-AdaGDA method can reach the best known gradient complexity of $\tilde{O}(\kappa^{3}\epsilon^{-3})$ with the mini-batch size $O(\kappa^3)$. The experiments on policy evaluation and fair classifier learning tasks are conducted to verify the efficiency of our new algorithms.
Poster
Jixiang Qing · Henry Moss · Tom Dhaene · Ivo Couckuyt

[ Auditorium 1 Foyer ]

We present Parallel Feasible Pareto Frontier Entropy Search ($\{\text{PF}\}^2$ES) --- a novel information-theoretic acquisition function for multi-objective Bayesian optimization supporting unknown constraints and batch queries. Due to the complexity of characterizing the mutual information between candidate evaluations and (feasible) Pareto frontiers, existing approaches must either employ crude approximations that significantly hamper their performance or rely on expensive inference schemes that substantially increase the optimization's computational overhead. By instead using a variational lower bound, $\{\text{PF}\}^2$ES provides a low-cost and accurate estimate of the mutual information. We benchmark $\{\text{PF}\}^2$ES against other information-theoretic acquisition functions, demonstrating its competitive performance for optimization across synthetic and real-world design problems.
Poster
Martin Jankowiak

[ Auditorium 1 Foyer ]

Bayesian variable selection is a powerful tool for data analysis, as it offers a principled method for variable selection that accounts for prior information and uncertainty. However, wider adoption of Bayesian variable selection has been hampered by computational challenges, especially in difficult regimes with a large number of covariates P or non-conjugate likelihoods. To scale to the large P regime we introduce an efficient Markov Chain Monte Carlo scheme whose cost per iteration is sublinear in P (though linear in the number of data points). In addition we show how this scheme can be extended to generalized linear models for count data, which are prevalent in biology, ecology, economics, and beyond. In particular we design efficient algorithms for variable selection in binomial and negative binomial regression, which includes logistic regression as a special case. In experiments we demonstrate the effectiveness of our methods, including on cancer and maize genomic data.

Poster
Alberto Cabezas Gonzalez · Christopher Nemeth

[ Auditorium 1 Foyer ]

We propose a new framework for efficiently sampling from complex probability distributions using a combination of normalizing flows and elliptical slice sampling (Murray et al., 2010). The central idea is to learn a diffeomorphism, through normalizing flows, that maps the non-Gaussian structure of the target distribution to an approximately Gaussian distribution. We then use the elliptical slice sampler, an efficient and tuning-free Markov chain Monte Carlo (MCMC) algorithm, to sample from the transformed distribution. The samples are then pulled back using the inverse normalizing flow, yielding samples that approximate the stationary target distribution of interest. Our transport elliptical slice sampler (TESS) is optimized for modern computer architectures, where its adaptation mechanism utilizes parallel cores to rapidly run multiple Markov chains for a few iterations. Numerical demonstrations show that TESS produces Monte Carlo samples from the target distribution with lower autocorrelation compared to non-transformed samplers, and demonstrates significant improvements in efficiency when compared to gradient-based proposals designed for parallel computer architectures, given a flexible enough diffeomorphism.

Poster
Eren Mehmet Kiral · Thomas Moellenhoff · Khan Emtiyaz

[ Auditorium 1 Foyer ]

The Bayesian Learning Rule provides a framework for generic algorithm design but can be difficult to use for three reasons. First, it requires a specific parameterization of exponential family. Second, it uses gradients which can be difficult to compute. Third, its update may not always stay on the distribution's manifold. We address these difficulties by proposing an extension based on Lie-groups where posteriors are parametrized through transformations of an arbitrary base distribution and updated via the group's exponential map. This simplifies all three difficulties for many cases, providing flexible parametrizations through group's action, simple gradient computation through reparameterization, and updates that always stay on the manifold. We use the new learning rule to derive a new algorithm for deep learning with desirable biologically-plausible attributes to learn sparse features.

Poster
Tomas Geffner · Justin Domke

[ Auditorium 1 Foyer ]

Many methods that build powerful variational distributions based on unadjusted Langevin transitions exist. Most of these were developed using a wide range of different approaches and techniques. Unfortunately, the lack of a unified analysis and derivation makes developing new methods and reasoning about existing ones a challenging task. We address this giving a single analysis that unifies and generalizes these existing techniques. The main idea is to augment the target and variational by numerically simulating the underdamped Langevin diffusion process and its time reversal. The benefits of this approach are twofold: it provides a unified formulation for many existing methods, and it simplifies the development of new ones. In fact, using our formulation we propose a new method that combines the strengths of previously existing algorithms; it uses underdamped Langevin transitions and powerful augmentations parameterized by a score network. Our empirical evaluation shows that our proposed method consistently outperforms relevant baselines in a wide range of tasks.

Poster
Juan Kuntz Nussio · Jen Ning Lim · Adam Johansen

[ Auditorium 1 Foyer ]

Neal and Hinton (1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional F, and the EM algorithm as coordinate descent applied to F. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with F and show that their limits coincide with F's stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.

Poster
Marvin Schmitt · Stefan Radev · Paul Bürkner

[ Auditorium 1 Foyer ]

Bayesian model comparison (BMC) offers a principled probabilistic approach to study and rank competing models. In standard BMC, we construct a discrete probability distribution over the set of possible models, conditional on the observed data of interest. These posterior model probabilities (PMPs) are measures of uncertainty, but---when derived from a finite number of observations---are also uncertain themselves. In this paper, we conceptualize distinct levels of uncertainty which arise in BMC. We explore a fully probabilistic framework for quantifying meta-uncertainty, resulting in an applied method to enhance any BMC workflow. Drawing on both Bayesian and frequentist techniques, we represent the uncertainty over the uncertain PMPs via meta-models which combine simulated and observed data into a predictive distribution for PMPs on new data. We demonstrate the utility of the proposed method in the context of conjugate Bayesian regression, likelihood-based inference with Markov chain Monte Carlo, and simulation-based inference with neural networks.

Poster
Syrine Belakaria · Janardhan Rao Doppa · Nicolo Fusi · Rishit Sheth

[ Auditorium 1 Foyer ]

The rising growth of deep neural networks (DNNs) and datasets in size motivates the need for efficient solutions for simultaneous model selection and training. Many methods for hyperparameter optimization (HPO) of iterative learners, including DNNs, attempt to solve this problem by querying and learning a response surface while searching for the optimum of that surface. However, many of these methods make myopic queries, do not consider prior knowledge about the response structure, and/or perform a biased cost-aware search, all of which exacerbate identifying the best-performing model when a total cost budget is specified. This paper proposes a novel approach referred to as Budget-Aware Planning for Iterative Learners (BAPI) to solve HPO problems under a constrained cost budget. BAPI is an efficient non-myopic Bayesian optimization solution that accounts for the budget and leverages the prior knowledge about the objective function and cost function to select better configurations and to take more informed decisions during the evaluation (training). Experiments on diverse HPO benchmarks for iterative learners show that BAPI performs better than state-of-the-art baselines in most cases.

Poster
Mrinank Sharma · Sebastian Farquhar · Eric Nalisnick · Tom Rainforth

[ Auditorium 1 Foyer ]

We investigate the benefit of treating all the parameters in a Bayesian neural network stochastically and find compelling theoretical and empirical evidence that this standard construction may be unnecessary. To this end, we prove that expressive predictive distributions require only small amounts of stochasticity. In particular, partially stochastic networks with only n stochastic biases are universal probabilistic predictors for n-dimensional predictive problems. In empirical investigations, we find no systematic benefit of full stochasticity across four different inference modalities and eight datasets; partially stochastic networks can match and sometimes even outperform fully stochastic networks, despite their reduced memory costs.

Poster
Rick Presman · Jason Xu

[ Auditorium 1 Foyer ]

Constrained learning is prevalent in countless statistical tasks. Recent work proposes distance-to-set penalties to derive estimators under general constraints that can be specified as sets, but focuses on obtaining point estimates that do not come with corresponding measures of uncertainty. To remedy this, we approach distance-to-set regularization from a Bayesian lens. We consider a class of smooth distance-to-set priors, showing that they yield well-defined posteriors toward quantifying uncertainty for constrained learning problems. We discuss relationships and advantages over prior work on Bayesian constraint relaxation. Moreover, we prove that our approach is optimal in an information geometric-sense for finite penalty parameters ρ, and enjoys favorable statistical properties when ρ → ∞. The method is designed to perform effectively within gradient-based MCMC samplers, as illustrated on a suite of simulated and real data applications.

Poster
Yirui Liu · Xinghao Qiao · Liying Wang · Jessica Lam

[ Auditorium 1 Foyer ]

Training deep graph neural networks (GNNs) poses a challenging task, as the performance of GNNs may suffer from the number of hidden message-passing layers. The literature has focused on the proposals of over-smoothing and under-reaching to explain the performance deterioration of deep GNNs. In this paper, we propose a new explanation for such deteriorated performance phenomenon, mis-simplification, that is, mistakenly simplifying graphs by preventing self-loops and forcing edges to be unweighted. We show that such simplifying can reduce the potential of message-passing layers to capture the structural information of graphs. In view of this, we propose a new framework, edge enhanced graph neural network (EEGNN). EEGNN uses the structural information extracted from the proposed Dirichlet mixture Poisson graph model (DMPGM), a Bayesian nonparametric model for graphs, to improve the performance of various deep message-passing GNNs. We propose a Markov chain Monte Carlo inference framework for DMPGM. Experiments over different datasets show that our method achieves considerable performance increase compared to baselines.

Poster
Benjie Wang · Marta Kwiatkowska

[ Auditorium 1 Foyer ]

Probabilistic circuits (PCs) are a class of tractable probabilistic models, which admit efficient inference routines depending on their structural properties. In this paper, we introduce md-vtrees, a novel structural formulation of (marginal) determinism in structured decomposable PCs, which generalizes previously proposed classes such as probabilistic sentential decision diagrams. Crucially, we show how md-vtrees can be used to derive tractability conditions and efficient algorithms for advanced inference queries expressed as arbitrary compositions of basic probabilistic operations, such as marginalization, multiplication and reciprocals, in a sound and generalizable manner. In particular, we derive the first polytime algorithms for causal inference queries such as backdoor adjustment on PCs. As a practical instantiation of the framework, we propose MDNets, a novel PC architecture using md-vtrees, and empirically demonstrate their application to causal inference.

Poster
Viacheslav Borovitskiy · Mohammad Reza Karimi · Vignesh Ram Somnath · Andreas Krause

[ Auditorium 1 Foyer ]

We propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and are analogous to the Euclidean isotropic processes, like squared exponential or Matérn. We propose an efficient computational technique for the ostensibly intractable problem of evaluating these priors' kernels, making such Gaussian processes usable within the usual toolboxes and downstream applications. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. We prove a hardness result, showing that in this case, exact kernel computation cannot be performed efficiently. However, we propose a simple Monte Carlo approximation for handling moderately sized cases. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime.

Poster
Daniel Jeong · Seyoung Kim

[ Auditorium 1 Foyer ]

Multi-output Gaussian process (GP) regression has been widely used as a flexible nonparametric Bayesian model for predicting multiple correlated outputs given inputs. However, the cubic complexity in the sample size and the output dimensions for inverting the kernel matrix has limited their use in the large-data regime. In this paper, we introduce the factorial stochastic differential equation as a representation of multi-output GP regression, which is a factored state-space representation as in factorial hidden Markov models. We propose a structured mean-field variational inference approach that achieves a time complexity linear in the number of samples, along with its sparse variational inference counterpart with complexity linear in the number of inducing points. On simulated and real-world data, we show that our approach significantly improves upon the scalability of previous methods, while achieving competitive prediction accuracy.

Poster
Yehu Chen · Roman Garnett · Jacob Montgomery · Annamaria Prati

[ Auditorium 1 Foyer ]

We introduce a Bayesian multi-task Gaussian process model for estimating treatment effects from panel data, where an intervention outside the observer's control influences a subset of the observed units. Our model encodes structured temporal dynamics both within and across the treatment and control groups and incorporates a flexible prior for the evolution of treatment effects over time. These innovations aid in inferring posteriors for dynamic treatment effects that encode our uncertainty about the likely trajectories of units in the absence of treatment. We also discuss the asymptotic properties of the joint posterior over counterfactual outcomes and treatment effects, which exhibits intuitive behavior in the large-sample limit. In experiments on both synthetic and real data, our approach performs no worse than existing methods and significantly better when standard assumptions are violated.

Poster
Thomas M. McDonald · Magnus Ross · Michael Smith · Mauricio Álvarez

[ Auditorium 1 Foyer ]

A key challenge in the practical application of Gaussian processes (GPs) is selecting a proper covariance function. The process convolutions construction of GPs allows some additional flexibility, but still requires choosing a proper smoothing kernel, which is non-trivial. Previous approaches have built covariance functions by using GP priors over the smoothing kernel, and by extension the covariance, as a way to bypass the need to specify it in advance. However, these models have been limited in several ways: they are restricted to single dimensional inputs, e.g. time; they only allow modelling of single outputs and they do not scale to large datasets since inference is not straightforward. In this paper, we introduce a nonparametric process convolution formulation for GPs that alleviates these weaknesses. We achieve this using a functional sampling approach based on Matheron’s rule to perform fast sampling using interdomain inducing variables. We test the performance of our model on benchmarks for single output, multi-output and large-scale GP regression, and find that our approach can provide improvements over standard GP models, particularly for larger datasets.

Poster
Kaiyu Li · Daniel Giles · Toni Karvonen · Serge Guillas · Francois-Xavier Briol

[ Auditorium 1 Foyer ]

Multilevel Monte Carlo is a key tool for approximating integrals involving expensive scientific models. The idea is to use approximations of the integrand to construct an estimator with improved accuracy over classical Monte Carlo. We propose to further enhance multilevel Monte Carlo through Bayesian surrogate models of the integrand, focusing on Gaussian process models and the associated Bayesian quadrature estimators. We show, using both theory and numerical experiments, that our approach can lead to significant improvements in accuracy when the integrand is expensive and smooth, and when the dimensionality is small or moderate. We conclude the paper with a case study illustrating the potential impact of our method in landslide-generated tsunami modelling, where the cost of each integrand evaluation is typically too large for operational settings.

Poster
Natalie Maus · Kaiwen Wu · David Eriksson · Jacob Gardner

[ Auditorium 1 Foyer ]

Bayesian optimization (BO) is a popular approach to sample-efficient optimization of black-box objective functions. While BO has been successfully applied to a wide range of scientific applications, traditional approaches to single-objective BO only seek to find a single best solution. This can be a significant limitation in situations where solutions may later turn out to be intractable, e.g., a designed molecule may turn out to later violate constraints that can only be evaluated after the optimization process has concluded. To combat this issue, we propose Rank-Ordered Bayesian Optimization with Trust-regions (ROBOT) which aims to find a portfolio of high-performing solutions that are diverse according to a user-specified diversity metric. We evaluate ROBOT on several real-world applications and show that it can discover large sets of high-performing diverse solutions while requiring few additional function evaluations compared to finding a single best solution.

Poster
Henry Moss · Sebastian Ober · Victor Picheny

[ Auditorium 1 Foyer ]

Sparse Gaussian processes are a key component of high-throughput Bayesian optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of determinantal point processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable for a broad range of downstream sequential decision making tasks.

Poster
Yohan Jung · Jinkyoo Park

[ Auditorium 1 Foyer ]

Convolutional deep sets are architecture of deep neural network (DNN) that can model the stationary stochastic process. This architecture uses the kernel smoother and the DNN for constructing the translation equivariant functional representations, and thus reflecting the inductive bias of the stationarity into the DNN. However, since this architecture employs the kernel smoother known as the non-parametric model, it might produce ambiguous representations when the amount of data points is not given sufficiently. To remedy this issue, in this work, we introduce a Bayesian convolutional deep sets that construct the random translation equivariant functional representations with stationary prior. Furthermore, we present how to impose task-dependent prior for each dataset because the wrongly imposed prior forms the even worse representation than that of the kernel smoother. We validate that the proposed architecture and its training on various experiments with time-series and image datasets.

Poster
Matthew Hoffman · Pavel Sountsov · Tuan Anh Le · Ben Lee · Christopher Suter · Vikash Mansinghka · Rif A. Saurous

[ Auditorium 1 Foyer ]

The problem of inferring object shape from a single 2D image is underconstrained. Prior knowledge about what objects are plausible can help, but even given such prior knowledge there may still be uncertainty about the shapes of occluded parts of objects. Recently, conditional neural radiance field (NeRF) models have been developed that can learn to infer good point estimates of 3D models from single 2D images. The problem of inferring uncertainty estimates for these models has received less attention. In this work, we propose probabilistic NeRF (ProbNeRF), a model and inference strategy for learning probabilistic generative models of 3D objects’ shapes and appearances, and for doing posterior inference to recover those properties from 2D images. ProbNeRF is trained as a variational autoencoder, but at test time we use Hamiltonian Monte Carlo (HMC) for inference. Given one or a few 2D images of an object (which may be partially occluded), ProbNeRF is able not only to accurately model the parts it sees, but also to propose realistic and diverse hypotheses about the parts it does not see. We show that key to the success of ProbNeRF are (i) a deterministic rendering scheme, (ii) an annealed-HMC strategy, (iii) a hypernetwork-based decoder architecture, …

Poster
Hiroshi Morioka · Aapo Hyvarinen

[ Auditorium 1 Foyer ]

Causal discovery methods typically extract causal relations between multiple nodes (variables) based on univariate observations of each node. However, one frequently encounters situations where each node is multivariate, i.e. has multiple observational modalities. Furthermore, the observed modalities may be generated through an unknown mixing process, so that some original latent variables are entangled inside the nodes. In such a multimodal case, the existing frameworks cannot be applied. To analyze such data, we propose a new causal representation learning framework called connectivity-contrastive learning (CCL). CCL disentangles the observational mixing and extracts a set of mutually independent latent components, each having a separate causal structure between the nodes. The actual learning proceeds by a novel self-supervised learning method in which the pretext task is to predict the label of a pair of nodes from the observations of the node pairs. We present theorems which show that CCL can indeed identify both the latent components and the multimodal causal structure under weak technical assumptions, up to some indeterminacy. Finally, we experimentally show its superior causal discovery performance compared to state-of-the-art baselines, in particular demonstrating robustness against latent confounders.

Poster
Quanhan Xi · Benjamin Bloem-Reddy

[ Auditorium 1 Foyer ]

Most modern probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Different tasks tolerate different indeterminacies, however recent applications have indicated the need for strongly identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, and recent work excludes many---but not all---indeterminacies. In this work, we motivate model-identifiability in terms of task-identifiability, then construct a theoretical framework for analyzing the indeterminacies of latent variable models, which enables their precise characterization in terms of the generator function and prior distribution spaces. We reveal that strong identifiability is possible even with highly flexible nonlinear generators, and give two such examples. One is a straightforward modification of iVAE (Khemakhem et al., 2020); the other uses triangular monotonic maps, leading to novel connections between optimal transport and identifiability.

Poster
Ilya Shpitser

[ Auditorium 1 Foyer ]

Graphical models such as Markov random fields (MRFs) that are associated with undirected graphs, and Bayesian networks (BNs) that are associated with directed acyclic graphs, have proven to be a very popular approach for reasoning under uncertainty, prediction problems and causal inference.Parametric MRF likelihoods are well-studied for Gaussian and categorical data. However, in more complicated parametric and semi-parametric settings, likelihoods specified via clique potential functions are generally not known to be congenial or non-redundant. Congenial and non-redundant DAG likelihoods are far simpler to specify in both parametric and semi-parametric settings by modeling Markov factors in the DAG factorization. However, DAG likelihoods specified in this way are not guaranteed to coincide in distinct DAGs within the same Markov equivalence class. This complicates likelihoods based model selection procedures for DAGs by ``sneaking in'' potentially unwarranted assumptions about edge orientations.In this paper we link a density function decomposition due to Chen with the clique factorization of MRFs described by Lauritzen to provide a general likelihood for MRF models. The proposed likelihood is composed of variationally independent, and non-redundant closed form functionals of the observed data distribution, and is sufficiently general to apply to arbitrary parametric and semi-parametric models. We use an extension of …

Poster
Zhendong Wang · Ruijiang Gao · Mingzhang Yin · Mingyuan Zhou · David Blei

[ Auditorium 1 Foyer ]

This paper proposes probabilistic conformal prediction (PCP), a predictive inference algorithm that estimates a target variable by a discontinuous predictive set. Given inputs, PCP constructs the predictive set based on random samples from an estimated generative model. It is efficient and compatible with conditional generative models with either explicit or implicit density functions. We show that PCP guarantees correct marginal coverage with finite samples and give empirical evidence of conditional coverage. We study PCP on a variety of simulated and real datasets. Compared to existing conformal prediction methods, PCP provides sharper predictive sets.

Poster
Lukang Sun · Avetik Karagulyan · Peter Richtarik

[ Auditorium 1 Foyer ]

Stein Variational Gradient Descent (SVGD) is an important alternative to the Langevin-type algorithms for sampling from probability distributions of the form $\pi(x) \propto \exp(-V(x))$. In the existing theory of Langevin-type algorithms and SVGD, the potential function $V$ is often assumed to be $L$-smooth. However, this restrictive condition excludes a large class of potential functions such as polynomials of degree greater than $2$. Our paper studies the convergence of the SVGD algorithm for distributions with $(L_0,L_1)$-smooth potentials. This relaxed smoothness assumption was introduced by Zhang et al. [2019a] for the analysis of gradient clipping algorithms. With the help of trajectory-independent auxiliary conditions, we provide a descent lemma establishing that the algorithm decreases the KL divergence at each iteration and prove a complexity bound for SVGD in the population limit in terms of the Stein Fisher information.
Poster
Laurence Davies · Robert Salomone · Matthew Sutton · Chris Drovandi

[ Auditorium 1 Foyer ]

Reversible jump Markov chain Monte Carlo (RJMCMC) proposals that achieve reasonable acceptance rates and mixing are notoriously difficult to design in most applications. Inspired by recent advances in deep neural network-based normalizing flows and density estimation, we demonstrate an approach to enhance the efficiency of RJMCMC sampling by performing transdimensional jumps involving reference distributions. In contrast to other RJMCMC proposals, the proposed method is the first to apply a non-linear transport-based approach to construct efficient proposals between models with complicated dependency structures. It is shown that, in the setting where exact transports are used, our RJMCMC proposals have the desirable property that the acceptance probability depends only on the model probabilities. Numerical experiments demonstrate the efficacy of the approach.

Poster
ANSUMAN BANERJEE · Shayak Chakraborty · Sourav Chakraborty · Kuldeep S Meel · Uddalok Sarkar · Sayantan Sen

[ Auditorium 1 Foyer ]

Sampling over combinatorial spaces is a fundamental problem in artificial intelligence with a wide variety of applications. Since state-of-the-art techniques heavily rely on heuristics whose rigorous analysis remains beyond the reach of current theoretical tools, the past few years have witnessed interest in the design of techniques to test the quality of samplers. The current state-of-the-art techniques, $\mathsf{Barbarik}$ and $\mathsf{Barbarik2}$, focuses on the cases where combinatorial spaces are encoded as Conjunctive Normal Form (CNF) formulas. While CNF is a general-purpose form, often techniques rely on exploiting specific representations to achieve speedup. Of particular interest are Horn clauses, which form the basis of the logic programming tools in AI. In this context, a natural question is whether it is possible to design a tester that can determine the correctness of a given Horn sampler. The primary contribution of this paper is an affirmative answer to the above question. We design the first tester, $\mathsf{Flash}$, which tests the correctness of a given Horn sampler: given a specific distribution $\mathcal{I}$ and parameters $\eta$, $\varepsilon$, and $\delta$, the tester $\mathsf{Flash}$ correctly (with probability at least $ 1-\delta$) distinguishes whether the underlying distribution of the Horn-sampler is ``$\varepsilon$-close'' to $\mathcal{I}$ or ``$\eta$-far'' from $\mathcal{I}$ by …
Poster
Manh Nguyen · Anderson Ye Zhang

[ Auditorium 1 Foyer ]

Item response theory (IRT) is the study of how people make probabilistic decisions, with diverse applications in education testing, recommendation systems, among others. The Rasch model of binary response data, one of the most fundamental models in IRT, remains an active area of research with important practical significance. Recently, Nguyen and Zhang (2022) proposed a new spectral estimation algorithm that is both efficient and accurate. In this work, we extend their results in two important ways. Firstly, we obtain a refined entrywise error bound for the spectral algorithm, complementing the `average error' $\ell_2$ bound in their work. Notably, the spectral algorithm achieves the minimax optimal entrywise error bound (up to a log factor). Building on the entrywise error bound, we also show that the spectral algorithm enjoys optimal sample complexity for top-$K$ recovery (e.g., identifying the best $K$ items from approval/disapproval response data), explaining interesting empirical findings in the previous work. Our second contribution addresses an important but understudied topic in IRT: privacy. Despite the human-centric applications of IRT, there has not been any proposed privacy-preserving mechanism in the literature. We develop a private extension of the spectral algorithm, leveraging its unique Markov chain formulation and the discrete Gaussian mechanism …
Poster
Rosanne Turner · Peter Grunwald

[ Auditorium 1 Foyer ]

Sequential decision making significantly speeds up research and is more cost-effective compared to fixed-n methods. We present a method for sequential decision making for stratified count data that retains Type-I error guarantee or false discovery rate under optional stopping, using e-variables. We invert the method to construct stratified anytime-valid confidence sequences, where cross-talk between subpopulations in the data can be allowed during data collection to improve power. Finally, we combine information collected in separate subpopulations through pseudo-Bayesian averaging and switching to create effective estimates for the minimal, mean and maximal treatment effects in the subpopulations.

Poster
Gianluigi Lopardo · Frederic Precioso · Damien Garreau

[ Auditorium 1 Foyer ]

Anchors (Ribeiro et al., 2018) is a post-hoc, rule-based interpretability method. For text data, it proposes to explain a decision by highlighting a small set of words (an anchor) such that the model to explain has similar outputs when they are present in a document. In this paper, we present the first theoretical analysis of Anchors, considering that the search for the best anchor is exhaustive. After formalizing the algorithm for text classification, we present explicit results on different classes of models when the vectorization step is TF-IDF, and words are replaced by a fixed out-of-dictionary token when removed. Our inquiry covers models such as elementary if-then rules and linear classifiers. We then leverage this analysis to gain insights on the behavior of Anchors for any differentiable classifiers. For neural networks, we empirically show that the words corresponding to the highest partial derivatives of the model with respect to the input, reweighted by the inverse document frequencies, are selected by Anchors.

Poster
Charles Marx · Youngsuk Park · Hilaf Hasson · Yuyang Wang · Stefano Ermon · Luke Huan

[ Auditorium 1 Foyer ]

Although black-box models can accurately predict outcomes such as weather patterns, they often lack transparency, making it challenging to extract meaningful insights (such as which atmospheric conditions signal future rainfall). Model explanations attempt to identify the essential features of a model, but these explanations can be inconsistent: two near-optimal models may admit vastly different explanations. In this paper, we propose a solution to this problem by constructing uncertainty sets for explanations of the optimal model(s) in both frequentist and Bayesian settings. Our uncertainty sets are guaranteed to include the explanation of the optimal model with high probability, even though this model is unknown. We demonstrate the effectiveness of our approach in both synthetic and real-world experiments, illustrating how our uncertainty sets can be used to calibrate trust in model explanations.

Poster
Christoph Luther · Gunnar König · Moritz Grosse-Wentrup

[ Auditorium 1 Foyer ]

The Shapley Additive Global Importance (SAGE) value is a theoretically appealing interpretability method that fairly attributes global importance to a model’s features. However, its exact calculation requires the computation of the feature’s surplus performance contributions over an exponential number of feature sets. This is computationally expensive, particularly because estimating the surplus contributions requires sampling from conditional distributions. Thus, SAGE approximation algorithms only take a fraction of the feature sets into account. We propose d-SAGE, a method that accelerates SAGE approximation. d-SAGE is motivated by the observation that conditional independencies (CIs) between a feature and the model target imply zero surplus contributions, such that their computation can be skipped. To identify CIs, we leverage causal structure learning (CSL) to infer a graph that encodes (conditional) independencies in the data as d-separations. This is computationally more efficient because the expense of the one-time graph inference and the d-separation queries is negligible compared to the expense of surplus contribution evaluations. Empirically we demonstrate that d-SAGE enables the efficient and accurate estimation of SAGE values.

Poster
Sebastian Bordt · Ulrike von Luxburg

[ Auditorium 1 Foyer ]

In explainable machine learning, local post-hoc explanation algorithms and inherently interpretable models are often seen as competing approaches. This work offers a partial reconciliation between the two by establishing a correspondence between Shapley Values and Generalized Additive Models (GAMs). We introduce $n$-Shapley Values, a parametric family of local post-hoc explanation algorithms that explain individual predictions with interaction terms up to order $n$. By varying the parameter $n$, we obtain a sequence of explanations that covers the entire range from Shapley Values up to a uniquely determined decomposition of the function we want to explain. The relationship between $n$-Shapley Values and this decomposition offers a functionally-grounded characterization of Shapley Values, which highlights their limitations. We then show that $n$-Shapley Values, as well as the Shapley Taylor- and Faith-Shap interaction indices, recover GAMs with interaction terms up to order $n$. This implies that the original Shapely Values recover GAMs without variable interactions. Taken together, our results provide a precise characterization of Shapley Values as they are being used in explainable machine learning. They also offer a principled interpretation of partial dependence plots of Shapley Values in terms of the underlying functional decomposition. A package for the estimation of different interaction indices is …
Poster
Zhun Deng · He Sun · Steven Wu · Linjun Zhang · David Parkes

[ Auditorium 1 Foyer ]

AI methods are used in societally important settings, ranging from credit to employment to housing, and it is crucial to provide fairness in regard to automated decision making. Moreover, many settings are dynamic, with populations responding to sequential decision policies. We introduce the study of reinforcement learning (RL) with stepwise fairness constraints, which require group fairness at each time step. In the case of tabular episodic RL, we provide learning algorithms with strong theoretical guarantees in regard to policy optimality and fairness violations. Our framework provides tools to study the impact of fairness constraints in sequential settings and brings up new challenges in RL.

Poster
Yao Yao · Qihang Lin · Tianbao Yang

[ Auditorium 1 Foyer ]

As machine learning being used increasingly in making high-stakes decisions, an arising challenge is to avoid unfair AI systems that lead to discriminatory decisions for protected population. A direct approach for obtaining a fair predictive model is to train the model through optimizing its prediction performance subject to fairness constraints. Among various fairness constraints, the ones based on the area under the ROC curve (AUC) are emerging recently because they are threshold-agnostic and effective for unbalanced data. In this work, we formulate the problem of training a fairness-aware predictive model as an AUC optimization problem subject to a class of AUC-based fairness constraints. This problem can be reformulated as a min-max optimization problem with min-max constraints, which we solve by stochastic first-order methods based on a new Bregman divergence designed for the special structure of the problem. We numerically demonstrate the effectiveness of our approach on real-world data under different fairness metrics.

Poster
Rui Wang · Pengyu Cheng · Ricardo Henao

[ Auditorium 1 Foyer ]

Pretrained language models (PLMs), such as GPT- 2, have achieved remarkable empirical performance in text generation tasks. However, pre- trained on large-scale natural language corpora, the generated text from PLMs may exhibit social bias against disadvantaged demographic groups. To improve the fairness of PLMs in text generation, we propose to minimize the mutual information between the semantics in the generated text sentences and their demographic polarity, i.e., the demographic group to which the sentence is referring. In this way, the mentioning of a demographic group (e.g., male or female) is encouraged to be independent from how it is described in the generated text, thus effectively alleviating the so cial bias. Moreover, we propose to efficiently estimate the upper bound of the above mutual information via importance sampling, leveraging a natural language corpus. We also propose a distillation mechanism that preserves the language modeling ability of the PLMs after debiasing. Empirical results on real-world benchmarks demonstrate that the proposed method yields superior performance in term of both fairness and language modeling ability.

Poster
Wei-Ning Chen · Ayfer Ozgur · Graham Cormode · Akash Bharadwaj

[ Auditorium 1 Foyer ]

We consider the federated frequency estimation problem, where each user holds a private item $X_i$ from a size-$d$ domain and a server aims to estimate the empirical frequency (i.e., histogram) of $n$ items with $n \ll d$. Without any security and privacy considerations, each user can communicate its item to the server by using $\log d$ bits. A naive application of secure aggregation protocols would, however, require $d\log n$ bits per user. Can we reduce the communication needed for secure aggregation, and does security come with a fundamental cost in communication? In this paper, we develop an information-theoretic model for secure aggregation that allows us to characterize the fundamental cost of security and privacy in terms of communication. We show that with security (and without privacy) $\Omega\left( n \log d \right)$ bits per user are necessary and sufficient to allow the server to compute the frequency distribution. This is significantly smaller than the $d\log n$ bits per user needed by the naive scheme but significantly higher than the $\log d$ bits per user needed without security. To achieve differential privacy, we construct a linear scheme based on a noisy sketch that locally perturbs the data and does not require a …
Poster
Jasper Tan · Daniel LeJeune · Blake Mason · Hamid Javadi · Richard Baraniuk

[ Auditorium 1 Foyer ]

Is overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier's vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy-utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. However, remarkably, we then show that if coupled with proper regularization, increasing the number of parameters of a model can actually simultaneously increase both its privacy and performance, thereby eliminating the privacy-utility trade-off. Theoretically, we demonstrate this curious phenomenon for logistic regression with ridge regularization in a bi-level feature ensemble setting. Pursuant to our theoretical exploration, we develop a novel leave-one-out analysis tool to precisely characterize the vulnerability of a linear classifier to the optimal membership inference attack. We empirically exhibit this "blessing of dimensionality" for neural networks on a variety of tasks using early stopping as the regularizer

Poster
Truc Nguyen · Phung Lai · Khang Tran · Hai Phan · My T. Thai

[ Auditorium 1 Foyer ]

Federated learning (FL) was originally regarded as a framework for collaborative learning among clients with data privacy protection through a coordinating server. In this paper, we propose a new active membership inference (AMI) attack carried out by a dishonest server in FL. In AMI attacks, the server crafts and embeds malicious parameters into global models to effectively infer whether a target data sample is included in a client's private training data or not. By exploiting the correlation among data features through a non-linear decision boundary, AMI attacks with a certified guarantee of success can achieve severely high success rates under rigorous local differential privacy (LDP) protection; thereby exposing clients' training data to significant privacy risk. Theoretical and experimental results on several benchmark datasets show that adding sufficient privacy-preserving noise to prevent our attack would significantly damage FL's model utility.

Poster
Meyer Scetbon · Elvis Dohmatob

[ Auditorium 1 Foyer ]

In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.

Poster
Kyriakos Lotidis · Nicholas Bambos · Jose Blanchet · Jiajin Li

[ Auditorium 1 Foyer ]

We focus on robust estimation of the unobserved state of a discrete-time stochastic system with linear dynamics. A standard analysis of this estimation problem assumes a baseline innovation model; with Gaussian innovations we recover the Kalman filter. However, in many settings, there is insufficient or corrupted data to validate the baseline model. To cope with this problem, we minimize the worst-case mean-squared estimation error of adversarial models chosen within a Wasserstein neighborhood around the baseline. We also constrain the adversarial innovations to form a martingale difference sequence. The martingale constraint relaxes the i.i.d. assumptions which are often imposed on the baseline model. Moreover, we show that the martingale constraints guarantee that the adversarial dynamics remain adapted to the natural time-generated information. Therefore, adding the martingale constraint allows to improve upon over-conservative policies that also protect against unrealistic omniscient adversaries. We establish a strong duality result which we use to develop an efficient subgradient method to compute the distributionally robust estimation policy. If the baseline innovations are Gaussian, we show that the worst-case adversary remains Gaussian. Our numerical experiments indicate that the martingale constraint may also aid in adding a layer of robustness in the choice of the adversarial power.

Poster
Nabeel Seedat · Alan Jeffares · Fergus Imrie · Mihaela van der Schaar

[ Auditorium 1 Foyer ]

Conformal prediction is a powerful distribution-free tool for uncertainty quantification, establishing valid prediction intervals with finite-sample guarantees. To produce valid intervals which are also adaptive to the difficulty of each instance, a common approach is to compute normalized nonconformity scores on a separate calibration set. Self-supervised learning has been effectively utilized in many domains to learn general representations for downstream predictors. However, the use of self-supervision beyond model pretraining and representation learning has been largely unexplored. In this work, we investigate how self-supervised pretext tasks can improve the quality of the conformal regressors, specifically by improving the adaptability of conformal intervals. We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores. We empirically demonstrate the benefit of the additional information using both synthetic and real data on the efficiency (width), deficit, and excess of conformal prediction intervals.

Poster
Dennis Wei · Haoze Wu · Min Wu · Pin-Yu Chen · Clark Barrett · Eitan Farchi

[ Auditorium 1 Foyer ]

The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well. This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models. We derive bounds using both a natural exponential-reciprocal decomposition of the softmax as well as an alternative decomposition in terms of the log-sum-exp function. The new bounds are provably and/or numerically tighter than linear bounds obtained in previous work on robustness verification of transformers. As illustrations of the utility of the bounds, we apply them to verification of transformers as well as of the robustness of predictive uncertainty estimates of deep ensembles.

Poster
Helen Zhou · Sivaraman Balakrishnan · Zachary Lipton

[ Auditorium 1 Foyer ]

Rates of missing data often depend on record-keeping policies and thus may change across times and locations, even when the underlying features are comparatively stable. In this paper, we introduce the problem of Domain Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and (unlabeled) target data would be exchangeable but for different missing data mechanisms. We show that if missing data indicators are available, DAMS reduces to covariate shift. Addressing cases where such indicators are absent, we establish the following theoretical results for underreporting completely at random: (i) covariate shift is violated (adaptation is required); (ii) the optimal linear source predictor can perform arbitrarily worse on the target domain than always predicting the mean; (iii) the optimal target predictor can be identified, even when the missingness rates themselves are not; and (iv) for linear models, a simple analytic adjustment yields consistent estimates of the optimal target parameters. In experiments on synthetic and semi-synthetic data, we demonstrate the promise of our methods when assumptions hold. Finally, we discuss a rich family of future extensions.

Poster
Ibrahim Alabdulmohsin · Nicole Chiou · Alexander D'Amour · Arthur Gretton · Sanmi Koyejo · Matt Kusner · Stephen Pfohl · Olawale Salaudeen · Jessica Schrouff · Katherine Tsai

[ Auditorium 1 Foyer ]

We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.

Poster
Haotian Ye · James Zou · Linjun Zhang

[ Auditorium 1 Foyer ]

The existence of spurious correlations such as image backgrounds in the training environment can make empirical risk minimization (ERM) perform badly in the test environment. To address this problem, Kirichenko et al. (2022) empirically found that the core features that are related to the outcome can still be learned well even with the presence of spurious correlations. This opens a promising strategy to first train a feature learner rather than a classifier, and then perform linear probing (last layer retraining) in the test environment. However, a theoretical understanding of when and why this approach works is lacking. In this paper, we find that core features are only learned well when their associated non-realizable noise is smaller than that of spurious features, which is not necessarily true in practice. We provide both theories and experiments to support this finding and to illustrate the importance of non-realizable noise. Moreover, we propose an algorithm called Freeze then Train (FTT), that first freezes certain salient features and then trains the rest of the features using ERM. We theoretically show that FTT preserves features that are more beneficial to test time probing. Across two commonly used spurious correlation datasets, FTT outperforms ERM, IRM, JTT and …

Poster
Ahmed Alaa · Zeshan Hussain · David Sontag

[ Auditorium 1 Foyer ]

We develop a predictive inference procedure that combines conformal prediction (CP) with unconditional quantile regression (QR)—a commonly used tool in econometrics [1] that involves regressing the re-centered influence function (RIF) of the quantile functional over input covariates. Unlike the more widely-known conditional QR, unconditional QR explicitly captures the impact of changes in covariate distribution on the quantiles of the marginal distribution of outcomes. Leveraging this property, our procedure issues adaptive predictive intervals with localized frequentist coverage guarantees. It operates by fitting a machine learning model for the RIFs using training data, and then applying the CP procedure for any test covariate with respect to a “hypothetical” covariate distribution localized around the new instance. Experiments show that our procedure is adaptive to heteroscedasticity, provides transparent coverage guarantees that are relevant to the test instance at hand, and performs competitively with existing methods in terms of efficiency.

Poster
Thomas Kleine Buening · Christos Dimitrakakis · Hannes Eriksson · Divya Grover · Emilio Jorge

[ Auditorium 1 Foyer ]

While the Bayesian decision-theoretic framework offers an elegant solution to the problem of decision making under uncertainty, one question is how to appropriately select the prior distribution. One idea is to employ a worst-case prior. However, this is not as easy to specify in sequential decision making as in simple statistical estimation problems. This paper studies (sometimes approximate) minimax-Bayes solutions for various reinforcement learning problems to gain insights into the properties of the corresponding priors and policies. We find that while the worst-case prior depends on the setting, the corresponding minimax policies are more robust than those that assume a standard (i.e. uniform) prior.

Poster
Haoran Sun · Hanjun Dai · Bo Dai · Haomin Zhou · Dale Schuurmans

[ Auditorium 1 Foyer ]

Langevin dynamics, the particle version of the gradient flow that minimizes the KL divergence on Wasserstein manifold, induces gradient based Markov chain Monte Carlo (MCMC) samplers like Langevin Monte Carlo (LMC) in continuous space. The superior efficiency of gradient based MCMC samplers stimulates the recent attempts to generalize LMC to discrete space. However, the principled extension of the Langevin dynamics for discrete space is still missing due to the lack of well-defined gradients. In this work, we generalize the Wasserstein gradient flow to discrete spaces and derive the corresponding discrete counterpart of Langevin dynamics. With this new understanding, the recent ``gradient''-based samplers in discrete space can be obtained by choosing proper discretizations. This new framework also enables us to derive a new algorithm named \textit{Discrete Langevin Monte Carlo} (DLMC) by simulating the Wasserstein gradient flow with respect to simulation time. As a result, DLMC has a convenient parallel implementation and location-dependent velocities that allow larger average jump distance. We demonstrate the advantages of DLMC for sampling and learning in various binary and categorical distributions.