Skip to yearly menu bar Skip to main content


Timezone: Europe/Madrid

Registration Desk: Registration + Assistance Thu 27 Apr 07:00 a.m.  


Mentorship: Mentoring: Tips for Scientific Writing Thu 27 Apr 08:00 a.m.  

Marc Deisenroth

Invited Talk: Tamara Broderick

An Automatic Finite-Sample Robustness Check: Can Dropping a Little Data Change Conclusions?

Practitioners will often analyze a data sample with the goal of applying any conclusions to a new population. For instance, if economists conclude microcredit is effective at alleviating poverty based on observed data, policymakers might decide to distribute microcredit in other locations or future years. Typically, the original data is not a perfect random sample from the population where policy is applied -- but researchers might feel comfortable generalizing anyway so long as deviations from random sampling are small, and the corresponding impact on conclusions is small as well. Conversely, researchers might worry if a very small proportion of the data sample was instrumental to the original conclusion. So we propose a method to assess the sensitivity of statistical conclusions to the removal of a very small fraction of the data set. Manually checking all small data subsets is computationally infeasible, so we propose an approximation based on the classical influence function. Our method is automatically computable for common estimators. We provide finite-sample error bounds on approximation performance and a low-cost exact lower bound on sensitivity. We find that sensitivity is driven by a signal-to-noise ratio in the inference problem, does not disappear asymptotically, and is not decided by misspecification. Empirically we find that many data analyses are robust, but the conclusions of several influential economics papers can be changed by removing (much) less than 1% of the data.

Tamara Broderick is an Associate Professor in the Department of Electrical Engineering and Computer Science at MIT. She is a member of the MIT Laboratory for Information and Decision Systems (LIDS), the MIT Statistics and Data Science Center, and the Institute for Data, Systems, and Society (IDSS). She completed her Ph.D. in Statistics at the University of California, Berkeley in 2014. Previously, she received an AB in Mathematics from Princeton University (2007), a Master of Advanced Study for completion of Part III of the Mathematical Tripos from the University of Cambridge (2008), an MPhil by research in Physics from the University of Cambridge (2009), and an MS in Computer Science from the University of California, Berkeley (2013). Her recent research has focused on developing and analyzing models for scalable Bayesian machine learning. She has been awarded selection to the COPSS Leadership Academy (2021), an Early Career Grant (ECG) from the Office of Naval Research (2020), an AISTATS Notable Paper Award (2019), an NSF CAREER Award (2018), a Sloan Research Fellowship (2018), an Army Research Office Young Investigator Program (YIP) award (2017), Google Faculty Research Awards, an Amazon Research Award, the ISBA Lifetime Members Junior Researcher Award, the Savage Award (for an outstanding doctoral dissertation in Bayesian theory and methods), the Evelyn Fix Memorial Medal and Citation (for the Ph.D. student on the Berkeley campus showing the greatest promise in statistical research), the Berkeley Fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize (for the graduating Princeton senior with the highest academic average).



Oral: Supervised Learning Thu 27 Apr 10:30 a.m.  

Oral
Neil Jethani · Adriel Saporta · Rajesh Ranganath

[ Auditorium 1 ]

Feature attribution methods identify which features of an input most influence a model's output. Most widely-used feature attribution methods (such as SHAP, LIME, and Grad-CAM) are "class-dependent" methods in that they generate a feature attribution vector as a function of class. In this work, we demonstrate that class-dependent methods can "leak" information about the selected class, making that class appear more likely than it is. Thus, an end user runs the risk of drawing false conclusions when interpreting an explanation generated by a class-dependent method. In contrast, we introduce "distribution-aware" methods, which favor explanations that keep the label's distribution close to its distribution given all features of the input. We introduce SHAP-KL and FastSHAP-KL, two baseline distribution-aware methods that compute Shapley values. Finally, we perform a comprehensive evaluation of seven class-dependent and three distribution-aware methods on three clinical datasets of different high-dimensional data types: images, biosignals, and text.

Oral
Zhe Huang · Mary-Joy Sidhom · Benjamin Wessler · Michael C. Hughes

[ Auditorium 1 ]

Semi-supervised learning (SSL) promises improved accuracy compared to training classifiers on small labeled datasets by also training on many unlabeled images. In real applications like medical imaging, unlabeled data will be collected for expediency and thus uncurated: possibly different from the labeled set in classes or features. Unfortunately, modern deep SSL often makes accuracy worse when given uncurated unlabeled data. Recent complex remedies try to detect out-of-distribution unlabeled images and then discard or downweight them. Instead, we introduce Fix-A-Step, a simpler procedure that views all uncurated unlabeled images as potentially helpful. Our first insight is that even uncurated images can yield useful augmentations of labeled data. Second, we modify gradient descent updates to prevent optimizing a multi-task SSL loss from hurting labeled-set accuracy. Fix-A-Step can "repair" many common deep SSL methods, improving accuracy on CIFAR benchmarks across all tested methods and levels of artificial class mismatch. On a new medical SSL benchmark called Heart2Heart, Fix-A-Step can learn from 353,500 truly uncurated ultrasound images to deliver gains that generalize across hospitals.

Oral
Yulai Zhao · Jianshu Chen · Simon Du

[ Auditorium 1 ]

This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP).We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks.Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$.Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition.These techniques can be of independent interest.
Oral
Ellango Jothimurugesan · Kevin Hsieh · Jianyu Wang · Gauri Joshi · Phillip B Gibbons

[ Auditorium 1 ]

Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). Our work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation, with their single global model, are ill-suited to staggered drifts, necessitating multiple-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step.


Oral: Statistical Methods 2 Thu 27 Apr 11:30 a.m.  

Oral
Victoria Crawford

[ Auditorium 1 ]

In this paper, we consider the optimization problem \scpl (\scp), which is to find a minimum cost subset of a ground set $U$ such that the value of a submodular function $f$ is above a threshold $\tau$. In contrast to most existing work on \scp, it is not assumed that $f$ is monotone. Two bicriteria approximation algorithms are presented for \scp that, for input parameter $0 < \epsilon < 1$, give $O( 1 / \epsilon^2 )$ ratio to the optimal cost and ensures the function $f$ is at least $\tau(1 - \epsilon)/2$. A lower bound shows that under the value query model shows that no polynomial-time algorithm can ensure that $f$ is larger than $\tau/2$. Further, the algorithms presented are scalable to large data sets, processing the ground set in a stream. Similar algorithms developed for \scp also work for the related optimization problem of \smpl (\smp). Finally, the algorithms are demonstrated to be effective in experiments involving graph cut and data summarization functions.
Oral
Ziye Ma · Somayeh Sojoudi

[ Auditorium 1 ]

This paper is concerned with low-rank matrix optimization, which has found a wide range of applications in machine learning. This problem in the special case of matrix sensing has been studied extensively through the notion of Restricted Isometry Property (RIP), leading to a wealth of results on the geometric landscape of the problem and the convergence rate of common algorithms. However, the existing results can handle the problem in the case with a general objective function subject to noisy data only when the RIP constant is close to 0. In this paper, we develop a new mathematical framework to solve the above-mentioned problem with a far less restrictive RIP constant. We prove that as long as the RIP constant of the noiseless objective is less than 1/3, any spurious local solution of the noisy optimization problem must be close to the ground truth solution. By working through the strict saddle property, we also show that an approximate solution can be found in polynomial time. We characterize the geometry of the spurious local minima of the problem in a local region around the ground truth in the case when the RIP constant is greater than 1/3. Compared to the existing results …

Oral
Giovanni Luca Marchetti · Vladislav Polianskii · Anastasiia Varava · Florian T. Pokorny · Danica Kragic

[ Auditorium 1 ]

We introduce a non-parametric density estimator deemed Radial Voronoi Density Estimator (RVDE). RVDE is grounded in the geometry of Voronoi tessellations and as such benefits from local geometric adaptiveness and broad convergence properties. Due to its radial definition RVDE is continuous and computable in linear time with respect to the dataset size. This amends for the main shortcomings of previously studied VDEs, which are highly discontinuous and computationally expensive. We provide a theoretical study of the modes of RVDE as well as an empirical investigation of its performance on high-dimensional data. Results show that RVDE outperforms other non-parametric density estimators, including recently introduced VDEs.

Oral
Garud Iyengar · Henry Lam · Tianyu Wang

[ Auditorium 1 ]

Empirical risk minimization (ERM) and distributionally robust optimization (DRO) are popular approaches for solving stochastic optimization problems that appear in operations management and machine learning. Existing generalization error bounds for these methods depend on either the complexity of the cost function or dimension of the uncertain parameters; consequently, the performance of these methods is poor for high-dimensional problems with objective functions under high complexity. We propose a simple approach in which the distribution of uncertain parameters is approximated using a parametric family of distributions. This mitigates both sources of complexity; however, it introduces a model misspecification error. We show that this new source of error can be controlled by suitable DRO formulations. Our proposed parametric DRO approach has significantly improved generalization bounds over existing ERM / DRO methods and parametric ERM for a wide variety of settings. Our method is particularly effective under distribution shifts. We also illustrate the superior performance of our approach on both synthetic and real-data portfolio optimization and regression tasks.


Poster Session 3 Thu 27 Apr 02:00 p.m.  

Poster
Yucheng Wang · Mingyuan Zhou · Yu Sun · Xiaoning Qian

[ Auditorium 1 Foyer ]

Learning to hash has become popular for video retrieval due to its fast speed and low storage consumption. Previous efforts formulate video hashing as training a binary auto-encoder, for which noncontinuous latent representations are optimized by the biased straight-through~(ST) back-propagation heuristic. We propose to formulate video hashing as learning a discrete variational auto-encoder with the factorized Bernoulli latent distribution, termed as Bernoulli variational auto-encoder~(BerVAE). The corresponding evidence lower bound~(ELBO) in our BerVAE implementation leads to closed-form gradient expression, which can be applied to achieve principled training along with some other unbiased gradient estimators. BerVAE enables uncertainty-aware video hashing by predicting the probability distribution of video hash code-words, thus providing reliable uncertainty quantification. Experiments on both simulated and real-world large-scale video data demonstrate that our BerVAE trained with unbiased gradient estimators can achieve the state-of-the-art retrieval performance. Furthermore, we show that quantified uncertainty is highly correlated to video retrieval performance, which can be leveraged to further improve the retrieval accuracy. Our code is available at https://github.com/wangyucheng1234/BerVAE

Poster
Mingshan Sun · Ye Zheng · Tianpeng Bao · Jianqiu Chen · Guoqiang Jin · Liwei Wu · Rui Zhao · Xiaoke Jiang

[ Auditorium 1 Foyer ]

Uni6D is the first 6D pose estimation approach to employ a unified backbone network to extract features from both RGB and depth images. We discover that the principal reasons of Uni6D performance limitations are Instance-Outside and Instance-Inside noise. Uni6D's simple pipeline design inherently introduces Instance-Outside noise from background pixels in the receptive field, while ignoring Instance-Inside noise in the input depth data. In this paper, we propose a two-step denoising approach for dealing with the aforementioned noise in Uni6D. To reduce noise from non-instance regions, an instance segmentation network is utilized in the first step to crop and mask the instance. A lightweight depth denoising module is proposed in the second step to calibrate the depth feature before feeding it into the pose regression network. Extensive experiments show that our Uni6Dv2 reliably and robustly eliminates noise, outperforming Uni6D without sacrificing too much inference efficiency. It also reduces the need for annotated real data that requires costly labeling.

Poster
Yiyan Huang · Cheuk Hang Leung · Shumin Ma · Zhiri Yuan · Qi Wu · SIYI WANG · Dongdong Wang · Zhixiang Huang

[ Auditorium 1 Foyer ]

Credit policy evaluation presents profitable opportunities for E-commerce platforms through improved decision-making. The core of policy evaluation is estimating the causal effects of the policy on the target outcome. However, selection bias presents a key challenge in estimating causal effects from real-world data. Some recent causal inference methods attempt to mitigate selection bias by leveraging covariate balancing in the representation space to obtain the domain-invariant features. However, it is noticeable that balanced representation learning can be accompanied by a failure of domain discrimination, resulting in the loss of domain-related information. This is referred to as the over-balancing issue. In this paper, we introduce a novel objective for representation balancing methods to do policy evaluation. In particular, we construct a doubly robust loss based on the predictions of treatment and outcomes, serving as a prerequisite for covariate balancing to deal with the over-balancing issue. In addition, we investigate how to improve treatment effect estimations by exploiting the unconfoundedness assumption. The extensive experimental results on benchmark datasets and a newly introduced credit dataset show a general outperformance of our method compared with existing methods.

Poster
Stefan Hegselmann · Alejandro Buendia · Hunter Lang · Monica Agrawal · Xiaoyi Jiang · David Sontag

[ Auditorium 1 Foyer ]

We study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serialization methods including templates, table-to-text models, and large language models. Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method’s ability to exploit prior knowledge encoded in large language models. Unlike many deep learning methods for tabular datasets, this approach is also competitive with strong traditional baselines like gradient-boosted trees, especially in the very-few-shot setting.

Poster
Sayak Chatterjee · Dibyendu Saha · Soham Dan · Bhaswar B. Bhattacharya

[ Auditorium 1 Foyer ]

In this paper, we study the two-sample problem for inhomogeneous Erd\H{o}s-R\'enyi (IER), random graph models, in the $L_r$ norm, in the high-dimensional regime where the number of samples is smaller or comparable to the size of the graphs. Given two symmetric matrices $P, Q \in [0, 1]^{n \times n}$ (with zeros on the diagonals), the two-sample problem for IER graphs (with respect to the $L_r$ norm $||\cdot||_r$) is to test the hypothesis $H_0: P=Q$ versus $H_1: ||P-Q||_r \geq \varepsilon$, given a sample of $m$ graphs from the respective distributions. In this paper, we obtain the optimal sample complexity for testing in the $L_r$-norm, for all integers $r \geq 1$. We also derive the asymptotic distribution of the optimal tests under $H_0$ and develop a method for consistently estimating their variances, for all integers $r \geq 1$. This allows us to efficiently implement the optimal tests with precise asymptotic levels and establish their asymptotic consistency. We validate our theoretical results by numerical experiments for various natural IER models.
Poster
Mojmir Mutny · Tadeusz Janik · Andreas Krause

[ Auditorium 1 Foyer ]

A key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget into measurements to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a {\em Markov chain}, and we can only choose them by selecting a {\em policy} controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm -- \textsc{markov-design} -- that efficiently selects policies whose measurement allocation \emph{provably converges to the optimal one}. The algorithm is sequential in nature, adapting its choice of policies (experiments) using past measurements. In addition to our theoretical analysis, we demonstrate our framework on applications in ecological surveillance and pharmacology.

Poster
Yuhang He · Andrew Markham

[ Auditorium 1 Foyer ]

We propose synperiodic filter banks, a novel multi-scale learnable filter bank construction strategy that all filters are synchronized by their rotating periodicity. By synchronizing in a certain periodicity, we naturally get filters whose temporal length are reduced if they carry higher frequency response, and vice versa. Such filters internally maintain a better time-frequency resolution trade-off. By further alternating the periodicity, we can easily obtain a group of synperiodic filter bank (we call synperiodic filter banks), where filters of same frequency response in different groups differ in temporal length. Convolving these filter banks with sound raw waveform achieves multi-scale perception in time domain. Moreover, applying the same filter banks to recursively process the 2x-downsampled waveform enables multi-scale perception in the frequency domain. Benefiting from the multi-scale perception in both time and frequency domains, our proposed synperiodic filter banks learn multi-scale time-frequency representation in a data-driven way. Experiments on both sound source direction of arrival (DoA) and physical location detection task show the superiority of synperiodic filter banks.

Poster
Shiv Shankar · Ritwik Sinha · Saayan Mitra · Moumita Sinha · Madalina Fiterau

[ Auditorium 1 Foyer ]

Brands use cookies and device identifiers to link different web visits to the same consumer. However, with increasing demands for privacy, these identifiers are about to be phased out, making identity fragmentation a permanent feature of the online world. Assessing treatment effects via randomized experiments (A/B testing) in such a scenario is challenging because identity fragmentation causes a) users to receive hybrid/mixed treatments, and b) hides the causal link between the historical treatments and the outcome. In this work, we address the problem of estimating treatment effects when a lack of identification leads to incomplete knowledge of historical treatments. This is a challenging problem which has not been addressed in literature yet. We develop a new method called DIET, which can adjust for users being exposed to mixed treatments without the entire history of treatments being available. Our method takes inspiration from the Cox model, and uses a proportional outcome approach under which we prove that one can obtain consistent estimates of treatment effects even under identity fragmentation. Our experiments, on one simulated and two real datasets, show that our method leads to up to 20% reduction in error and 25% reduction in bias over the naive estimate.

Poster
Ning Liu · Yue Yu · Huaiqian You · Neeraj Tatikola

[ Auditorium 1 Foyer ]

Neural operators, which emerge as implicit solution operators of hidden governing equations, have recently become popular tools for learning responses of complex real-world physical systems. Nevertheless, the majority of neural operator applications has thus far been data-driven, which neglects the intrinsic preservation of fundamental physical laws in data. In this paper, we introduce a novel integral neural operator architecture, to learn physical models with fundamental conservation laws automatically guaranteed. In particular, by replacing the frame-dependent position information with its invariant counterpart in the kernel space, the proposed neural operator is designed to be translation- and rotation-invariant, and consequently abides by the conservation laws of linear and angular momentums. As applications, we demonstrate the expressivity and efficacy of our model in learning complex material behaviors from both synthetic and experimental datasets, and show that, by automatically satisfying these essential physical laws, our learned neural operator is not only generalizable in handling translated and rotated datasets, but also achieves improved accuracy and efficiency from the baseline neural operator models.

Poster
Kaiqi Zhao · Animesh Jain · Ming Zhao

[ Auditorium 1 Foyer ]

Pruning is a promising approach to compress deep learning models in order to deploy them on resource-constrained edge devices. However, many existing pruning solutions are based on unstructured pruning, which yields models that cannot efficiently run on commodity hardware; and they often require users to manually explore and tune the pruning process, which is time-consuming and often leads to sub-optimal results. To address these limitations, this paper presents Automatic Attention Pruning (AAP), an adaptive, attention-based, structured pruning approach to automatically generate small, accurate, and hardware-efficient models that meet user objectives. First, it proposes iterative structured pruning using activation-based attention maps to effectively identify and prune unimportant filters. Then, it proposes adaptive pruning policies for automatically meeting the pruning objectives of accuracy-critical, memory-constrained, and latency-sensitive tasks. A comprehensive evaluation shows that AAP substantially outperforms the state-of-the-art structured pruning works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/Automatic-Attention-Pruning.git.

Poster
Alexander Norcliffe · Bogdan Cebere · Fergus Imrie · Pietro Lió · Mihaela van der Schaar

[ Auditorium 1 Foyer ]

Synthetic data is becoming an increasingly promising technology, and successful applications can improve privacy, fairness, and data democratization. While there are many methods for generating synthetic tabular data, the task remains non-trivial and unexplored for specific scenarios. One such scenario is survival data. Here, the key difficulty is censoring: for some instances, we are not aware of the time of event, or if one even occurred. Imbalances in censoring and time horizons cause generative models to experience three new failure modes specific to survival analysis: (1) generating too few at-risk members; (2) generating too many at-risk members; and (3) censoring too early. We formalize these failure modes and provide three new generative metrics to quantify them. Following this, we propose SurvivalGAN, a generative model that handles survival data firstly by addressing the imbalance in the censoring and event horizons, and secondly by using a dedicated mechanism for approximating time-to-event/censoring. We evaluate this method via extensive experiments on medical datasets. SurvivalGAN outperforms multiple baselines at generating survival data, and in particular addresses the failure modes as measured by the new metrics, in addition to improving downstream performance of survival models trained on the synthetic data.

Poster
Daesoo Lee · Sara Malacarne · Erlend Aune

[ Auditorium 1 Foyer ]

Time series generation (TSG) studies have mainly focused on the use of Generative Adversarial Networks (GANs) combined with recurrent neural network (RNN) variants. However, the fundamental limitations and challenges of training GANs still remain. In addition, the RNN-family typically has difficulties with temporal consistency between distant timesteps. Motivated by the successes in the image generation (IMG) domain, we propose TimeVQVAE, the first work, to our knowledge, that uses vector quantization (VQ) techniques to address the TSG problem. Moreover, the priors of the discrete latent spaces are learned with bidirectional transformer models that can better capture global temporal consistency. We also propose VQ modeling in a time-frequency domain, separated into low-frequency (LF) and high-frequency (HF). This allows us to retain important characteristics of the time series and, in turn, generate new synthetic signals that are of better quality, with sharper changes in modularity, than its competing TSG methods. Our experimental evaluation is conducted on all datasets from the UCR archive, using well-established metrics in the IMG literature, such as Fréchet inception distance and inception scores. Our implementation on GitHub: \url{https://github.com/ML4ITS/TimeVQVAE}.

Poster
Asif Khan · Amos Storkey

[ Auditorium 1 Foyer ]

In an unsupervised attack on variational autoencoders (VAEs) an adversary finds a small perturbation in an input sample that significantly changes its latent space encoding, thereby compromising the reconstruction for a fixed decoder. A known reason for such vulnerability is the distortions in the latent space resulting from a mismatch between approximated latent posterior and a prior distribution. Consequently, a slight change in an input sample can move its encoding to a low/zero density region in the latent space resulting in an unconstrained generation. This paper demonstrates that an optimal way for an adversary to attack VAEs is to exploit a directional bias of a stochastic pullback metric tensor induced by the encoder and decoder networks. The pullback metric tensor of an encoder measures the change in infinitesimal latent volume from an input to a latent space. Thus, it can be viewed as a lens to analyse the effect of input perturbations leading to latent space distortions. We propose robustness evaluation scores using the eigenspectrum of a pullback metric tensor. Moreover, we empirically show that the scores correlate with the robustness parameter $\beta$ of the $\beta-$VAE. Since increasing $\beta$ also compromises reconstruction quality, we demonstrate a simple strategy using \textit{mixup} …
Poster
Yinglong Guo · Dongmian Zou · Gilad Lerman

[ Auditorium 1 Foyer ]

We propose a novel and trainable graph unpooling layer for effective graph generation. The unpooling layer receives an input graph with features and outputs an enlarged graph with desired structure and features. We prove that the output graph of the unpooling layer remains connected and for any connected graph there exists a series of unpooling layers that can produce it from a 3-node graph. We apply the unpooling layer within the generator of a generative adversarial network as well as the decoder of a variational autoencoder. We give extensive experimental evidence demonstrating the competitive performance of our proposed method on synthetic and real data.

Poster
Giannis Nikolentzos · Michalis Vazirgiannis

[ Auditorium 1 Foyer ]

Graph neural networks have recently attracted a lot of attention and have been applied with great success to several important graph problems. The Random Walk Graph Neural Network model was recently proposed as a more intuitive alternative to the well-studied family of message passing neural networks. This model compares each input graph against a set of latent hidden graphs'' using a kernel that counts common random walks up to some length. In this paper, we propose a new architecture, called Geometric Random Walk Graph Neural Network (GRWNN), that generalizes the above model such that it can count common walks of infinite length in two graphs. The proposed model retains the transparency of Random Walk Graph Neural Networks since its first layer also consists of a number of trainablehidden graphs'' which are compared against the input graphs using the geometric random walk kernel. To compute the kernel, we employ a fixed-point iteration approach involving implicitly defined operations. Then, we capitalize on implicit differentiation to derive an efficient training scheme which requires only constant memory, regardless of the number of fixed-point iterations. The employed random walk kernel is differentiable, and therefore, the proposed model is end-to-end trainable. Experiments on standard graph …

Poster
Amur Ghose · Yingxue Zhang · Jianye Hao · Mark Coates

[ Auditorium 1 Foyer ]

Contrastive learning has emerged as a premier method for learning representations with or without supervision. Recent studies have shown its utility in graph representation learning for pre-training. Despite successes, the understanding of how to design effective graph augmentations that can capture structural properties common to many different types of downstream graphs remains incomplete. We propose a set of well-motivated graph transformation operations derived via graph spectral analysis to provide a bank of candidates when constructing augmentations for a graph contrastive objective, enabling contrastive learning to capture useful structural representation from pre-training graph datasets. We first present a spectral graph cropping augmentation that involves filtering nodes by applying thresholds to the eigenvalues of the leading Laplacian eigenvectors. Our second novel augmentation reorders the graph frequency components in a structural Laplacian-derived position graph embedding. Further, we introduce a method that leads to improved views of local subgraphs by performing alignment via global random walk embeddings. Our experimental results indicate consistent improvements in out-of-domain graph data transfer compared to state-of-the-art graph contrastive learning methods, shedding light on how to design a graph learner that is able to learn structural properties common to diverse graph types.

Poster
Leonardo Cella · · Grégoire Pacreau · Massimiliano Pontil

[ Auditorium 1 Foyer ]

We study the problem of transfer-learning in the setting of stochastic linear contextual bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning the tasks jointly. Following recent results to design Lasso stochastic bandit policies, we propose an efficient greedy policy based on trace norm regularization. It implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank. Unlike previous work in the literature, our policy does not need to know the rank of the underlying matrix, nor {does} it requires the covariance of the arms distribution to be invertible. We derive an upper bound on the multi-task regret of our policy, which is, up to logarithmic factors, of order $T\sqrt{rN}+\sqrt{rNTd}$, where $T$ is the number of tasks, $r$ the rank, $d$ the number of variables and $N$ the number of rounds per task. We show the benefit of our strategy over an independent task learning baseline, which has a worse regret of order $T\sqrt{dN}$. We also argue that our policy {is minimax optimal} and, when $T\geq d$, has a multi-task regret which is comparable to the regret of …
Poster
Atsutoshi Kumagai · Tomoharu Iwata · Hiroshi Takahashi · Yasuhiro Fujiwara

[ Auditorium 1 Foyer ]

We propose a meta-learning method to improve the anomaly detection performance on unseen target tasks that have only unlabeled data. Existing meta-learning methods for anomaly detection have shown remarkable performance but require labeled data in target tasks. Although they can treat unlabeled data as normal assuming anomalies in the unlabeled data are negligible, this assumption is often violated in practice. As a result, the methods have low performance. Our method meta-learns with related tasks that have labeled and unlabeled data such that the expected test anomaly detection performance is directly improved when the anomaly detector is adapted to given unlabeled data. Our method is based on autoencoders (AEs), which are widely used neural network-based anomaly detectors. We model anomalous attributes for each unlabeled instance in the reconstruction loss of the AE, which are used to prevent the anomalies from being reconstructed; they can remove the effect of the anomalies. We formulate adaptation to the unlabeled data as a learning problem of the last layer of the AE and the anomalous attributes. This formulation enables the optimum solution to be obtained with a closed-form alternate update formula, which is preferable to efficiently maximize the expected test anomaly detection performance. The effectiveness …

Poster
Junjie Yang · Tianlong Chen · Mingkang Zhu · Fengxiang He · Dacheng Tao · Yingbin Liang · Zhangyang Wang

[ Auditorium 1 Foyer ]

Learning to optimize (L2O) has gained increasing popularity, which automates the design of optimizers by data-driven approaches. However, current L2O methods often suffer from poor generalization performance in at least two folds: (i) applying the L2O-learned optimizer to unseen optimizees, in terms of lowering their loss function values (optimizer generalization, or generalizable learning of optimizers"); and (ii) the test performance of an optimizee (itself as a machine learning model), trained by the optimizer, in terms of the accuracy over unseen data (optimizee generalization, orlearning to generalize"). While the optimizer generalization has been recently studied, the optimizee generalization (or learning to generalize) has not been rigorously studied in the L2O context, which is the aim of this paper. We first theoretically establish an implicit connection between the local entropy and the Hessian, and hence unify their roles in the handcrafted design of generalizable optimizers as equivalent metrics of the landscape flatness of loss functions. We then propose to incorporate these two metrics as flatness-aware regularizers into the L2O framework in order to meta-train optimizers to learn to generalize, and theoretically show that such generalization ability can be learned during the L2O meta-training process and then transformed to the optimizee loss …

Poster
Qihan Wang · Chen Dun · Fangshuo Liao · Chris Jermaine · Anastasios Kyrillidis

[ Auditorium 1 Foyer ]

Recent work on the Lottery Ticket Hypothesis (LTH) shows that there exist winning tickets'' in large neural networks. These tickets representsparse'' versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to pretrain the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms.For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that well-represents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the LOttery ticket through Filter-wise Training algorithm, dubbed as LoFT. LoFT is a model-parallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that LoFT i) preserves and finds good lottery tickets, …

Poster
Pavan Karjol · Rohan Kashyap · Prathosh A P

[ Auditorium 1 Foyer ]

We consider the problem of discovering subgroup $H$ of permutation group $S_n$. Unlike the traditional $H$-invariant networks wherein $H$ is assumed to be known, we present a method to discover the underlying subgroup, given that it satisfies certain conditions. Our results show that one could discover any subgroup of type $S_k (k \leq n)$ by learning an $S_n$-invariant function and a linear transformation. We also prove similar results for cyclic and dihedral subgroups. Finally, we provide a general theorem that can be extended to discover other subgroups of $S_n$. We also demonstrate the applicability of our results through numerical experiments on image-digit sum and symmetric polynomial regression tasks.
Poster
Xuantong LIU · Jianfeng Zhang · Tianyang Hu · He CAO · Yuan Yao · Lujia Pan

[ Auditorium 1 Foyer ]

Although deep neural networks achieve tremendous success on various classification tasks, the generalization ability drops sheer when training datasets exhibit long-tailed distributions. One of the reasons is that the learned representations (i.e. features) from the imbalanced datasets are less effective than those from balanced datasets. Specifically, the learned representation under class-balanced distribution will present the Neural Collapse (NC) phenomena. NC indicates the features from the same category are close to each other and from different categories are maximally distant, showing an optimal linear separable state of classification. However, the pattern differs on imbalanced datasets and is partially responsible for the reduced performance of the model. In this work, we propose two explicit feature regularization terms to learn high-quality representation for class-imbalanced data. With the proposed regularization, NC phenomena will appear under the class-imbalanced distribution, and the generalization ability can be significantly improved. Our method is easily implemented, highly effective, and can be plugged into most existing methods. The extensive experimental results on widely-used benchmarks show the effectiveness of our method

Poster
Zheyang Xiong · Fangshuo Liao · Anastasios Kyrillidis

[ Auditorium 1 Foyer ]

The strong Lottery Ticket Hypothesis (LTH) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around the initialization. In particular, we focus on the following open questions: By allowing an $\varepsilon$-scale perturbation on the random initial weights, can we reduce the over-parameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation?We answer the first question by first extending the theoretical result on subset sum to allow perturbation on the candidates. Applying this result to the neural network setting, we show that by allowing $\varepsilon$-scale perturbation, we can reduce the over-parameterization requirement of the strong LTH by a factor of $O(1/(1+\varepsilon))$. To answer the second question, we show via experiments that the perturbed weight achieved by the projected SGD shows better performance under the strong LTH pruning.
Poster
Xingyu Xu · Yuantao Gu

[ Auditorium 1 Foyer ]

Benign overfitting refers to a recently discovered intriguing phenomenon that over-parameterized neural networks, in many cases, can fit the training data perfectly but still generalize well, surprisingly contrary to the traditional belief that overfitting is harmful for generalization. In spite of its surging popularity in recent years, little has been known in the theoretical aspect of benign overfitting of neural networks.In this work, we provide a theoretical analysis of benign overfitting for two-layer neural networks with possibly non-smooth activation function. Without resorting to the popular Neural Tangent Kernel (NTK) approximation, we prove that neural networks can be trained with gradient descent to classify binary-labeled training data perfectly (achieving zero training loss) even in presence of polluted labels, but still generalize well. Our result removes the smoothness assumption in previous literature and goes beyond the NTK regime; this enables a better theoretical understanding of benign overfitting within a practically more meaningful setting, e.g., with (leaky-)ReLU activation function, small random initialization, and finite network width.

Poster
Hongru Yang · Zhangyang Wang

[ Auditorium 1 Foyer ]

Motivated by both theory and practice, this work studies how random pruning the weights affects a neural network's neural tangent kernel (NTK). In particular, this work establishes a equivalence of the NTKs between a fully-connected neural network and its randomly pruned version. The equivalence is established under two cases. The first main result studies the infinite-width asymptotic. It is shown that given a pruning probability, for fully-connected neural networks with the weights randomly pruned at the initialization, as the width of each layer grows to infinity sequentially, the NTK of the pruned neural network converges to the limiting NTK of the original network with some extra scaling. If the network weights are rescaled appropriately after pruning, this extra scaling can be removed. The second main result considers the finite width case. It is shown that to ensure the NTK's closeness to the limit, the dependence of width on the sparsity parameter is asymptotically linear, as the NTK's gap to its limit goes down to zero. Moreover, if the pruning probability is set to zero (i.e., no pruning), the bound on the required width matches the bound for fully-connected neural networks in previous works up to logarithmic factors. The proof of …

Poster
Giovanni Luca Marchetti · Gustaf Tegnér · Anastasiia Varava · Danica Kragic

[ Auditorium 1 Foyer ]

We introduce a general method for learning representations that are equivariant to symmetries of data. Our central idea is to decompose the latent space into an invariant factor and the symmetry group itself. The components semantically correspond to intrinsic data classes and poses respectively. The learner is trained on a loss encouraging equivariance based on supervision from relative symmetry information. The approach is motivated by theoretical results from group theory and guarantees representations that are lossless, interpretable and disentangled. We provide an empirical investigation via experiments involving datasets with a variety of symmetries. Results show that our representations capture the geometry of data and outperform other equivariant representation learning frameworks.

Poster
Quan Nguyen · Roman Garnett

[ Auditorium 1 Foyer ]

Active search is a setting in adaptive experimental design where we aim to uncover members of rare, valuable class(es) subject to a budget constraint. An important consideration in this problem is diversity among the discovered targets -- in many applications, diverse discoveries offer more insight and may be preferable in downstream tasks. However, most existing active search policies either assume that all targets belong to a common positive class or encourage diversity via simple heuristics. We present a novel formulation of active search with multiple target classes, characterized by a utility function chosen from a flexible family whose members induce preferences for label diversity among discoveries via a diminishing returns mechanism. We then study this problem under the Bayesian lens and prove a hardness result for approximating the optimal policy for arbitrary positive, increasing, and concave utility functions. Finally, we propose an efficient, nonmyopic approximation to the optimal policy for this class of utilities and demonstrate its superior empirical performance across a wide variety of experimental settings, including drug discovery.

Poster
Samuel Stanton · Wesley Maddox · Andrew Gordon Wilson

[ Auditorium 1 Foyer ]

Bayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncertainty about query outcomes. In practice, subjectively implausible outcomes can occur regularly for two reasons: 1) model misspecification and 2) covariate shift. Conformal prediction is an uncertainty quantification method with coverage guarantees even for misspecified models and a simple mechanism to correct for covariate shift. We propose conformal Bayesian optimization, which directs queries towards regions of search space where the model predictions have guaranteed validity, and investigate its behavior on a suite of black-box optimization tasks and tabular ranking tasks. In many cases we find that query coverage can be significantly improved without harming sample-efficiency.

Poster
Hideaki Ishibashi · Masayuki Karasuyama · Ichiro Takeuchi · Hideitsu Hino

[ Auditorium 1 Foyer ]

Bayesian optimization (BO) improves the efficiency of black-box optimization; however, the associated computational cost and power consumption remain dominant in the application of machine learning methods. This paper proposes a method of determining the stopping time in BO. The proposed criterion is based on the difference between the expectation of the minimum of a variant of the simple regrets before and after evaluating the objective function with a new parameter setting. Unlike existing stopping criteria, the proposed criterion is guaranteed to converge to the theoretically optimal stopping criterion for any choices of arbitrary acquisition functions and threshold values. Moreover, the threshold for the stopping criterion can be determined automatically and adaptively. We experimentally demonstrate that the proposed stopping criterion finds reasonable timing to stop a BO with a small number of evaluations of the objective function.

Poster
Tung Mai · Alexander Munteanu · Cameron Musco · Anup Rao · Chris Schwiegelshohn · David Woodruff

[ Auditorium 1 Foyer ]

We study oblivious sketching for $k$-sparse linear regression under various loss functions. In particular, we are interested in a distribution over sketching matrices $S\in\R^{m\times n}$ that does not depend on the inputs $A\in\R^{n\times d}$ and $b\in\R^n$, such that, given access to $SA$ and $Sb$, we can recover a $k$-sparse $\tilde x\in\mathbb{R}^d$ with $\|A\tilde x-b\|_f\leq (1+\varepsilon) \min\nolimits_{k{\text{-sparse}\,x\in\mathbb{R}^d}} \|Ax-b\|_f$. Here $\|\cdot\|_f: \mathbb R^n \rightarrow \mathbb R$ is some loss function -- such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $m=\Theta(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This extends to $\ell_p$ loss with an additional additive $O(k\log(k/\varepsilon)/\varepsilon^2)$ term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression, where $A$ is the identity matrix. For this problem, under the $\ell_2$ norm, we observe an upper bound of $m=O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2)$, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU …
Poster
Ziye Ma · Somayeh Sojoudi

[ Auditorium 1 Foyer ]

This paper is concerned with low-rank matrix optimization, which has found a wide range of applications in machine learning. This problem in the special case of matrix sensing has been studied extensively through the notion of Restricted Isometry Property (RIP), leading to a wealth of results on the geometric landscape of the problem and the convergence rate of common algorithms. However, the existing results can handle the problem in the case with a general objective function subject to noisy data only when the RIP constant is close to 0. In this paper, we develop a new mathematical framework to solve the above-mentioned problem with a far less restrictive RIP constant. We prove that as long as the RIP constant of the noiseless objective is less than 1/3, any spurious local solution of the noisy optimization problem must be close to the ground truth solution. By working through the strict saddle property, we also show that an approximate solution can be found in polynomial time. We characterize the geometry of the spurious local minima of the problem in a local region around the ground truth in the case when the RIP constant is greater than 1/3. Compared to the existing results …

Poster
Ellango Jothimurugesan · Kevin Hsieh · Jianyu Wang · Gauri Joshi · Phillip B Gibbons

[ Auditorium 1 Foyer ]

Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). Our work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation, with their single global model, are ill-suited to staggered drifts, necessitating multiple-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step.

Poster
Lijie Hu · Zihang Xiang · Jiabin Liu · Di Wang

[ Auditorium 1 Foyer ]

In this paper we study the (sparse) Generalized Eigenvalue Problem (GEP), which arises in anumber of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA) and sliced inverse regression (SIR). Existing techniques for GEP all fail to consider the protection of sensitive information in the training set. Models learned by such methods can implicitlymemorize the details of sensitive information. To address this issue, we provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a $\rho$-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has an $\ell_2$-norm estimation error of $\tilde{O}(\frac{d}{n}+\frac{d}{n^2\rho})$ (under some mild assumptions), where $d$ is the dimension and $n$ is the sample size. Next, we discuss how to find such a initial parameter privately. In the high dimensional sparse case where $d\gg n$, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of $\tilde{O}(\frac{s\log d}{n}+\frac{s\log d}{n^2\rho})$ for various statistical models, where $s$ is the sparsity of the underlying parameter. Moreover, we show that these errors in …
Poster
Brian Liu · Rahul Mazumder

[ Auditorium 1 Foyer ]

Tree ensembles are powerful models that achieve excellent predictive performances, but can grow to unwieldy sizes. These ensembles are often post-processed (pruned) to reduce memory footprint and improve interpretability. We present ForestPrune, a novel optimization framework to post-process tree ensembles by pruning depth layers from individual trees. Since the number of nodes in a decision tree increases exponentially with tree depth, pruning deep trees drastically compactifies ensembles. We develop a specialized optimization algorithm to efficiently obtain high-quality solutions to problems under ForestPrune. Our algorithm typically reaches good solutions in seconds for medium-size datasets and ensembles, with 10000s of rows and 100s of trees, resulting in significant speedups over existing approaches. Our experiments demonstrate that ForestPrune produces parsimonious models that outperform models extracted by existing post-processing algorithms.

Poster
Yasutoshi Ida · Sekitoshi Kanai · Atsutoshi Kumagai

[ Auditorium 1 Foyer ]

Non-convex sparse regularizations with group structures are useful tools for selecting important feature groups. For optimization with these regularizations, block coordinate descent (BCD) is a standard solver that iteratively updates each parameter group. However, it suffers from high computation costs for a large number of parameter groups. The state-of-the-art method prunes unnecessary updates in BCD by utilizing bounds on the norms of the parameter groups. Unfortunately, since it computes the bound for each iteration, the computation cost still tends to be high when the updates are not sufficiently pruned. This paper proposes a fast BCD for non-convex group regularizations. Specifically, it selects a small subset of the parameter groups from all the parameter groups on the basis of the bounds and performs BCD on the subset. The subset grows step by step in accordance with the bounds during optimization. Since it computes the bounds only when selecting and growing the subsets, the total cost for computing the bounds is smaller than in the previous method. In addition, we theoretically guarantee the convergence of our method. Experiments show that our method is up to four times faster than the state-of-the-art method and 68 times faster than the original BCD without any …

Poster
Kareem Ahmed · Kai-Wei Chang · Guy Van den Broeck

[ Auditorium 1 Foyer ]

Numerous neuro-symbolic approaches have recently been proposed typically with the goal of adding symbolic knowledge to the output layer of a neural network. Ideally, such losses maximize the probability that the neural network's predictions satisfy the underlying domain. Unfortunately, this type of probabilistic inference is often computationally infeasible. Neuro-symbolic approaches therefore commonly resort to fuzzy approximations of this probabilistic objective, sacrificing sound probabilistic semantics, or to sampling which is very seldom feasible. We approach the problem by first assuming the constraint decomposes conditioned on the features learned by the network. We iteratively strengthen our approximation, restoring the dependence between the constraints most responsible for degrading the quality of the approximation. This corresponds to computing the mutual information between pairs of constraints conditioned on the network's learned features, and may be construed as a measure of how well aligned the gradients of two distributions are. We show how to compute this efficiently for tractable circuits. We test our approach on three tasks: predicting a minimum-cost path in Warcraft, predicting a minimum-cost perfect matching, and solving Sudoku puzzles, observing that it improves upon the baselines while sidestepping intractability.

Poster
Carles Domingo-Enrich · Raaz Dwivedi · Lester Mackey

[ Auditorium 1 Foyer ]

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on n sample points. However, existing kernel tests either run in n^2 time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each n point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test---recovering the same optimal detection boundary---while running in near-linear time. Building on the same principle, we also develop near-linear time procedures for aggregating multiple kernel tests and estimating asymptotic test variance. On our real and simulated data experiments, CTT provides 100--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.

Poster
Chen Dun · Mirian Hipolito Garcia · Chris Jermaine · Dimitrios Dimitriadis · Anastasios Kyrillidis

[ Auditorium 1 Foyer ]

We focus on dropout techniques in asynchronous distributed computing for federated learning (FL) scenarios. We propose AsyncDrop, a novel asynchronous FL framework with smart (i.e., informed/structured) dropout. Overall, AsyncDrop achieves better performance compared to state of the art asynchronous methodologies, while resulting in less communication and training time overheads. The key idea revolves around creating "submodels'" out of the global model and distributing their training to workers based on device heterogeneity. We rigorously justify that such an approach can be theoretically characterized. We implement our approach and compare it against other asynchronous baselines, by adapting existing synchronous FL algorithms to asynchronous scenarios. Empirically, AsyncDrop reduces the communication cost and training time, while improving the final test accuracy in diverse non-i.i.d. FL scenarios.

Poster
Hongchang Gao · Bin Gu · My T. Thai

[ Auditorium 1 Foyer ]

Bilevel optimization has been applied to a wide variety of machine learning models and numerous stochastic bilevel optimization algorithms have been developed in recent years. However, most existing algorithms restrict their focus on the single-machine setting so that they are incapable of handling the distributed data. To address this issue, under the setting where all participants compose a network and perform peer-to-peer communication in this network, we developed two novel decentralized stochastic bilevel optimization algorithms based on the gradient tracking communication mechanism and two different gradient estimators. Additionally, we established their convergence rates for nonconvex-strongly-convex problems with novel theoretical analysis strategies. To our knowledge, this is the first work achieving these theoretical results. Finally, we applied our algorithms to practical machine learning models, and the experimental results confirmed the efficacy of our algorithms.

Poster
Giannis Nikolentzos · Michalis Vazirgiannis

[ Auditorium 1 Foyer ]

Graph kernels have become a standard approach for tackling the graph similarity and learning tasks at the same time. Most graph kernels proposed so far are instances of the R-convolution framework. These kernels decompose graphs into their substructures and sum over all pairs of these substructures. However, considerably less attention has been paid to other types of kernels. In this paper, we propose a new kernel between graphs which reorders the adjacency matrix of each graph based on soft permutation matrices, and then compares those aligned adjacency matrices to each other using a linear kernel. To compute the permutation matrices, the kernel finds corresponding vertices in different graphs. Two vertices match with each other if the Weisfeiler-Leman test of isomorphism assigns the same label to both of them. The proposed kernel is evaluated on several graph classification and graph regression datasets. Our results indicate that the kernel is competitive with traditional and state-of-the-art methods.

Poster
Weilin Cong · Mehrdad Mahdavi

[ Auditorium 1 Foyer ]

As privacy protection receives much attention, unlearning the effect of a specific node from a pre-trained graph learning model has become equally important. However, due to the \emph{node dependency} in the graph-structured data, representation unlearning in Graph Neural Networks (GNNs) is challenging and less well explored. In this paper, we fill in this gap by first studying the unlearning problem in linear-GNNs, and then introducing its extension to non-linear structures. Given a set of nodes to unlearn, we propose \textsc{Projector} that unlearns by projecting the weight parameters of the pre-trained model onto a subspace that is irrelevant to features of the nodes to be forgotten. \textsc{Projector} could overcome the challenges caused by node dependency and enjoys perfect data removal, i.e., the unlearned model parameters do not contain any information about the unlearned node features which is guaranteed by algorithmic construction. Empirical results on real-world datasets illustrate the effectiveness and efficiency of \textsc{Projector}.

Poster
Ekaterina Khramtsova · Guido Zuccon · Xi Wang · Mahsa Baktashmotlagh

[ Auditorium 1 Foyer ]

While deep neural networks are proven to be effective learning systems, their analysis is complex due to high-dimensionality of their weight space. To remedy this, persistent topological properties can be used as an additional descriptor, providing insights on how the network weights evolve during training. In this paper, we focus on convolutional neural networks, and define the topology of the space, populated by convolutional filters (i.e., kernels). We perform an extensive analysis of topological properties of the convolutional filters. Specifically, we define a metric based on persistent homology, namely, Convolutional Topology Representation, to determine an important factor in neural networks training - generalizability of the model to the test set. We further analyse how various training methods affect the topology of convolutional layers.

Poster
Prashant Trivedi · Nandyala Hemachandra

[ Auditorium 1 Foyer ]

This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents …

Poster
Yigit Efe Erginbas · Soham Phade · Kannan Ramchandran

[ Auditorium 1 Foyer ]

Large-scale online recommendation systems must facilitate the allocation of a limited number of items among competing users while learning their preferences from user feedback. As a principled way of incorporating market constraints and user incentives in the design, we consider our objectives to be two-fold: maximal social welfare with minimal instability. To maximize social welfare, our proposed framework enhances the quality of recommendations by exploring allocations that optimistically maximize the rewards. To minimize instability, a measure of users' incentives to deviate from recommended allocations, the algorithm prices the items based on a scheme derived from the Walrasian equilibria. Though it is known that these equilibria yield stable prices for markets with known user preferences, our approach accounts for the inherent uncertainty in the preferences and further ensures that the users accept most of their recommendations under offered prices. To the best of our knowledge, our approach is the first to integrate techniques from combinatorial bandits, optimal resource allocation, and collaborative filtering to obtain an algorithm that achieves sub-linear social welfare regret as well as sub-linear instability. Empirical studies on synthetic and real-world data also demonstrate the efficacy of our strategy compared to approaches that do not fully incorporate all these …

Poster
David Simchi-Levi · Chonghuan Wang

[ Auditorium 1 Foyer ]

Multi-armed bandit has been well-known for its efficiency in online decision-making in terms of minimizing the loss of the participants’ welfare during experiments (i.e., the regret). In clinical trials and many other scenarios, the statistical power of inferring the treatment effects (i.e., the gaps between the mean outcomes of different arms) is also crucial. Nevertheless, minimizing the regret entails harming the statistical power of estimating the treatment effect, since the observations from some arms can be limited. In this paper, we investigate the trade-off between efficiency and statistical power by casting the multi-armed bandit experimental design into a minimax multi-objective optimization problem. We introduce the concept of Pareto optimality to mathematically characterize the situation in which neither the statistical power nor the efficiency can be improved without degrading the other. We derive a useful sufficient and necessary condition for the Pareto optimal solutions. Additionally, we design an effective Pareto optimal multi-armed bandit experiment that can be tailored to different levels of the trade-off between the two objectives.

Poster
Aditya Bhaskara · Kamesh Munagala

[ Auditorium 1 Foyer ]

For many of the classic online learning settings, it is known that having a "hint" about the loss function before making a prediction leads to significantly better regret guarantees. In this work we study the question, do hints allow us to go beyond the standard notion of regret (which competes against the best fixed strategy) and compete against adaptive or dynamic strategies? After all, if hints were perfect, we can clearly compete against a fully dynamic strategy. For some common online learning settings, we provide upper and lower bounds for the switching regret, i.e., the difference between the loss incurred by the algorithm and the optimal strategy in hindsight that switches state at most $L$ times, where $L$ is some parameter. We show positive results for online linear optimization and the classic experts problem. Interestingly, such results turn out to be impossible for the setting of multi-arm bandits.
Poster
Shibo Li · Zheng Wang · Akil Narayan · Robert Kirby · Shandian Zhe

[ Auditorium 1 Foyer ]

Model Agnostic Meta-Learning (MAML) is widely used to find a good initialization for a family of tasks. Despite its success, a critical challenge in MAML is to calculate the gradient w.r.t. the initialization of a long training trajectory for the sampled tasks, because the computation graph can rapidly explode and the computational cost is very expensive. To address this problem, we propose Adjoint MAML (A-MAML). We view gradient descent in the inner optimization as the evolution of an Ordinary Differential Equation (ODE). To efficiently compute the gradient of the validation loss w.r.t. the initialization, we use the adjoint method to construct a companion, backward ODE. To obtain the gradient w.r.t. the initialization, we only need to run the standard ODE solver twice --- one is forward in time that evolves a long trajectory of gradient flow for the sampled task; the other is backward and solves the adjoint ODE. We need not create or expand any intermediate computational graphs, adopt aggressive approximations, or impose proximal regularizers in the training loss. Our approach is cheap, accurate, and adaptable to different trajectory lengths. We demonstrate the advantage of our approach in both synthetic and real-world meta-learning tasks. The code is available at …

Poster
Maxim Kodryan · Dmitry Kropotov · Dmitry Vetrov

[ Auditorium 1 Foyer ]

Tensor decomposition methods have proven effective in various applications, including compression and acceleration of neural networks. At the same time, the problem of determining optimal decomposition ranks, which present the crucial parameter controlling the compression-accuracy trade-off, is still acute. In this paper, we introduce MARS — a new efficient method for the automatic selection of ranks in general tensor decompositions. During training, the procedure learns binary masks over decomposition cores that “select” the optimal tensor structure. The learning is performed via relaxed maximum a posteriori (MAP) estimation in a specific Bayesian model and can be naturally embedded into the standard neural network training routine. Diverse experiments demonstrate that MARS achieves better results compared to previous works in various tasks.

Poster
Hsin-En Su · Yen-Ju Chen · Ping-Chun Hsieh · Xi Liu

[ Auditorium 1 Foyer ]

We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.

Poster
Aadirupa Saha · Aldo Pacchiano · Jonathan Lee

[ Auditorium 1 Foyer ]

We consider the problem of preference-based reinforcement learning (PbRL), where, unlike traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit (0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the traditional reward-based RL framework crucially depends on how accurately a system designer can express an appropriate reward function, which is often a non-trivial task. The main novelty of the our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we propose an algorithm with a regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further extend the above algorithm to the case of unknown transition dynamics and provide an algorithm with regret $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference-based RL problem with trajectory preferences.
Poster
Sing-Yuan Yeh · Fu-Chieh Chang · Chang-Wei Yueh · Pei-Yuan Wu · Alberto Bernacchia · Sattar Vakili

[ Auditorium 1 Foyer ]

Modern reinforcement learning (RL) often faces an enormous action-state space. Existing analytical results are typically for settings with a small number of action-states, or simple models such as linearly modeled Q functions. To derive statistically efficient RL policies handling large action-tate spaces, with more general Q functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel-based Q-learning when a generative model exists. We propose a non-parametric Q-learning algorithm which finds an ε-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to ε and the complexity of the kernel (in terms of its information gain or effective dimension). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.

Poster
Gandharv Patil · Prashanth L.A. · Dheeraj Nagaraj · Doina Precup

[ Auditorium 1 Foyer ]

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm, when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal O (1/t) rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation, and show that this variant fares favourably in problems with ill-conditioned features.

Poster
Victor Boone · Bruno Gaujal

[ Auditorium 1 Foyer ]

This paper investigates a new learning problem,the identification of Blackwell optimal policieson deterministic MDPs (DMDPs): A learner hasto return a Blackwell optimal policy with fixedconfidence using a minimal number of queries.First, we characterize the maximal set of DMDPsfor which the identification is possible. Then,we focus on the analysis of algorithms based onproduct-form confidence regions. We minimizethe number of queries by efficiently visiting thestate-action pairs with respect to the shape ofconfidence sets. Furthermore, these confidencesets are themselves optimized to achieve betterperformances. The performances of our methodscompare to the lower bounds up to a factor $n^2$ inthe worst case – where $n$ is the number of states,and constant in certain classes of DMDPs.
Poster
Muhammad Aneeq uz Zaman · Alec Koppel · Sujay Bhatt · Tamer Basar

[ Auditorium 1 Foyer ]

We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by developing an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this Sandbox Learning, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $\Os(\epsilon^{-4})$ where $\epsilon$ is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class.
Poster
Thomas Gebhart · Jakob Hansen · Paul Schrater

[ Auditorium 1 Foyer ]

Knowledge graph embedding involves learning representations of entities---the vertices of the graph---and relations---the edges of the graph---such that the resulting representations encode the known factual information represented by the knowledge graph and can be used in the inference of new relations. We show that knowledge graph embedding is naturally expressed in the topological and categorical language of cellular sheaves: a knowledge graph embedding can be described as an approximate global section of an appropriate knowledge sheaf over the graph, with consistency constraints induced by the knowledge graph's schema. This approach provides a generalized framework for reasoning about knowledge graph embedding models and allows for the expression of a wide range of prior constraints on embeddings. Further, the resulting embeddings can be easily adapted for reasoning over composite relations without special training. We implement these ideas to highlight the benefits of the extensions inspired by this new perspective.

Poster
Vincent Liu · Yash Chandak · Philip Thomas · Martha White

[ Auditorium 1 Foyer ]

In this work, we consider the off-policy policy evaluation problem for contextual bandits and finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical for policy evaluation, but existing estimators that reuse old data introduce large bias such that we can not obtain a valid confidence interval. Inspired from a related field called survey sampling, we introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias. The estimator unifies several existing off-policy policy evaluation methods and improves on them with the use of auxiliary information and a regression approach. We prove that the new estimator is asymptotically unbiased, and provide a consistent variance estimator to a construct a large sample confidence interval. Finally, we empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.

Poster
Chris Dann · Mohammad Ghavamzadeh · Teodor Vanislavov Marinov

[ Auditorium 1 Foyer ]

In reinforcement learning applications, we often want to accurately estimate the return of several policies of interest. We study this problem, multiple-policy high-confidence policy evaluation, where the goal is to estimate the return of all given target policies up to a desired accuracy with as few samples as possible. The natural approaches to this problem, i.e.,~evaluating each policy separately or estimating a model of the MDP, do not take into account the similarities between target policies and scale with the number of policies to evaluate or the size of the MDP, respectively.We present an alternative approach based on reusing samples from on-policy Monte-Carlo estimators and show that it is more sample-efficient in favorable cases. Specifically, we provide guarantees in terms of a notion of overlap of the set of target policies and shed light on when such an approach is indeed beneficial compared to existing methods.

Poster
Anna Winnicki · R Srikant

[ Auditorium 1 Foyer ]

A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of (Sutton and Barto, 2018), page 8 of (Tsitsiklis, 2002)). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead (Silver et al., 2016, Silver et al., 2017) rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.

Poster
yingying zhang · Chengchun Shi · Shikai Luo

[ Auditorium 1 Foyer ]

Off-policy evaluation is critical in a number of applications where new policiesneed to be evaluated offline before online deployment. Most existing methods focuson the expected return, define the target parameter through averaging and provide apoint estimator only. In this paper, we develop a novel procedure to produce reliableinterval estimators for a target policy’s return starting from any initial state. Ourproposal accounts for the variability of the return around its expectation, focuseson the individual effect and offers valid uncertainty quantification. Our mainidea lies in designing a pseudo policy that generates subsamples as if they weresampled from the target policy so that existing conformal prediction algorithmsare applicable to prediction interval construction. Our methods are justified bytheories, synthetic data and real data from short-video platforms.

Poster
Honghao Wei · Arnob Ghosh · Ness Shroff · Lei Ying · Xingyu Zhou

[ Auditorium 1 Foyer ]

We study model-free reinforcement learning (RL) algorithms in episodic non-stationary constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a cumulative constraint on the expected utility (cost). In the non-stationary environment, the reward, utility functions, and the transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantee. Our results on regret bound and constraint violation for tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Furthermore, we provide a general framework for dealing with the well-known challenges when analyzing non-stationary CMDPs without requiring prior knowledge of the variation budget and apply the approach for both tabular and linear approximation settings.

Poster
Zhe Huang · Mary-Joy Sidhom · Benjamin Wessler · Michael C. Hughes

[ Auditorium 1 Foyer ]

Semi-supervised learning (SSL) promises improved accuracy compared to training classifiers on small labeled datasets by also training on many unlabeled images. In real applications like medical imaging, unlabeled data will be collected for expediency and thus uncurated: possibly different from the labeled set in classes or features. Unfortunately, modern deep SSL often makes accuracy worse when given uncurated unlabeled data. Recent complex remedies try to detect out-of-distribution unlabeled images and then discard or downweight them. Instead, we introduce Fix-A-Step, a simpler procedure that views all uncurated unlabeled images as potentially helpful. Our first insight is that even uncurated images can yield useful augmentations of labeled data. Second, we modify gradient descent updates to prevent optimizing a multi-task SSL loss from hurting labeled-set accuracy. Fix-A-Step can "repair" many common deep SSL methods, improving accuracy on CIFAR benchmarks across all tested methods and levels of artificial class mismatch. On a new medical SSL benchmark called Heart2Heart, Fix-A-Step can learn from 353,500 truly uncurated ultrasound images to deliver gains that generalize across hospitals.

Poster
Lukas Zierahn · Dirk van der Hoeven · Nicolò Cesa-Bianchi · Gergely Neu

[ Auditorium 1 Foyer ]

We study a contextual version of online combinatorial optimisation with full and semi-bandit feedback. In this sequential decision-making problem, an online learner has to select an action from a combinatorial decision space after seeing a vector-valued context in each round. As a result of its action, the learner incurs a loss that is a bilinear function of the context vector and the vector representation of the chosen action. We consider two natural versions of the problem: semi-bandit where the losses are revealed for each component appearing in the learner's combinatorial action, and full-bandit where only the total loss is observed. We design computationally efficient algorithms based on a new loss estimator that takes advantage of the special structure of the problem, and show regret bounds order $\sqrt{T}$ with respect to the time horizon. The bounds demonstrate polynomial scaling with the relevant problem parameters which is shown to be nearly optimal. The theoretical results are complemented by a set of experiments on simulated data.
Poster
Andrea Pugnana · Salvatore Ruggieri

[ Auditorium 1 Foyer ]

Selective classification (or classification with a reject option) pairs a classifier with a selection function to determine whether or not a prediction should be accepted. This framework trades off coverage (probability of accepting a prediction) with predictive performance, typically measured by distributive loss functions.In many application scenarios, such as credit scoring, performance is instead measured by ranking metrics, such as the Area Under the ROC Curve (AUC). We propose a model-agnostic approach to associate a selection function to a given probabilistic binary classifier. The approach is specifically targeted at optimizing the AUC. We provide both theoretical justifications and a novel algorithm, called AUCROSS, to achieve such a goal. Experiments show that our method succeeds in trading-off coverage for AUC, improving over existing selective classification methods targeted at optimizing accuracy.

Poster
Michal Sharoni · Sivan Sabato

[ Auditorium 1 Foyer ]

We study the supervised learning paradigm called Learning Using Privileged Information, first suggested by Vapnik and Vashist (2009). In this paradigm, in addition to the examples and labels, additional (privileged) information is provided only for training examples. The goal is to use this information to improve the classification accuracy of the resulting classifier, where this classifier can only use the non-privileged information of new example instances to predict their label. We study the theory of privileged learning with the zero-one loss under the natural Privileged ERM algorithm proposed in Peshyony and Vapnik (2010). We provide a counter example to a claim made in that work regarding the VC dimension of the loss class induced by this problem; We conclude that the claim is incorrect. We then provide a correct VC dimension analysis which gives both lower and upper bounds on the capacity of the Privileged ERM loss class.We further show, via a generalization analysis, that worst-case guarantees for Privileged ERM cannot improve over standard non-privileged ERM, unless the capacity of the privileged information is similar or smaller to that of the non-privileged information. This result points to an important limitation of the Privileged ERM approach. In our closing discussion, we …

Poster
Suya Wu · Enmao Diao · Taposh Banerjee · Jie Ding · Vahid Tarokh

[ Auditorium 1 Foyer ]

Classical change detection algorithms typically require modeling pre-change and post-change distributions. The calculations may not be feasible for various machine learning models because of the complexity of computing the partition functions and normalized distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. In this paper, we develop a new variant of the classical Cumulative Sum (CUSUM) change detection, namely Score-based CUSUM (SCUSUM), based on Fisher divergence and the Hyv\"arinen score. Our method allows the applications of the quickest change detection for unnormalized distributions. We provide a theoretical analysis of the detection delay given the constraints on false alarms. We prove the asymptotic optimality of the proposed method in some particular cases. We also provide numerical experiments to demonstrate our method's computation, performance, and robustness advantages.

Poster
Paramita Koley · Harshavardhan Alimi · Shrey Singla · Sourangshu Bhattacharya · niloy ganguly · Abir De

[ Auditorium 1 Foyer ]

In this paper, we consider the problem of global change-point detection in event sequence data, where both the event distributions and change-points are assumed to be unknown. For this problem, we propose DCPD, a Log-likelihood Ratio based Global Change-point Detector, which detects change-points after observing the entire sequence. Based on the Transformer Hawkes Process (THP), a well-known neural TPP framework, we develop DCPD, a differentiable change-point detector, along with maintaining distinct intensity and mark predictor for each partition. Further, we propose DCPD-W, a sliding-window-based extension of DCPD to improve its scalability in terms of the number of events or change-points with minor sacrifice in performance. Experiments on synthetic datasets explore the effects of runtime, relative complexity, and other aspects of distributions on various properties of our change-point detector, namely robustness, detection accuracy, scalability, etc under controlled environments. Finally, we perform experiments on six real-world temporal event sequences collected from diverse domains like health, geographical regions, etc. We show that our methods either outperform or perform comparably with the baselines.

Poster
Niklas Stoehr · Benjamin Radford · Ryan Cotterell · Aaron Schein

[ Auditorium 1 Foyer ]

Many dynamical systems in the real world are naturally described by latent states with intrinsic ordering, such as "ally", "neutral", and "enemy" relationships in international relations. These latent states manifest through countries' cooperative versus conflictual interactions over time. State-space models (SSMs) explicitly relate the dynamics of observed measurements to transitions in latent states. For discrete data, SSMs commonly do so through a state-to-action emission matrix and a state-to-state transition matrix. This paper introduces the Ordered Matrix Dirichlet (OMD) as a prior distribution over ordered stochastic matrices wherein the discrete distribution in the kth row is stochastically dominated by the (k+1)th, such that probability mass is shifted to the right when moving down rows. We illustrate the OMD prior within two SSMs: a hidden Markov model, and a novel dynamic Poisson Tucker decomposition model tailored to international relations data. We find that models built on the OMD recover interpretable ordered latent structure without forfeiting predictive performance. We suggest future applications to other domains where models with stochastic matrices are popular (e.g., topic modeling), and publish user-friendly code.

Poster
Yuchao Qin · Mihaela van der Schaar · Changhee Lee

[ Auditorium 1 Foyer ]

Clustering time-series data in healthcare is crucial for clinical phenotyping to understand patients’ disease progression patterns and to design treatment guidelines tailored to homogeneous patient subgroups. While rich temporal dynamics enable the discovery of potential clusters beyond static correlations, two major challenges remain outstanding: i) discovery of predictive patterns from many potential temporal correlations in the multi-variate time-series data and ii) association of individual temporal patterns to the target label distribution that best characterizes the underlying clinical progression. To address such challenges, we develop a novel temporal clustering method, T-Phenotype, to discover phenotypes of predictive temporal patterns from labeled time-series data. We introduce an efficient representation learning approach in frequency domain that can encode variable-length, irregularly-sampled time-series into a unified representation space, which is then applied to identify various temporal patterns that potentially contribute to the target label using a new notion of path-based similarity. Throughout the experiments on synthetic and real-world datasets, we show that T-Phenotype achieves the best phenotype discovery performance over all the evaluated baselines. We further demonstrate the utility of T-Phenotype by uncovering clinically meaningful patient subgroups characterized by unique temporal patterns.

Poster
Sara Ahmadian · Maryam Negahbani

[ Auditorium 1 Foyer ]

Correlation clustering is a ubiquitous paradigm in unsupervised machine learning where addressing unfairness is a major challenge. Motivated by this, we study {\em fair correlation clustering} where the data points may belong to different protected groups and the goal is to ensure fair representation of all groups across clusters. Our paper significantly generalizes and improves on the quality guarantees of previous work of Ahmadi et al. and Ahmadian et al. as follows. \begin{itemize} \item We allow the user to specify an arbitrary upper bound on the representation of each group in a cluster. \item Our algorithm allows individuals to have multiple protected features and ensure fairness simultaneously across them all. \item We prove guarantees for clustering quality and fairness in this general setting. Furthermore, this improves on the results for the special cases studied in previous work.\end{itemize}Our experiments on real-world data demonstrate that our clustering quality compared to the optimal solution is much better than what our theoretical result suggests.

Poster
David Bosch · Ashkan Panahi · Ayca Ozcelikkale · Devdatt Dubhashi

[ Auditorium 1 Foyer ]

We compute precise asymptotic expressions for the learning curves of least squares random feature (RF) models with either a separable strongly convex regularization or the $\ell_1$ regularization. We propose a novel multi-level application of the convex Gaussian min max theorem (CGMT) to overcome the traditional difficulty of finding computable expressions for random features models with correlated data. Our result takes the form of a computable 4-dimensional scalar optimization. In contrast to previous results, our approach does not require solving an often intractable proximal operator, which scales with the number of model parameters. Furthermore, we extend the universality results for the training and generalization errors for RF models to $\ell_1$ regularization. In particular, we demonstrate that under mild conditions, random feature models with elastic net or $\ell_1$ regularization are asymptotically equivalent to a surrogate Gaussian model with the same first and second moments. We numerically demonstrate the predictive capacity of our results, and show experimentally that the predicted test error is accurate even in the non-asymptotic regime.
Poster
Nathan Kallus · Miruna Oprescu

[ Auditorium 1 Foyer ]

The conditional average treatment effect (CATE) is the best measure of individual causal effects given baseline covariates. However, the CATE only captures the (conditional) average, and can overlook risks and tail events, which are important to treatment choice. In aggregate analyses, this is usually addressed by measuring the distributional treatment effect (DTE), such as differences in quantiles or tail expectations between treatment groups. Hypothetically, one can similarly fit conditional quantile regressions in each treatment group and take their difference, but this would not be robust to misspecification or provide agnostic best-in-class predictions. We provide a new robust and model-agnostic methodology for learning the conditional DTE (CDTE) for a class of problems that includes conditional quantile treatment effects, conditional super-quantile treatment effects, and conditional treatment effects on coherent risk measures given by f-divergences. Our method is based on constructing a special pseudo-outcome and regressing it on covariates using any regression learner. Our method is model-agnostic in that it can provide the best projection of CDTE onto the regression model class. Our method is robust in that even if we learn these nuisances nonparametrically at very slow rates, we can still learn CDTEs at rates that depend on the class complexity and …

Poster
Kei Ishikawa · Niao He

[ Auditorium 1 Foyer ]

We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value. It can be shown that our estimator contains the recently proposed sharp estimator by Dorn and Guo (2022) as a special case, and our method enables a novel extension of the classical marginal sensitivity model using f-divergence. To construct our estimator, we leverage the kernel method to obtain a tractable approximation to the conditional moment constraints, which traditional non-sharp estimators failed to take into account. In the theoretical analysis, we provide a condition for the choice of the kernel which guarantees no specification error that biases the lower bound estimation. Furthermore, we provide consistency guarantees of policy evaluation and learning. In the experiments with synthetic and real-world data, we demonstrate the effectiveness of the proposed method.

Poster
Grigor Keropyan · David Strieder · Mathias Drton

[ Auditorium 1 Foyer ]

Learning causal relationships from empirical observations is a central task in scientific research. A common method is to employ structural causal models that postulate noisy functional relations among a set of interacting variables. To ensure unique identifiability of causal directions, researchers consider restricted subclasses of structural causal models. Post-nonlinear (PNL) causal models constitute one of the most flexible options for such restricted subclasses, containing in particular the popular additive noise models as a further subclass. However, learning PNL models is not well studied beyond the bivariate case. The existing methods learn non-linear functional relations by minimizing residual dependencies and subsequently test independence from residuals to determine causal orientations. However, these methods can be prone to overfitting and, thus, difficult to tune appropriately in practice. As an alternative, we propose a new approach for PNL causal discovery that uses rank-based methods to estimate the functional parameters. This new approach exploits natural invariances of PNL models and disentangles the estimation of the non-linear functions from the independence tests used to find causal orientations. We prove consistency of our method and validate our results in numerical experiments.

Poster
Jayadev Acharya · Sourbh Nitin Bhadane · Arnab Bhattacharyya · Saravanan Kandasamy · Ziteng Sun

[ Auditorium 1 Foyer ]

We study the sample complexity of causal structure learning on a two-variable system with observational and experimental data. Specifically, for two variables $X$ and $Y$, we consider the classical scenario where either $X$ causes $Y$, $Y$ causes $X$, or there is an unmeasured confounder between $X$ and $Y$. Let $m_1$ be the number of observational samples of $(X,Y)$, and let $m_2$ be the number of interventional samples where either $X$ or $Y$ has been subject to an external intervention. We show that if $X$ and $Y$ are over a finite domain of size $k$ and are significantly correlated, the minimum $m_2$ needed is sublinear in $k$. Moreover, as $m_1$ grows, the minimum $m_2$ needed to identify the causal structure decreases. In fact, we can give a tight characterization of the tradeoff between $m_1$ and $m_2$ when $m_1 = O(k)$ or is sufficiently large. We build upon techniques for closeness testing when $m_1$ is small (e.g., sublinear in $k$), and for non-parametric density estimation when $m_2$ is large. Our hardness results are based on carefully constructing causal models whose marginal and interventional distributions form hard instances of canonical results on property testing.
Poster
Xinwei Sun · Xiangyu Zheng · Jim Weinstein

[ Auditorium 1 Foyer ]

Causal decomposition has provided a powerful tool to analyze health disparity problems by assessing the proportion of disparity caused by each mediator (the variable that mediates the effect of the exposure on the health outcome). However, most of these methods lack policy implications, as they fail to account for all sources of disparities caused by the mediator. Besides, its identifiability needs to specify a set to be admissible to make the strong ignorability condition hold, which can be problematic as some variables in this set may induce new spurious features. To resolve these issues, under the framework of the structural causal model, we propose a new decomposition, dubbed as adjusted and unadjusted effects, which is able to include all types of disparity by adjusting each mediator’s distribution from the disadvantaged group to the advantaged ones. Besides, by learning the maximal ancestral graph and implementing causal discovery from heterogeneous data, we can identify the admissible set, followed by an efficient algorithm for estimation. The theoretical correctness and efficacy of our method are demonstrated using a synthetic dataset and a common spine disease dataset.

Poster
Shinsaku Sakaue · Taihei Oki

[ Auditorium 1 Foyer ]

Learning sketching matrices for fast and accurate low-rank approximation (LRA) has gained increasing attention. Recently, Bartlett, Indyk, and Wagner (COLT 2022) presented a generalization bound for the learning-based LRA. Specifically, for rank-$k$ approximation using an $m \times n$ learned sketching matrix with $s$ non-zeros in each column, they proved an $Õ(nsm)$ bound on the \emph{fat shattering dimension} ($Õ$ hides logarithmic factors). We build on their work and make two contributions.(1) We present a better $Õ(nsk)$ bound ($k \le m$). En route to obtaining this result, we give a low-complexity \emph{Goldberg--Jerrum algorithm} for computing pseudo-inverse matrices, which would be of independent interest.(2) We alleviate an assumption of the previous study that sketching matrices have a fixed sparsity pattern. We prove that learning positions of non-zeros increases the fat shattering dimension only by $O(ns\log n)$. In addition, experiments confirm the practical benefit of learning sparsity patterns.
Poster
Shiwei Zeng · Jie Shen

[ Auditorium 1 Foyer ]

We study the problem of crowdsourced PAC learning of threshold functions. This is a challenging problem and only recently have query-efficient algorithms been established under the assumption that a noticeable fraction of the workers are perfect. In this work, we investigate a more challenging case where the majority may behave adversarially and the rest behave as the Massart noise -- a significant generalization of the perfectness assumption. We show that under the semi-verified model of Charikar et al. (2017), where we have (limited) access to a trusted oracle who always returns correct annotations, it is possible to PAC learn the underlying hypothesis class with a manageable amount of label queries. Moreover, we show that the labeling cost can be drastically mitigated via the more easily obtained comparison queries. Orthogonal to recent developments in semi-verified or list-decodable learning that crucially rely on data distributional assumptions, our PAC guarantee holds by exploring the wisdom of the crowd.

Poster
Pierre Gaillard · Aadirupa Saha · Soham Dan

[ Auditorium 1 Foyer ]

We address the problem of Internal Regret in adversarial Sleeping Bandits and the relationship between different notions of sleeping regrets in multi-armed bandits. We propose a new concept called Internal Regret for sleeping multi-armed bandits (MAB) and present an algorithm that achieves sublinear Internal Regret, even when losses and availabilities are both adversarial. We demonstrate that a low internal regret leads to both low external regret and low policy regret for i.i.d. losses. Our contribution is unifying existing notions of regret in sleeping bandits and exploring their implications for each other. In addition, we extend our results to Dueling Bandits (DB), a preference feedback version of multi-armed bandits, and design a low-regret algorithm for sleeping dueling bandits with stochastic preferences and adversarial availabilities. We validate the effectiveness of our algorithms through empirical evaluations.

Poster
Yunzhe Zhou · Zhengling Qi · Chengchun Shi · Lexin Li

[ Auditorium 1 Foyer ]

In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a …

Poster
Koji Tabata · Junpei Komiyama · Atsuyoshi Nakamura · Tamiki Komatsuzaki

[ Auditorium 1 Foyer ]

The classification bandit problem aims to determine whether a set of given $K$ arms contains at least $L$ good arms or not. Here, an arm is said to be good if its expected reward is no less than a specified threshold.To solve this problem, we introduce an asymptotically optimal algorithm, named P-tracking, based on posterior sampling. Unlike previous asymptotically optimal algorithms that require solving a linear programming problem with an exponentially large number of constraints, P-tracking solves an equivalent optimization problem that can be computed in time linear in $K$. Additionally, unlike existing algorithms, P-tracking does not require forced exploration steps. Empirical results show that P-tracking outperforms existing algorithms in sample efficiency.
Poster
Jung-hun Kim · Se-Young Yun · Minchan Jeong · Junhyun Nam · Jinwoo Shin · Richard Combes

[ Auditorium 1 Foyer ]

We study contextual linear bandit problems under feature uncertainty; they are noisy with missing entries. To address the challenges of the noise, we analyze Bayesian oracles given observed noisy features. Our Bayesian analysis finds that the optimal hypothesis can be far from the underlying realizability function, depending on the noise characteristics, which are highly non-intuitive and do not occur for classical noiseless setups. This implies that classical approaches cannot guarantee a non-trivial regret bound. Therefore, we propose an algorithm that aims at the Bayesian oracle from observed information under this model, achieving $\tilde{O}(d\sqrt{T})$ regret bound when there is a large number of arms. We demonstrate the proposed algorithm using synthetic and real-world datasets.
Poster
Nishant Mehta · Junpei Komiyama · Vamsi Potluru · Andrea Nguyen · Mica Grant-Hagen

[ Auditorium 1 Foyer ]

We introduce the thresholded linear bandit problem, a novel sequential decision making problem at the interface of structured stochastic multi-armed bandits and learning halfspaces. The set of arms is $[0, 1]^d$, the expected Bernoulli reward is piecewise constant with a jump at a separating hyperplane, and each arm is associated with a cost that is a positive linear combination of the arm's components. This problem is motivated by several practical applications. For instance, imagine tuning the continuous features of an offer to a consumer; higher values incur higher cost to the vendor but result in a more attractive offer. At some threshold, the offer is attractive enough for a random consumer to accept at the higher probability level. For the one-dimensional case, we present Leftist, which enjoys $\log^2 T$ problem-dependent regret in favorable cases and has $\log(T) \sqrt{T}$ worst-case regret; we also give a lower bound suggesting this is unimprovable. We then present MD-Leftist, our extension of Leftist to the multi-dimensional case, which obtains similar regret bounds but with $d^{2.5} \log d$ and $d^{1.5} \log d$ dependence on dimension for the two types of bounds respectively. Finally, we experimentally evaluate Leftist.
Poster
Runzhe Wan · Lin Ge · Rui Song

[ Auditorium 1 Foyer ]

Online learning in large-scale structured bandits is known to be challenging due to the curse of dimensionality. In this paper, we propose a unified meta-learning framework for a wide class of structured bandit problems where the parameter space can be factorized to item-level, which covers many popular tasks. Compared with existing approaches, the proposed solution is both scalable to large systems and robust by utilizing a more flexible model. At the core of this framework is a Bayesian hierarchical model that allows information sharing among items via their features, upon which we design a meta Thompson sampling algorithm. Three representative examples are discussed thoroughly. Theoretical analysis and extensive numerical results both support the usefulness of the proposed method.

Poster
Sujay Bhatt · Guanhua Fang · Ping Li

[ Auditorium 1 Foyer ]

Piecewise stationary stochastic multi-armed bandits have been extensively explored in the risk-neutral and sub-Gaussian setting. In this work, we consider a multi-armed bandit framework in which the reward distributions are heavy-tailed and non-stationary, and evaluate the performance of algorithms using general risk criteria. Specifically, we make the following contributions: (i) We first propose a non-parametric change detection algorithm that can detect general distributionalchanges in heavy-tailed distributions. (ii)We then propose a truncation-based UCB-type bandit algorithm integrating the above regime change detection algorithm to minimize the regret of the non-stationary learning problem. (iii) Finally, we establish the regret bounds for the proposed bandit algorithm by characterizing the statistical properties of the general change detection algorithm, along with a novel regret analysis.

Poster
Benjamin Howson · Ciara Pike-Burke · Sarah Filippi

[ Auditorium 1 Foyer ]

There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.

Poster
Jia-Wei Shan · Peng Zhao · Zhi-Hua Zhou

[ Auditorium 1 Foyer ]

Performative prediction is a framework to capture the endogenous distribution changes resulting from the reactions of deployed environments to the learner's decision. Existing results require that the collected data are sampled from the clean observed distribution. However, this is often not the case in real-world applications, and even worse, data collected in open environments may include corruption due to various undesirable factors. In this paper, we study the entanglement of endogenous distribution change and corruption in open environments, where data are obtained from a corrupted decision-dependent distribution. The central challenge in this problem is the entangling effects between changing distributions and corruptions, which impede the use of effective gradient-based updates. To overcome this difficulty, we propose a novel recursive formula that decouples the two sources of effects, which allows us to further exploit suitable techniques for handling two decoupled effects and obtaining favorable guarantees. Theoretically, we prove that our proposed algorithm converges to the desired solution under corrupted observations, and simultaneously it can retain a competitive rate in the uncorrupted case. Experimental results also support our theoretical findings.

Poster
Jianyu Xu · Dan Qiao · Yu-Xiang Wang

[ Auditorium 1 Foyer ]

We study the problem of online dynamic pricing with two types of fairness constraints: a "procedural fairness" which requires the "proposed" prices to be equal in expectation among different groups, and a "substantive fairness" which requires the "accepted" prices to be equal in expectation among different groups. A policy that is simultaneously procedural and substantive fair is referred to as "doubly fair". We show that a doubly fair policy must be random to have higher revenue than the best trivial policy that assigns the same price to different groups. In a two-group setting, we propose an online learning algorithm for the 2-group pricing problems that achieves $\tilde{O}(\sqrt{T})$ regret, zero procedural unfairness and $\tilde{O}(\sqrt{T})$ substantive unfairness over $T$ rounds of learning. We also prove two lower bounds showing that these results on regret and unfairness are both information-theoretically optimal up to iterated logarithmic factors. To the best of our knowledge, this is the first dynamic pricing algorithm that learns to price while satisfying two fairness constraints at the same time.
Poster
Michael Poli · Stefano Massaroli · Stefano Ermon · Bryan Wilder · Eric Horvitz

[ Auditorium 1 Foyer ]

We present a methodology for formulating simplifying abstractions in machine learning systems by identifying and harnessing the utility structure of decisions. Machine learning tasks commonly involve high-dimensional output spaces (e.g., predictions for every pixel in an image or node in a graph), even though a coarser output would often suffice for downstream decision-making (e.g., regions of an image instead of pixels). Developers often hand-engineer abstractions of the output space, but numerous abstractions are possible and it is unclear how the choice of output space for a model impacts its usefulness in downstream decision-making. We propose a method that configures the output space automatically in order to minimize the loss of decision-relevant information. Taking a geometric perspective, we formulate a step of the algorithm as a projection of the probability simplex --- termed fold --- that minimizes the total loss of decision-related information in the H-entropy sense. Crucially, learning in the abstracted outcome space requires significantly less data, leading to a net improvement in decision quality. We demonstrate the method in two domains: data acquisition for deep neural network training and a closed-loop wildfire management task.

Poster
Jerry Chee · Hwanwoo Kim · Panos Toulis

[ Auditorium 1 Foyer ]

In this paper, we develop a statistical inference procedure using stochastic gradient descent (SGD)-based confidence intervals. These intervals are of the simplest possible form: θ{N,j} ± 2√(γ/N) , where θN is the SGD estimate of model parameters θ over N data points, and γ is the learning rate. This construction relies only on a proper selection of the learning rate to ensure the standard SGD conditions for O(1/n) convergence. The procedure performs well in all our empirical evaluations, achieving near-nominal coverage intervals scaling up to 20× as many parameters as other SGD-based inference methods. We demonstrate our method’s practical significance on modeling adverse events in emergency general surgery patients using a novel dataset from the Hospital of the University of Pennsylvania. Our code is available on \href{https://github.com/jerry-chee/SGDInference}{GitHub}.

Poster
Negin Golrezaei · Patrick Jaillet · Jason Cheuk Nam Liang · Vahab Mirrokni

[ Auditorium 1 Foyer ]

Internet advertisers (buyers) repeatedly procure ad impressions from ad platforms (sellers) with the aim to maximize total conversion (i.e. ad value) while respecting both budget and return-on-investment (ROI) constraints for efficient utilization of limited monetary resources. Facing such a constrained buyer who aims to learn her optimal strategy to acquire impressions, we study from a seller's perspective how to learn and price ad impressions through repeated posted price mechanisms to maximize revenue. For this two-sided learning setup, we propose a learning algorithm for the seller that utilizes an episodic binary-search procedure to identify a revenue-optimal selling price. We show that such a simple learning algorithm enjoys low seller regret when within each episode, the budget and ROI constrained buyer approximately best responds to the posted price. We present simple yet natural buyer's bidding algorithms under which the buyer approximately best responds while satisfying budget and ROI constraints, leading to a low regret for our proposed seller pricing algorithm. The design of our seller algorithm is motivated by the fact that the seller's revenue function admits a bell-shaped structure when the buyer best responds to prices under budget and ROI constraints, enabling our seller algorithm to identify revenue-optimal selling prices efficiently.

Poster
El-Mahdi El-Mhamdi · Sadegh Farhadkhani · Rachid Guerraoui · Lê-Nguyên Hoang

[ Auditorium 1 Foyer ]

The geometric median, an instrumental component of the secure machine learning toolbox, is known to be effective when robustly aggregating models (or gradients), gathered from potentially malicious (or strategic) users. What is less known is the extent to which the geometric median incentivizes dishonest behaviors. This paper addresses this fundamental question by quantifying its strategyproofness. While we observe that the geometric median is not even approximately strategyproof, we prove that it is asymptotically $\alpha$-strategyproof: when the number of users is large enough, a user that misbehaves can gain at most a multiplicative factor $\alpha$, which we compute as a function of the distribution followed by the users. We then generalize our results to the case where users actually care more about specific dimensions, determining how this impacts $\alpha$. We also show how the skewed geometric medians can be used to improve strategyproofness.
Poster
Lucas CLARTE · Bruno Loureiro · Florent Krzakala · Lenka Zdeborova

[ Auditorium 1 Foyer ]

Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent behavior in the calibration curve of optimally regularised estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.

Poster
Austin J. Stromme

[ Auditorium 1 Foyer ]

The Schrödinger bridge is a stochastic process that finds the most likely coupling of two measures with respect to Brownian motion, and is equivalent to the popular entropically regularized optimal transport problem. Motivated by recent applications of the Schrödinger bridge to trajectory reconstruction problems, we study the problem of sampling from a Schrödinger bridge in high dimensions. We assume sample access to the marginals of the Schrödinger bridge process and prove that the natural plug-in sampler achieves a fast statistical rate of estimation for the population bridge in terms of relative entropy. This sampling procedure is given by computing the entropic OT plan between samples from each marginal, and joining a draw from this plan with a Brownian bridge. We apply this result to construct a new and computationally feasible estimator that yields improved rates for entropic optimal transport map estimation.

Poster
Spencer Compton · Dmitriy Katz · Benjamin Qi · Kristjan Greenewald · Murat Kocaoglu

[ Auditorium 1 Foyer ]

Given a set of discrete probability distributions, the minimum entropy coupling is the minimum entropy joint distribution that has the input distributions as its marginals. This has immediate relevance to tasks such as entropic causal inference for causal graph discovery and bounding mutual information between variables that we observe separately. Since finding the minimum entropy coupling is NP-Hard, various works have studied approximation algorithms. The work of [Compton, ISIT 2022] shows that the greedy coupling algorithm of [Kocaoglu et al., AAAI 2017] is always within log2(e) ≈ 1.44 bits of the optimal coupling. Moreover, they show that it is impossible to obtain a better approximation guarantee using the majorization lower-bound that all prior works have used: thus establishing a majorization barrier. In this work, we break the majorization barrier by designing a stronger lower-bound that we call the profile method. Using this profile method, we are able to show that the greedy algorithm is always within log2(e)/e ≈ 0.53 bits of optimal for coupling two distributions (previous best-known bound is within 1 bit), and within (1 + log_2(e))/2 ≈ 1.22 bits for coupling any number of distributions (previous best-known bound is within 1.44 bits). We also examine a …

Poster
Alankrita Bhatt · Jongha Jon Ryu · Young-Han Kim

[ Auditorium 1 Foyer ]

A new portfolio selection strategy that adapts to a continuous side-information sequence is presented, with a universal wealth guarantee against a class of state-constant rebalanced portfolios with respect to a state function that maps each side-information symbol to a finite set of states. In particular, given that a state function belongs to a collection of functions of finite Natarajan dimension, the proposed strategy is shown to achieve, asymptotically to first order in the exponent, the same wealth as the best state-constant rebalanced portfolio with respect to the best state function, chosen in hindsight from observed market. This result can be viewed as an extension of the seminal work of Cover and Ordentlich (1996) that assumes a single-state function.

Poster
Adrian N. Bishop · Edwin V. Bonilla

[ Auditorium 1 Foyer ]

We consider the Bayesian optimal filtering problem: i.e. estimating some conditional statistics of a latent time-series signal from an observation sequence. Classical approaches often rely on the use of assumed or estimated transition and observation models. Instead, we formulate a generic recurrent neural network framework and seek to learn directly a recursive mapping from observational inputs to the desired estimator statistics. The main focus of this article is the approximation capabilities of this framework. We provide approximation error bounds for filtering in general non-compact domains. We also consider strong time-uniform approximation error bounds that guarantee good long-time performance. We discuss and illustrate a number of practical concerns and implications of these results.

Poster
Zhaozhuo Xu · Zhao Song · Anshumali Shrivastava

[ Auditorium 1 Foyer ]

Markov Decision Process (MDP) with large action space naturally occurs in many applications such as language processing, information retrieval, and recommendation system. There have been various approaches to solve these MDPs through value iteration (VI). Unfortunately, all VI algorithms require expensive linear scans over the entire action space for value function estimation during each iteration. To this end, we present two provable Least-Squares Value Iteration (LSVI) algorithms with runtime complexity sublinear in the number of actions for linear MDPs. We formulate the value function estimation procedure in VI as an approximate maximum inner product search problem and propose a Locality Sensitive Hashing (LSH) type data structure to solve this problem with sublinear time complexity. Our major contribution is combining the guarantees of approximate maximum inner product search with the regret analysis of reinforcement learning. We prove that, with the appropriate choice of approximation factor, there exists a sweet spot. Our proposed Sublinear LSVI algorithms maintain the same regret as the original LSVI algorithms while reducing the runtime complexity to sublinear in the number of actions. To the best of our knowledge, this is the first work that combines LSH with reinforcement learning that results in provable improvements. We hope that …

Poster
Jillian Fisher · Lang Liu · Krishna Pillutla · Yejin Choi · Zaid Harchaoui

[ Auditorium 1 Foyer ]

Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data.

Poster
Raef Bassily · Mehryar Mohri · Ananda Theertha Suresh

[ Auditorium 1 Foyer ]

A key problem in a variety of applications is that of domain adaptation from a public source domain, for which a relatively large amount of labeled data with no privacy constraints is at one's disposal, to a private target domain, for which a private sample is available with very few or no labeled data. In regression problems, where there are no privacy constraints on the source or target data, a discrepancy minimization approach was shown to outperform a number of other adaptation algorithm baselines. Building on that approach, we initiate a principled study of differentially private adaptation from a source domain with public labeled data to a target domain with unlabeled private data. We design differentially private discrepancy-based adaptation algorithms for this problem. The design and analysis of our private algorithms critically hinge upon several key properties we prove for a smooth approximation of the weighted discrepancy, such as its smoothness with respect to the $\ell_1$-norm and the sensitivity of its gradient. We formally show that our adaptation algorithms benefit from strong generalization and privacy guarantees.
Poster
Qin-Cheng Zheng · Shen-Huan Lyu · Shao-Qun Zhang · Yuan Jiang · Zhi-Hua Zhou

[ Auditorium 1 Foyer ]

Decision tree learning algorithms such as CART are generally based on heuristics that maximizes the purity gain greedily. Though these algorithms are practically successful, theoretical properties such as consistency are far from clear. In this paper, we discover that the most serious obstacle encumbering consistency analysis for decision tree learning algorithms lies in the fact that the worst-case purity gain, i.e., the core heuristics for tree splitting, can be zero. Based on this recognition, we present a new algorithm, named Grid Classification And Regression Tree (GridCART), with a provable consistency rate $\mathcal{O}(n^{-1 / (d + 2)})$, which is the first consistency rate proved for heuristic tree learning algorithms.
Poster
Qian Yu · Yining Wang · Baihe Huang · Qi Lei · Jason Lee

[ Auditorium 1 Foyer ]

Optimization of smooth reward functions under bandit feedback is a long-standing problem in online learning. This paper approaches this problem by studying the convergence under smoothness and Kurdyka-Lojasiewicz conditions. We designed a search-based algorithm that achieves an improved rate compared to the standard gradient-based method. In conjunction with a matching lower bound, this algorithm shows optimality in the dependence on precision for the low-dimension regime.

Poster
Felix Biggs · Benjamin Guedj

[ Auditorium 1 Foyer ]

We introduce a modified version of the excess risk, which can be used to obtain empirically tighter, faster-rate PAC-Bayesian generalisation bounds. This modified excess risk leverages information about the relative hardness of data examples to reduce the variance of its empirical counterpart, tightening the bound. We combine this with a new bound for [−1, 1]-valued (and potentially non-independent) signed losses, which is more favourable when they empirically have low variance around 0. The primary new technical tool is a novel result for sequences of interdependent random vectors which may be of independent interest. We empirically evaluate these new bounds on a number of real-world datasets.

Poster
Sebastian G. Gruber · Florian Buettner

[ Auditorium 1 Foyer ]

Reliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications.    The most common way to measure this uncertainty is via the predicted confidence.    While this tends to work well for in-domain samples, these estimates are unreliable under domain drift and restricted to classification.    Alternatively, proper scores can be used for most predictive tasks but a bias-variance decomposition for model uncertainty does not exist in the current literature.    In this work we introduce a general bias-variance decomposition for proper scores, giving rise to the Bregman Information as the variance term.    We discover how exponential families and the classification log-likelihood are special cases and provide novel formulations.    Surprisingly, we can express the classification case purely in the logit space.    We showcase the practical relevance of this decomposition on several downstream tasks, including model ensembles and confidence regions.    Further, we demonstrate how different approximations of the instance-level Bregman Information allow reliable out-of-distribution detection for all degrees of domain drift.
Poster
Matthew Holland

[ Auditorium 1 Foyer ]

Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution of gradient-based learners.

Poster
Yulai Zhao · Jianshu Chen · Simon Du

[ Auditorium 1 Foyer ]

This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP).We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks.Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$.Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition.These techniques can be of independent interest.
Poster
Alvaro Correia · Daniel Worrall · Roberto Bondesan

[ Auditorium 1 Foyer ]

Simulated annealing (SA) is a stochastic global optimisation metaheuristic applicable to a wide range of discrete and continuous variable problems. Despite its simplicity, SA hinges on carefully handpicked components, viz. proposal distribution and annealing schedule, that often have to be fine tuned to individual problem instances. In this work, we seek to make SA more effective and easier to use by framing its proposal distribution as a reinforcement learning policy that can be optimised for higher solution quality given a computational budget. The result is Neural SA, a competitive and general machine learning method for combinatorial optimisation that is efficient, and easy to design and train. We show Neural SA with such a learnt proposal distribution, parametrised by small equivariant neural networks, outperforms SA baselines on several problems: Rosenbrock's function and the Knapsack, Bin Packing and Travelling Salesperson problems. We also show Neural SA scales well to large problems (generalising to much larger instances than those seen during training) while getting comparable performance to popular off-the-shelf solvers and machine learning methods in terms of solution quality and wall-clock time.

Poster
Joachim Spoerhase · Kamyar Khodamoradi · Benedikt Riegel · Bruno Ordozgoiti · Aristides Gionis

[ Auditorium 1 Foyer ]

In the reconciliation $k$-median problem we ask to cluster a set of data points by picking $k$ cluster centers so as to minimize the sum of distances of the data points to their cluster centers plus the sum of pairwise distances between the centers. The problem, which is a variant of classic $k$-median, aims to find a set of cluster centers that are not too far from each other, and it has applications, or example, when selecting a committee to deliberate on a controversial topic. This problem was introduced recently~\citep{ordozgoiti2019reconciliation}, and it was shown that a local-search-based algorithm is always within a factor $O(k)$ of an optimum solution and performs well in practice. In this paper, we demonstrate a close connection of reconciliation $k$-median to a variant of the $k$-facility location problem, in which each potential cluster center has an individual opening cost and we aim at minimizing the sum of client-center distances and the opening costs. This connection enables us to provide a new algorithm for reconciliation $k$-median that yields a constant-factor approximation (independent of $k$). We also provide a sparsification scheme that reduces the number of potential cluster centers to $O(k)$ in order to substantially speed up approximation …
Poster
Victoria Crawford

[ Auditorium 1 Foyer ]

In this paper, we consider the optimization problem \scpl (\scp), which is to find a minimum cost subset of a ground set $U$ such that the value of a submodular function $f$ is above a threshold $\tau$. In contrast to most existing work on \scp, it is not assumed that $f$ is monotone. Two bicriteria approximation algorithms are presented for \scp that, for input parameter $0 < \epsilon < 1$, give $O( 1 / \epsilon^2 )$ ratio to the optimal cost and ensures the function $f$ is at least $\tau(1 - \epsilon)/2$. A lower bound shows that under the value query model shows that no polynomial-time algorithm can ensure that $f$ is larger than $\tau/2$. Further, the algorithms presented are scalable to large data sets, processing the ground set in a stream. Similar algorithms developed for \scp also work for the related optimization problem of \smpl (\smp). Finally, the algorithms are demonstrated to be effective in experiments involving graph cut and data summarization functions.
Poster
Erfan Yazdandoost Hamedani · Afrooz Jalilzadeh · Necdet Serhat Aybat

[ Auditorium 1 Foyer ]

In this paper we propose a class of randomized primal-dual methods incorporating line search to contend with large-scale saddle point~(SP) problems defined by a convex-concave function $\mathcal L(\bx,y)\triangleq \sum_{i=1}^M f_i(x_i)+\Phi(\bx,y)-h(y)$. We analyze the convergence rate of the proposed method under mere convexity and strong convexity assumptions of $\mathcal L$ in $\bf x$-variable. In particular, assuming $\nabla_y\Phi(\cdot,\cdot)$ is Lipschitz and $\nabla_{\bf x}\Phi(\cdot,y)$ is coordinate-wise Lipschitz for any fixed $y$, the ergodic sequence generated by the algorithm achieves the $\mathcal O(M/k)$ convergence rate in the expected primal-dual gap.Furthermore, assuming that $\mathcal L(\cdot,y)$ is strongly convex for any $y$, and that $\Phi({\bf x},\cdot)$ is affine for any $\bf x$, the scheme enjoys a faster rate of $\mathcal O(M/k^2)$ in terms of primal solution suboptimality. We implemented the proposed algorithmic framework to solve kernel matrix learning problem, and tested it against other state-of-the-art first-order methods.
Poster
Johan Larsson · Quentin Klopfenstein · Mathurin Massias · Jonas Wallin

[ Auditorium 1 Foyer ]

The lasso is the most famous sparse regression and feature selection method. One reason for its popularity is the speed at which the underlying optimization problem can be solved. Sorted L-One Penalized Estimation (SLOPE) is a generalization of the lasso with appealing statistical properties. In spite of this, the method has not yet reached widespread interest. A major reason for this is that current software packages that fit SLOPE rely on algorithms that perform poorly in high dimensions. To tackle this issue, we propose a new fast algorithm to solve the SLOPE optimization problem, which combines proximal gradient descent and proximal coordinate descent steps. We provide new results on the directional derivative of the SLOPE penalty and its related SLOPE thresholding operator, as well as provide convergence guarantees for our proposed solver. In extensive benchmarks on simulated and real data, we demonstrate our method's performance against a long list of competing algorithms.

Poster
Sasila ILANDARIDEVA · Yannis Bekri · Anatoli Juditsky · Vianney Perchet

[ Auditorium 1 Foyer ]

We discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that the problem objective is smooth and quadratically minorated and stochastic perturbations are sub-Gaussian, our analysis prescribes the method parameters which ensure fast convergence of the estimation error (the radius of a confidence ball of a given norm around the approximate solution). This convergence is linear during the first "preliminary’’ phase of the routine and is sublinear during the second "asymptotic" phase.We consider an application of the proposed approach to sparse Generalized Linear Regression problem. In this setting, we show that the proposed algorithm attains the optimal convergence of the estimation error under weak assumptions on the regressor distribution. We also present a numerical study illustrating the performance of the algorithm on high-dimensional simulation data.

Poster
Aleksandr Beznosikov · Eduard Gorbunov · Hugo Berard · Nicolas Loizou

[ Auditorium 1 Foyer ]

Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed variants with compression, which were extensively studied in the literature, especially during the last few years. In this paper, we propose a unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities. A key to our unified framework is a parametric assumption on the stochastic estimates. Via our general theoretical framework, we either recover the sharpest known rates for the known special cases or tighten them. Moreover, to illustrate the flexibility of our approach we develop several new variants of SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate randomization (SEGA-SGDA). Although the variants of these methods were known for the minimization problems, they were never considered for solving min-max problems and VIPs. We also …

Poster
Elias Wirth · Thomas Kerdreux · Sebastian Pokutta

[ Auditorium 1 Foyer ]

Frank-Wolfe algorithms (FW) are popular first-order methods for solving constrained convex optimization problems that rely on a linear minimization oracle instead of potentially expensive projection-like oracles. Many works have identified accelerated convergence rates under various structural assumptions on the optimization problem and for specific FW variants when using line-search or short-step, requiring feedback from the objective function. Little is known about accelerated convergence regimes when utilizing open-loop step-size rules, a.k.a. FW with pre-determined step-sizes, which are algorithmically extremely simple and stable. Not only is FW with open-loop step-size rules not always subject to the same convergence rate lower bounds as FW with line-search or short-step, but in some specific cases, such as kernel herding in infinite dimensions, it has been empirically observed that FW with open-loop step-size rules leads to faster convergence than FW with line-search or short-step. We propose a partial answer to this unexplained phenomenon in kernel herding, characterize a general setting for which FW with open-loop step-size rules converges non-asymptotically faster than with line-search or short-step, and derive several accelerated convergence results for FW with open-loop step-size rules.

Poster
Ruichen Jiang · Nazanin Abolfazli · Aryan Mokhtari · Erfan Yazdandoost Hamedani

[ Auditorium 1 Foyer ]

In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a novel bilevel optimization method that locally approximates the solution set of the lower-level problem via a cutting plane and then runs a conditional gradient update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We also prove stronger convergence guarantees under the H\"olderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems.
Poster
Michael Sucker · Peter Ochs

[ Auditorium 1 Foyer ]

We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algorithms provably outperform related algorithms based on a (deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. By generalizing existing ideas, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum, which enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hyperparameters of standard optimization algorithms to empirically underline our theory.

Poster
Louis Leconte · Sholom Schechtman · Eric Moulines

[ Auditorium 1 Foyer ]

In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to solve each constrained optimization subproblem. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set and allows iterates that are infeasible. The numerical complexity of AskewSGD is comparable to existing approaches for training QNNs, such as the straight-through gradient estimator used in BinaryConnect, or other state of the art methods (ProxQuant, LUQ).We establish convergence guarantees for AskewSGD (under general assumptions for the objective function). Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.

Poster
Quan Xiao · Han Shen · Wotao Yin · Tianyi Chen

[ Auditorium 1 Foyer ]

Bilevel optimization, which captures the inherent nested structure of machine learning problems, is gaining popularity in many recent applications. Existing works on bilevel optimization mostly consider either the unconstrained problems or the constrained upper-level problems. In this context, this paper considers the stochastic bilevel optimization problems with equality constraints in both upper and lower levels. By leveraging the special structure of the equality constraints problem, the paper first presents an alternating projected SGD approach to tackle this problem and establishes the $\tilde{\cal O}(\epsilon^{-2})$ sample and iteration complexity that matches the state-of-the-art complexity of ALSET \citep{chen2021closing} for stochastic unconstrained bilevel problems. To further save the cost of projection, the paper presents an alternating projected SGD approach with lazy projection and establishes the $\tilde{\cal O}(\epsilon^{-2}/T)$ upper-level and $\tilde{\cal O}(\epsilon^{-1.5}/T^{\frac{3}{4}})$ lower-level projection complexity of this new algorithm, where $T$ is the upper-level projection interval. Application to federated bilevel optimization has been presented to showcase the performance of our algorithms. Our resultsdemonstrate that equality-constrained bilevel optimization with strongly-convex lower-level problems can be solved as efficiently as stochastic single-level optimization problems.
Poster
Antonio Orvieto · Anant Raj · Hans Kersting · Francis Bach

[ Auditorium 1 Foyer ]

Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.

Poster
Qixin Zhang · Zengde Deng · Zaiyi Chen · Kuangqi Zhou · Haoyuan Hu · Yu Yang

[ Auditorium 1 Foyer ]

In this paper, we revisit the online non-monotone continuous DR-submodular maximization problem over a down-closed convex set, which finds wide real-world applications in the domain of machine learning, economics, and operations research. At first, we present the \textbf{Meta-MFW} algorithm achieving a $1/e$-regret of $O(\sqrt{T})$ at the cost of $T^{3/2}$ stochastic gradient evaluations per round. As far as we know, \textbf{Meta-MFW} is the first algorithm to obtain $1/e$-regret of $O(\sqrt{T})$ for the online non-monotone continuous DR-submodular maximization problem over a down-closed convex set. Furthermore, in sharp contrast with \textbf{ODC} algorithm~\citep{thang2021online}, \textbf{Meta-MFW} relies on the simple online linear oracle without discretization, lifting, or rounding operations. Considering the practical restrictions, we then propose the \textbf{Mono-MFW} algorithm, which reduces the per-function stochastic gradient evaluations from $T^{3/2}$ to 1 and achieves a $1/e$-regret bound of $O(T^{4/5})$. Next, we extend \textbf{Mono-MFW} to the bandit setting and propose the \textbf{Bandit-MFW} algorithm which attains a $1/e$-regret bound of $O(T^{8/9})$. To the best of our knowledge, \textbf{Mono-MFW} and \textbf{Bandit-MFW} are the first sublinear-regret algorithms to explore the one-shot and bandit setting for online non-monotone continuous DR-submodular maximization problem over a down-closed convex set, respectively. Finally, we conduct numerical experiments on both synthetic and real-world datasets to verify the effectiveness …
Poster
Xiaolu Wang · Yuchen Jiao · Hoi-To Wai · Yuantao Gu

[ Auditorium 1 Foyer ]

We consider the problem of distributed principal component analysis (PCA) where the data samples are dispersed across different agents. Despite the rich literature on this problem under various specific settings, there is still a lack of efficient algorithms that are amenable to decentralized and asynchronous implementations. In this paper, we extend the incremental aggregated gradient (IAG) method in convex optimization to the nonconvex PCA problems based on an Riemannian gradient-type method named IARG-PCA. The IARG-PCA method admits low per-iteration computational and communication cost and can be readily implemented in a decentralized and asynchronous manner. Moreover, we show that the IARG-PCA method converges linearly to the leading eigenvector of the sample covariance of the whole dataset with a constant step size. The iteration complexity coincides with the best-known result of the IAG method in terms of the linear dependence on the number of agents. Meanwhile, the communication complexity is much lower than the state-of-the-art decentralized PCA algorithms if the eigengap of the sample covariance is moderate. Numerical experiments on synthetic and real datasets show that our IARG-PCA method exhibits substantially lower communication cost and comparable computational cost compared with other existing algorithms.

Poster
Zhiqiang Xu

[ Auditorium 1 Foyer ]

We revisit the acceleration of the noise-tolerant power method for which, despite previous studies, the results remain unsatisfactory as they are either wrong or suboptimal, also lacking generality. In this work, we present a simple yet general and optimal analysis via noise-corrupted Chebyshev polynomials, which allows a larger iteration rank $p$ than the target rank $k$, requires less noise conditions in a new form, and achieves the optimal iteration complexity $\Theta\left(\sqrt{\frac{\lambda_{k}-\lambda_{q+1}}{\lambda_{k}}}\log\frac{1}{\epsilon}\right)$ for some $q$ satisfying $k\leq q\leq p$ in a certain regime of the momentum parameter. Interestingly, it shows dynamic dependence of the noise tolerance on the spectral gap, i.e., from linear at the beginning to square-root near convergence, while remaining commensurate with the previous in terms of overall tolerance. We relate our new form of noise norm conditions to the existing trigonometric one, which enables an improved analysis of generalized eigenspace computation and canonical correlation analysis. We conduct an extensive experimental study to showcase the great performance of the considered algorithm with a larger iteration rank $p>k$ across different applications.
Poster
Vincent Plassier · Eric Moulines · Alain Durmus

[ Auditorium 1 Foyer ]

This paper focuses on Bayesian inference in a federated learning context (FL). While several distributed MCMC algorithms have been proposed, few consider the specific limitations of FL such as communication bottlenecks and statistical heterogeneity. Recently, Federated Averaging Langevin Dynamics (FALD) was introduced, which extends the Federated Averaging algorithm to Bayesian inference. We obtain a novel tight non-asymptotic upper bound on the Wasserstein distance to the global posterior for FALD. This bound highlights the effects of statistical heterogeneity, which causes a drift in the local updates that negatively impacts convergence. We propose a new algorithm VR-FALD* that uses control variates to correct the client drift. We establish non-asymptotic bounds showing that VR-FALD* is not affected by statistical heterogeneity. Finally, we illustrate our results on several FL benchmarks for Bayesian inference.

Poster
Tristan Cinquin · Tammo Rukat · Philipp Schmidt · Martin Wistuba · Artur Bekasov

[ Auditorium 1 Foyer ]

Gradient boosting machines (GBMs) based on decision trees consistently demonstrate state-of-the-art results on regression and classification tasks with tabular data, often outperforming deep neural networks. However, these models do not provide well-calibrated predictive uncertainties, which prevents their use for decision making in high-risk applications. The Bayesian treatment is known to improve predictive uncertainty calibration, but previously proposed Bayesian GBM methods are either computationally expensive, or resort to crude approximations. Variational inference is often used to implement Bayesian neural networks, but is difficult to apply to GBMs, because the decision trees used as weak learners are non-differentiable. In this paper, we propose to implement Bayesian GBMs using variational inference with soft decision trees, a fully differentiable alternative to standard decision trees introduced by Irsoy et al. Our experiments demonstrate that variational soft trees and variational soft GBMs provide useful uncertainty estimates, while retaining good predictive performance. The proposed models show higher test likelihoods when compared to the state-of-the-art Bayesian GBMs in 7/10 tabular regression datasets and improved out-of-distribution detection in 5/10 datasets.

Poster
Caleb Dahlke · Sue Zheng · Jason Pacheco

[ Auditorium 1 Foyer ]

Computing mutual information (MI) of random variables lacks a closed-form in nontrivial models. Variational MI approximations are widely used as flexible estimators for this purpose, but computing them typically requires solving a costly nonconvex optimization. We prove that a widely used class of variational MI estimators can be solved via moment matching operations in place of the numerical optimization methods that are typically required. We show that the same moment matching solution yields variational estimates for so-called "implicit" models that lack a closed form likelihood function. Furthermore, we demonstrate that this moment matching solution has multiple orders of magnitude computational speed up compared to the standard optimization based solutions. We show that theoretical results are supported by numerical evaluation in fully parameterized Gaussian mixture models and a generalized linear model with implicit likelihood due to nuisance variables. We also demonstrate on the implicit simulation-based likelihood SIR epidemiology model, where we avoid costly likelihood free inference and observe many orders of magnitude speedup.

Poster
XIANGMING MENG · Tomoyuki Obuchi · Yoshiyuki Kabashima

[ Auditorium 1 Foyer ]

We theoretically analyze the model selection consistency of least absolute shrinkage and selection operator (Lasso), both with and without post-thresholding, for high-dimensional Ising models. For random regular (RR) graphs of size $p$ with regular node degree $d$ and uniform couplings $\theta_0$, it is rigorously proved that Lasso without post-thresholding is model selection consistent in the whole paramagnetic phase with the same order of sample complexity $n=\Omega{(d^3\log{p})}$ as that of $\ell_1$-regularized logistic regression ($\ell_1$-LogR). This result is consistent with the conjecture in Meng, Obuchi, and Kabashima 2021 using the non-rigorous replica method from statistical physics and thus complements it with a rigorous proof. For general tree-like graphs, it is demonstrated that the same result as RR graphs can be obtained under mild assumptions of the dependency condition and incoherence condition. Moreover, we provide a rigorous proof of the model selection consistency of Lasso with post-thresholding for general tree-like graphs in the paramagnetic phase without further assumptions on the dependency and incoherence conditions. Experimental results agree well with our theoretical analysis.
Poster
Petrus Mikkola · Julien Martinelli · Louis Filstroff · Samuel Kaski

[ Auditorium 1 Foyer ]

Bayesian optimization (BO) is a powerful framework for optimizing black-box, expensive-to-evaluate functions. Over the past decade, many algorithms have been proposed to integrate cheaper, lower-fidelity approximations of the objective function into the optimization process, with the goal of converging towards the global optimum at a reduced cost. This task is generally referred to as multi-fidelity Bayesian optimization (MFBO). However, MFBO algorithms can lead to higher optimization costs than their vanilla BO counterparts, especially when the low-fidelity sources are poor approximations of the objective function, therefore defeating their purpose. To address this issue, we propose rMFBO (robust MFBO), a methodology to make any GP-based MFBO scheme robust to the addition of unreliable information sources. rMFBO comes with a theoretical guarantee that its performance can be bound to its vanilla BO analog, with high controllable probability. We demonstrate the effectiveness of the proposed methodology on a number of numerical benchmarks, outperforming earlier MFBO methods on unreliable sources. We expect rMFBO to be particularly useful to reliably include human experts with varying knowledge within BO processes.

Poster
Jeremy Sellier · Petros Dellaportas

[ Auditorium 1 Foyer ]

We introduce a novel scheme for Bayesian inference on permanental processes which models the Poisson intensity as the square of a Gaussian process. Combining generalized kernels and a Fourier features-based representation of the Gaussian process with a Laplace approximation to the posterior, we achieve a fast and efficient inference that does not require numerical integration over the input space, allows kernel design and scales linearly with the number of events. Our method builds and improves upon the state-of-theart Laplace Bayesian point process benchmark of Walder and Bishop (2017), demonstrated on both synthetic, real-world temporal and large spatial data sets.

Poster
David Watson · Kristin Blesch · Jan Kapar · Marvin N. Wright

[ Auditorium 1 Foyer ]

We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-of-the-art probabilistic circuits and deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. An accompanying $\texttt{R}$ package, $\texttt{arf}$, is available on $\texttt{CRAN}$.
Poster
Giovanni Luca Marchetti · Vladislav Polianskii · Anastasiia Varava · Florian T. Pokorny · Danica Kragic

[ Auditorium 1 Foyer ]

We introduce a non-parametric density estimator deemed Radial Voronoi Density Estimator (RVDE). RVDE is grounded in the geometry of Voronoi tessellations and as such benefits from local geometric adaptiveness and broad convergence properties. Due to its radial definition RVDE is continuous and computable in linear time with respect to the dataset size. This amends for the main shortcomings of previously studied VDEs, which are highly discontinuous and computationally expensive. We provide a theoretical study of the modes of RVDE as well as an empirical investigation of its performance on high-dimensional data. Results show that RVDE outperforms other non-parametric density estimators, including recently introduced VDEs.

Poster
Simon Bartels · Kristoffer Stensbo-Smidt · Pablo Moreno-Muñoz · Wouter Boomsma · Jes Frellsen · Soren Hauberg

[ Auditorium 1 Foyer ]

We present a method to approximate Gaussian process regression models to large datasets by considering only a subset of the data. Our approach is novel in that the size of the subset is selected on the fly during exact inference with little computational overhead. From an empirical observation that the log-marginal likelihood often exhibits a linear trend once a sufficient subset of a dataset has been observed, we conclude that many large datasets contain redundant information that only slightly affects the posterior. Based on this, we provide probabilistic bounds on the full model evidence that can identify such subsets. Remarkably, these bounds are largely composed of terms that appear in intermediate steps of the standard Cholesky decomposition, allowing us to modify the algorithm to adaptively stop the decomposition once enough data have been observed.

Poster
Sebastian Tay · Quoc Phong Nguyen · Chuan Sheng Foo · Bryan Kian Hsiang Low

[ Auditorium 1 Foyer ]

The Nash equilibrium (NE) is a classic solution concept for normal-form games that is stable under potential unilateral deviations by self-interested agents. Bayesian optimization (BO) has been used to find NE in continuous general-sum games with unknown costly-to-sample utility functions in a sample-efficient manner. This paper presents the first no-regret BO algorithm that is sample-efficient in finding pure NE by leveraging theory on high probability confidence bounds with Gaussian processes and the maximum information gain of kernel functions. Unlike previous works, our algorithm is theoretically guaranteed to converge to the optimal solution (i.e., NE). We also introduce the novel setting of applying BO to finding mixed NE in unknown discrete general-sum games and show that our theoretical framework is general enough to be extended naturally to this setting by developing a no-regret BO algorithm that is sample-efficient in finding mixed NE. We empirically show that our algorithms are competitive w.r.t. suitable baselines in finding NE.

Poster
Jake Cunningham · Daniel Augusto de Souza · So Takao · Mark van der Wilk · Marc Deisenroth

[ Auditorium 1 Foyer ]

Gaussian processes (GPs) are typically criticised for their unfavourable scaling in both computational and memory requirements. For large datasets, sparse GPs reduce these demands by conditioning on a small set of inducing variables designed to summarise the data. In practice however, for large datasets requiring many inducing variables, such as low-lengthscale spatial data, even sparse GPs can become computationally expensive, limited by the number of inducing variables one can use. In this work, we propose a new class of inter-domain variational GP, constructed by projecting a GP onto a set of compactly supported B-spline basis functions. The key benefit of our approach is that the compact support of the B-spline basis functions admits the use of sparse linear algebra to significantly speed up matrix operations and drastically reduce the memory footprint. This allows us to very efficiently model fast-varying spatial phenomena with tens of thousands of inducing variables, where previous approaches failed.

Poster
Aryan Deshwal · Sebastian Ament · Maximilian Balandat · Eytan Bakshy · Janardhan Rao Doppa · David Eriksson

[ Auditorium 1 Foyer ]

We consider the problem of optimizing expensive black-box functions over high-dimensional combinatorial spaces which arises in many science, engineering, and ML applications. We use Bayesian Optimization (BO) and propose a novel surrogate modeling approach for efficiently handling a large number of binary and categorical parameters. The key idea is to select a number of discrete structures from the input space (the dictionary) and use them to define an ordinal embedding for high-dimensional combinatorial structures. This allows us to use existing Gaussian process models for continuous spaces. We develop a principled approach based on binary wavelets to construct dictionaries for binary spaces, and propose a randomized construction method that generalizes to categorical spaces. We provide theoretical justification to support the effectiveness of the dictionary-based embeddings. Our experiments on diverse real-world benchmarks demonstrate the effectiveness of our proposed surrogate modeling approach over state-of-the-art BO methods.

Poster
Francois Bachoc · Louis Béthune · Alberto González Sanz · Jean-Michel Loubes

[ Auditorium 1 Foyer ]

We present a novel kernel over the space of probability measures based on the dual formulation of optimal regularized transport. We propose an Hilbertian embedding of the space of probabilities using their Sinkhorn potentials, which are solutions of the dual entropic relaxed optimal transport between the probabilities and a reference measure $\mathcal{U}$. We prove that this construction enables to obtain a valid kernel, by using the Hilbert norms. We prove that the kernel enjoys theoretical properties such as universality and some invariances, while still being computationally feasible. Moreover we provide theoretical guarantees on the behaviour of a Gaussian process based on this kernel. The empirical performances are compared with other traditional choices of kernels for processes indexed on distributions.
Poster
Arber Qoku · Florian Buettner

[ Auditorium 1 Foyer ]

Many real-world systems are described not only by data from a single source but via multiple data views. In genomic medicine, for instance, patients can be characterized by data from different molecular layers. Latent variable models with structured sparsity are a commonly used tool for disentangling variation within and across data views. However, their interpretability is cumbersome since it requires a direct inspection and interpretation of each factor from domain experts. Here, we propose MuVI, a novel multi-view latent variable model based on a modified horseshoe prior for modeling structured sparsity. This facilitates the incorporation of limited and noisy domain knowledge, thereby allowing for an analysis of multi-view data in an inherently explainable manner. We demonstrate that our model (i) outperforms state-of-the-art approaches for modeling structured sparsity in terms of the reconstruction error and the precision/recall, (ii) robustly integrates noisy domain expertise in the form of feature sets, (iii) promotes the identifiability of factors and (iv) infers interpretable and biologically meaningful axes of variation in a real-world multi-view dataset of cancer patients.

Poster
Yaxuan Zhu · Jianwen Xie · Ping Li

[ Auditorium 1 Foyer ]

We propose the NeRF-LEBM, a likelihoodbased top-down 3D-aware 2D image generative model that incorporates 3D representation via Neural Radiance Fields (NeRF) and 2D imaging process via differentiable volume rendering. The model represents an image as a rendering process from 3D object to 2D image and is conditioned on some latent variables that account for object characteristics and are assumed to follow informative trainable energy-based prior models. We propose two likelihood-based learning frameworks to train the NeRF-LEBM: (i) maximum likelihood estimation with Markov chain Monte Carlo-based inference and (ii) variational inference with the reparameterization trick. We study our modelsin the scenarios with both known and unknown camera poses. Experiments on several benchmark datasets demonstrate that the NeRF-LEBM can infer 3D object structures from 2D images, generate 2D images with novel views and objects, learn from incomplete 2D images, and learn from 2D images with known or unknown camera poses.

Poster
Muralikrishnna Guruswamy Sethuraman · Romain Lopez · Rahul Mohan · Faramarz Fekri · Tommaso Biancalani · Jan-Christian Huetter

[ Auditorium 1 Foyer ]

Learning causal relationships between variables is a well-studied problem in statistics, with many important applications in science. However, modeling real-world systems remain challenging, as most existing algorithms assume that the underlying causal graph is acyclic. While this is a convenient framework for developing theoretical developments about causal reasoning and inference, the underlying modeling assumption is likely to be violated in real systems, because feedback loops are common (e.g., in biological systems). Although a few methods search for cyclic causal models, they usually rely on some form of linearity, which is also limiting, or lack a clear underlying probabilistic model. In this work, we propose a novel framework for learning nonlinear cyclic causal graphical models from interventional data, called NODAGS-Flow. We perform inference via direct likelihood optimization, employing techniques from residual normalizing flows for likelihood estimation. Through synthetic experiments and an application to single-cell high-content perturbation screening data, we show significant performance improvements with our approach compared to state-of-the-art methods with respect to structure recovery and predictive performance.

Poster
Hailiang Dong · James Amato · Vibhav Gogate · Nicholas Ruozzi

[ Auditorium 1 Foyer ]

Temporal models such as Dynamic Bayesian Networks (DBNs) and Hidden Markov Models (HMMs) have been widely used to model time-dependent sequential data. Typically, these approaches limit focus to discrete domains, employ first-order Markov and stationary assumptions, and limit representational power so that efficient (approximate) inference procedures can be applied. We propose a novel temporal model for continuous domains, where the transition distribution is conditionally tractable: it is modelled as a tractable continuous density over the variables at the current time slice only, while the parameters are controlled using a Recurrent Neural Network (RNN) that takes all previous observations as input. We show that, in this model, various inference tasks can be efficiently implemented using forward filtering with simple gradient ascent. Our experimental results on two different tasks over several real-world sequential datasets demonstrate the superior performance of our model against existing competitors.

Poster
Alexander Lew · George Matheos · Tan Zhi-Xuan · Matin Ghavami · Nishad Gothoskar · Stuart Russell · Vikash Mansinghka

[ Auditorium 1 Foyer ]

This paper introduces SMCP3, a method for automatically implementing custom sequential Monte Carlo samplers for inference in probabilistic programs. Unlike particle filters and resample-move SMC (Gilks and Berzuini, 2001), SMCP3 algorithms can improve the quality of samples and weights using pairs of Markov proposal kernels that are also specified by probabilistic programs. Unlike Del Moral et al. (2006b), these proposals can themselves be complex probabilistic computations that generate auxiliary variables, apply deterministic transformations, and lack tractable marginal densities. This paper also contributes an efficient implementation in Gen that eliminates the need to manually derive incremental importance weights. SMCP3 thus simultaneously expands the design space that can be explored by SMC practitioners and reduces the implementation effort. SMCP3 is illustrated using applications to 3D object tracking, state-space modeling, and data clustering, showing that SMCP3 methods can simultaneously improve the quality and reduce the cost of marginal likelihood estimation and posterior inference.

Poster
Xingyu Xu · Yuantao Gu

[ Auditorium 1 Foyer ]

We study the problem of reconstructing a continuous multidimensional signal from a small number of samples under Fourier constraints assuming that the Fourier power spectrum of the signal has some desirable properties, e.g. being compactly supported, being sparse. We further assume that the Fourier constraint can be expressed as a prior distribution on the Fourier power spectrum, which subsumes the aforementioned examples. The study of sampling and reconstructing in this vein has attracted much attention with a long history.In this paper, we are interested in finding oblivious sampling strategies, that is, sampling without knowing what specific constraint is put on the Fourier power spectrum. We show that it is possible to obliviously sample a Fourier-constrained multidimensional signal with a near-optimal (up to a logarithmic factor) number of samples that guarantee successful reconstruction, partially answering an open question in Avron et al. (2019) which considered the $1$-dimensional case. Our approach highlights a phenomenon that is unique for dimension $d\ge 2$ that the sampling strategy should depend on the geometry of the region on which the signal is to be reconstructed, unlike the case $d=1$ where all regions are of the form $[a,b]$ which are all geometrically equivalent. Our proof, using tools …
Poster
Hongwei Shang · Jean-Marc Langlois · Kostas Tsioutsiouliklis · Changsung Kang

[ Auditorium 1 Foyer ]

In this paper we study the problem of estimating accurately the precision and recall for binary classification when the classes are imbalanced and only a limited number of human labels are available. In this case, one common strategy is to over-sample the small positive class predicted by the classifier. Rather than random sampling where the values in a confusion matrix are observations coming from a multinomial distribution, we over-sample the minority positive class predicted by the classifier, resulting in two independent binomial distributions. But how much should we over-sample? And what confidence/credible intervals can we deduce based on our over-sampling?We provide formulas for (1) the confidence intervals of the adjusted precision/recall after over-sampling; (2) Bayesian credible intervals of adjusted precision/recall by obtaining their predictive posterior distribution. For precision, the higher the over-sampling rate, the narrower the confidence/credible interval. For recall, there exists an optimal over-sampling ratio, which minimizes the width of the confidence/credible interval. Also, we present experiments on synthetic data and real data to demonstrate the capability of our method to construct accurate intervals. Finally, we demonstrate how we can apply our techniques to a quality monitoring system. We find the size of the smallest editorial test for a …

Poster
simon barthelmé · Nicolas Tremblay · Pierre-Olivier Amblard

[ Auditorium 1 Foyer ]

Discrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets. They are however held back in some cases by the high cost of sampling. In the worst-case scenario, the sampling cost scales as O(n^3) where n is the number of elements of the ground set. A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels. In such cases, the cost of standard sampling algorithms scales as O(np^2 + nm^2) where m is the (average) number of samples of the DPP (usually m ≪ n) and p the rank of the kernel used to define the DPP (m ≤ p ≤ n). The first term, O(np^2), comes from a SVD-like step. We focus here on the second term of this cost, O(nm^2), and show that it can be brought down to O(nm + m^3 log m) without loss on the sampling’s exactness. In practice, we observe very substantial speedupscompared to the classical algorithm as soon as n > 1, 000. The algorithm described here is a close variant of the standard algorithm for sampling continuous DPPs, and uses rejection sampling. In the specific case of projection DPPs, we also show that …

Poster
Martin Pawelczyk · Himabindu Lakkaraju · Seth Neel

[ Auditorium 1 Foyer ]

As predictive models are increasingly being employed to make consequential decisions, there is a growing emphasis on developing techniques that can provide recourse to affected individuals. While such recourses can be immensely beneficial to affected individuals, potential adversaries could also exploit these recourses to compromise privacy. In this work, we initiate the first study investigating if and how an adversary can leverage recourses to infer private information about the underlying model's training data. To this end, we propose a series of novel membership inference attacks which leverage algorithmic recourse. More specifically, we generalize the loss-based attacks proposed in the privacy literature to account for the information captured by algorithmic recourse, and introduce a general class of membership inference attacks against recourses called distance-based attacks. Lastly, we experiment with multiple real world datasets and recourse methods that demonstrate the effectiveness of our attacks; firmly establishing the privacy risks inherent in providing algorithmic recourse. Our results indicate that not only does our distance-based attack outperform random guessing, but it also outperforms the loss-based membership inference attack at low false positive rates. This finding demonstrates that our distance-based attacks have applications beyond the recourse setting to the general problem of membership inference against …

Poster
Tamara Pereira · Erik Nascimento · Lucas E. Resck · Diego Mesquita · Amauri Souza

[ Auditorium 1 Foyer ]

Explaining node predictions in graph neural networks (GNNs) often boils down to finding graph substructures that preserve predictions. Finding these structures usually implies back-propagating through the GNN, bonding the complexity (e.g., number of layers) of the GNN to the cost of explaining it. This naturally begs the question: Can we break this bond by explaining a simpler surrogate GNN? To answer the question, we propose Distill n' Explain (DnX). First, DnX learns a surrogate GNN via knowledge distillation. Then, DnX extracts node or edge-level explanations by solving a simple convex program. We also propose FastDnX, a faster version of DnX that leverages the linear decomposition of our surrogate model. Experiments show that DnX and FastDnX often outperform state-of-the-art GNN explainers while being orders of magnitude faster. Additionally, we support our empirical findings with theoretical results linking the quality of the surrogate model (i.e., distillation error) to the faithfulness of explanations.

Poster
Dmitry Malioutov · Sanjeeb Dash · Dennis Wei

[ Auditorium 1 Foyer ]

ML models take on a new life after deployment and raise a host of new challenges: data drift, model recalibration and monitoring. If performance erodes over time, engineers in charge may ask what changed -- did the data distribution change, did the model get worse after retraining? We propose a flexible paradigm for answering a variety of model diagnosis questions by finding heaviest-weight interpretable regions, which we call heavy-sets. We associate a local weight describing model mismatch at each data-point, and find a simple region maximizing the sum (or average) of these weights. Specific choice of weights can find regions where two models differ the most, where a single model makes unusually many errors, or where two data-sets have large differences in densities. The premise is that a region with overall elevated errors (weights) may discover statistically significant effects despite individual errors not standing out in the noise.We focus on interpretable regions defined by sparse AND-rules (conjunctive rule using a small subset of available features). We first describe an exact integer-programming (IP) formulation applicable to smaller data-sets. As the exact IP is NP-hard, we develop a greedy coordinate-wise dynamic-programming based formulation. For smaller datasets the heuristic often comes close in …

Poster
Muhammad Faaiz Taufiq · Patrick Bloebaum · Lenon Minorics

[ Auditorium 1 Foyer ]

Shapley values are model-agnostic methods for explaining model predictions. Many commonly used methods of computing Shapley values, known as off-manifold methods, rely on model evaluations on out-of-distribution input samples. Consequently, explanations obtained are sensitive to model behaviour outside the data distribution, which may be irrelevant for all practical purposes. While on-manifold methods have been proposed which do not suffer from this problem, we show that such methods are overly dependent on the input data distribution, and therefore result in unintuitive and misleading explanations. To circumvent these problems, we propose ManifoldShap, which respects the model's domain of validity by restricting model evaluations to the data manifold. We show, theoretically and empirically, that ManifoldShap is robust to off-manifold perturbations of the model and leads to more accurate and intuitive explanations than existing state-of-the-art Shapley methods.

Poster
Neil Jethani · Adriel Saporta · Rajesh Ranganath

[ Auditorium 1 Foyer ]

Feature attribution methods identify which features of an input most influence a model's output. Most widely-used feature attribution methods (such as SHAP, LIME, and Grad-CAM) are "class-dependent" methods in that they generate a feature attribution vector as a function of class. In this work, we demonstrate that class-dependent methods can "leak" information about the selected class, making that class appear more likely than it is. Thus, an end user runs the risk of drawing false conclusions when interpreting an explanation generated by a class-dependent method. In contrast, we introduce "distribution-aware" methods, which favor explanations that keep the label's distribution close to its distribution given all features of the input. We introduce SHAP-KL and FastSHAP-KL, two baseline distribution-aware methods that compute Shapley values. Finally, we perform a comprehensive evaluation of seven class-dependent and three distribution-aware methods on three clinical datasets of different high-dimensional data types: images, biosignals, and text.

Poster
Yixuan Zhang · Feng Zhou · · Yang Wang · Fang Chen

[ Auditorium 1 Foyer ]

In learning with fairness, for every instance, its label can be randomly flipped to another class due to the practitioner’s prejudice, namely, label bias. The existing well-studied fair representation learning methods focus on removing the dependency between the sensitive factors and the input data, but do not address how the representations retain useful information when the labels are unreliable. In fact, we find that the learned representations become random or degenerated when the instance is contaminated by label bias. To alleviate this issue, we investigate the problem of learning fair representations that are independent of the sensitive factors while retaining the task-relevant information given only access to unreliable labels. Our model disentangles the dependency between fair representations and sensitive factors in the latent space. To remove the reliance between the labels and sensitive factors, we incorporate an additional penalty based on mutual information. The learned purged fair representations can then be used in any downstream processing. We demonstrate the superiority of our method over previous works through multiple experiments on both synthetic and real-world datasets.

Poster
Ji Wang · Ding Lu · Ian Davidson · Zhaojun Bai

[ Auditorium 1 Foyer ]

There are synergies of research interests and in-dustrial efforts in modeling fairness and correct-ing algorithmic bias in machine learning. Inthis paper, we present a scalable algorithm forspectral clustering (SC) with group fairness con-straints. Group fairness is also known as statis-tical parity where in each cluster, each protectedgroup is represented with the same proportion asin the entirety. While FairSC algorithm (Klein-dessner et al., 2019) is able to find the fairer clus-tering, it is compromised by high computationalcosts due to the algorithm’s kernels of computingnullspaces and the square roots of dense matricesexplicitly. We present a new formulation of theunderlying spectral computation of FairSC by in-corporating nullspace projection and Hotelling’sdeflation such that the resulting algorithm, calleds-FairSC, only involves the sparse matrix-vectorproducts and is able to fully exploit the sparsityof the fair SC model. The experimental resultson the modified stochastic block model demon-strate that while it is comparable with FairSC inrecovering fair clustering, s-FairSC is 12× fasterthan FairSC for moderate model sizes. s-FairSCis further demonstrated to be scalable in the sensethat the computational costs of s-FairSC only in-crease marginally compared to the SC withoutfairness constraints.

Poster
Jayadev Acharya · Yuhan Liu · Ziteng Sun

[ Auditorium 1 Foyer ]

We study discrete distribution estimation under user-level local differential privacy (LDP). In user-level $\varepsilon$-LDP, each user has a $m\ge1$ samples and the privacy of all $m$ samples must be preserved simultaneously. We resolve the following dilemma: While on the one hand having more samples per user should provide more information about the underlying distribution, on the other hand, guaranteeing privacy of all $m$ samples should make estimation task more difficult. We obtain tight bounds for this problem under almost all parameter regimes. Perhaps surprisingly, we show that in suitable parameter regimes, having $m$ samples per user is equivalent to having $m$ times more users, each with only one sample. Our results demonstrate interesting phase transitions for $m$ and the privacy parameter $\varepsilon$ in the estimation risk. Finally, connecting with recent results on shuffled DP, we show that combined with random shuffling,our algorithm leads to optimal error guarantees (up to logarithmic factors) under the central model of user-level DP in certain parameter regimes. We provide several simulations to verify our theoretical findings.
Poster
Paul Mangold · Aurélien Bellet · Joseph Salmon · Marc Tommasi

[ Auditorium 1 Foyer ]

In this paper, we study differentially private empirical risk minimization (DP-ERM). It has been shown that the worst-case utility of DP-ERM reduces polynomially as the dimension increases. This is a major obstacle to privately learning large machine learning models. In high dimension, it is common for some model's parameters to carry more information than others. To exploit this, we propose a differentially private greedy coordinate descent (DP-GCD) algorithm. At each iteration, DP-GCD privately performs a coordinate-wise gradient step along the gradients' (approximately) greatest entry. We show theoretically that DP-GCD can achieve a logarithmic dependence on the dimension for a wide range of problems by naturally exploiting their structural properties (such as quasi-sparse solutions). We illustrate this behavior numerically, both on synthetic and real datasets.

Poster
Lingxiao Wang · Boxin Zhao · Mladen Kolar

[ Auditorium 1 Foyer ]

We study the matrix completion problem under joint differential privacy and develop a non-convex low-rank matrix factorization-based method for solving it. Our method comes with strong privacy and utility guarantees, has a linear convergence rate, and is more scalable than the best-known alternative (Chien et al., 2021). Our method achieves the (near) optimal sample complexity for matrix completion required by the non-private baseline and is much better than the best known result under joint differential privacy. Furthermore, we prove a tight utility guarantee that improves existing approaches and removes the impractical resampling assumption used in the literature. Numerical experiments further demonstrate the superiority of our method.

Poster
Saeyoung Rho · Rachel Cummings · Vishal Misra

[ Auditorium 1 Foyer ]

Synthetic control is a causal inference tool used to estimate the treatment effects of an intervention by creating synthetic counterfactual data. This approach combines measurements from other similar observations (i.e., donor pool) to predict a counterfactual time series of interest (i.e., target unit) by analyzing the relationship between the target and the donor pool before the intervention. As synthetic control tools are increasingly applied to sensitive or proprietary data, formal privacy protections are often required. In this work, we provide the first algorithms for differentially private synthetic control with explicit error bounds. Our approach builds upon tools from non-private synthetic control and differentially private empirical risk minimization. We provide upper and lower bounds on the sensitivity of the synthetic control query, explicit error bounds on the accuracy of our private synthetic control algorithms, and experimental results on synthetic data.

Poster
Zachary Robertson · Hantao Zhang · Sanmi Koyejo

[ Auditorium 1 Foyer ]

Inverse decision theory (IDT) aims to learn a performance metric for classification by eliciting expert classifications on examples. However, elicitation in practical settings may require many classifications of potentially ambiguous examples. To improve the efficiency of elicitation, we propose the cooperative inverse decision theory (CIDT) framework as a formalization of the performance metric elicitation problem. In cooperative inverse decision theory, the expert and a machine play a game where both are rewarded according to the expert's performance metric, but the machine does not initially know what this function is. We show that optimal policies in this framework produce active learning that leads to an exponential improvement in sample complexity over previous work. One of our key findings is that a broad class of sub-optimal experts can be represented as having uncertain preferences. We use this finding to show such experts naturally fit into our proposed framework extending inverse decision theory to efficiently deal with decision data that is sub-optimal due to noise, conflicting experts, or systematic error.

Poster
Roy Dong · Heling Zhang · Lillian Ratliff

[ Auditorium 1 Foyer ]

As data-driven methods are deployed in real-world settings, the processes that generate the observed data will often react to the decisions of the learner. For example, a data source may have some incentive for the algorithm to provide a particular label (e.g. approve a bank loan), and manipulate their features accordingly. Work in strategic classification and decision-dependent distributions seeks to characterize the closed-loop behavior of deploying learning algorithms by explicitly considering the effect of the classifier on the underlying data distribution. More recently, works in performative prediction seek to classify the closed-loop behavior by considering general properties of the mapping from classifier to data distribution, rather than an explicit form. Building on this notion, we analyze repeated risk minimization as the perturbed trajectories of the gradient flows of performative risk minimization. We consider the case where there may be multiple local minimizers of performative risk, motivated by situations where the initial conditions may have significant impact on the long-term behavior of the system. We provide sufficient conditions to characterize the region of attraction for the various equilibria in this settings. Additionally, we introduce the notion of performative alignment, which provides a geometric condition on the convergence of repeated risk minimization …

Poster
Thomas Mortier · Viktor Bengs · Eyke Hüllermeier · Stijn Luca · Willem Waegeman

[ Auditorium 1 Foyer ]

Multi-class classification methods that produce sets of probabilistic classifiers, such as ensemble learning methods, are able to model aleatoric and epistemic uncertainty. Aleatoric uncertainty is then typically quantified via the Bayes error, and epistemic uncertainty via the size of the set. In this paper, we extend the notion of calibration, which is commonly used to evaluate the validity of the aleatoric uncertainty representation of a single probabilistic classifier, to assess the validity of an epistemic uncertainty representation obtained by sets of probabilistic classifiers. Broadly speaking, we call a set of probabilistic classifiers calibrated if one can find a calibrated convex combination of these classifiers. To evaluate this notion of calibration, we propose a novel nonparametric calibration test that generalizes an existing test for single probabilistic classifiers to the case of sets of probabilistic classifiers. Making use of this test, we empirically show that ensembles of deep neural networks are often not well calibrated.

Poster
Sina Baharlouei · Fatemeh Sheikholeslami · Meisam Razaviyayn · Zico Kolter

[ Auditorium 1 Foyer ]

This work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the abstain class. In this work, we show that such a provable framework can benefit by extension to networks with multiple explicit abstain classes, where the adversarial examples are adaptively assigned to those. We show that naively adding multiple abstain classes can lead to model degeneracy, then we propose a regularization approach and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of abstain classes.

Poster
Huishuai Zhang · Da Yu · Yiping Lu · Di He

[ Auditorium 1 Foyer ]

Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-layer network with randomly initialized entries and the neural tangent kernel setup where the parameters are not far from initialization. The proof idea is to show the label information can be efficiently backpropagated to the input while keeping the linear separability. Our theory and experimental evidence further show that the linear classifier trained with the adversarial noises of the training data can well classify the adversarial noises of the test data, indicating that adversarial noises actually inject a distributional perturbation to the original data distribution. Furthermore, we empirically demonstrate that the adversarial noises may become less linearly separable when the above conditions are compromised while they are still much easier to classify than original features.

Poster
Andi Nika · Adish Singla · Goran Radanovic

[ Auditorium 1 Foyer ]

We consider the problem of defense against reward-poisoning attacks in reinforcement learning and formulate it as a game in $T$ rounds between a defender and an adaptive attacker in an adversarial environment. To address this problem, we design two novel defense algorithms. First, we propose \textit{Exp3-DARP}, a defense algorithm that uses Exp3 as a hyperparameter learning subroutine, and show that it achieves order-optimal $\tilde{\Theta}(T^{1/2})$ bounds on our notion of regret with respect to a defense that always picks the optimal parameter in hindsight. We show that the order of $T$ in the bounds cannot be improved when the reward arrival process is adversarial, even if the feedback model of the defense is stronger. However, assuming that the environment is stochastic, we propose \textit{OMDUCB-DARP} that uses estimates of costs as proxies to update the randomized strategy of the learner and are able to substantially improve the bounds proportional to how smoothly the attacker's strategy changes. Furthermore, we show that weaker types of defense, that do not take into account the attack structure and the poisoned rewards, suffer linear regret with respect to a defender that always selects the optimal parameter in hindsight when faced with an adaptive attacker that uses a …
Poster
Taejin Kim · Shubhranshu Singh · Nikhil Madaan · Carlee Joe-Wong

[ Auditorium 1 Foyer ]

Federated learning allows for clients in a distributed system to jointly train a machine learning model. However, clients' models are vulnerable to attacks during the training and testing phases. In this paper, we address the issue of adversarial clients performing "internal evasion attacks'': crafting evasion attacks at test time to deceive other clients. For example, adversaries may aim to deceive spam filters and recommendation systems trained with federated learning for monetary gain. The adversarial clients have extensive information about the victim model in a federated learning setting, as weight information is shared amongst clients. We are the first to characterize the transferability of such internal evasion attacks for different learning methods and analyze the trade-off between model accuracy and robustness depending on the degree of similarities in client data. We show that adversarial training defenses in the federated learning setting only display limited improvements against internal attacks. However, combining adversarial training with personalized federated learning frameworks increases relative internal attack robustness by 60% compared to federated adversarial training and performs well under limited system resources.

Poster
Mohsen Heidari · Wojciech Szpankowski

[ Auditorium 1 Foyer ]

Motivated by the limited qubit capacity of current quantum systems, we study the quantum sample complexity of k-qubit quantum operators, i.e., operations applicable on only k out of d qubits. The problem is studied according to the quantum probably approximately correct (QPAC) model abiding by quantum mechanical laws such as no-cloning, state collapse, and measurement incompatibility. With the delicacy of quantum samples and the richness of quantum operations, one expects a significantly larger quantum sample complexity. This paper proves the contrary. We show that the quantum sample complexity of k-qubit quantum operations is comparable to the classical sample complexity of their counterparts (juntas), at least when $\frac{k}{d}\ll 1$. This is surprising, especially since sample duplication is prohibited, and measurement incompatibility would lead to an exponentially larger sample complexity with standard methods. Our approach is based on the Pauli decomposition of quantum operators and a technique called Quantum Shadow Sampling (QSS) to reduce the sample complexity exponentially. The results are proved by developing (i) a connection between the learning loss and the Pauli decomposition; (ii) a scalable QSS circuit for estimating the Pauli coefficients; and (iii) a quantum algorithm for learning $k$-qubit operators with sample complexity $O(\frac{k4^k}{\epsilon^2}\log d)$.