Skip to yearly menu bar Skip to main content


Timezone: CET

Registration Desk: Registration + Assistance Tue 25 Apr 07:00 a.m.  


Opening Remarks Tue 25 Apr 08:45 a.m.  


Invited Talk: Arthur Gretton

Causal Effect Estimation with Context and Confounders

A fundamental causal modelling task is to predict the effect of an intervention (or treatment) on an outcome, given context/covariates. Examples include predicting the effect of a medical treatment on patient health given patient symptoms and demographic information, or predicting the effect of ticket pricing on airline sales given seasonal fluctuations in demand. The problem becomes especially challenging when the treatment and context are complex (for instance, "treatment" might be a web ad design or a radiotherapy plan), and when only observational data is available (i.e., we have access to historical data, but cannot intervene ourselves). The challenge is greater still when the covariates are not observed, and constitute a hidden source of confounding.

We will provide practical tools and methods for estimating causal effects of complex, high dimensional treatments from observational data. The approach is based on conditional feature means, which represent conditional expectations of relevant model features. These features can be deep neural nets (adaptive, finite dimensional, learned from data), or kernel features (fixed, infinite dimensional, enforcing smoothness). When hidden confounding is present, we will demonstrate a neural net implementation of instrumental variable regression to correct for this confounding. We will apply these methods to modelling employment outcomes for the US Job Corps program for Disadvantaged Youth, and in policy evaluation for reinforcement learning.

Arthur Gretton is a Professor with the Gatsby Computational Neuroscience Unit, and director of the Centre for Computational Statistics and Machine Learning (CSML) at UCL. He received degrees in Physics and Systems Engineering from the Australian National University, and a PhD with Microsoft Research and the Signal Processing and Communications Laboratory at the University of Cambridge. He previously worked at the MPI for Biological Cybernetics, and at the Machine Learning Department, Carnegie Mellon University. Arthur’s recent research interests in machine learning include causal inference and representation learning, design and training of generative models (implicit: Wasserstein gradient flows, GANs; and explicit: energy-based models), and nonparametric hypothesis testing. He has been an associate editor at IEEE Transactions on Pattern Analysis and Machine Intelligence from 2009 to 2013, an Action Editor for JMLR since April 2013, an Area Chair for NeurIPS in 2008 and 2009, a Senior Area Chair for NeurIPS in 2018 and 2021, an Area Chair for ICML in 2011 and 2012, a Senior Area Chair for ICML in 2022, a member of the COLT Program Committee in 2013, and a member of Royal Statistical Society Research Section Committee since January 2020. Arthur was program chair for AISTATS in 2016 (with Christian Robert), tutorials chair for ICML 2018 (with Ruslan Salakhutdinov), workshops chair for ICML 2019 (with Honglak Lee), program chair for the Dali workshop in 2019 (with Krikamol Muandet and Shakir Mohammed), and co-organsier of the Machine Learning Summer School 2019 in London (with Marc Deisenroth).



Oral: Optimal Transport, Information Theory Tue 25 Apr 10:30 a.m.  

Oral
Charlotte Bunne · Ya-Ping Hsieh · Marco Cuturi · Andreas Krause
The static optimal transport $(\mathrm{OT})$ problem between Gaussians seeks to recover an optimal map, or more generally a coupling, to morph a Gaussian into another. It has been well studied and applied to a wide variety of tasks. Here we focus on the dynamic formulation of OT, also known as the Schrödinger bridge (SB) problem, which has recently seen a surge of interest in machine learning due to its connections with diffusion-based generative models. In contrast to the static setting, much less is known about the dynamic setting, even for Gaussian distributions. In this paper, we provide closed-form expressions for SBs between Gaussian measures. In contrast to the static Gaussian OT problem, which can be simply reduced to studying convex programs, our framework for solving SBs requires significantly more involved tools such as Riemannian geometry and generator theory. Notably, we establish that the solutions of SBs between Gaussian measures are themselves Gaussian processes with explicit mean and covariance kernels, and thus are readily amenable for many downstream applications such as generative modeling or interpolation. To demonstrate the utility, we devise a new method for modeling the evolution of single-cell genomics data and report significantly improved numerical stability compared to existing …
Oral
James Thornton · Marco Cuturi

While the optimal transport (OT) problem was originally formulated as a linear program, regularizing it with an entropic penalty has been favored by practitioners in many recent applications, where that regularization is seen as beneficial from both computational and statistical perspectives. The Sinkhorn fixed-point algorithm isthe most popular approach to solve that regularized problem, and, as a result, multiple attempts have been made to reduce its runtime using, e.g., annealing in the regularization parameter, momentum or acceleration in the iterates. The premiseof this work is that initialization of the Sinkhorn algorithm has received comparatively little attention, possibly due to two preconceptions: since the regularized OT problem is convex, it may not be worth crafting a good initialization, since any isguaranteed to work; secondly, because the outputs of the Sinkhorn algorithm are often differentiated in end-to-end pipelines, a data-dependent initialization would bias Jacobian estimates obtained when unrolling iterations. We challenge this conventional wisdom, and show that data-dependent initializers result in dramatic speed-ups, without affecting the correctness of Jacobian maps, as long as those are recovered using implicit differentiation. Our initializations rely on simple closed-forms for exact or approximate OT solutions, using known results in the 1D, Gaussianor GMM settings. These initializations …

Oral
Shelvia Wongso · Rohan Ghosh · Mehul Motani
In this paper, we study the memorization and generalization behaviour of deep neural networks (DNNs) using sliced mutual information (SMI), which is the average of the mutual information (MI) between one-dimensional random projections. We argue that the SMI between features in a DNN ($T$) and ground truth labels ($Y$), $SI(T;Y)$, can be seen as a form of \textit{usable information} that the features contain about the labels. We show theoretically that $SI(T;Y)$ can encode geometric properties of the feature distribution, such as its spherical soft-margin and intrinsic dimensionality, in a way that MI cannot. Additionally, we present empirical evidence showing how $SI(T;Y)$ can capture memorization and generalization in DNNs. In particular, we find that, in the presence of label noise, all layers start to memorize but the earlier layers stabilize more quickly than the deeper layers. Finally, we point out that, in the context of Bayesian Neural Networks, the SMI between the penultimate layer and the output represents the worst case uncertainty of the network's output.
Oral
Cheuk Ting Li · Farzan Farnia

Generative adversarial networks (GANs) represent a game between two neural network machines designed to learn the distribution of data. It is commonly observed that different GAN formulations and divergence/distance measures used could lead to considerably different performance results, especially when the data distribution is multi-modal. In this work, we give a theoretical characterization of the mode-seeking behavior of general f-divergences and Wasserstein distances, and prove a performance guarantee for the setting where the underlying model is a mixture of multiple symmetric quasiconcave distributions. This can help us understand the trade-off between the quality and diversity of the trained GANs' output samples. Our theoretical results show the mode-seeking nature of the Jensen-Shannon (JS) divergence over standard KL-divergence and Wasserstein distance measures. We subsequently demonstrate that a hybrid of JS-divergence and Wasserstein distance measures minimized by Lipschitz GANs mimics the mode-seeking behavior of the JS-divergence. We present numerical results showing the mode-seeking nature of the JS-divergence and its hybrid with the Wasserstein distance while highlighting the mode-covering properties of KL-divergence and Wasserstein distance measures. Our numerical experiments indicate the different behavior of several standard GAN formulations in application to benchmark Gaussian mixture and image datasets.


Panel: Carving out a niche as a researcher Tue 25 Apr 11:30 a.m.  

Marc Deisenroth · Jessica Schrouff · Santiago VELASCO-FORERO

Oral: Trustworthy ML and Statistics Tue 25 Apr 02:00 p.m.  

Oral
Rachel Redberg · Yuqing Zhu · Yu-Xiang Wang

[ Auditorium 1 ]

The "Propose-Test-Release" (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is "nice". We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from “Private Aggregation of Teacher Ensembles (PATE)” --- privately releasing the entire model with a delicate data-dependent analysis.

Oral
Elvis Dohmatob · Chuan Guo · Morgane Goibert

[ Auditorium 1 ]

Machine learning models are known to be susceptible to adversarial perturbations. Even more concerning is the fact that these adversarial perturbations can be found by black-box search using surprisingly few queries, which essentially restricts the perturbation to a subspace of dimension $k$---much smaller than the dimension $d$ of the image space. This intriguing phenomenon raises the question: \emph{Is the vulnerability to black-box attacks inherent or can we hope to prevent them?} In this paper, we initiate a rigorous study of the phenomenon of low-dimensional adversarial perturbations (LDAPs). Our result characterizes precisely the sufficient conditions for the existence of LDAPs, and we show that these conditions hold for neural networks under practical settings, including the so-called lazy regime wherein the parameters of the trained network remain close to their values at initialization. Our theoretical results are confirmed by experiments on both synthetic and real data.
Oral
Jiachen T. Wang · Ruoxi Jia

[ Auditorium 1 ]

Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a …


Oral: Representations of Graphs Tue 25 Apr 03:30 p.m.  

Oral
Behrooz Tahmasebi · Derek Lim · Stefanie Jegelka

[ Auditorium 1 ]

To achieve a graph representation, most Graph Neural Networks (GNNs) follow two steps: first, each graph is decomposed into a number of subgraphs (which we call the recursion step), and then the collection of subgraphs is encoded by several iterative pooling steps. While recently proposed higher-order networks show a remarkable increase in the expressive power through a single recursion on larger neighborhoods followed by iterative pooling, the power of deeper recursion in GNNs without any iterative pooling is still not fully understood. To make it concrete, we consider a pure recursion-based GNN which we call Recursive Neighborhood Pooling GNN (RNP-GNN). The expressive power of an RNP-GNN and its computational cost quantifies the power of (pure) recursion for a graph representation network. We quantify the power by means of counting substructures, which is one main limitation of the Message Passing graph Neural Networks (MPNNs), and show how RNP-GNN can exploit the sparsity of the underlying graph to achieve low-cost powerful representations. We also compare the recent lower bounds on the time complexity and show how recursion-based networks are near optimal.

Oral
Xinyue Xia · Gal Mishne · Yusu Wang

[ Auditorium 1 ]

Graphons are general and powerful models for generating graphs of varying size. In this paper, we propose to directly model graphons using neural networks, obtaining Implicit Graphon Neural Representation (IGNR). Existing work in modeling and reconstructing graphons often approximates a target graphon by a fixed resolution piece-wise constant representation. Our IGNR has the benefit that it can represent graphons up to arbitrary resolutions, and enables natural and efficient generation of arbitrary sized graphs with desired structure once the model is learned. Furthermore, we allow the input graph data to be unaligned and have different sizes by leveraging the Gromov-Wasserstein distance. We first demonstrate the effectiveness of our model by showing its superior performance on a graphon learning task. We then propose an extension of IGNR that can be incorporated into an auto-encoder framework, and demonstrate its good performance under a more general setting of graphon learning. We also show that our model is suitable for graph representation learning and graph generation.

Oral
Hannah Sansford · Alexander Modell · Nick Whiteley · Patrick Rubin-Delanchy

[ Auditorium 1 ]

Recent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes, in which link probabilities are inner products. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold. Recovering a global representation of the manifold is impossible in a sparse regime. However, we can zoom in on local neighbourhoods, where a lower-dimensional representation is possible. As our constructions allow the points to be uniformly distributed on the manifold, we find evidence against the common perception that triangles imply community structure.

Oral
Ga Ming Angus Chan · Tianxi Li

[ Auditorium 1 ]

Statistical modeling of random networks has been a widely used approach to uncovering interaction mechanisms of complex systems and predicting unobserved links in real-world networks. In many applications, network connections are collected via egocentric sampling: a subset of nodes was sampled first, after which all links involving this subset of nodes were recorded; all other information was missing. Compared with the typical assumption of uniformly missing at random, the egocentrically sampled partial networks requires specially designed modeling strategies. The previous available statistical methods are either computationally infeasible or based on intuitive designs without theoretical justification. We propose a method to fit general low-rank models for egocentrically sampled networks, which include several popular network models. The method is based on spectral properties and is computationally efficient for large-scale networks. The proposed method gives a consistent recovery of the missing subnetwork due to egocentric sampling for sparse networks. To our knowledge, this is the first available theoretical guarantee for egocentric partial network estimation in the scope of low-rank models. We evaluate the method on several synthetic and real-world networks and show that it delivers competitive performance in link prediction tasks.


Poster Session 1 Tue 25 Apr 04:30 p.m.  

Poster
Aman Shrivastava · Ramprasaath R. Selvaraju · Nikhil Naik · Vicente Ordonez

[ Auditorium 1 Foyer ]

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite

Poster
Vitalii Bulygin · Dmytro Mykheievskyi · Kyrylo Kuchynskyi

[ Auditorium 1 Foyer ]

We propose a fast and low complexity anchor-free instance segmentation approach BlitzMask. For the first time, the approach achieves competitive results for real-time inference on mobile devices. The model architecture modifies CenterNet by adding a new lite head to the CenterNet architecture. The model contains only layers optimized for inference on mobile devices, e.g. batch normalization, standard convolution, depthwise convolution, and can be easily embedded into a mobile device. The instance segmentation task requires finding an arbitrary (not a priori fixed) number of instance masks. The proposed method predicts the number of instance masks separately for each image using a predicted heatmap. Then, it decomposes each instance mask over a predicted spanning set, which is an output of the lite head. The approach uses training from scratch with a new optimization process and a new loss function. A model with EfficientNet-Lite B4 backbone and 320x320 input resolution achieves 28.9 mask AP at 29.2 fps on Samsung S21 GPU and 28.0 mask AP at 39.4 fps on Samsung S21 DSP. This sets a new speed benchmark for inference for instance segmentation on mobile devices.

Poster
Ruitu Xu · Yifei Min · Tianhao Wang · Michael Jordan · Zhaoran Wang · Zhuoran Yang

[ Auditorium 1 Foyer ]

We study a heterogeneous agent macroeconomic model with an infinite number of households and firms competing in a labor market. Each household earns income and engages in consumption at each time step while aiming to maximize a concave utility subject to the underlying market conditions. The households aim to find the optimal saving strategy that maximizes their discounted cumulative utility given the market condition, while the firms determine the market conditions through maximizing corporate profit based on the household population behavior. The model captures a wide range of applications in macroeconomic studies, and we propose a data-driven reinforcement learning framework that finds the regularized competitive equilibrium of the model. The proposed algorithm enjoys theoretical guarantees in converging to the equilibrium of the market at a sub-linear rate.

Poster
Zhiyue Zhang · Hongyuan Mei · Yanxun Xu

[ Auditorium 1 Foyer ]

Offline reinforcement learning is a promising approach for training intelligent medical agents to learn treatment policies and assist decision making in many healthcare applications, such as scheduling clinical visits and assigning dosages for patients with chronic conditions. In this paper, we investigate the potential usefulness of Decision Transformer —a new offline reinforcement learning paradigm—in medical domains where decision making in continuous time is desired. As Decision Transformer only handles discrete-time (or, turn-based) sequential decision making scenarios, we generalize it to Continuous-Time Decision Transformer that not only considers the past clinical measurements and treatments but also the timings of previous visits, and learns to suggest the timings of future visits as well as the treatment plan at each visit. Extensive experiments on synthetic datasets and simulators motivated by real-world medical applications demonstrate that Continuous-Time Decision Transformer is able to outperform competitors and has clinical utility in terms of improving patients’ health and prolonging their survival by learning high-performance policies from logged data generated using policies of different levels of quality.

Poster
Hussein Mozannar · Hunter Lang · Dennis Wei · Prasanna Sattigeri · Subhro Das · David Sontag

[ Auditorium 1 Foyer ]

Algorithmic predictors should be able to defer the prediction to a human decision maker to ensure accurate predictions. In this work, we jointly train a classifier with a rejector, which decides on each data point whether the classifier or the human should predict. We show that prior approaches can fail to find a human-AI system with low mis-classification error even when there exists a linear classifier and rejector that have zero error (the realizable setting). We prove that obtaining a linear pair with low error is NP-hard even when the problem is realizable. To complement this negative result, we give a mixed-integer-linear-programming (MILP) formulation that can optimally solve the problem in the linear setting. However, the MILP only scales to moderately-sized problems. Therefore, we provide a novel surrogate loss function that is realizable-consistent and performs well empirically. We test our approaches on a comprehensive set of datasets and compare to a wide range of baselines.

Poster
Song Duong · Alberto Lumbreras · Mike Gartrell · Patrick Gallinari

[ Auditorium 1 Foyer ]

Data-to-text (D2T) and text-to-data (T2D) are dual tasks that convert structured data, such as graphs or tables into fluent text, and vice versa. These tasks are usually handled separately and use corpora extracted from a single source. Current systems leverage pre-trained language models fine-tuned on D2T or T2D tasks. This approach has two main limitations: first, a separate system has to be tuned for each task and source; second, learning is limited by the scarcity of available corpora. This paper considers a more general scenario where data are available from multiple heterogeneous sources. Each source, with its specific data format and semantic domain, provides a non-parallel corpus of text and structured data. We introduce a variational auto-encoder model with disentangled style and content variables that allows us to represent the diversity that stems from multiple sources of text and data. Our model is designed to handle the tasks of D2T and T2D jointly. We evaluate our model on several datasets, and show that by learning from multiple sources, our model closes the performance gap with its supervised single-source counterpart and outperforms it in some cases.

Poster
Shay Deutsch · Stefano Soatto

[ Auditorium 1 Foyer ]

We introduce the Graph Sylvester Embedding (GSE), an unsupervised graph representation of local similarity, connectivity, and global structure. GSE uses the solution of the Sylvester equation to capture both network structure and neighborhood proximity in a single representation. Unlike embeddings based on the eigenvectors of the Laplacian, GSE incorporates two or more basis functions, for instance using the Laplacian and the affinity matrix. Such basis functions are constructed not from the original graph, but from one whose weights measure the centrality of an edge (the fraction of the number of shortest paths that pass through that edge) in the original graph. This allows more flexibility and control to represent complex network structure and shows significant improvements over the state of the art when used for data analysis tasks such as predicting failed edges in material science and network alignment in the human-SARS CoV-2 protein-protein interactome.

Poster
Hannah Sansford · Alexander Modell · Nick Whiteley · Patrick Rubin-Delanchy

[ Auditorium 1 Foyer ]

Recent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes, in which link probabilities are inner products. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold. Recovering a global representation of the manifold is impossible in a sparse regime. However, we can zoom in on local neighbourhoods, where a lower-dimensional representation is possible. As our constructions allow the points to be uniformly distributed on the manifold, we find evidence against the common perception that triangles imply community structure.

Poster
Ga Ming Angus Chan · Tianxi Li

[ Auditorium 1 Foyer ]

Statistical modeling of random networks has been a widely used approach to uncovering interaction mechanisms of complex systems and predicting unobserved links in real-world networks. In many applications, network connections are collected via egocentric sampling: a subset of nodes was sampled first, after which all links involving this subset of nodes were recorded; all other information was missing. Compared with the typical assumption of uniformly missing at random, the egocentrically sampled partial networks requires specially designed modeling strategies. The previous available statistical methods are either computationally infeasible or based on intuitive designs without theoretical justification. We propose a method to fit general low-rank models for egocentrically sampled networks, which include several popular network models. The method is based on spectral properties and is computationally efficient for large-scale networks. The proposed method gives a consistent recovery of the missing subnetwork due to egocentric sampling for sparse networks. To our knowledge, this is the first available theoretical guarantee for egocentric partial network estimation in the scope of low-rank models. We evaluate the method on several synthetic and real-world networks and show that it delivers competitive performance in link prediction tasks.

Poster
Jiachen Yang · Tarik Dzanic · Brenden Petersen · Jun Kudo · Ketan Mittal · Vladimir Tomov · Jean-Sylvain Camier · Tuo Zhao · Hongyuan Zha · Tzanio Kolev · Robert Anderson · Daniel Faissol

[ Auditorium 1 Foyer ]

Finite element simulations of physical systems governed by partial differential equations (PDE) crucially depend on adaptive mesh refinement (AMR) to allocate computational budget to regions where higher resolution is required. Existing scalable AMR methods make heuristic refinement decisions based on instantaneous error estimation and thus do not aim for long-term optimality over an entire simulation. We propose a novel formulation of AMR as a Markov decision process and apply deep reinforcement learning (RL) to train refinement policies directly from simulation. AMR poses a new problem for RL as both the state dimension and available action set changes at every step, which we solve by proposing new policy architectures with differing generality and inductive bias. The model sizes of these policy architectures are independent of the mesh size and hence can be deployed on larger simulations than those used at training time. We demonstrate in comprehensive experiments on static function estimation and time-dependent equations that RL policies can be trained on problems without using ground truth solutions, are competitive with a widely-used error estimator, and generalize to larger and unseen test problems.

Poster
Joshua Chang · Carson Chow · Julia Porcino

[ Auditorium 1 Foyer ]

The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means -- linear factorization and null hypothesis statistical testing for item partitioning/selection, and finally, posthoc calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. Its item partitioning, based on exploratory factor analysis, is blind to the final nonlinear IRT model and is not performed in a manner consistent with goodness of fit to the final model. In this manuscript, we develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method uses sparsity-based shrinkage to obviate the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that item partitioning is consistent with the ultimate nonlinear factor model. We also analogize our multidimensional IRT model to probabilistic autoencoders, specifying an encoder function that amortizes the inference of ability parameters from item responses. The encoder function is equivalent to the ``VBE'' step in …

Poster
Seungjae Shin · HeeSun Bae · DongHyeok Shin · Weonyoung Joo · Il-Chul Moon

[ Auditorium 1 Foyer ]

Training neural networks on a large dataset requires substantial computational costs. Dataset reduction selects or synthesizes data instances based on the large dataset, while minimizing the degradation in generalization performance from the full dataset. Existing methods utilize the neural network during the dataset reduction procedure, so the model parameter becomes important factor in preserving the performance after reduction. By depending upon the importance of parameters, this paper introduces a new reduction objective, coined LCMat, which Matches the Loss Curvatures of the original dataset and reduced dataset over the model parameter space, more than the parameter point. This new objective induces a better adaptation of the reduced dataset on the perturbed parameter region than the exact point matching. Particularly, we identify the worst case of the loss curvature gap from the local parameter region, and we derive the implementable upper bound of such worst-case with theoretical analyses. Our experiments on both coreset selection and condensation benchmarks illustrate that LCMat shows better generalization performances than existing baselines.

Poster
Younggeun Kim · Ying Liu · Xue-Xin Wei

[ Auditorium 1 Foyer ]

The recently proposed identifiable variational autoencoder (iVAE) framework provides a promising approach for learning latent independent components (ICs). iVAEs use auxiliary covariates to build an identifiable generation structure from covariates to ICs to observations, and the posterior network approximates ICs given observations and covariates. Though the identifiability is appealing, we show that iVAEs could have local minimum solution where observations and the approximated ICs are independent given covariates.– a phenomenon we referred to as the posterior collapse problem of iVAEs. To overcome this problem, we develop a new approach, covariate-informed iVAE (CI-iVAE) by considering a mixture of encoder and posterior distributions in the objective function. In doing so, the objective function prevents the posterior collapse, resulting latent representations that contain more information of the observations. Furthermore, CI-iVAE extends the original iVAE objective function to a larger class and finds the optimal one among them, thus having tighter evidence lower bounds than the original iVAE. Experiments on simulation datasets, EMNIST, Fashion-MNIST, and a large-scale brain imaging dataset demonstrate the effectiveness of our new method.

Poster
Changwoo Lee · Xiao Hu · Hun-Seok Kim

[ Auditorium 1 Foyer ]

In this paper, we propose an iterative source error correction (ISEC) decoding scheme for deep-learning-based joint source-channel coding (Deep JSCC). Given a noisy codeword received through the channel, we use a Deep JSCC encoder and decoder pair to update the codeword iteratively to find a (modified) maximum a-posteriori (MAP) solution. For efficient MAP decoding, we utilize a neural network-based denoiser to approximate the gradient of the log-prior density of the codeword space. Albeit the non-convexity of the optimization problem, our proposed scheme improves various distortion and perceptual quality metrics from the conventional one-shot (non-iterative) Deep JSCC decoding baseline. Furthermore, the proposed scheme produces more reliable source reconstruction results compared to the baseline when the channel noise characteristics do not match the ones used during training.

Poster
Hiroki Naganuma · Hideaki Iiduka

[ Auditorium 1 Foyer ]

One of the training strategies of generative models is to minimize the Jensen–Shannon divergence between the model distribution and the data distribution. Since data distribution is unknown, generative adversarial networks (GANs) formulate this problem as a game between two models, a generator and a discriminator. The training can be formulated in the context of game theory and the local Nash equilibrium (LNE). It does not seem feasible to derive guarantees of stability or optimality for the existing methods. This optimization problem is far more challenging than the single objective setting. Here, we use the conjugate gradient method to reliably and efficiently solve the LNE problem in GANs. We give a proof and convergence analysis under mild assumptions showing that the proposed method converges to a LNE with three different learning rate update rules, including a constant learning rate. Finally, we demonstrate that the proposed method outperforms stochastic gradient descent (SGD) and momentum SGD in terms of best Fréchet inception distance (FID) score and outperforms Adam on average. The code is available at \url{https://github.com/Hiroki11x/ConjugateGradient_GAN}.

Poster
Simon Damm · Dennis Forster · Dmytro Velychko · Zhenwen Dai · Asja Fischer · Jörg Lücke

[ Auditorium 1 Foyer ]

The central objective function of a variational autoencoder (VAE) is its variational lower bound (the ELBO). Here we show that for standard (i.e., Gaussian) VAEs the ELBO converges to a value given by the sum of three entropies: the (negative) entropy of the prior distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions (the latter is already part of the ELBO). Our derived analytical results are exact and apply for small as well as for intricate deep networks for encoder and decoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local maxima and saddle points). The result implies that the ELBO can for standard VAEs often be computed in closed-form at stationary points while the original ELBO requires numerical approximations of integrals. As a main contribution, we provide the proof that the ELBO for VAEs is at stationary points equal to entropy sums. Numerical experiments then show that the obtained analytical results are sufficiently precise also in those vicinities of stationary points that are reached in practice. Furthermore, we discuss how the novel entropy form of the ELBO can be used to analyze and …

Poster
Rayyan Ahmad Khan · Martin Kleinsteuber

[ Auditorium 1 Foyer ]

Network embedding has emerged as a promising research field for network analysis. Recently, an approach, named Barlow Twins, has been proposed for self-supervised learning in computer vision by applying the redundancy-reduction principle to the embedding vectors corresponding to two distorted versions of the image samples. Motivated by this, we propose Barlow Graph Auto-Encoder, a simple yet effective architecture for learning network embedding. It aims to maximize the similarity between the embedding vectors of immediate and larger neighborhoods of a node while minimizing the redundancy between the components of these projections. In addition, we also present the variational counterpart named Barlow Variational Graph Auto-Encoder. We demonstrate the effectiveness of our approach in learning multiple graph-related tasks, i.e., link prediction, clustering, and downstream node classification, by providing extensive comparisons with several well-known techniques on eight benchmark datasets.

Poster
Mohammad Sadegh Akhondzadeh Tezerjani · Vijay Lingam · Aleksandar Bojchevski

[ Auditorium 1 Foyer ]

Today we have a good theoretical understanding of the representational power of Graph Neural Networks (GNNs). For example, their limitations have been characterized in relation to a hierarchy of Weisfeiler-Lehman (WL) isomorphism tests. However, we do not know what is encoded in the learned representations. This is our main question. We answer it using a probing framework to quantify the amount of meaningful information captured in graph representations. Our findings on molecular datasets show the potential of probing for understanding the inductive biases of graph-based models. We compare different families of models, and show that Graph Transformers capture more chemically relevant information compared to models based on message passing. We also study the effect of different design choices such as skip connections and virtual nodes. We advocate for probing as a useful diagnostic tool for evaluating and developing graph-based models.

Poster
Lev Telyatinkov · Simone Scardapane

[ Auditorium 1 Foyer ]

Missing data imputation (MDI) is crucial when dealing with tabular datasets across various domains. Autoencoders can be trained to reconstruct missing values, and graph autoencoders (GAE) can additionally consider similar patterns in the dataset when imputing new values for a given instance. However, previously proposed GAEs suffer from scalability issues, requiring the user to define a similarity metric among patterns to build the graph connectivity beforehand. In this paper, we leverage recent progress in latent graph learning to propose a novel EdGe Generation Graph AutoEncoder (EGG-GAE) for missing data imputation that overcomes these two drawbacks. EGG-GAE works on randomly sampled mini-batches of the input data (hence scaling to larger datasets), and it automatically infers the best connectivity across the mini-batch for each architecture layer. We also experiment with several extensions, including an ensemble strategy for inference and the inclusion of what we call prototype nodes, obtaining significant improvements, both in terms of imputation error and final downstream accuracy, across multiple benchmarks and baselines.

Poster
Behrooz Tahmasebi · Derek Lim · Stefanie Jegelka

[ Auditorium 1 Foyer ]

To achieve a graph representation, most Graph Neural Networks (GNNs) follow two steps: first, each graph is decomposed into a number of subgraphs (which we call the recursion step), and then the collection of subgraphs is encoded by several iterative pooling steps. While recently proposed higher-order networks show a remarkable increase in the expressive power through a single recursion on larger neighborhoods followed by iterative pooling, the power of deeper recursion in GNNs without any iterative pooling is still not fully understood. To make it concrete, we consider a pure recursion-based GNN which we call Recursive Neighborhood Pooling GNN (RNP-GNN). The expressive power of an RNP-GNN and its computational cost quantifies the power of (pure) recursion for a graph representation network. We quantify the power by means of counting substructures, which is one main limitation of the Message Passing graph Neural Networks (MPNNs), and show how RNP-GNN can exploit the sparsity of the underlying graph to achieve low-cost powerful representations. We also compare the recent lower bounds on the time complexity and show how recursion-based networks are near optimal.

Poster
Ryumei Nakada · Halil Gulluk · Zhun Deng · Wenlong Ji · James Zou · Linjun Zhang

[ Auditorium 1 Foyer ]

Language-supervised vision models have recently attracted great attention in computer vision. A common approach to build such models is to use contrastive learning on paired data across the two modalities, as exemplified by Contrastive Language-Image Pre-Training (CLIP). In this paper, (i) we initiate the investigation of a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) including CLIP loss and show its connection to singular value decomposition (SVD). Namely, we show that each step of loss minimization by gradient descent can be seen as performing SVD on a contrastive cross-covariance matrix. Based on this insight, (ii) we analyze the performance of MMCL under linear representation settings. We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality even under the presence of wrongly matched pairs. This characterizes the robustness of MMCL to noisy data. Furthermore, when we have access to additional unpaired data, (iii) we propose a new MMCL loss that incorporates additional unpaired datasets. We show that the algorithm can detect the ground-truth pairs and improve performance by fully exploiting unpaired datasets. The performance of the proposed algorithm was verified by numerical experiments.

Poster
Ziqing Xu · Hancheng Min · Salma Tarmoun · Enrique Mallada · Rene Vidal

[ Auditorium 1 Foyer ]

Recent theoretical analyses of the convergence of gradient descent (GD) to a global minimum for over-parametrized neural networks make strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (spectral, balanced). In this work, we relax these assumptions and derive a linear convergence rate for two-layer linear networks trained using GD on the squared loss in the case of finite step size, finite width and general initialization. Despite the generality of our analysis, our rate estimates are significantly tighter than those of prior work. Moreover, we provide a time-varying step size rule that monotonically improves the convergence rate as the loss function decreases to zero. Numerical experiments validate our findings.

Poster
Hussein Hazimeh · Natalia Ponomareva

[ Auditorium 1 Foyer ]

Adversarial nets have proved to be powerful in various domains including generative modeling (GANs), transfer learning, and fairness. However, successfully training adversarial nets using first-order methods remains a major challenge. Typically, careful choices of the learning rates are needed to maintain the delicate balance between the competing networks. In this paper, we design a novel learning rate scheduler that dynamically adapts the learning rate of the adversary to maintain the right balance. The scheduler is driven by the fact that the loss of an ideal adversarial net is a constant known a priori. The scheduler is thus designed to keep the loss of the optimized adversarial net close to that of an ideal network. We run large-scale experiments to study the effectiveness of the scheduler on two popular applications: GANs for image generation and adversarial nets for domain adaptation. Our experiments indicate that adversarial nets trained with the scheduler are less likely to diverge and require significantly less tuning. For example, on CelebA, a GAN with the scheduler requires only one-tenth of the tuning budget needed without a scheduler. Moreover, the scheduler leads to statistically significant improvements in model quality, reaching up to 27% in Frechet Inception Distance for image …

Poster
Achraf Bahamou · Donald Goldfarb · Yi Ren

[ Auditorium 1 Foyer ]

Deep Neural Networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a "mini-block Fisher (MBF)" preconditioned stochastic gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF’s per-iteration computational cost is only slightly higher than it is for first-order methods. The performance of MBF is compared to that of several baseline methods, on Autoencoder, Convolutional Neural Network (CNN), and Graph Convolutional Network (GCN) problems, to validate its effectiveness both in terms …

Poster
James Thornton · Marco Cuturi

[ Auditorium 1 Foyer ]

While the optimal transport (OT) problem was originally formulated as a linear program, regularizing it with an entropic penalty has been favored by practitioners in many recent applications, where that regularization is seen as beneficial from both computational and statistical perspectives. The Sinkhorn fixed-point algorithm isthe most popular approach to solve that regularized problem, and, as a result, multiple attempts have been made to reduce its runtime using, e.g., annealing in the regularization parameter, momentum or acceleration in the iterates. The premiseof this work is that initialization of the Sinkhorn algorithm has received comparatively little attention, possibly due to two preconceptions: since the regularized OT problem is convex, it may not be worth crafting a good initialization, since any isguaranteed to work; secondly, because the outputs of the Sinkhorn algorithm are often differentiated in end-to-end pipelines, a data-dependent initialization would bias Jacobian estimates obtained when unrolling iterations. We challenge this conventional wisdom, and show that data-dependent initializers result in dramatic speed-ups, without affecting the correctness of Jacobian maps, as long as those are recovered using implicit differentiation. Our initializations rely on simple closed-forms for exact or approximate OT solutions, using known results in the 1D, Gaussianor GMM settings. These initializations …

Poster
Vinod Kumar Chauhan · Soheila Molaei · Marzia Hoque Tania · Anshul Thakur · Tingting Zhu · David Clifton

[ Auditorium 1 Foyer ]

Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.

Poster
Shelvia Wongso · Rohan Ghosh · Mehul Motani

[ Auditorium 1 Foyer ]

In this paper, we study the memorization and generalization behaviour of deep neural networks (DNNs) using sliced mutual information (SMI), which is the average of the mutual information (MI) between one-dimensional random projections. We argue that the SMI between features in a DNN ($T$) and ground truth labels ($Y$), $SI(T;Y)$, can be seen as a form of \textit{usable information} that the features contain about the labels. We show theoretically that $SI(T;Y)$ can encode geometric properties of the feature distribution, such as its spherical soft-margin and intrinsic dimensionality, in a way that MI cannot. Additionally, we present empirical evidence showing how $SI(T;Y)$ can capture memorization and generalization in DNNs. In particular, we find that, in the presence of label noise, all layers start to memorize but the earlier layers stabilize more quickly than the deeper layers. Finally, we point out that, in the context of Bayesian Neural Networks, the SMI between the penultimate layer and the output represents the worst case uncertainty of the network's output.
Poster
Daniel Goldfarb · Paul Hand

[ Auditorium 1 Foyer ]

Overparameterization is known to permit strong generalization performance in neural networks. In this work, we provide an initial theoretical analysis of its effect on catastrophic forgetting in a continual learning setup. We show experimentally that in Permuted MNIST image classification tasks, the generalization performance of multilayer perceptrons trained by vanilla stochastic gradient descent can be improved by overparameterization, and the extent of the performance increase achieved by overparameterization is comparable to that of state-of-the-art continual learning algorithms. We provide a theoretical explanation of this effect by studying a qualitatively similar two-task linear regression problem, where each task is related by a random orthogonal transformation. We show that when a model is trained on the two tasks in sequence without any additional regularization, the risk gain on the first task is small if the model is sufficiently overparameterized.

Poster
Dan Meller · Nicolas Berkouk

[ Auditorium 1 Foyer ]

We introduce the Singular Value Representation (SVR), a new method to represent the internal state of neural networks using SVD factorization of the weights. This construction yields a new weighted graph connecting what we call spectral neurons, that correspond to specific activation patterns of classical neurons. We derive a precise statistical framework to discriminate meaningful connections between spectral neurons for fully connected and convolutional layers. To demonstrate the usefulness of our approach for machine learning research, we highlight two discoveries we made using the SVR. First, we highlight the emergence of a dominant connection in VGG networks that spans multiple deep layers. Second, we witness, without relying on any input data, that batch normalization can induce significant connections between near-kernels of deep layers, leading to a remarkable spontaneous sparsification phenomenon.

Poster
Russell Tsuchida · Cheng Soon Ong

[ Auditorium 1 Foyer ]

Principal Component Analysis (PCA) and its exponential family extensions have three components: observations, latents and parameters of a linear transformation. We consider a generalised setting where the canonical parameters of the exponential family are a nonlinear transformation of the latents. We show explicit relationships between particular neural network architectures and the corresponding statistical models. We find that deep equilibrium models --- a recently introduced class of implicit neural networks --- solve maximum a-posteriori (MAP) estimates for the latents and parameters of the transformation. Our analysis provides a systematic way to relate activation functions, dropout, and layer structure, to statistical assumptions about the observations, thus providing foundational principles for unsupervised DEQs. For hierarchical latents, individual neurons can be interpreted as nodes in a deep graphical model. Our DEQ feature maps are end-to-end differentiable, enabling fine-tuning for downstream tasks.

Poster
Zihan Wang · Jason Lee · Qi Lei

[ Auditorium 1 Foyer ]

Understanding when and to what extent a model's gradient leaks the information of the training samples is an essential question in privacy. In this paper, we present a surprising result. Even without training and memorizing the data, we can fully recover the training samples from the gradient at a randomly chosen neural network. We prove the identifiability of reconstructing the batches of training samples under general conditions -- with shallow or deep neural networks and broad choices of activation functions. We also present efficient algorithms based on tensor decomposition to reconstruct such training data. As an effective attack for revealing sensitive training data, our findings implicate severe problems in privacy, especially in federated learning.

Poster
Ting Cai · Kirthevasan Kandasamy

[ Auditorium 1 Foyer ]

We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data's inputs belong to one of K discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error. When the labeling cost is B, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $\tilde{\calO}(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$ on the loss after T rounds. We also provide a more nuanced upper bound which demonstrates that the algorithm can adapt to the arrival pattern, and achieves better performance when the arrival pattern is more favorable. We complement both upper bounds with matching lower bounds. We next study this problem when the inputs belong to a continuous domain and the output of the experiment is a smooth function with bounded RKHS norm.After T rounds in d dimensions, we show that the loss is bounded by $\tilde{\calO}(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$ in an RKHS …
Poster
Shashank Singh

[ Auditorium 1 Foyer ]

Typical models of active learning assume a learner can directly manipulate or query a covariate X to study its relationship with a response Y. However, if X is a feature of a complex system, it may be possible only to indirectly influence X by manipulating a control variable Z, a scenario we refer to as Indirect Active Learning. Under a nonparametric fixed-budget model of Indirect Active Learning, we study minimax convergence rates for estimating a local relationship between X and Y, with different rates depending on the complexities and noise levels of the relationships between Z and X and between X and Y. We also derive minimax rates for passive learning under comparable assumptions, finding in many cases that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence. Experiments with simulated data validate our theoretical results.

Poster
Sivan Sabato

[ Auditorium 1 Foyer ]

Discriminative Feature Feedback is a setting first introduced by Dasgupta et al. (2018), which provides a protocol for interactive learning based on feature explanations that are provided by a human teacher. The features distinguish between the labels of pairs of possibly similar instances. That work has shown that learning in this model can have considerable statistical and computational advantages over learning in standard label-based interactive learning models.In this work, we provide new robust interactive learning algorithms for the Discriminative Feature Feedback model, with mistake bounds that are significantly lower than those of previous robust algorithms for this setting. In the adversarial setting, we reduce the dependence on the number of protocol exceptions from quadratic to linear. In addition, we provide an algorithm for a slightly more restricted model, which obtains an even smaller mistake bound for large models with many exceptions. In the stochastic setting, we provide the first algorithm that converges to the exception rate with a polynomial sample complexity. Our algorithm and analysis for the stochastic setting involve a new construction that we call Feature Influence, which may be of wider applicability.

Poster
Tyler Sypherd · Nathaniel Stromberg · Richard Nock · Visar Berisha · Lalitha Sankar

[ Auditorium 1 Foyer ]

There is a growing need for models that are interpretable and have reduced energy/computational cost (e.g., in health care analytics and federated learning). Examples of algorithms to train such models include logistic regression and boosting. However, one challenge facing these algorithms is that they provably suffer from label noise; this has been attributed to the joint interaction between oft-used convex loss functions and simpler hypothesis classes, resulting in too much emphasis being placed on outliers.In this work, we use the margin-based alpha-loss, which continuously tunes between canonical convex and quasi-convex losses, to robustly train simple models.We show that the alpha hyperparameter smoothly introduces non-convexity and offers the benefit of "giving up" on noisy training examples. We also provide results on the Long-Servedio dataset for boosting and a COVID-19 survey dataset for logistic regression, highlighting the efficacy of our approach across multiple relevant domains.

Poster
Achal Awasthi · Jason Xu

[ Auditorium 1 Foyer ]

Branching processes are a class of continuous-time Markov chains (CTMCs) prevalent for stochastic population models in ecology, biology, epidemiology, and many other fields. There transient or finite-time behavior is fully characterized by their transition probabilities. However, computing them requires marginalizing over all paths between endpoint-conditioned values, which often poses a computational bottleneck. Leveraging recent results that connect generating function methods to a compressed sensing framework, we recast this task from the lens of sparse optimization. We propose a new solution method using variable splitting; in particular, we derive closed form updates in a highly efficient ADMM algorithm. Remarkably, no matrix products---let alone inversions---are required at any step. Not only does this reduce computational cost by orders of magnitude over existing methods, but the algorithm is easily parallelizable and fairly insensitive to tuning parameters. A comparison to prior work is carried out in two applications to models of blood cell production and transposon evolution, showing that the proposed method is orders of magnitudes more scalable than existing work.

Poster
Mingtian Zhang · Yitong Sun · Chen Zhang · Steven McDonagh

[ Auditorium 1 Foyer ]

Flow-based models typically define a latent space with dimensionality identical to the observational space. In many problems, however, the data does not populate the full ambient data space that they natively reside in, rather inhabiting a lower-dimensional manifold. In such scenarios, flow-based models are unable to represent data structures exactly as their densities will always have support off the data manifold, potentially resulting in degradation of model performance. To address this issue, we propose to learn a manifold prior for flow models that leverage the recently proposed spread divergence towards fixing the crucial problem; the KL divergence and maximum likelihood estimation are ill-defined for manifold learning. In addition to improving both sample quality and representation quality, an auxiliary benefit enabled by our approach is the ability to identify the intrinsic dimension of the manifold distribution.

Poster
Joseph Janssen · Vincent Guan · Elina Robeva

[ Auditorium 1 Foyer ]

Scientists frequently prioritize learning from data rather than training the best possible model; however, research in machine learning often prioritizes the latter. Marginal contribution feature importance (MCI) was developed to break this trend by providing a useful framework for quantifying the relationships in data. In this work, we aim to improve upon the theoretical properties, performance, and runtime of MCI by introducing ultra-marginal feature importance (UMFI), which uses dependence removal techniques from the AI fairness literature as its foundation. We first propose axioms for feature importance methods that seek to explain the causal and associative relationships in data, and we prove that UMFI satisfies these axioms under basic assumptions. We then show on real and simulated data that UMFI performs better than MCI, especially in the presence of correlated interactions and unrelated features, while partially learning the structure of the causal graph and reducing the exponential runtime of MCI to super-linear.

Poster
Shalev Shaer · Gal Maman · Yaniv Romano

[ Auditorium 1 Foyer ]

This paper develops a model-free sequential test for conditional independence. The proposed test allows researchers to analyze an incoming i.i.d. data stream with any arbitrary dependency structure, and safely conclude whether a feature is conditionally associated with the response under study. We allow the processing of data points online, as soon as they arrive, and stop data acquisition once significant results are detected, rigorously controlling the type-I error rate. Our test can work with any sophisticated machine learning algorithm to enhance data efficiency to the extent possible. The developed method is inspired by two statistical frameworks. The first is the model-X conditional randomization test, a test for conditional independence that is valid in offline settings where the sample size is fixed in advance. The second is testing by betting, a ``game-theoretic'' approach for sequential hypothesis testing. We conduct synthetic experiments to demonstrate the advantage of our test over out-of-the-box sequential tests that account for the multiplicity of tests in the time horizon, and demonstrate the practicality of our proposal by applying it to real-world tasks.

Poster
Jonas Wacker · Ruben Ohana · Maurizio Filippone

[ Auditorium 1 Foyer ]

Randomized sketches of a tensor product of $p$ vectors follow a tradeoff between statistical efficiency and computational acceleration. Commonly used approaches avoid computing the high-dimensional tensor product explicitly, resulting in a suboptimal dependence of $\bigO(3^p)$ in the embedding dimension. We propose a simple Complex-to-Real (CtR) modification of well-known sketches that replaces real random projections by complex ones, incurring a lower $\bigO(2^p)$ factor in the embedding dimension. The output of our sketches is real-valued, which renders their downstream use straightforward. In particular, we apply our sketches to $p$-fold self-tensored inputs corresponding to the feature maps of the polynomial kernel. We show that our method achieves state-of-the-art performance in terms of accuracy \textit{and} speed compared to other randomized approximations from the literature.
Poster
Namrata Deka · Danica Sutherland

[ Auditorium 1 Foyer ]

We introduce a method, MMD-B-Fair, to learn fair representations of data via kernel two-sample testing. We find neural features of our data where a maximum mean discrepancy (MMD) test cannot distinguish between different values of sensitive attributes, while preserving information about the target. Minimizing the power of an MMD test is more difficult than maximizing it (as done in previous work), because the test threshold's complex behavior cannot be simply ignored. Our method exploits the simple asymptotics of block testing schemes to efficiently find fair representations without requiring the complex adversarial optimization or generative modelling schemes widely used by existing work on fair representation learning. We evaluate our approach on various datasets, showing its ability to “hide” information about sensitive attributes, and its effectiveness in downstream transfer tasks.

Poster
Youssef Allouah · Sadegh Farhadkhani · Rachid Guerraoui · Nirupam Gupta · Rafael Pinot · John Stephan

[ Auditorium 1 Foyer ]

Byzantine machine learning (ML) aims to ensure the resilience of distributed learning algorithms to misbehaving (or \emph{Byzantine}) machines. Although this problem received significant attention, prior works often assume the data held by the machines to be {\em homogeneous}, which is seldom true in practical settings. Data \emph{heterogeneity} makes Byzantine ML considerably more challenging, since a Byzantine machine can hardly be distinguished from a non-Byzantine outlier. A few solutions have been proposed to tackle this issue, but these provide suboptimal probabilistic guarantees and fare poorly in practice.This paper closes the theoretical gap, achieving optimality and inducing good empirical results. In fact, we show how to automatically adapt existing solutions for (homogeneous) Byzantine ML to the heterogeneous setting through a powerful mechanism, we call {\em nearest neighbor mixing} (NNM), which boosts any standard robust distributed gradient descent variant to yield optimal Byzantine resilience under heterogeneity. We obtain similar guarantees (in expectation) by plugging NNM in the distributed {\em stochastic} heavy ball method, a practical substitute to distributed gradient descent. We obtain empirical results that significantly outperform state-of-the-art Byzantine ML solutions.

Poster
Xun Qian · Hanze Dong · Tong Zhang · Peter Richtarik

[ Auditorium 1 Foyer ]

Communication overhead is well known to be a key bottleneck in large scale distributed learning, and a particularly successful class of methods which help to overcome this bottleneck is based on the idea of communication compression. Some of the most practically effective gradient compressors, such as TopK, are biased, which causes convergence issues unless one employs a well designed {\em error compensation/feedback} mechanism. Error compensation is therefore a fundamental technique in the distributed learning literature. In a recent development, Qian et al (NeurIPS 2021) showed that the error-compensation mechanism can be combined with acceleration/momentum, which is another key and highly successful optimization technique. In particular, they developed the error-compensated loop-less Katyusha (ECLK) method, and proved an accelerated linear rate in the strongly convex case. However, the dependence of their rate on the compressor parameter does not match the best dependence obtainable in the non-accelerated error-compensated methods. Our work addresses this problem. We propose several new accelerated error-compensated methods using the {\em catalyst acceleration} technique, and obtain results that match the best dependence on the compressor parameter in non-accelerated error-compensated methods up to logarithmic terms.

Poster
Tam Le · Truyen Nguyen · Kenji Fukumizu

[ Auditorium 1 Foyer ]

Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)--(iii), Le et al. (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)--(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for fast computation, and it is also negative definite. Additionally, we derive geometric structures for the UST and establish relations between our UST and other transport distances. We further exploit the negative definiteness to design positive definite kernels and evaluate them on various simulations to illustrate their fast computation and comparable performances against other transport baselines for unbalanced measures on a graph.

Poster
Konstantin Klemmer · Nathan Safir · Daniel Neill

[ Auditorium 1 Foyer ]

Graph neural networks (GNNs) provide a powerful and scalable solution for modeling continuous spatial data. However, they often rely on Euclidean distances to construct the input graphs. This assumption can be improbable in many real-world settings, where the spatial structure is more complex and explicitly non-Euclidean (e.g., road networks). Here, we propose PE-GNN, a new framework that incorporates spatial context and correlation explicitly into the models. Building on recent advances in geospatial auxiliary task learning and semantic spatial embeddings, our proposed method (1) learns a context-aware vector encoding of the geographic coordinates and (2) predicts spatial autocorrelation in the data in parallel with the main task. On spatial interpolation and regression tasks, we show the effectiveness of our approach, improving performance over different state-of-the-art GNN approaches. We observe that our approach not only vastly improves over the GNN baselines, but can match Gaussian processes, the most commonly utilized method for spatial interpolation problems.

Poster
Xinyue Xia · Gal Mishne · Yusu Wang

[ Auditorium 1 Foyer ]

Graphons are general and powerful models for generating graphs of varying size. In this paper, we propose to directly model graphons using neural networks, obtaining Implicit Graphon Neural Representation (IGNR). Existing work in modeling and reconstructing graphons often approximates a target graphon by a fixed resolution piece-wise constant representation. Our IGNR has the benefit that it can represent graphons up to arbitrary resolutions, and enables natural and efficient generation of arbitrary sized graphs with desired structure once the model is learned. Furthermore, we allow the input graph data to be unaligned and have different sizes by leveraging the Gromov-Wasserstein distance. We first demonstrate the effectiveness of our model by showing its superior performance on a graphon learning task. We then propose an extension of IGNR that can be incorporated into an auto-encoder framework, and demonstrate its good performance under a more general setting of graphon learning. We also show that our model is suitable for graph representation learning and graph generation.

Poster
Sarath Pattathil · Kaiqing Zhang · Asuman Ozdaglar

[ Auditorium 1 Foyer ]

Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the payoffs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learn- ing scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.

Poster
Luca Masserano · Tommaso Dorigo · Rafael Izbicki · Mikael Kuusela · Ann Lee

[ Auditorium 1 Foyer ]

Prediction algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method to construct confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led …

Poster
Jing Wang · Peng Zhao · Zhi-Hua Zhou

[ Auditorium 1 Foyer ]

Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a \emph{refined analysis framework}, which simplifies the derivation and importantly produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an $\Ot(k_\mu^{\sfrac{5}{4}} c_\mu^{-\sfrac{3}{4}} d^{\sfrac{3}{4}} P_T^{\sfrac{1}{4}}T^{\sfrac{3}{4}})$ regret, improving the $\Ot(k_\mu^{2} c_\mu^{-1}d^{\sfrac{9}{10}} P_T^{\sfrac{1}{5}}T^{\sfrac{4}{5}})$ bound in prior work, where $k_\mu$ and $c_\mu$ characterize the reward model's nonlinearity, $P_T$ measures the non-stationarity, $d$ and $T$ denote the dimension and time horizon.
Poster
Javad Azizi · Ofer Meshi · Masrour Zoghi · Maryam Karimzadehgan

[ Auditorium 1 Foyer ]

The recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior.In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback.Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.

Poster
Nicolas Christianson · Junxuan Shen · Adam Wierman

[ Auditorium 1 Foyer ]

We examine the problem of designing learning-augmented algorithms for metrical task systems (MTS) that exploit machine-learned advice while maintaining rigorous, worst-case guarantees on performance. We propose an algorithm, DART, that achieves this dual objective, providing cost within a multiplicative factor $(1+\epsilon)$ of the machine-learned advice (i.e., consistency) while ensuring cost within a multiplicative factor $2^{O(1/\epsilon)}$ of a baseline robust algorithm (i.e., robustness) for any $\epsilon > 0$. We show that this exponential tradeoff between consistency and robustness is unavoidable in general, but that in important subclasses of MTS, such as when the metric space has bounded diameter and in the $k$-server problem, our algorithm achieves improved, polynomial tradeoffs between consistency and robustness.
Poster
Zeju Qiu · Weiyang Liu · Tim Xiao · Zhen Liu · Umang Bhatt · Yucen Luo · Adrian Weller · Bernhard Schölkopf

[ Auditorium 1 Foyer ]

We consider the problem of iterative machine teaching, where a teacher sequentially provides examples based on the status of a learner under a discrete input space (i.e., a pool of finite samples), which greatly limits the teacher's capability. To address this issue, we study iterative teaching under a continuous input space where the input example (i.e., image) can be either generated by solving an optimization problem or drawn directly from a continuous distribution. Specifically, we propose data hallucination teaching (DHT) where the teacher can generate input data intelligently based on labels, the learner's status and the target concept. We study a number of challenging teaching setups (e.g., linear/neural learners in omniscient and black-box settings). Extensive empirical results verify the effectiveness of DHT.

Poster
· Seyed Alireza Bakhtiari · Johannes Kirschner · Matej Jusup · Ilija Bogunovic · Csaba Szepesvari

[ Auditorium 1 Foyer ]

A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.

Poster
Dan Qiao · Yu-Xiang Wang

[ Auditorium 1 Foyer ]

Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques --- which could be of independent interest --- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.
Poster
Aidan Scannell · Carl Henrik Ek · Arthur Richards

[ Auditorium 1 Foyer ]

Model-based reinforcement learning (RL) algorithms do not typically consider environments with multiple dynamic modes, where it is beneficial to avoid inoperable or undesirable modes. We present a model-based RL algorithm that constrains training to a single dynamic mode with high probability. This is a difficult problem because the mode constraint is a hidden variable associated with the environment's dynamics. As such, it is 1) unknown a priori and 2) we do not observe its output from the environment, so cannot learn it with supervised learning. We present a nonparametric dynamic model which learns the mode constraint alongside the dynamic modes. Importantly, it learns latent structure that our planning scheme leverages to 1) enforce the mode constraint with high probability, and 2) escape local optima induced by the mode constraint. We validate our method by showing that it can solve a simulated quadcopter navigation task whilst providing a level of constraint satisfaction both during and after training.

Poster
Hanlin Zhu · Ruosong Wang · Jason Lee

[ Auditorium 1 Foyer ]

Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the huge empirical success, the theoretical understandings of function approximation have mainly been focused on the tabular setting and the linear setting, and the theory of general value function approximation is still not well established. In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximation. We show that if the value functions can be approximated by a function class $\mathcal{F}$ which satisfies a certain closedness guarantee, our algorithm achieves an $\widetilde{O}(\text{poly}(\iota H)\sqrt{T})$ regret bound where $\iota$ is the product of the surprise bound and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes and $T = HK$ is the total number of steps the agent interacts with the environment. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting. Moreover, our algorithm only needs to solve $O(H\log K)$ empirical risk minimization (ERM) problems, which is far more efficient than previous algorithms that need to solve ERM problems for $\Omega(HK)$ times.
Poster
Xiaoyan Hu · Ho-fung Leung

[ Auditorium 1 Foyer ]

We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\Tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of \textsc{UCB-Advantage} algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower …
Poster
Carlos E. Luis · Alessandro Bottero · Julia Vinogradska · Felix Berkenkamp · Jan Peters

[ Auditorium 1 Foyer ]

We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation, but the over-approximation may result in inefficient exploration. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values and explicitly characterizes the gap in previous work. Moreover, our uncertainty quantification technique is easily integrated into common exploration strategies and scales naturally beyond the tabular setting by using standard deep reinforcement learning architectures. Experiments in difficult exploration tasks, both in tabular and continuous control settings, show that our sharper uncertainty estimates improve sample-efficiency.

Poster
Samuel Holt · Alihan Hüyük · Zhaozhi Qian · Hao Sun · Mihaela van der Schaar

[ Auditorium 1 Foyer ]

Many real-world offline reinforcement learning (RL) problems involve continuous-time environments with delays. Such environments are characterized by two distinctive features: firstly, the state x(t) is observed at irregular time intervals, and secondly, the current action a(t) only affects the future state x(t + g) with an unknown delay g > 0. A prime example of such an environment is satellite control where the communication link between earth and a satellite causes irregular observations and delays. Existing offline RL algorithms have achieved success in environments with irregularly observed states in time or known delays. However, environments involving both irregular observations in time and unknown delays remains an open and challenging problem. To this end, we propose Neural Laplace Control, a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner—and is able to learn from an offline dataset sampled with irregular time intervals from an environment that has a inherent unknown constant delay. We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance.

Poster
Jia Lin Hau · Marek Petrik · Mohammad Ghavamzadeh

[ Auditorium 1 Foyer ]

Risk-averse Markov Decision Processes (MDPs) can compute policies that achieve high returns with low variability but are usually difficult to solve. Few practical risk-averse objectives admit dynamic programming (DP) formulation, which is the mainstay of most MDP and RL algorithms. In this paper, we derive a new DP formulation for discounted risk-averse MDPs with Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) objectives. Our DP formulation, which is possible because we define value functions with time-dependent risk levels, can approximate optimal policies in a time that is polynomial in the approximation error. Then we use the ERM algorithm to optimize the EVaR objective in polynomial time using an optimized discretization scheme. Our numerical results demonstrate the viability of EVaR and ERM in discounted MDPs.

Poster
Xiang Li · Wenhao Yang · Jiadong Liang · Zhihua Zhang · Michael Jordan

[ Auditorium 1 Foyer ]

We study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a $\gamma$-discounted MDP under synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem (FCLT) for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for RL. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.
Poster
Andrea Tirinzoni · Matteo Pirotta · Alessandro Lazaric

[ Auditorium 1 Foyer ]

In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs. In practice, the embedding is often learned at the same time as the reward vector, thus leading to an online representation learning problem. Existing approaches to representation learning in contextual bandits are either very generic (e.g., model-selection techniques or algorithms for learning with arbitrary function classes) or specialized to particular structures (e.g., nested features or representations with certain spectral properties). As a result, the understanding of the cost of representation learning in contextual linear bandit is still limited. In this paper, we take a systematic approach to the problem and provide a comprehensive study through an instance-dependent perspective. We show that representation learning is fundamentally more complex than linear bandits (i.e., learning with a given representation). In particular, learning with a given set of representations is never simpler than learning with the worst realizable representation in the set, while we show cases where it can be arbitrarily harder. We complement this result with an extensive discussion of how it relates to existing literature and we illustrate positive instances where representation learning is as complex …

Poster
Zaiyan Xu · Kishan Panaganti · Dileep Kalathil

[ Auditorium 1 Foyer ]

We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment. We formulate this as a distributionally robust reinforcement learning (DR-RL) problem where the objective is to learn the policy which maximizes the value function against the worst possible stochastic model of the environment in an uncertainty set. We focus on the tabular episodic learning setting where the algorithm has access to a generative model of the nominal (training) environment around which the uncertainty set is defined. We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences: total variation, chi-square, Kullback-Leibler, and Wasserstein. We show that our algorithm achieves $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ sample complexity, which is uniformly better than the existing results by a factor of $|\mathcal{S}|$, where $|\mathcal{S}|$ is number of states, $|\mathcal{A}|$ is the number of actions, and $H$ is the horizon length. We also provide the first-ever sample complexity result for the Wasserstein uncertainty set. Finally, we demonstrate the performance of our algorithm using simulation experiments.
Poster
Jackie Baek · Vivek Farias

[ Auditorium 1 Foyer ]

Thompson sampling has become a ubiquitous approach to online decision problems with bandit feedback. The key algorithmic task for Thompson sampling is drawing a sample from the posterior of the optimal action. We propose an alternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance improvements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehensive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoretical perspective, we establish optimal regret guarantees for TS-UCB for both the K-armed and linear bandit models.

Poster
Shuqi Liu · Yuzhou Cao · Qiaozhen Zhang · Lei Feng · Bo An

[ Auditorium 1 Foyer ]

In contrast to ordinary supervised classification tasks that require a vast number of data with high-quality labels, complementary-label learning (CLL) deals with the weakly-supervised learning scenario where each instance is equipped with a complementary label, which specifies a class the example does not belong to. However, existing statistically consistent CLL approaches usually suffer from overfitting intrinsically. Although there exist other overfitting-resistant CLL approaches, they can only work with limited losses or lacks statistical guarantees. In this paper, we aim to propose overfitting-resistant and theoretically sound approaches for CLL. Considering the unique property of the distribution of complementarily labeled samples, we provide a risk estimator via order-preserving losses, which are naturally non-negative and thus can avoid overfitting caused by negative terms in risk estimators. Moreover, we provide classifier-consistency analysis and statistical guarantee for this estimator. Furthermore, we provide a reweighed version of the proposed risk estimator to further enhance its generalization ability and prove its statistical consistency. Experiments on benchmark datasets demonstrate the efficiency of our proposed methods.

Poster
Lawrence Stewart · Francis Bach · Quentin Berthet · Jean-Philippe Vert

[ Auditorium 1 Foyer ]

Neural networks can be trained to solve regression problems by using gradient-based methods to minimize the square loss. However, practitioners often prefer to reformulate regression as a classification problem, observing that training on the cross entropy loss results in better performance. By focusing on two-layer ReLU networks, which can be fully characterized by measures over their feature space, we explore how the implicit bias induced by gradient-based optimization could partly explain the above phenomenon. We provide theoretical evidence that the regression formulation yields a measure whose support can differ greatly from that for classification, in the case of one-dimensional data. Our proposed optimal supports correspond directly to the features learned by the input layer of the network. The different nature of these supports sheds light on possible optimization difficulties the square loss could encounter during training, and we present empirical results illustrating this phenomenon.

Poster
Ikko Yamane · Yann Chevaleyre · Takashi Ishida · Florian Yger

[ Auditorium 1 Foyer ]

In \emph{mediated uncoupled learning} (MU-learning), the goal is to predict an output variable $Y$ given an input variable $X$ as in ordinary supervised learning while the training dataset has no joint samples of $(X, Y)$ but only independent samples of $(X, U)$ and $(U, Y)$ each observed with a \emph{mediating} variable $U$. The existing MU-learning methods can only handle the squared loss, which prohibited the use of other popular loss functions such as the cross-entropy loss. We propose a general MU-learning framework that allows for the problems with Bregman divergences, which cover a wide range of loss functions useful for various types of tasks, in a unified manner. This loss family has \emph{maximal generality} among those whose minimizers characterize the conditional expectation. We prove that the proposed objective function is a tighter approximation to the oracle loss that one would minimize if ordinary supervised samples of $(X, Y)$ were available. We also propose an estimator of an interval containing the expected test loss of predictions of a trained model only using $(X, U)$- and $(U, Y)$-data. We provide a theoretical analysis on the excess risk for the proposed method and confirm its practical usefulness with regression experiments with synthetic data …
Poster
Pranjal Awasthi · Corinna Cortes · Christopher Mohri

[ Auditorium 1 Foyer ]

We study a problem of \emph{batch distribution drift} motivated by several applications, which consists of determining an accurate predictor for a target time segment, for which a moderate amount of labeled samples are at one's disposal, while leveraging past segments for which substantially more labeled samples are available. We give new algorithms for this problem guided by a new theoretical analysis and generalization bounds derived for this scenario. We further extend our results to the case where few or no labeled data is available for the period of interest. Finally, we report the results of extensive experiments demonstrating the benefits of our drifting algorithm, including comparisons with natural baselines. A by-product of our study is a principled solution to the problem of multiple-source adaptation with labeled source data and a moderate amount of target labeled data, which we briefly discuss and compare with.

Poster
Xiangyu Sun · Oliver Schulte · Guiliang Liu · Pascal Poupart

[ Auditorium 1 Foyer ]

We describe NTS-NOTEARS, a score-based structure learning method for time-series data to learn dynamic Bayesian networks (DBNs) that captures nonlinear, lagged (inter-slice) and instantaneous (intra-slice) relations among variables. NTS-NOTEARS utilizes 1D convolutional neural networks (CNNs) to model the dependence of child variables on their parents; 1D CNN is a neural function approximation model well-suited for sequential data. DBN-CNN structure learning is formulated as a continuous optimization problem with an acyclicity constraint, following the NOTEARS DAG learning approach. We show how prior knowledge of dependencies (e.g., forbidden and required edges) can be included as additional optimization constraints. Empirical evaluation on simulated and benchmark data shows that NTS-NOTEARS achieves state-of-the-art DAG structure quality compared to both parametric and nonparametric baseline methods, with improvement in the range of 10-20% on the F1-score. We also evaluate NTS-NOTEARS on complex real-world data acquired from professional ice hockey games that contain a mixture of continuous and discrete variables. The code is available online.

Poster
Syama Sundar Rangapuram · Shubham Kapoor · Rajbir Nirwan · Pedro Mercado · Tim Januschowski · Yuyang Wang · Michael Bohlke-Schneider

[ Auditorium 1 Foyer ]

Forecasts at different time granularities are required in practice for addressing various business problems starting from short-term operational to medium-term tactical and to long-term strategic planning. These forecasting problems are usually treated independently by learning different ML models which results in forecasts that are not consistent with the temporal aggregation structure, leading to inefficient decision making. Some of the recent work addressed this problem, however, it uses a post-hoc reconciliation strategy, which results in sub-optimal results and cannot produce probabilistic forecasts. In this paper, we present a global model that produces coherent, probabilistic forecasts for different time granularities by learning joint embeddings for the different aggregation levels with graph neural networks and temporal reconciliation. Temporal reconciliation not only enables consistent decisions for business problems across different planning horizons but also improves the quality of forecasts at finer time granularities. A thorough empirical evaluation illustrates the benefits of the proposed method.

Poster
Charline Le Lan · Joshua Greaves · Jesse Farebrother · Mark Rowland · Fabian Pedregosa · Rishabh Agarwal · Marc G. Bellemare

[ Auditorium 1 Foyer ]

Many machine learning problems encode their data as a matrix with a possibly very large number of rows and columns.In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data.Here, we are interested in determining the $d$-dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices.Although a number of sample-based methods exist for this problem (e.g. Oja's rule \citep{oja1982simplified}), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks \citep{baldi1989neural}. In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns.Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset \citep{lecun2010mnist} and the reinforcement learning domain PuddleWorld \citep{sutton1995generalization} demonstrating …
Poster
Ehsan Amid · Richard Nock · Manfred K. Warmuth

[ Auditorium 1 Foyer ]

The link with exponential families has allowed k-means clustering to be generalized to a wide variety of data-generating distributions in exponential families and clustering distortions among Bregman divergences. Getting the framework to go beyond exponential families is important to lift roadblocks like the lack of robustness of some population minimizers, which is carved into their axiomatization. Current generalizations of exponential families like the q-exponential families or even the deformed exponential families fail at achieving the goal. In this paper, we provide a new attempt at getting a complete framework, grounded in a new generalization of exponential families that we introduce, called tempered exponential measures (TEMs). TEMs keep the maximum entropy axiomatization framework of q-exponential families, but instead of normalizing the measure, normalize a dual called a co-distribution. Numerous interesting properties arise for clustering, such as improved and controllable robustness for population minimizers, that keep a simple analytic form.

Poster
Chandramauli Chakraborty · Sayan Paul · Saptarshi Chakraborty · Swagatam Das

[ Auditorium 1 Foyer ]

Clustering complex high-dimensional data is particularly challenging as the signal-to-noise ratio in such data is significantly lower than their classical counterparts. This is mainly because most of the features describing a data point have little to no information about the natural grouping of the data. Filtering such features is, thus, critical in harnessing meaningful information from such large-scale data. Many recent methods have attempted to find feature importance in a centroid-based clustering setting. Though empirically successful in classical low-dimensional settings, most do not perform up to the mark, especially on microarray and single-cell RNA-seq data. This paper extends the merits of weighted center-based clustering through the Ordered Weighted $\ell_1$ (OWL) norm for better feature selection. Appealing to the elegant properties of block coordinate-descent and Frank-Wolf algorithms, we are not only able to maintain computational efficiency but also able to outperform the state-of-the-art in high-dimensional settings. The proposal also comes with finite sample theoretical guarantees, including a rate of $\mathcal{O}\left(\sqrt{k \log p/n}\right)$, under model-sparsity, bridging the gap between theory and practice of weighted clustering.
Poster
Dragana Bajovic · Dusan Jakovetic · Soummya Kar

[ Auditorium 1 Foyer ]

Recent works have shown that high probability metrics with stochastic gradient descent (SGD) exhibit informativeness and in some cases advantage over the commonly adopted mean-square error-based ones. In this work we provide a formal framework for the study of general high probability bounds with SGD, based on the theory of large deviations. The framework allows for a generic (not-necessarily bounded) gradient noise satisfying mild technical assumptions, including the dependence of the noise distribution on the current iterate. Under the preceding assumptions, we find an upper large deviation bound for SGD with strongly convex functions. The corresponding rate function captures analytical dependence on the noise distribution and other problem parameters. This is in contrast with conventional mean-square error analysis that captures only the noise dependence through the variance and does not capture the effect of higher order moments or distribution skew. We also derive exact large deviation rates for the case when the objective function is quadratic and show that the obtained function matches the one from the general upper bound hence showing the tightness of the general upper bound. Numerical examples illustrate and corroborate theoretical findings.

Poster
Charles ASSAAD · Imad Ez-zejjari · Lei ZAN

[ Auditorium 1 Foyer ]

This paper presents an approach for identifying the root causes of collective anomalies given observational time series and an acyclic summary causal graph which depicts an abstraction of causal relations present in a dynamic system at its normal regime. The paper first shows how the problem of root cause identification can be divided into many independent subproblems by grouping related anomalies using d-separation. Further, it shows how, under this setting, some root causes can be found directly from the graph and from the time of appearance of anomalies. Finally, it shows, how the rest of the root causes can be found by comparing direct causal effects in the normal and in the anomalous regime. To this end, temporal adaptations of the back-door and the single-door criterions are introduced. Extensive experiments conducted on both simulated and real-world datasets demonstrate the effectiveness of the proposed method.

Poster
Ankur Ankan · Inge M N Wortel · Kenneth Bollen · Johannes Textor

[ Auditorium 1 Foyer ]

Measurement error is ubiquitous in many variables – from blood pressure recordings in physiology to intelligence measures in psychology. Structural equation models (SEMs) account for the process of measurement by explicitly distinguishing between latent variables and their measurement indicators. Users often fit entire SEMs to data, but this can fail if some model parameters are not identified. The model-implied instrumental variables (MIIVs) approach is a more flexible alternative that can estimate subsets of model parameters in identified equations. Numerous methods to identify individual parameters also exist in the field of graphical models (such as DAGs), but many of these do not account for measurement effects. Here, we take the concept of “latent-to-observed” (L2O) transformation from the MIIV approach and develop an equivalent graphical L2O transformation that allows applying existing graphical criteria to latent parameters in SEMs. We combine L2O transformation with graphical instrumental variable criteria to obtain an efficient algorithm for non-iterative parameter identification in SEMs with latent variables. We prove that this graphical L2O transformation with the instrumental set criterion is equivalent to the state-of-the-art MIIV approach for SEMs, and show that it can lead to novel identification strategies when combined with other graphical criteria.

Poster
Sofia Triantafyllou · Fattaneh Jabbari · Gregory Cooper

[ Auditorium 1 Foyer ]

Decision making often depends on causal effect estimation. For example, clinical decisions are often based on estimates of the probability of post-treatment outcomes. Experimental data from randomized controlled trials allow for unbiased estimation of these probabilities. However, such data are usually limited in the number of samples and the set of measured covariates. Observational data, such as electronic medical records, contain many more samples and a richer set of measured covariates, which can be used to estimate more personalized treatment effects; however, these estimates may be biased due to latent confounding.In this work, we propose a Bayesian method for combining observationaland experimental data for unbiased conditional treatment effect estimation. Our method addresses the following question: Given observational data $D_o$ measuring a set of covariates $\mathbf V$, and experimental data $D_e$ measuring a possibly smaller set of covariates $\mathbf{V_b}\subseteq \mathbf{V}$, which set of covariates $\mathbf{Z}$ leads to the optimal, unbiased prediction of the post-intervention outcome $P(Y |do(X), \mathbf{Z})$, and when can we use observational data for this estimation? In simulated data, we show that our method improves the prediction of post-intervention outcomes.
Poster
Manuele Leonelli · Gherardo Varando

[ Auditorium 1 Foyer ]

Causal discovery algorithms aim at untangling complex causal relationships from data. Here, we study causal discovery and inference methods based on staged tree models, which can represent complex and asymmetric causal relationships between categorical variables. We provide a first graphical representation of the equivalence class of a staged tree, by looking only at a specific subset of its underlying independences. We further define a new pre-metric, inspired by the widely used structural intervention distance, to quantify the closeness between two staged trees in terms of their corresponding causal inference statements. A simulation study highlights the efficacy of staged trees in uncovering complexes, asymmetric causal relationships from data, and real-world data applications illustrate their use in practical causal analysis.

Poster
Alicia Curth · Mihaela van der Schaar

[ Auditorium 1 Foyer ]

We study the problem of inferring heterogeneous treatment effects (HTEs) from time-to-event data in the presence of competing events. Albeit its great practical relevance, this problem has received little attention compared to its counterparts studying HTE estimation without time-to-event data or competing events. We take an outcome modeling approach to estimating HTEs, and consider how and when existing prediction models for time-to-event data can be used as plug-in estimators for potential outcomes. We then investigate whether competing events present new challenges for HTE estimation -- in addition to the standard confounding problem --, and find that, because there are multiple definitions of causal effects in this setting -- namely total, direct and separable effects --, competing events can act as an additional source of covariate shift depending on the desired treatment effect interpretation and associated estimand. We theoretically analyze and empirically illustrate when and how these challenges play a role when using generic machine learning prediction models for the estimation of HTEs.

Poster
Jie Shen

[ Auditorium 1 Foyer ]

We study the problem of efficient PAC learning of halfspaces in $\mathbb{R}^d$ in the presence of the malicious noise, where a fraction of the training samples are adversarially corrupted. A series of recent works have developed polynomial-time algorithms that enjoy near-optimal sample complexity and noise tolerance, yet leaving open whether a {\em linear-time} algorithm exists and matches these appealing statistical performance guarantees. In this work, we give an affirmative answer by developing an algorithm that runs in time $\tilde{O}(m d )$, where $m = \tilde{O}(\frac{d}{\epsilon})$ is the sample size and $\epsilon \in (0, 1)$ is the target error rate. Notably, the computational complexity of all prior algorithms suffer either a high order dependence on the problem size, or is implicitly proportional to $\frac{1}{\epsilon^2}$ through the sample size. Our key idea is to combine localization and an approximate version of matrix multiplicative weights update method to progressively downweight the contribution of the corrupted samples while refining the learned halfspace.
Poster
Grzegorz Gluch · Khashayar Barooti · Rüdiger Urbanke

[ Auditorium 1 Foyer ]

Considering arbitrary test examples is a new approach for addressing adversarial robustness. This model was introduced by Goldwasser et al.~[2020], where the authors present a selective and transductive learning algorithm which guarantees a low test error and low rejection rate wrt to the original distribution. Moreover, a lower bound, in terms of the VC-dimension, standard risk and number of samples, is presented. We show that this barrier can be broken in the quantum world.We consider a new model, influenced by the quantum PAC-learning model introduced by Bshouty and Jackson~[1995] and similar in spirit to Goldwaser et al.~[2020]. In this model we give an interactive protocol between the learner and the adversary (at test-time) that guarantees robustness. In this work we break the lower bound from Goldwasser et al.~[2020].From the technical perspective, our protocol is inspired by recent advances in delegation of quantum computation, e.g. Mahadev~[2018]. But in order to be applicable to our task, we extend the delegation protocol to enable a new feature, e.g. by extending delegation of decision problems, i.e. BQP, to sampling problems with adversarially chosen inputs.

Poster
Hippolyte Bourel · Anders Jonsson · Odalric Maillard · Mohammad Sadegh Talebi

[ Auditorium 1 Foyer ]

We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.

Poster
Fares Fourati · Vaneet Aggarwal · Christopher Quinn · Mohamed-Slim Alouini

[ Auditorium 1 Foyer ]

We investigate the problem of unconstrained combinatorial multi-armed bandits with full-bandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$-regret upper bound of $\Tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings.
Poster
Zixian Yang · R Srikant · Lei Ying

[ Auditorium 1 Foyer ]

Multi-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, and crowdsourcing. This paper considers a multi-server system with multiple servers and multiple types of jobs, where different job types require different amounts of processing time at different servers. The goal is to schedule jobs on servers without knowing the statistics of the processing times. To fully utilize the processing power of the servers, it is known that one has to at least learn the service rates of different job types on different servers. Prior works on this topic decouple the learning and scheduling phases which leads to either excessive exploration or extremely large job delays. We propose a new algorithm, which combines the MaxWeight scheduling policy with discounted upper confidence bound (UCB), to simultaneously learn the statistics and schedule jobs to servers. We obtain performance bounds for our algorithm that hold for both stationary and nonstationary service rates. Simulations confirm that the delay performance of our algorithm is several orders of magnitude better than previously proposed algorithms. Our algorithm also has the added benefit that it can handle non-stationarity in the service processes.

Poster
Çağın Ararat · Cem Tekin

[ Auditorium 1 Foyer ]

We introduce vector optimization problems with stochastic bandit feedback, in which preferences among designs are encoded by a polyhedral ordering cone $C$. Our setup generalizes the best arm identification problem to vector-valued rewards by extending the concept of Pareto set beyond multi-objective optimization. We characterize the sample complexity of ($\epsilon,\delta$)-PAC Pareto set identification by defining a new cone-dependent notion of complexity, called the {\em ordering complexity}. In particular, we provide gap-dependent and worst-case lower bounds on the sample complexity and show that, in the worst-case, the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the naïve elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, the returned ($\epsilon,\delta$)-PAC Pareto set, and the success of identification.
Poster
Patrick Saux · Odalric Maillard

[ Auditorium 1 Foyer ]

In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.

Poster
Yueyang Liu · Benjamin Van Roy · Kuang Xu

[ Auditorium 1 Foyer ]

Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to nonstationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to nonstationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. Theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulation, we demonstrate that predictive sampling outperforms Thompson sampling in all nonstationary environments examined.

Poster
Yusha Liu · Aarti Singh

[ Auditorium 1 Foyer ]

In continuum-armed bandit problems where the underlying function resides in a reproducing kernel Hilbert space (RKHS), namely, the kernelised bandit problems, an important open problem remains of how well learning algorithms can adapt if the regularity of the associated kernel function is unknown. In this work, we study adaptivity to the regularity of translation-invariant kernels, which is characterized by the decay rate of the Fourier transformation of the kernel, in the bandit setting. We derive an adaptivity lower bound, proving that it is impossible to simultaneously achieve optimal cumulative regret in a pair of RKHSs with different regularities. To verify the tightness of this lower bound, we show that an existing bandit model selection algorithm applied with minimax non-adaptive kernelised bandit algorithms matches the lower bound in dependence of T, the total number of steps, except for log factors. By filling in the regret bounds for adaptivity between RKHSs, we connect the statistical difficulty for adaptivity in continuum-armed bandits in three fundamental types of function spaces: RKHS, Sobolev space, and Holder space.

Poster
Soumyabrata Pal · Arun Sai Suggala · Karthikeyan Shanmugam · Prateek Jain

[ Auditorium 1 Foyer ]

We consider the problem of latent bandits with cluster structure where there are multiple agents, each with an associated multi-armed bandit problem. These agents are grouped into latent clusters such that the mean reward vectors of agents within the same cluster are equivalent. At each round, an agent, selected uniformly at random, pulls an arm and observes a corresponding noisy reward. The goal of the agents is to collaborate among each other to maximize their cumulative rewards. This problem is of practical importance in recommendation systems in frameworks such as user-user collaborative filtering, and has received wide attention of late. We propose LATTICE (Latent bAndiTs via maTrIx ComplEtion), the first algorithm for this problem that achieves the minimax optimal regret of $\widetilde{O}(\sqrt{(A+B)T})$ when the number of clusters is $O(1)$. Here, $A, B$ are the number of items and agents respectively, and $T$ is the number of online rounds. Without collaboration, any algorithm will suffer $\Omega(\sqrt{ABT})$ regret. The key novelty in our algorithm is the reduction of online learning problem to offline low-rank matrix completion problem. Our algorithm is computationally efficient and only requires $O(\log{T})$ calls to an offline matrix completion oracle across all $T$ rounds.
Poster
Taira Tsuchiya · Shinji Ito · Junya Honda

[ Auditorium 1 Foyer ]

We consider the combinatorial semi-bandit problem and present a new algorithm with a best-of-both-worlds regret guarantee; the regrets are bounded near-optimally in the stochastic and adversarial regimes. In the stochastic regime, we prove a variance-dependent regret bound depending on the tight suboptimality gap introduced by Kveton et al. (2015) with a good leading constant. In the adversarial regime, we show that the same algorithm simultaneously obtains various data-dependent regret bounds. Our algorithm is based on the follow-the-regularized-leader framework with a refined regularizer and adaptive learning rate. Finally, we numerically test the proposed algorithm and confirm its superior or competitive performance over existing algorithms such as Thompson sampling in most settings.

Poster
Hanjun Dai · Yuan Xue · Niao He · Yixin Wang · Na Li · Dale Schuurmans · Bo Dai

[ Auditorium 1 Foyer ]

In real-world decision-making, uncertainty is important yet difficult to handle. Stochastic dominance provides a theoretically sound approach to comparing uncertain quantities, but optimization with stochastic dominance constraints is often computationally expensive, which limits practical applicability. In this paper, we develop a simple yet efficient approach for the problem, Light Stochastic Dominance Solver (light-SD), by leveraging properties of the Lagrangian. We recast the inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses the intractability and leads to tractable updates or even closed-form solutions for gradient calculations. We prove convergence of the algorithm and test it empirically. The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.

Poster
Negin Golrezaei · Patrick Jaillet · Jason Cheuk Nam Liang

[ Auditorium 1 Foyer ]

We consider a dynamic pricing problem for repeated contextual second-price auctions with multiple strategic buyers who aim to maximize their long-term time discounted utility. The seller has limited information on buyers' overall demand curves which depends on a non-parametric market-noise distribution, and buyersmay potentially submit corrupted bids (relative to true valuations) to manipulate the seller's pricing policy for more favorable reserve prices in the future. We focus on designing the seller's learning policy to set contextual reserve prices where the seller's goal is to minimize regret compared to the revenue of a benchmark clairvoyant policy that has full information of buyers' demand. We propose a policy with a phased-structure that incorporates randomized ``isolation" periods, during which a buyer is randomly chosen to solely participate in the auction. We show that this design allows the seller to control the number of periods in which buyers significantly corrupt their bids. We then prove that our policy enjoys a T-period regret of O(\sqrt{T}) facing strategic buyers. Finally, we conduct numerical simulations to compare our proposed algorithm to standard pricing policies. Our numerical results show that our algorithm outperforms these policies under various buyer bidding behavior.

Poster
Quentin Bertrand · Wojciech Czarnecki · Gauthier Gidel

[ Auditorium 1 Foyer ]

The Elo score has been extensively used to rank players by their skill or strength in competitive games such as chess, go, or StarCraft II. The Elo score implicitly assumes games have a strong additive---hence transitive---component. In this paper, we investigate the challenge of identifying transitive components in games. As a starting point, we show that the Elo score provably fails to extract the transitive component of some elementary transitive games. Based on this observation, we propose an alternative ranking system which properly extracts the transitive components in these games. Finally, we conduct an in-depth empirical validation on real-world game payoff matrices: it shows significant prediction performance improvements compared to the Elo score.

Poster
Arshak Minasyan · Tigran Galstyan · Sona Hunanyan · Arnak Dalalyan

[ Auditorium 1 Foyer ]

We consider the problem of finding the matching map between two sets of $d$-dimensional noisy feature-vectors. The distinctive feature of our setting is that we do not assume that all the vectors of the first set have their corresponding vector in the second set. If $n$ and $m$ are the sizes of these two sets, we assume that the matching map that should be recovered is defined on a subset of unknown cardinality $k^*\le \min(n,m)$. We show that, in the high-dimensional setting, if the signal-to-noise ratio is larger than $5(d\log(4nm/\alpha))^{1/4}$, then the true matching map can be recovered with probability $1-\alpha$. Interestingly, this threshold does not depend on $k^*$ and is the same as the one obtained in prior work in the case of $k = \min(n,m)$. The procedure for which the aforementioned property is proved is obtained by a data-driven selection among candidate mappings $\{\hat\pi_k:k\in[\min(n,m)]\}$. Each $\hat\pi_k$ minimizes the sum of squares of distances between two sets of size $k$. The resulting optimization problem can be formulated as a minimum-cost flow problem, and thus solved efficiently. Finally, we report the results of numerical experiments on both synthetic and real-world data that illustrate our theoretical results and provide further insight …
Poster
Shuoguang Yang · Yuhao Yan · Xiuneng Zhu · Qiang Sun

[ Auditorium 1 Foyer ]

Sparse regression has been a popular approach to perform variable selection and enhance the prediction accuracy and interpretability of the resulting statistical model. Existing approaches focus on offline regularized regression, while the online scenario has rarely been studied. In this paper, we propose a novel online sparse linear regression framework for analyzing streaming data when data points arrive sequentially. Our proposed method is memory efficient and requires less stringent restricted strong convexity assumptions. Theoretically, we show that with a properly chosen regularization parameter, the $\ell_2$-norm statistical error of our estimator diminishes to zero in the optimal order of $\tilde \mathcal{O}(\frac{s}{\sqrt{T}})$, where $T$ is the streaming sample size. Numerical experiments demonstrate the practical efficiency of our algorithm.
Poster
Sung Jae Jun · Sokbae Lee

[ Auditorium 1 Foyer ]

The log odds ratio is a well-established metric for evaluating the association between binary outcome and exposure variables. Despite its widespread use, there has been limited discussion on how to summarize the log odds ratio as a function of confounders through averaging. To address this issue, we propose the Average Adjusted Association (AAA), which is a summary measure of association in a heterogeneous population, adjusted for observed confounders. To facilitate the use of it, we also develop efficient double/debiased machine learning (DML) estimators of the AAA. Our DML estimators use two equivalent forms of the efficient influence function, and are applicable in various sampling scenarios, including random sampling, outcome-based sampling, and exposure-based sampling. Through real data and simulations, we demonstrate the practicality and effectiveness of our proposed estimators in measuring the AAA.

Poster
Xingyu Lu · Hasin Us Sami · Basak Guler

[ Auditorium 1 Foyer ]

Collaborative machine learning enables privacy-preserving training of machine learning models without collecting sensitive client data. Despite recent breakthroughs, communication bottleneck is still a major challenge against its scalability to larger networks. To address this challenge, we propose PICO, the first collaborative learning framework with linear communication complexity, significantly improving over the quadratic state-of-the-art, under formal information-theoretic privacy guarantees. Theoretical analysis demonstrates that PICO slashes the communication cost while achieving equal computational complexity, adversary resilience, robustness to client dropouts, and model accuracy to the state-of-the-art. Extensive experiments demonstrate up to 91x reduction in the communication overhead, and up to 7x speed-up in the wall-clock training time compared to the state-of-the-art. As such, PICO addresses a key technical challenge in multi-party collaborative learning, paving the way for future large-scale privacy-preserving learning frameworks.

Poster
Cheuk Ting Li · Farzan Farnia

[ Auditorium 1 Foyer ]

Generative adversarial networks (GANs) represent a game between two neural network machines designed to learn the distribution of data. It is commonly observed that different GAN formulations and divergence/distance measures used could lead to considerably different performance results, especially when the data distribution is multi-modal. In this work, we give a theoretical characterization of the mode-seeking behavior of general f-divergences and Wasserstein distances, and prove a performance guarantee for the setting where the underlying model is a mixture of multiple symmetric quasiconcave distributions. This can help us understand the trade-off between the quality and diversity of the trained GANs' output samples. Our theoretical results show the mode-seeking nature of the Jensen-Shannon (JS) divergence over standard KL-divergence and Wasserstein distance measures. We subsequently demonstrate that a hybrid of JS-divergence and Wasserstein distance measures minimized by Lipschitz GANs mimics the mode-seeking behavior of the JS-divergence. We present numerical results showing the mode-seeking nature of the JS-divergence and its hybrid with the Wasserstein distance while highlighting the mode-covering properties of KL-divergence and Wasserstein distance measures. Our numerical experiments indicate the different behavior of several standard GAN formulations in application to benchmark Gaussian mixture and image datasets.

Poster
Lionel Riou-Durand · Pavel Sountsov · Jure Vogrinc · Charles Margossian · Sam Power

[ Auditorium 1 Foyer ]

Hamiltonian Monte Carlo (HMC) is a widely used sampler for continuous probability distributions. In many cases, the underlying Hamiltonian dynamics exhibit a phenomenon of resonance which decreases the efficiency of the algorithm and makes it very sensitive to hyperparameter values. This issue can be tackled efficiently, either via the use of trajectory length randomization (RHMC) or via partial momentum refreshment. The second approach is connected to the kinetic Langevin diffusion, and has been mostly investigated through the use of Generalized HMC (GHMC). However, GHMC induces momentum flips upon rejections causing the sampler to backtrack and waste computational resources. In this work we focus on a recent algorithm bypassing this issue, named Metropolis Adjusted Langevin Trajectories (MALT). We build upon recent strategies for tuning the hyperparameters of RHMC which target a bound on the Effective Sample Size (ESS) and adapt it to MALT, thereby enabling the first user-friendly deployment of this algorithm. We construct a method to optimize a sharper bound on the ESS and reduce the estimator variance. Easily compatible with parallel implementation, the resultant Adaptive MALT algorithm is competitive in terms of ESS rate and hits useful tradeoffs in memory usage when compared to GHMC, RHMC and NUTS.

Poster
Zhenbang Wang · Emanuel Ben-David · Martin Slawski

[ Auditorium 1 Foyer ]

In the analysis of data sets consisting of (X, Y)-pairs, a tacit assumption is that each pair corresponds to the same observational unit. If, however, such pairs are obtained via record linkage of two files, this assumption can be violated as a result of mismatch error rooting, for example, in the lack of reliable identifiers in the two files. Recently, there has been a surge of interest in this setting under the term "Shuffled Data" in which the underlying correct pairing of (X, Y)-pairs is represented via an unknown permutation. Explicit modeling of the permutation tends to be associated with overfitting, prompting the need for suitable methods of regularization. In this paper, we propose an exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and local shuffling. This prior turns out to be conjugate for canonical shuffled data problems in which the likelihood conditional on a fixed permutation can be expressed as product over the corresponding (X,Y)-pairs. Inference can be based on the EM algorithm in which the E-step is approximated by sampling, e.g., via the Fisher-Yates algorithm. The M-step is shown to admit a reduction from n^2 …

Poster
hang zhang · Ping Li

[ Auditorium 1 Foyer ]

This paper studies the generalization capability of the compressed \emph{$k$-nearest neighbor} (KNN) estimator, where randomly-projected low-dimensional data are put into the KNN estimator rather than the high-dimensional raw data. Considering both regression and classification, we give improved bounds on its generalization errors, to put more specific, $\ell_2$ error for regression and mis-classification rate for classification. As a byproduct of our analysis, we prove that ordered distance is almost preserved with random projections, which we believe is for the first time. In addition, we provide numerical experiments on various public datasets to verify our theorems.
Poster
Charlotte Bunne · Ya-Ping Hsieh · Marco Cuturi · Andreas Krause

[ Auditorium 1 Foyer ]

The static optimal transport $(\mathrm{OT})$ problem between Gaussians seeks to recover an optimal map, or more generally a coupling, to morph a Gaussian into another. It has been well studied and applied to a wide variety of tasks. Here we focus on the dynamic formulation of OT, also known as the Schrödinger bridge (SB) problem, which has recently seen a surge of interest in machine learning due to its connections with diffusion-based generative models. In contrast to the static setting, much less is known about the dynamic setting, even for Gaussian distributions. In this paper, we provide closed-form expressions for SBs between Gaussian measures. In contrast to the static Gaussian OT problem, which can be simply reduced to studying convex programs, our framework for solving SBs requires significantly more involved tools such as Riemannian geometry and generator theory. Notably, we establish that the solutions of SBs between Gaussian measures are themselves Gaussian processes with explicit mean and covariance kernels, and thus are readily amenable for many downstream applications such as generative modeling or interpolation. To demonstrate the utility, we devise a new method for modeling the evolution of single-cell genomics data and report significantly improved numerical stability compared to existing …
Poster
Ralph Urlus · Max Baak · Stéphane Collot · Ilan Fridman Rojas

[ Auditorium 1 Foyer ]

Quoting robust uncertainties on machine learning (ML) model metrics, such as f1-score, precision, recall, etc., from sources of uncertainty such as data sampling, parameter initialization, and target labelling, is typically not done in the field of data science, even though these are essential for the proper interpretation and comparison of ML models. This text shows how to calculate and visualize the impact of one dominant source of uncertainty – the sampling uncertainty of the test dataset – on each point of the Precision-Recall (PR) and Receiver Operating Characteristic (ROC) curves. This is particularly relevant for PR curves, where the joint uncertainty on recall and precision can be large and non-linear, especially at low recall. Four statistical methods to evaluate this uncertainty, both frequentist and Bayesian in origin, are compared in terms of coverage and speed. Of these, Wilks’ method is the winner: it provides (near) correct coverage for samples as small as 10 records, works fine when the precision or recall are close to the edges of zero or one, and can be evaluated quickly for practical use. The presented algorithms are available through a public Python library. We recommend that showing uncertainty bands of PR or ROC curves becomes …

Poster
Ludovic Arnould · Claire Boyer · Erwan Scornet

[ Auditorium 1 Foyer ]

Statistical wisdom suggests that very complex models, interpolating training data, will be poor at predicting unseen examples.Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity.While it is widely known that fully-grown decision trees interpolate and, in turn, have bad predictive performances, the same behavior is yet to be analyzed for Random Forests (RF).In this paper, we study the trade-off between interpolation and consistency for several types of RF algorithms. Theoretically, we prove that interpolation regimes and consistency cannot be achieved simultaneously for several non-adaptive RF.Since adaptivity seems to be the cornerstone to bring together interpolation and consistency, we study interpolating Median RF which are proved to be consistent in the interpolating regime. This is the first result conciliating interpolation and consistency for RF, highlighting that the averaging effect introduced by feature randomization is a key mechanism, sufficient to ensure the consistency in the interpolation regime and beyond.Numerical experiments show that Breiman's RF are consistent while exactly interpolating, when no bootstrap step is involved.We theoretically control the size of the interpolation area, which converges fast enough to zero, giving …

Poster
Hengchao Chen · Xiang Li · Qiang Sun

[ Auditorium 1 Foyer ]

Non-asymptotic statistical analysis is often missing for modern geometry-aware machine learning algorithms due to the possibly intricate non-linear manifold structure. This paper studies an intrinsic mean model on the manifold of restricted positive semi-definite matrices and provides a non-asymptotic statistical analysis of the Karcher mean. We also consider a general extrinsic signal-plus-noise model, under which a deterministic error bound of the Karcher mean is provided. As an application, we show that the distributed principal component analysis algorithm, LRC-dPCA, achieves the same performance as the full sample PCA algorithm. Numerical experiments lend strong support to our theories.

Poster
ARNAB MAITI · Kevin Jamieson · Lillian Ratliff

[ Auditorium 1 Foyer ]

We study the sample complexity of identifying an approximate equilibrium for two-player zero-sum $n\times 2$ matrix games. That is, in a sequence of repeated game plays, how many rounds must the two players play before reaching an approximate equilibrium (e.g., Nash)? We derive instance-dependent bounds that define an ordering over game matrices that captures the intuition that the dynamics of some games converge faster than others. Specifically, we consider a stochastic observation model such that when the two players choose actions $i$ and $j$, respectively, they both observe each other's played actions and a stochastic observation $X_{ij}$ such that $\mathbb{E}[X_{ij}] = A_{ij}$. To our knowledge, our work is the first case of instance-dependent lower bounds on the number of rounds the players must play before reaching an approximate equilibrium in the sense that the number of rounds depends on the specific properties of the game matrix $A$ as well as the desired accuracy. We also prove a converse statement: there exist player strategies that achieve this lower bound.
Poster
Vivien Cabannes · Stefano Vigogna

[ Auditorium 1 Foyer ]

Optimizing the misclassification risk is in general NP-hard. Tractable solvers can be obtained by considering a surrogate regression problem. While convergence to the regression function is typically sublinear, the corresponding classification error can decay much faster. Fast and super fast rates (up to exponential) have been established for general smooth losses on problems where a hard margin is present between classes. This leaves out models based on non-smooth losses such as support vector machines, and problems where there is no hard margin, begging several questions. Are such models incapable of fast convergence? Are they therefore structurally inferior? Is the hard margin condition really necessary to obtain exponential convergence? Developing a new strategy, we provide an answer to these questions. In particular, we show not only that support vector machines can indeed converge exponentially fast, but also that they can do so even without hard margin.

Poster
Kush Bhatia · Wenshuo Guo · Jacob Steinhardt

[ Auditorium 1 Foyer ]

Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. Our framework models rewards and policies as nonparametric functions belonging to subsets of Reproducing Kernel Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true reward and must output a policy that performs well under the true reward. For this setting, we first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate, which depends primarily on the eigenvalue spectrum of a certain linear operator on the RKHSs. Despite the generality of these results, our bounds are stronger than previous bounds developed for more specialized problems. We specifically show that the well-studied problem of Gaussian process (GP) bandit optimization is a special …

Poster
Chengliang Tang · Nathan Lenssen · Ying Wei · Tian Zheng

[ Auditorium 1 Foyer ]

Learning function-on-scalar predictive models for conditional densities and identifying factors that influence the entire probability distribution are vital tasks in many data-driven applications. We present an efficient Majorization-Minimization optimization algorithm, Wasserstein Distributional Learning (WDL), that trains Semi-parametric Conditional Gaussian Mixture Models (SCGMM) for conditional density functions and uses the Wasserstein distance $W_2$ as a proper metric for the space of density outcomes. We further provide theoretical convergence guarantees and illustrate the algorithm using boosted machines. Experiments on the synthetic data and real-world applications demonstrate the effectiveness of the proposed WDL algorithm.
Poster
Francesco Quinzan · Rajiv Khanna · Moshik Hershcovitch · Sarel Cohen · Daniel Waddington · Tobias Friedrich · Michael Mahoney

[ Auditorium 1 Foyer ]

We study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-submodular functions. The proposed algorithm achieves exponentially fast parallel run time in the adaptive query model, scaling much better than prior work. Furthermore, our extension allows the use of downward-closed constraints, which can be used to encode certain fairness criteria into the feature selection process. We prove strong approximation guarantees for the algorithm based on standard assumptions. These guarantees are applicable to many parametric models, including Generalized Linear Models. Finally, we demonstrate empirically that the proposed algorithm competes favorably with state-of-the-art techniques for feature selection, on real-world and synthetic datasets.

Poster
Hanni Cheng · Haosi Zheng · Ya Cong · Weihao Jiang · Shiliang Pu

[ Auditorium 1 Foyer ]

Learning-based algorithms to solve TSP are getting popular in recent years, but most existing works cannot solve very large-scale TSP instances within a limited time. To solve this problem, this paper introduces a creative and distinctive method to select and locally optimize sub-parts of a solution. Concretely, we design a novel framework to generalize a small-scale selector-and-optimizer network to large-scale TSP instances by iteratively selecting while optimizing one sub-problem. At each iteration, the time of sub-problem sampling and selection is significantly reduced due to the full use of parallel computing. Our neural model is well-designed to exploit the characteristics of the sub-problems. Furthermore, another destroy-and-repair method is raised to avoid the local minimum of the iterative algorithm from a global perspective. Extensive experiments show that our method accelerates state-of-the-art learning-based algorithms more than 2x while achieving better solution quality on large-scale TSP instances ranging in size from 200 to 20,000.

Poster
Lianke Qin · Zhao Song · Lichen Zhang · Danyang Zhuo

[ Auditorium 1 Foyer ]

Online matrix vector multiplication is a fundamental step and bottleneck in many machine learning algorithms. It is defined as follows: given a matrix at the pre-processing phase, at each iteration one receives a query vector and needs to form the matrix-vector product (approximately) before observing the next vector. In this work, we study a particular instance of such problem called the online projection matrix vector multiplication. Via a reduction, we show it suffices to solve the inverse maintenance problem. Additionally, our framework supports dimensionality reduction to speed up the computation that approximates the matrix-vector product with an optimization-friendly error guarantee. Moreover, our unified approach can handle both data-oblivious sketching and data-dependent sampling. Finally, we demonstrate the effectiveness of our framework by speeding up the empirical risk minimization solver.

Poster
Hadrien Hendrikx

[ Auditorium 1 Foyer ]

We consider a decentralized optimization problem, in which n nodes collaborate to optimize a global objective function using local communications only. While many decentralized algorithms focus on gossip communications (pairwise averaging), we consider a different scheme, in which a ``token'' that contains the current estimate of the model performs a random walk over the network, and updates its model using the local model of the node it is at. Indeed, token algorithms generally benefit from improved communication efficiency and privacy guarantees. We frame the token algorithm as a randomized gossip algorithm on a conceptual graph, which allows us to prove a series of convergence results for variance-reduced and accelerated token algorithms for the complete graph. We also extend these results to the case of multiple tokens by extending the conceptual graph, and to general graphs by tweaking the communication procedure. The reduction from token to well-studied gossip algorithms leads to tight rates for many token algorithms, and we illustrate their performance empirically.

Poster
Yutong Dai · Guanyi Wang · Daniel Robinson · Frank E. Curtis

[ Auditorium 1 Foyer ]

This paper introduces a new proximal stochastic gradient method with variance reduction and stabilization for minimizing the sum of a convex stochastic function and a group sparsity-inducing regularization function. Since the method may be viewed as a stabilized version of the recently proposed algorithm PStorm, we call our algorithm S-PStorm. Our analysis shows that S-PStorm has strong convergence results. In particular, we prove an upper bound on the number of iterations required by S-PStorm before its iterates correctly identify (with high probability) an optimal support (i.e., the zero and nonzero structure of an optimal solution). Most algorithms in the literature with such a support identification property use variance reduction techniques that require either periodically evaluating an exact gradient or storing a history of stochastic gradients. Unlike these methods, S-PStorm achieves variance reduction without requiring either of these, which is advantageous. Moreover, our support-identification result for S-PStorm shows that, with high probability, an optimal support will be identified correctly in all iterations with index above a threshold. We believe that this type of result is new to the literature since the few existing other results prove that the optimal support is identified with high probability at each iteration with a sufficiently …

Poster
Michał Grudzień · Grigorii Malinovskii · Peter Richtarik

[ Auditorium 1 Foyer ]

The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022) identified four distinct generations of LT methods based on the quality of the provided theoretical communication complexity guarantees. Despite a lot of progress in this area, none of the existing works were able to show that it is theoretically better to employ multiple local gradient-type steps (i.e., to engage in LT) than to rely on a single local gradient-type step only in the important heterogeneous data regime. In a recent breakthrough embodied in their ProxSkip method and its theoretical analysis, Mishchenko et al. (2022) showed that LT indeed leads to provable communication acceleration for arbitrarily heterogeneous data, thus jump-starting the 5th generation of LT methods. However, while these latest generation LT methods are compatible with DS, none of them support CS. We resolve this open problem in the affirmative. In order to do so, we had to base …

Poster
Zhaoyue Chen · Yifan Sun

[ Auditorium 1 Foyer ]

The Frank-Wolfe algorithm is a popular method in structurally constrained machine learning applications, due to its fast per-iteration complexity. However, one major limitation of the method is a slow rate of convergence that is difficult to accelerate due to erratic, zig-zagging step directions, even asymptotically close to the solution. We view this as an artifact of discretization; that is to say, the Frank-Wolfe \emph{flow}, which is its trajectory at asymptotically small step sizes, does not zig-zag, and reducing discretization error will go hand-in-hand in producing a more stabilized method, with better convergence properties. We propose two improvements: a multistep Frank-Wolfe method that directly applies optimized higher-order discretization schemes; and an LMO-averaging scheme with reduced discretization error, and whose local convergence rate over general convex sets accelerates from a rate of $O(1/k)$ to up to $O(1/k^{3/2})$.
Poster
Batiste Le Bars · Aurélien Bellet · Marc Tommasi · Erick Lavoie · Anne-Marie Kermarrec

[ Auditorium 1 Foyer ]

One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.

Poster
Han Shen · Songtao Lu · Xiaodong Cui · Tianyi Chen

[ Auditorium 1 Foyer ]

Federated learning (FL) has received increasing interests during the past years, However, most of the existing works focus on supervised learning, and federated learning for sequential decision making has not been fully explored. Part of the reason is that learning a policy for sequential decision making typically requires repeated interaction with the environments, which is costly in many FL applications.To overcome this issue, this work proposes a federated offline policy optimization method abbreviated as FedOPO that allows clients to jointly learn the optimal policy without interacting with environments during training. Albeit the nonconcave-convex-strongly concave nature of the resultant max-min-max problem, we establish both the local and global convergence of our FedOPO algorithm. Experiments on the OpenAI gym demonstrate that our algorithm is able to find a near-optimal policy while enjoying various merits brought by FL, including training speedup and improved asymptotic performance.

Poster
Xianyang Zhang · Trisha Dawn

[ Auditorium 1 Foyer ]

One common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points. The framework includes several well-established procedures, such as the penalized likelihood and minimum description length. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming when (i) the data sequence is long and (ii) obtaining the cost value involves solving a non-trivial optimization problem. This paper introduces a new sequential updating method (SE) to find the cost value effectively. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. The new method is applied to change-point detection in generalized linear models and penalized regression. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time (PELT) method without sacrificing estimation accuracy.

Poster
Ziang Chen · Jianfeng Lu · Huajie Qian · XInshang Wang · Wotao Yin

[ Auditorium 1 Foyer ]

One implicit assumption in current stochastic gradient descent (SGD) algorithms is the identical cost for sampling each component function of the finite-sum objective. However, there are applications where the costs differ substantially, for which SGD schemes with uniform sampling invoke a high sampling load. We investigate the use of importance sampling (IS) as a cost saver in this setting, in contrast to its traditional use for variance reduction. The key ingredient is a novel efficiency metric for IS that advocates low sampling costs while penalizing high gradient variances. We then propose HeteRSGD, an SGD scheme that performs gradient sampling according to optimal probability weights stipulated by the metric, and establish theories on its optimal asymptotic and finite-time convergence rates among all possible IS-based SGD schemes. We show that the relative efficiency gain of HeteRSGD can be arbitrarily large regardless of the problem dimension and number of components. Our theoretical results are validated numerically for both convex and nonconvex problems.

Poster
Evan Becker · Jingdong Gao · Ted Zadouri · Baharan Mirzasoleiman

[ Auditorium 1 Foyer ]

We consider maximization of stochastic monotone continuous submodular functions (CSF) with a diminishing return property. Existing algorithms only guarantee the performance \textit{in expectation}, and do not bound the probability of getting a bad solution. This implies that for a particular run of the algorithms, the solution may be much worse than the provided guarantee in expectation. In this paper, we first empirically verify that this is indeed the case. Then, we provide the first \textit{high-probability} analysis of the existing methods for stochastic CSF maximization, namely PGA, boosted PGA, SCG, and SCG++. Finally, we provide an improved high-probability bound for SCG, under slightly stronger assumptions, with a better convergence rate than that of the expected solution. Through extensive experiments on non-concave quadratic programming (NQP) and optimal budget allocation, we confirm the validity of our bounds and show that even in the worst-case, PGA converges to $OPT/2$, and boosted PGA, SCG, SCG++ converge to $(1 - 1/e)OPT$, but at a slower rate than that of the expected solution.
Poster
Andi Han · Bamdev Mishra · Pratik Jawanpuria · Junbin Gao

[ Auditorium 1 Foyer ]

In this paper, we propose a convergence acceleration scheme for general Riemannian optimization problems by extrapolating iterates on manifolds. We show that when the iterates are generated from the Riemannian gradient descent method, the scheme achieves the optimal convergence rate asymptotically and is computationally more favorable than the recently proposed Riemannian Nesterov accelerated gradient methods. A salient feature of our analysis is the convergence guarantees with respect to the use of general retraction and vector transport. Empirically, we verify the practical benefits of the proposed acceleration strategy, including robustness to the choice of different averaging schemes on manifolds.

Poster
Tyler Maunu · Thibaut Le Gouic · Philippe Rigollet

[ Auditorium 1 Foyer ]

We revisit the problem of recovering a low-rank positive semidefinite matrix from rank-one projections using tools from optimal transport. More specifically, we show that a variational formulation of this problem is equivalent to computing a Wasserstein barycenter. In turn, this new perspective enables the development of new geometric first-order methods with strong convergence guarantees in Bures-Wasserstein distance. Experiments on simulated data demonstrate the advantages of our new methodology over existing methods.

Poster
Yu Wang · Mikolaj Kasprzak · Jonathan Huggins

[ Auditorium 1 Foyer ]

Variational Inference (VI) is an attractive alternative to Markov Chain Monte Carlo (MCMC) due to its computational efficiency in the case of large datasets and/or complex models with high-dimensional parameters. However, evaluating the accuracy of variational approximations remains a challenge. Existing methods characterize the quality of the whole variational distribution, which is almost always poor in realistic applications, even if specific posterior functionals such as the component-wise means or variances are accurate. Hence, these diagnostics are of practical value only in limited circumstances. To address this issue, we propose the "TArgeted Diagnostic for Distribution Approximation Accuracy" (TADDAA), which uses many short parallel MCMC chains to obtain lower bounds on the error of each posterior functional of interest. We also develop a reliability check for TADDAA to determine when the lower bounds should not be trusted. Numerical experiments validate the practical utility and computational efficiency of our approach on a range of synthetic distributions and real-data examples, including sparse logistic regression and Bayesian neural network models.

Poster
Vishnu Raj · Tianyu Cui · Markus Heinonen · Pekka Marttinen

[ Auditorium 1 Foyer ]

Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding Summary Evidence Lower BOund. The approach is fully Bayesian without any heuristic tuning parameters, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.

Poster
Srshti Putcha · Christopher Nemeth · Paul Fearnhead

[ Auditorium 1 Foyer ]

Stochastic gradient MCMC (SGMCMC) offers a scalable alternative to traditional MCMC, by constructing an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data. While efficient to compute, the resulting gradient estimator may exhibit a high variance and impact sampler performance. The problem of variance control has been traditionally addressed by constructing a better stochastic gradient estimator, often using control variates. We propose to use a discrete, non-uniform probability distribution to preferentially subsample data points that have a greater impact on the stochastic gradient. In addition, we present a method of adaptively adjusting the subsample size at each iteration of the algorithm, so that we increase the subsample size in areas of the sample space where the gradient is harder to estimate. We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.

Poster
Chengkuan Hong · Christian Shelton

[ Auditorium 1 Foyer ]

Neyman-Scott processes (NSPs) have been applied across a range of fields to model points or temporal events with a hierarchy of clusters. Markov chain Monte Carlo (MCMC) is typically used for posterior sampling in the model. However, MCMC's mixing time can cause the resulting inference to be slow, and thereby slow down model learning and prediction. We develop the first variational inference (VI) algorithm for NSPs, and give two examples of suitable variational posterior point process distributions. Our method minimizes the inclusive Kullback-Leibler (KL) divergence for VI to obtain the variational parameters. We generate samples from the approximate posterior point processes much faster than MCMC, as we can directly estimate the approximate posterior point processes without any MCMC steps or gradient descent. We include synthetic and real-world data experiments that demonstrate our VI algorithm achieves better prediction performance than MCMC when computational time is limited.

Poster
Łukasz Struski · Marcin Mazur · Paweł Batorski · PRzemysław Spurek · Jacek Tabor

[ Auditorium 1 Foyer ]

Many crucial problems in deep learning and statistical inference are caused by a variational gap, i.e., a difference between model evidence (log-likelihood) and evidence lower bound (ELBO). In particular, in a classical VAE setting that involves training via an ELBO cost function, it is difficult to provide a robust comparison of the effects of training between models, since we do not know a log-likelihood of data (but only its lower bound). In this paper, to deal with this problem, we introduce a general and effective upper bound, which allows us to efficiently approximate the evidence of data. We provide extensive theoretical and experimental studies of our approach, including its comparison to the other state-of-the-art upper bounds, as well as its application as a tool for the evaluation of models that were trained on various lower bounds.

Poster
Raul Astudillo · Zhiyuan Jerry Lin · Eytan Bakshy · Peter Frazier

[ Auditorium 1 Foyer ]

Preferential Bayesian optimization (PBO) is a framework for optimizing a decision maker's latent utility function using preference feedback. This work introduces the expected utility of the best option (qEUBO) as a novel acquisition function for PBO. When the decision maker's responses are noise-free, we show that qEUBO is one-step Bayes optimal and thus equivalent to the popular knowledge gradient acquisition function. We also show that qEUBO enjoys an additive constant approximation guarantee to the one-step Bayes-optimal policy when the decision maker's responses are corrupted by noise. We provide an extensive evaluation of qEUBO and demonstrate that it outperforms the state-of-the-art acquisition functions for PBO across many settings. Finally, we show that, under sufficient regularity conditions, qEUBO's Bayesian simple regret converges to zero at a rate $o(1/n)$ as the number of queries, $n$, goes to infinity. In contrast, we show that simple regret under qEI, a popular acquisition function for standard BO often used for PBO, can fail to converge to zero. Enjoying superior performance, simple computation, and a grounded decision-theoretic justification, qEUBO is a promising acquisition function for PBO.
Poster
Nikolay Krantsevich · Jingyu He · P. Richard Hahn

[ Auditorium 1 Foyer ]

Determining subgroups that respond especially well (or poorly) to specific interventions (medical or policy) requires new supervised learning methods tailored specifically for causal inference. Bayesian Causal Forest (BCF) is a recent method that has been documented to perform well on data generating processes with strong confounding of the sort that is plausible in many applications. This paper develops a novel algorithm for fitting the BCF model, which is more efficient than the previous Gibbs sampler. The new algorithm can be used to initialize independent chains of the existing Gibbs sampler leading to better posterior exploration and coverage of the associated interval estimates in simulation studies. The new algorithm is compared to related approaches via simulation studies as well as an empirical analysis.

Poster
Yang Yang · Gennaro Gala · Robert Peharz

[ Auditorium 1 Foyer ]

Probabilistic circuits (PCs) are a prominent representation of probability distributions with tractable inference. While parameter learning in PCs is rigorously studied, structure learning is often more based on heuristics than on principled objectives. In this paper, we develop Bayesian structure scores for deterministic PCs, i.e., the structure likelihood with parameters marginalized out, which are well known as rigorous objectives for structure learning in probabilistic graphical models. When used within a greedy cutset algorithm, our scores effectively protect against overfitting and yield a fast and almost hyper-parameter-free structure learner, distinguishing it from previous approaches. In experiments, we achieve good trade-offs between training time and model fit in terms of log-likelihood. Moreover, the principled nature of Bayesian scores unlocks PCs for accommodating frameworks such as structural expectation-maximization.

Poster
Felix Jimenez · Matthias Katzfuss

[ Auditorium 1 Foyer ]

Bayesian optimization is a technique for optimizing black-box target functions. At the core of Bayesian optimization is a surrogate model that predicts the output of the target function at previously unseen inputs to facilitate the selection of promising input values. Gaussian processes (GPs) are commonly used as surrogate models but are known to scale poorly with the number of observations. Inducing point GP approximations can mitigate scaling issues, but may provide overly smooth estimates of the target function. In this work we adapt the Vecchia approximation, a popular GP approximation from spatial statistics, to enable scalable high-dimensional Bayesian optimization. We develop several improvements and extensions to Vecchia, including training warped GPs using mini-batch gradient descent, approximate neighbor search, and variance recalibration. We demonstrate the superior performance of Vecchia in BO using both Thompson sampling and qUCB. On several test functions and on two reinforcement-learning problems, our methods compared favorably to the state of the art, often outperforming inducing point methods and even exact GPs.

Poster
Sulin Liu · Qing Feng · David Eriksson · Ben Letham · Eytan Bakshy

[ Auditorium 1 Foyer ]

Bayesian optimization (BO) is a powerful approach to sample-efficient optimization of black-box objective functions. However, the application of BO to areas such as recommendation systems often requires taking the interpretability and simplicity of the configurations into consideration, a setting that has not been previously studied in the BO literature. To make BO applicable in this setting, we present several regularization-based approaches that allow us to discover sparse and more interpretable configurations. We propose a novel differentiable relaxation based on homotopy continuation that makes it possible to target sparsity by working directly with $L_0$ regularization. We identify failure modes for regularized BO and develop a hyperparameter-free method, sparsity exploring Bayesian optimization (SEBO) that seeks to simultaneously maximize a target objective and sparsity. SEBO and methods based on fixed regularization are evaluated on synthetic and real-world problems, and we show that we are able to efficiently optimize for sparsity.
Poster
Nicola Branchini · Virginia Aglietti · Neil Dhir · Theodoros Damoulas

[ Auditorium 1 Foyer ]

We study the problem of globally optimizing the causal effect on a target variable of an unknown causal graph in which interventions can be performed. This problem arises in many areas of science including biology, operations research and healthcare. We propose Causal Entropy Optimization (CEO), a framework that generalizes Causal Bayesian Optimization (CBO) to account for all sources of uncertainty, including the one arising from the causal graph structure. CEO incorporates the causal structure uncertainty both in the surrogate models for the causal effects and in the mechanism used to select interventions via an information-theoretic acquisition function. The resulting algorithm automatically trades-off structure learning and causal effect optimization, while naturally accounting for observation noise. For various synthetic and real-world structural causal models, CEO achieves faster convergence to the global optimum compared with CBO while also learning the graph. Furthermore, our joint approach to structure learning and causal optimization improves upon sequential, structure-learning-first approaches.

Poster
Matthias Bitzer · Mona Meister · Christoph Zimmer

[ Auditorium 1 Foyer ]

Learning precise surrogate models of complex computer simulations and physical machines often require long-lasting or expensive experiments. Furthermore, the modeled physical dependencies exhibit nonlinear and nonstationary behavior. Machine learning methods that are used to produce the surrogate model should therefore address these problems by providing a scheme to keep the number of queries small, e.g. by using active learning and be able to capture the nonlinear and nonstationary properties of the system. One way of modeling the nonstationarity is to induce input-partitioning, a principle that has proven to be advantageous in active learning for Gaussian processes. However, these methods either assume a known partitioning, need to introduce complex sampling schemes or rely on very simple geometries. In this work, we present a simple, yet powerful kernel family that incorporates a partitioning that: i) is learnable via gradient-based methods, ii) uses a geometry that is more flexible than previous ones, while still being applicable in the low data regime. Thus, it provides a good prior for active learning procedures. We empirically demonstrate excellent performance on various active learning tasks.

Poster
William Walker · Hugo Soulat · Changmin Yu · Maneesh Sahani

[ Auditorium 1 Foyer ]

We introduce a new approach to probabilistic unsupervised learning based on the recognition-parametrised model (RPM): a normalised semi-parametric hypothesis class for joint distributions over observed and latent variables. Under the key assumption that observations are conditionally independent given latents, the RPM combines parametric prior and observation-conditioned latent distributions with non-parametric observation marginals. This approach leads to a flexible learnt recognition model capturing latent dependence between observations, without the need for an explicit, parametric generative model. The RPM often admits exact maximum-likelihood learning even for powerful neural-network-based recognition. We develop effective approximations applicable in other cases. Experiments demonstrate the effectiveness of the RPM on high-dimensional data, learning image classification from weak indirect supervision; direct image-level latent Dirichlet allocation; and Recognition-Parametrised Gaussian Process Factor Analysis (RP-GPFA) applied to multi-factorial spatiotemporal datasets. The RPM provides a powerful framework to discover meaningful latent structure underlying observational data, a function critical to both animal and artificial intelligence.

Poster
Nikil Selvam · Honghua Zhang · Guy Van den Broeck

[ Auditorium 1 Foyer ]

Tree-shaped graphical models are widely used for their tractability. However, they unfortunately lack expressive power as they require committing to a particular sparse dependency structure. We propose a novel class of generative models called mixtures of all trees: that is, a mixture over all possible ($n^{n-2}$) tree-shaped graphical models over n variables. We show that it is possible to parameterize this Mixture of All Trees (MoAT) model compactly (using a polynomial-size representation) in a way that allows for tractable likelihood computation and optimization via stochastic gradient descent. Furthermore, by leveraging the tractability of tree-shaped models, we devise fast-converging conditional sampling algorithms for approximate inference, even though our theoretical analysis suggests that exact computation of marginals in the MoAT model is NP-hard. Empirically, MoAT achieves state-of-the-art performance on density estimation benchmarks when compared against powerful probabilistic models including hidden Chow-Liu Trees.
Poster
Jen Ning Lim · Sebastian Vollmer · Lorenz Wolf · Andrew Duncan

[ Auditorium 1 Foyer ]

Energy-Based Models (EBMs) have proven to be a highly effective approach for modelling densities on finite-dimensional spaces. Their ability to incorporate domain-specific choices and constraints into the structure of the model through composition make EBMs an appealing candidate for applications in physics, biology and computer vision and various other fields. Recently, Energy-Based Processes (EBP) for modelling stochastic processes was proposed for \textit{unconditional} exchangeable data (e.g., point clouds).In this work, we present a novel subclass of EBPs, called $\mathcal{F}$-EBM for \textit{conditional} exchangeable data, which is able to learn distributions of functions (such as curves or surfaces) from functional samples evaluated at finitely many points. Two unique challenges arise in the functional context. Firstly, training data is often not evaluated along a fixed set of points. Secondly, steps must be taken to control the behaviour of the model between evaluation points, to mitigate overfitting.The proposed model is an energy based model on function space that is decomposed spectrally, where a Gaussian Process path measure is used to reweight the distribution to capture smoothness properties of the underlying process being modelled. The resulting model has the ability to utilize irregularly sampled training data and can output predictions at any resolution, providing an effective …
Poster
Rob Romijnders · Yuki Asano · Christos Louizos · Max Welling

[ Auditorium 1 Foyer ]

Pandemics have a major impact on society and the economy. In the case of a new virus, such as COVID-19, high-grade tests and vaccines might be slow to develop and scarce in the crucial initial phase. With no time to waste and lock-downs being expensive, contact tracing is thus an essential tool for policymakers. In theory, statistical inference on a virus transmission model can provide an effective method for tracing infections. However, in practice, such algorithms need to run decentralized, rendering existing methods -- that require hundreds or even thousands of daily messages per person -- infeasible. In this paper, we develop an algorithm that (i) requires only a few (2-5) daily messages, (ii) works with extremely low bandwidths (3-5 bits) and (iii) enables quarantining and targeted testing that drastically reduces the peak and length of the pandemic. We compare the effectiveness of our algorithm using two agent-based simulators of realistic contact patterns and pandemic parameters and show that it performs well even with low bandwidth, imprecise tests, and incomplete population coverage. Code will be provided.

Poster
Ke Bai · Pengyu Cheng · Weituo Hao · Ricardo Henao · Larry Carin

[ Auditorium 1 Foyer ]

Total correlation (TC) is a fundamental concept in information theory that measures statistical dependency among multiple random variables. Recently, TC has shown noticeable effectiveness as a regularizer in many learning tasks, where the correlation among multiple latent embeddings requires to be jointly minimized or maximized. However, calculating precise TC values is challenging, especially when the closed-form distributions of embedding variables are unknown. In this paper, we introduce a unified framework to estimate total correlation values with sample-based mutual information (MI) estimators. More specifically, we discover a relation between TC and MI and propose two types of calculation paths (tree-like and line-like) to decompose TC into MI terms. With each MI term being bounded, the TC values can be successfully estimated. Further, we provide theoretical analyses concerning the statistical consistency of the proposed TC estimators. Experiments are presented on both synthetic and real-world scenarios, where our estimators demonstrate effectiveness in all TC estimation, minimization, and maximization tasks. The code is available at https://github.com/Linear95/TC-estimation.

Poster
Cyril Bachelard · Apostolos Chalkis · Vissarion Fisikopoulos · Elias Tsigaridas

[ Auditorium 1 Foyer ]

We propose novel randomized geometric tools to detect low-volatility anomalies in stock markets; a principal problem in financial economics.Our modeling of the (detection) problem results in sampling and estimating the (relative) volume of geodesically non-convex and non-connected spherical patchesthat arise by intersecting a non-standard simplex with a sphere.To sample, we introduce two novel Markov Chain Monte Carlo (MCMC) algorithms that exploit the geometry of the problem and employ state-of-the-art continuous geometric random walks (such as Billiard walk and Hit-and-Run) adapted on spherical patches.To our knowledge, this is the first geometric formulation and MCMC-based analysis of the volatility puzzle in stock markets.We have implemented our algorithms in C++ (along with an R interface)and we illustrate the power of our approach by performing extensive experiments on real data. Our analyses provide accurate detection and new insights into the distribution of portfolios’ performance characteristics. Moreover, we use our tools to show that classical methods for low-volatility anomaly detection in finance form bad proxies that could lead to misleading or inaccurate results.

Poster
Yue Xiang · Dongyao Zhu · Bowen Lei · Dongkuan Xu · Ruqi Zhang

[ Auditorium 1 Foyer ]

Gradients have been exploited in recent studies in proposal distributions to accelerate the convergence of Markov chain Monte Carlo algorithms on discrete distributions. However, these methods require a natural differentiable extension of the target discrete distribution, which often does not exist or does not provide an effective guidance of the gradient. In this paper, we develop a gradient-like proposal for any discrete distribution without the requirement of a natural differentiable extension. Built upon a locally-balanced proposal, our method efficiently approximates the discrete likelihood ratio via a Newton's series expansion, as a discrete analog of the continuous Taylor series expansion, to enable a large and efficient exploration in discrete spaces. We show that our method can also be viewed as a multilinear extension, thus inheriting the desired properties. We prove that our method has a guaranteed convergence rate with or without the Metropolis-Hastings step. Furthermore, our method outperforms a number of popular alternatives in several different experiments, including the Ising model, the facility location problem, text summarization, and image retrieval.

Poster
Alex Boyd · Yuxin Chang · Stephan Mandt · Padhraic Smyth

[ Auditorium 1 Foyer ]

Continuous-time event sequences, i.e., sequences consisting of continuous time stamps and associated event types (``marks''), are an important type of sequential data with many applications, e.g., in clinical medicine or user behavior modeling. Since these data are typically modeled in an autoregressive manner (e.g., using neural Hawkes processes or their classical counterparts), it is natural to ask questions about future scenarios such as ``what kind of event will occur next'' or ``will an event of type $A$ occur before one of type $B$.'' Addressing such queries with direct methods such as naive simulation can be highly inefficient from a computational perspective. This paper introduces a new typology of query types and a framework for addressing them using importance sampling. Example queries include predicting the $n^\text{th}$ event type in a sequence and the hitting time distribution of one or more event types. We also leverage these findings further to be applicable for estimating general ``$A$ before $B$'' type of queries. We prove theoretically that our estimation method is effectively always better than naive simulation and demonstrate empirically based on three real-world datasets that our approach can produce orders of magnitude improvements in sampling efficiency compared to naive methods.
Poster
Arnab Mondal · Lakshya Singhal · Piyush Lalitkumar Tiwary · Parag Singla · Prathosh A P

[ Auditorium 1 Foyer ]

Class imbalance is a common phenomenon in multiple application domains such as healthcare, where the sample occurrence of one or few class categories is more prevalent in the dataset than the rest. This work addresses the class-imbalance issue by proposing an over-sampling method for the minority classes in the latent space of a Regularized Auto-Encoder (RAE). Specifically, we construct a latent space by maximizing the conditional data likelihood using an Encoder-Decoder structure, such that oversampling through convex combinations of latent samples preserves the class identity. A jointly-trained linear classifier that separates convexly coupled latent vectors from different classes is used to impose this property on the AE's latent space. Further, the aforesaid linear classifier is used for final classification without retraining. We theoretically show that our method can achieve a low variance risk estimate compared to naive oversampling methods and is robust to overfitting. We conduct several experiments on benchmark datasets and show that our method outperforms the existing oversampling techniques for handling class imbalance.

Poster
Naoya Takeishi · Alexandros Kalousis

[ Auditorium 1 Foyer ]

The combination of deep neural nets and theory-driven models (deep grey-box models) can be advantageous due to the inherent robustness and interpretability of the theory-driven part. Deep grey-box models are usually learned with a regularized risk minimization to prevent a theory-driven part from being overwritten and ignored by a deep neural net. However, an estimation of the theory-driven part obtained by uncritically optimizing a regularizer can hardly be trustworthy if we are not sure which regularizer is suitable for the given data, which may affect the interpretability. Toward a trustworthy estimation of the theory-driven part, we should analyze the behavior of regularizers to compare different candidates and to justify a specific choice. In this paper, we present a framework that allows us to empirically analyze the behavior of a regularizer with a slight change in the architecture of the neural net and the training objective.

Poster
Munir Hiabu · Joseph Meyer · Marvin N. Wright

[ Auditorium 1 Foyer ]

We consider a global representation of a regression or classification function by decomposing it into the sum of main and interaction components of arbitrary order. We propose a new identification constraint that allows for the extraction of interventional SHAP values and partial dependence plots, thereby unifying local and global explanations. With our proposed identification, a feature's partial dependence plot corresponds to the main effect term plus the intercept. The interventional SHAP value of feature $k$ is a weighted sum of the main component and all interaction components that include $k$, with the weights given by the reciprocal of the component's dimension. This brings a new perspective to local explanations such as SHAP values which were previously motivated by game theory only. We show that the decomposition can be used to reduce direct and indirect bias by removing all components that include a protected feature. Lastly, we motivate a new measure of feature importance. In principle, our proposed functional decomposition can be applied to any machine learning model, but exact calculation is only feasible for low-dimensional structures or ensembles of those. We provide an algorithm and efficient implementation for gradient-boosted trees (xgboost) and random planted forest. Conducted experiments suggest that our …
Poster
Duy Nguyen · Ngoc Bui · Viet Anh Nguyen

[ Auditorium 1 Foyer ]

Explaining algorithmic decisions and recommending actionable feedback is increasingly important for machine learning applications in consequential domains such as healthcare, criminal justice, and college admission. Recently, significant efforts have been invested in finding a diverse set of recourses to cover the wide spectrum of users' preferences. However, existing works often neglect the requirement that the recourses should be actionable and close to the data manifold; hence, the constructed recourses might be implausible and unsatisfying to users. To address these issues, we propose a novel approach that explicitly directs the diverse set of actionable recourses towards the data manifold. We first find a diverse set of prototypes in the favorable class that balances the trade-off between diversity and proximity. We demonstrate two specific methods to find these prototypes: either by finding the maximum a posteriori estimate of a determinantal point process or by solving a quadratic binary program. To ensure the actionability constraints, we construct an actionability graph in which the nodes represent the training samples and the edges indicate the feasible action between two instances. We then find a feasible path to each prototype, and this path demonstrates the feasible actions for each recourse in the plan. The experimental results …

Poster
Gilles Audemard · Jean-Marie Lagniez · Pierre Marquis · Nicolas Szczepanski

[ Auditorium 1 Foyer ]

Boosted trees is a dominant ML model, exhibiting high accuracy. However, boosted trees are hardly intelligible, and this is a problem whenever they are used in safety-critical applications. Indeed, in such a context, provably sound explanations for the predictions made are expected. Recent work have shown how subset-minimal abductive explanations can be derived for boosted trees, using automated reasoning techniques. However, the generation of such well-founded explanations is intractable in the general case. To improve the scalability of their generation, we introduce the notion of tree-specific explanation for a boosted tree. We show that tree-specific explanations are provably sound abductive explanations that can be computed in polynomial time. We also explain how to derive a subset-minimal abductive explanation from a tree-specific explanation. Experiments on various datasets show the computational benefits of leveraging tree-specific explanations for deriving subset-minimal abductive explanations.

Poster
Jiachen T. Wang · Ruoxi Jia

[ Auditorium 1 Foyer ]

Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a …

Poster
Shaokui Wei · Jiayin Liu · Bing Li · Hongyuan Zha

[ Auditorium 1 Foyer ]

We study the fair regression problem under the notion of Mean Parity (MP) fairness, which requires the conditional mean of the learned function output to be constant with respect to the sensitive attributes. We address this problem by leveraging reproducing kernel Hilbert space (RKHS) to construct the functional space whose members are guaranteed to satisfy the fairness constraints. The proposed functional space suggests a closed-form solution for the fair regression problem that is naturally compatible with multiple sensitive attributes. Furthermore, by formulating the fairness-accuracy tradeoff as a relaxed fair regression problem, we derive a corresponding regression function that can be implemented efficiently and provides interpretable tradeoffs. More importantly, under some mild assumptions, the proposed method can be applied to regression problems with a covariance-based notion of fairness. Experimental results on benchmark datasets show the proposed methods achieve competitive and even superior performance compared with several state-of-the-art methods.

Poster
Cyrus Cousins

[ Auditorium 1 Foyer ]

Cardinal objectives serve as intuitive targets in fair machine learning by summarizing utility (welfare) or disutility (malfare) u over g groups.Under standard axioms, all welfare and malfare functions are w-weighted p-power-means, with p ≤ 1 for welfare, or p ≥ 1 for malfare.We show the same under weaker axioms, and also identify stronger axioms that naturally restrict p.It is known that power-mean malfare functions are Lipschitz continuous, and thus statistically easy to estimate or learn.We show that all power means are locally Hölder continuous, i.e., |M(u; w)−M(u′ ; w)| ≤ α λ ∥u − u′∥^α for some λ, α,∥·∥.In particular, λ and 1/α are bounded except as p → 0 or mini wi → 0, and via this analysis we bound the sample complexity of optimizing welfare.This yields a novel concept of fair-PAC learning, wherein welfare functions are only polynomially harder to optimize than malfare functions, except when p ≈ 0 or mini wi ≈ 0, which is exponentially harder.

Poster
Matthäus Kleindessner · Michele Donini · Chris Russell · Muhammad Bilal Zafar

[ Auditorium 1 Foyer ]

We revisit the problem of fair principal component analysis (PCA), where the goal is to learn the best low-rank linear approximation of the data that obfuscates demographic information. We propose a conceptually simple approach that allows for an analytic solution similar to standard PCA and can be kernelized. Our methods have the same complexity as standard PCA, or kernel PCA, and run much faster than existing methods for fair PCA based on semidefinite programming or manifold optimization, while achieving similar results.

Poster
Boris van Breugel · Hao Sun · Zhaozhi Qian · Mihaela van der Schaar

[ Auditorium 1 Foyer ]

Data is the foundation of most science. Unfortunately, sharing data can be obstructed by the risk of violating data privacy, impeding research in fields like healthcare. Synthetic data is a potential solution. It aims to generate data that has the same distribution as the original data, but that does not disclose information about individuals. Membership Inference Attacks (MIAs) are a common privacy attack, in which the attacker attempts to determine whether a particular real sample was used for training of the model. Previous works that propose MIAs against generative models either display low performance---giving the false impression that data is highly private---or need to assume access to internal generative model parameters---a relatively low-risk scenario, as the data publisher often only releases synthetic data, not the model. In this work we argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model. Experimentally we show that DOMIAS is significantly more successful at MIA than previous work, especially at attacking uncommon samples. The latter is disconcerting since these samples may correspond to underrepresented groups. We …

Poster
Ossi Räisä · Joonas Jälkö · Samuel Kaski · Antti Honkela

[ Auditorium 1 Foyer ]

While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation (MI), and synthetic data generation using noise-aware (NA) Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation using the values of marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.

Poster
Ruihan Wu · Jin Peng Zhou · Kilian Weinberger · Chuan Guo

[ Auditorium 1 Foyer ]

Label differential privacy (label-DP) is a popular framework for training private ML models on datasets with public features and sensitive private labels. Despite its rigorous privacy guarantee, it has been observed that in practice label-DP does not preclude label inference attacks (LIAs): Models trained with label-DP can be evaluated on the public training features to recover, with high accuracy, the very private labels that it was designed to protect. In this work, we argue that this phenomenon is not paradoxical and that label-DP is designed to limit the advantage of an LIA adversary compared to predicting training labels using the Bayes classifier. At label-DP epsilon=0$ this advantage is zero, hence the optimal attack is to predict according to the Bayes classifier and is independent of the training labels. Finally, we empirically demonstrate that our result closely captures the behavior of simulated attacks on both synthetic and real world datasets.

Poster
Michelle Chen · Olga Ohrimenko

[ Auditorium 1 Foyer ]

We consider the problem of ensuring confidentiality of dataset properties aggregated over many records of a dataset. Such properties can encode sensitive information, such as trade secrets or demographic data, while involving a notion of data protection different to the privacy of individual records typically discussed in the literature. In this work, we demonstrate how a distribution privacy framework can be applied to formalize such data confidentiality. We extend the Wasserstein Mechanism from Pufferfish privacy and the Gaussian Mechanism from attribute privacy to this framework, then analyze their underlying data assumptions and how they can be relaxed. We then empirically evaluate the privacy-utility tradeoffs of these mechanisms and apply them against a practical property inference attack which targets global properties of datasets. The results show that our mechanisms can indeed reduce the effectiveness of the attack while providing utility substantially greater than a crude group differential privacy baseline. Our work thus provides groundwork for theoretical mechanisms for protecting global properties of datasets along with their evaluation in practice.

Poster
Rachel Redberg · Yuqing Zhu · Yu-Xiang Wang

[ Auditorium 1 Foyer ]

The "Propose-Test-Release" (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is "nice". We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from “Private Aggregation of Teacher Ensembles (PATE)” --- privately releasing the entire model with a delicate data-dependent analysis.

Poster
Banghua Zhu · Lun Wang · Qi Pang · Shuai Wang · Jiantao Jiao · Dawn Song · Michael Jordan

[ Auditorium 1 Foyer ]

We propose Byzantine-robust federated learning protocols with nearly optimal statistical rates based on recent progress in high dimensional robust statistics. In contrast to prior work, our proposed protocols improve the dimension dependence and achieve a tight statistical rate in terms of all the parameters for strongly convex losses. We also provide matching statistical lower bound for the problem. For experiments, we benchmark against competing protocols and show the empirical superiority of the proposed protocols.

Poster
Karim TIT · Teddy Furon · Mathias Rousset

[ Auditorium 1 Foyer ]

Deep neural networks are robust against random corruptions of the inputs to some extent. This global sense of safety is not sufficient in critical applications where probabilities of failure must be assessed with accuracy. Some previous works applied known statistical methods from the field of rare event analysis to classification. Yet, they use classifiers as black-box models without taking into account gradient information, readily available for deep learning models via auto-differentiation. We propose a new and highly efficient estimator of probabilities of failure dedicated to neural networks as it leverages the fast computation of gradients of the model through back-propagation.

Poster
Rajeev Verma · Daniel Barrejon · Eric Nalisnick

[ Auditorium 1 Foyer ]

We study the statistical properties of learning to defer (L2D) to multiple experts. In particular, we address the open problems of deriving a consistent surrogate loss, confidence calibration, and principled ensembling of experts. Firstly, we derive two consistent surrogates---one based on a softmax parameterization, the other on a one-vs-all (OvA) parameterization---that are analogous to the single expert losses proposed by Mozannar & Sontag (2020) and Verma & Nalisnick (2022), respectively. We then study the frameworks' ability to estimate P( m_j = y | x ), the probability that the jth expert will correctly predict the label for x. Theory shows the softmax-based loss causes mis-calibration to propagate between the estimates while the OvA-based loss does not (though in practice, we find there are trade offs). Lastly, we propose a conformal inference technique that chooses a subset of experts to query when the system defers. We perform empirical validation on tasks for galaxy, skin lesion, and hate speech classification.

Poster
Zhouhao Yang · Yihong Guo · Pan Xu · Anqi Liu · Animashree Anandkumar

[ Auditorium 1 Foyer ]

Learning an optimal policy from offline data is notoriously challenging, which requires the evaluation of the learning policy using data pre-collected from a static logging policy. We study the policy optimization problem in offline contextual bandits using policy gradient methods. We employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly. We show that our algorithm converges to a stationary point with rate $O(1/T)$, where $T$ is the number of time steps. We conduct experiments on real-world datasets under various scenarios of logging policies to compare our proposed algorithm with baseline methods in offline contextual bandits. We also propose a variant of our algorithm, DROPO-exp, to further improve the performance when a limited amount of online interaction is allowed. Our results demonstrate the effectiveness and robustness of the proposed algorithms, especially under heavily biased offline data.
Poster
Chieh-Hsin Lai · Dongmian Zou · Gilad Lerman

[ Auditorium 1 Foyer ]

We propose a new method for novelty detection that can tolerate high corruption of the training points, whereas previous works assumed either no or very low corruption. Our method trains a robust variational autoencoder (VAE), which aims to generate a model for the uncorrupted training points. To gain robustness to high corruption, we incorporate the following four changes to the common VAE: 1. Extracting crucial features of the latent code by a carefully designed dimension reduction component for distributions; 2. Modeling the latent distribution as a mixture of Gaussian low-rank inliers and full-rank outliers, where the testing only uses the inlier model; 3. Applying the Wasserstein-1 metric for regularization, instead of the Kullback-Leibler (KL) divergence; and 4. Using a robust error for reconstruction. We establish both robustness to outliers and suitability to low-rank modeling of the Wasserstein metric as opposed to the KL divergence. We illustrate state-of-the-art results on standard benchmarks.

Poster
Hoang Phan · Trung Le · Trung Phung · Anh Bui · Nhat Ho · Dinh Phung

[ Auditorium 1 Foyer ]

Despite superior performance in many situations, deep neural networks are often vulnerable to adversarial examples and distribution shifts, limiting model generalization ability in real-world applications. To alleviate these problems, recent approaches leverage distributional robustness optimization (DRO) to find the most challenging distribution, and then minimize loss function over this most challenging distribution. Regardless of having achieved some improvements, these DRO approaches have some obvious limitations. First, they purely focus on local regularization to strengthen model robustness, missing a global regularization effect that is useful in many real-world applications (e.g., domain adaptation, domain generalization, and adversarial machine learning). Second, the loss functions in the existing DRO approaches operate in only the most challenging distribution, hence decouple with the original distribution, leading to a restrictive modeling capability. In this paper, we propose a novel regularization technique, following the veins of Wasserstein-based DRO framework. Specifically, we define a particular joint distribution and Wasserstein-based uncertainty, allowing us to couple the original and most challenging distributions for enhancing modeling capability and applying both local and global regularizations. Empirical studies on different learning problems demonstrate that our proposed approach significantly outperforms the existing regularization approaches in various domains.

Poster
Maria-Florina Balcan · Rattana Pukdee · Pradeep Ravikumar · Hongyang Zhang

[ Auditorium 1 Foyer ]

Adversarial training is a standard technique for training adversarially robust models. In this paper, we study adversarial training as an alternating best-response strategy in a 2-player zero-sum game. We prove that even in a simple scenario of a linear classifier and a statistical model that abstracts robust vs. non-robust features, the alternating best response strategy of such game may not converge. On the other hand, a unique pure Nash equilibrium of the game exists and is provably robust. We support our theoretical results with experiments, showing the non-convergence of adversarial training and the robustness of Nash equilibrium.

Poster
Elvis Dohmatob · Chuan Guo · Morgane Goibert

[ Auditorium 1 Foyer ]

Machine learning models are known to be susceptible to adversarial perturbations. Even more concerning is the fact that these adversarial perturbations can be found by black-box search using surprisingly few queries, which essentially restricts the perturbation to a subspace of dimension $k$---much smaller than the dimension $d$ of the image space. This intriguing phenomenon raises the question: \emph{Is the vulnerability to black-box attacks inherent or can we hope to prevent them?} In this paper, we initiate a rigorous study of the phenomenon of low-dimensional adversarial perturbations (LDAPs). Our result characterizes precisely the sufficient conditions for the existence of LDAPs, and we show that these conditions hold for neural networks under practical settings, including the so-called lazy regime wherein the parameters of the trained network remain close to their values at initialization. Our theoretical results are confirmed by experiments on both synthetic and real data.
Poster
Natasa Tagasovska · Firat Ozdemir · Axel Brando

[ Auditorium 1 Foyer ]

Despite the major progress of deep models as learning machines, uncertainty estimation remains a major challenge. Existing solutions rely on modified loss functions or architectural changes. We propose to compensate for the lack of built-in uncertainty estimates by supplementing any network, retrospectively, with a subsequent vine copula model, Vine-Copula Neural Networks (VCNN). Through synthetic and real-data experiments, we show that VCNNs could be task (regression/classification) and architecture (recurrent, fully connected) agnostic, providing better-callibrated uncertainty estimates, comparable to state-of-the-art built-in uncertainty solutions.

Poster
Garud Iyengar · Henry Lam · Tianyu Wang

[ Auditorium 1 Foyer ]

Empirical risk minimization (ERM) and distributionally robust optimization (DRO) are popular approaches for solving stochastic optimization problems that appear in operations management and machine learning. Existing generalization error bounds for these methods depend on either the complexity of the cost function or dimension of the uncertain parameters; consequently, the performance of these methods is poor for high-dimensional problems with objective functions under high complexity. We propose a simple approach in which the distribution of uncertain parameters is approximated using a parametric family of distributions. This mitigates both sources of complexity; however, it introduces a model misspecification error. We show that this new source of error can be controlled by suitable DRO formulations. Our proposed parametric DRO approach has significantly improved generalization bounds over existing ERM / DRO methods and parametric ERM for a wide variety of settings. Our method is particularly effective under distribution shifts. We also illustrate the superior performance of our approach on both synthetic and real-data portfolio optimization and regression tasks.


Affinity Event: LatinX in AI Social Tue 25 Apr 06:00 p.m.  

Mirian Hipolito Garcia · Alejandro Casas · Santiago VELASCO-FORERO · Laura Montoya