### Poster Session

## Poster Session 3

##### Auditorium 1 Foyer

**Online Bilevel Optimization: Regret Analysis of Online Alternating Gradient Methods**

Davoud Ataee Tarzanagh · Parvin Nazari · Bojian Hou · Li Shen · Laura Balzano

This paper introduces \textit{online bilevel optimization} in which a sequence of time-varying bilevel problems is revealed one after the other. We extend the known regret bounds for single-level online algorithms to the bilevel setting. Specifically, we provide new notions of \textit{bilevel regret}, develop an online alternating time-averaged gradient method that is capable of leveraging smoothness, and give regret bounds in terms of the path-length of the inner and outer minimizer sequences.

**Quantifying intrinsic causal contributions via structure preserving interventions**

Dominik Janzing · Patrick Bloebaum · Atalanti Mastakouri · Philipp M. Faller · Lenon Minorics · Kailash Budhathoki

We propose a notion of causal influence that describes the `intrinsic' part of the contribution of a node on a target node in a DAG. By recursively writing each node as a function of the upstream noise terms, we separate the intrinsic information added by each node from the one obtained from its ancestors. To interpret the intrinsic information as a causal contribution, we consider`

structure-preserving interventions' that randomize each node in a way that mimics the usual dependence on the parents and does not perturb the observed joint distribution. To get a measure that is invariant across arbitrary orderings of nodes we use Shapley based symmetrization and show that it reduces in the linear case to simple ANOVA after resolving the target node into noise variables. We describe our contribution analysis for variance and entropy, but contributions for other target metrics can be defined analogously.

**Near-Optimal Pure Exploration in Matrix Games: A Generalization of Stochastic Bandits \& Dueling Bandits**

ARNAB MAITI · Ross Boczar · Kevin Jamieson · Lillian Ratliff

We study the sample complexity of identifying the pure strategy Nash equilibrium (PSNE) in a two-player zero-sum matrix game with noise. Formally, we are given a stochastic model where any learner can sample an entry $(i,j)$ of the input matrix $A\in [-1,1]^{n\times m}$ and observe $A_{i,j}+\eta$ where $\eta$ is a zero-mean $1$-sub-Gaussian noise. The aim of the learner is to identify the PSNE of $A$, whenever it exists, with high probability while taking as few samples as possible. Zhou et al., (2017) presents an instance-dependent sample complexity lower bound that depends only on the entries in the row and column in which the PSNE lies. We design a near-optimal algorithm whose sample complexity matches the lower bound, up to log factors. The problem of identifying the PSNE also generalizes the problem of pure exploration in stochastic multi-armed bandits and dueling bandits, and our result matches the optimal bounds, up to log factors, in both the settings.

**Optimal Budgeted Rejection Sampling for Generative Models**

Alexandre Vérine · Muni Sreenivas Pydi · Benjamin Negrevergne · Yann Chevaleyre

Rejection sampling methods have recently been proposed to improve the performance of discriminator-based generative models. However, these methods are only optimal under an unlimited sampling budget, and are usually applied to a generator trained independently of the rejection procedure. We first propose an Optimal Budgeted Rejection Sampling (OBRS) scheme that is provably optimal with respect to \textit{any} $f$-divergence between the true distribution and the post-rejection distribution, for a given sampling budget. Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model's overall performance. Through experiments and supporting theory, we show that the proposed methods are effective in significantly improving the quality and diversity of the samples.

**Discriminator Guidance for Autoregressive Diffusion Models**

Filip Ekström Kelvinius · Fredrik Lindsten

We introduce discriminator guidance in the setting of Autoregressive Diffusion Models. The use of a discriminator to guide a diffusion process has previously been used for continuous diffusion models, and in this work we derive ways of using a discriminator together with a pretrained generative model in the discrete case. First, we show that using an optimal discriminator will correct the pretrained model and enable exact sampling from the underlying data distribution. Second, to account for the realistic scenario of using a sub-optimal discriminator, we derive a sequential Monte Carlo algorithm which iteratively takes the predictions from the discriminator into account during the generation process. We test these approaches on the task of generating molecular graphs and show how the discriminator improves the generative performance over using only the pretrained model.

**Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks**

Hristo Papazov · Scott Pesme · Nicolas Flammarion

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.

**Approximate Bayesian Class-Conditional Models under Continuous Representation Shift**

Thomas Lee · Amos Storkey

For models consisting of a classifier in some representation space, learning online from a non-stationary data stream often necessitates changes in the representation. So, the question arises of what is the best way to adapt the classifier to shifts in representation. Current methods only slowly change the classifier to representation shift, introducing noise into learning as the classifier is misaligned to the representation. We propose DeepCCG, an empirical Bayesian approach to solve this problem. DeepCCG works by updating the posterior of a class conditional Gaussian classifier such that the classifier adapts in one step to representation shift. The use of a class conditional Gaussian classifier also enables DeepCCG to use a log conditional marginal likelihood loss to update the representation. To perform the update to the classifier and representation, DeepCCG maintains a fixed number of examples in memory and so a key part of DeepCCG is selecting what examples to store, choosing the subset that minimises the KL divergence between the true posterior and the posterior induced by the subset. We explore the behaviour of DeepCCG in online continual learning (CL), demonstrating that it performs well against a spectrum of online CL methods and that it reduces the change in performance due to representation shift.

**Proving Linear Mode Connectivity of Neural Networks via Optimal Transport**

Damien Ferbach · Baptiste Goujaud · Gauthier Gidel · Aymeric Dieuleveut

The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.

**Functional Flow Matching**

Gavin Kerrigan · Giosue Migliorini · Padhraic Smyth

We propose Functional Flow Matching (FFM), a function-space generative model that generalizes the recently-introduced Flow Matching model to operate directly in infinite-dimensional spaces. Our approach works by first defining a path of probability measures that interpolates between a fixed Gaussian measure and the data distribution, followed by learning a vector field on the underlying space of functions that generates this path of measures. Our method does not rely on likelihoods or simulations, making it well-suited to the function space setting. We provide both a theoretical framework for building such models and an empirical evaluation of our techniques. We demonstrate through experiments on synthetic and real-world benchmarks that our proposed FFM method outperforms several recently proposed function-space generative models.

**DE-HNN: An effective neural model for Circuit Netlist representation**

Zhishang Luo · Truong Son Hy · Puoya Tabaghi · Michaël Defferrard · elahe rezaei · Ryan Carey · Rhett Davis · rajeev jain · Yusu Wang

The run-time for optimization tools used in chip design has grown with the complexity of designs to the point where it can take several days to go through one design cycle which has become a bottleneck. Designers want fast tools that can quickly give feedback on a design. Using the input and output data of the tools from past designs, one can attempt to build a machine learning model that predicts the outcome of a design in significantly shorter time than running the tool. The accuracy of such models is affected by the representation of the design data, which is usually a netlist that describes the elements of the digital circuit and how they are connected. Graph representations for the netlist together with graph neural networks have been investigated for such models. However, the characteristics of netlists pose several challenges for existing graph learning frameworks, due to the large number of nodes and the importance of long-range interactions between nodes. To address these challenges, we represent the netlist as a directed hypergraph and propose a Directional Equivariant Hypergraph Neural Network (DE-HNN) for the effective learning of (directed) hypergraphs. Theoretically, we show that our DE-HNN can universally approximate any node or hyperedge based function that satisfies certain permutation equivariant and invariant properties natural for directed hypergraphs. We compare the proposed DE-HNN with several State-of-the-art (SOTA) machine learning models for (hyper)graphs and netlists, and show that the DE-HNN significantly outperforms them in predicting the outcome of optimized place-and-route tools directly from the input netlists.

**Towards Practical Non-Adversarial Distribution Matching**

Ziyu Gong · Ben Usman · Han Zhao · David Inouye

Distribution matching can be used to learn invariant representations with applications in fairness and robustness. Most prior works resort to adversarial matching methods but the resulting minimax problems are unstable and challenging to optimize. Non-adversarial likelihood-based approaches either require model invertibility, impose constraints on the latent prior, or lack a generic framework for distribution matching. To overcome these limitations, we propose a non-adversarial VAE-based matching method that can be applied to any model pipeline. We develop a set of alignment upper bounds for distribution matching (including a noisy bound) that have VAE-like objectives but with a different perspective. We carefully compare our method to prior VAE-based matching approaches both theoretically and empirically. Finally, we demonstrate that our novel matching losses can replace adversarial losses in standard invariant representation learning pipelines without modifying the original architectures---thereby significantly broadening the applicability of non-adversarial matching methods.

Graph classification is an important learning task for graph-structured data. Graph neural networks (GNNs) have recently gained growing attention in graph learning and shown significant improvements on many important graph problems. Despite their state-of-the-art performances, existing GNNs only use local information from a very limited neighborhood around each node, suffering from loss of multi-modal information and overheads of excessive computation. To address these issues, we propose a novel Tensor-view Topological Graph Neural Network (TTG-NN), a class of simple yet effective topological deep learning built upon persistent homology, graph convolution, and tensor operations. This new method incorporates tensor learning to simultaneously capture {\it Tensor-view Topological} (TT), as well as Tensor-view Graph (TG) structural information on both local and global levels. Computationally, to fully exploit graph topology and structure, we propose two flexible TT and TG representation learning modules which disentangles feature tensor aggregation and transformation, and learns to preserve multi-modal structure with less computation. Theoretically, we derive high probability bounds on both the out-of-sample and in-sample mean squared approximation errors for our proposed Tensor Transformation Layer (TTL). Real data experiments show that the proposed TTG-NN outperforms 20 state-of-the-art methods on various graph benchmarks.

**Accelerating Approximate Thompson Sampling with Underdamped Langevin Monte Carlo**

Haoyang Zheng · Wei Deng · Christian Moya · Guang Lin

Approximate Thompson sampling with Langevin Monte Carlo broadens its reach from Gaussian posterior sampling to encompass more general smooth posteriors. However, it still encounters scalability issues in high-dimensional problems when demanding high accuracy. To address this, we propose an approximate Thompson sampling strategy, utilizing underdamped Langevin Monte Carlo, where the latter is the go-to workhorse for simulations of high-dimensional posteriors. Based on the standard smoothness and log-concavity conditions, we study the accelerated posterior concentration and sampling using a specific potential function. This design improves the sample complexity for realizing logarithmic regrets from $\mathcal{\tilde O}(d)$ to $\mathcal{\tilde O}(\sqrt{d})$. The scalability and robustness of our algorithm are also empirically validated through synthetic experiments in high-dimensional bandit problems.

**FALCON: FLOP-Aware Combinatorial Optimization for Neural Network Pruning**

Xiang Meng · Wenyu Chen · Riade Benbaki · Rahul Mazumder

The increasing computational demands of modern neural networks present deployment challenges on resource-constrained devices. Network pruning offers a solution to reduce model size and computational cost while maintaining performance. However, current pruning methods focus primarily on improving sparsity by reducing the number of nonzero parameters, often neglecting other deployment costs such as inference time, which are closely related to the number of floating-point operations (FLOPs). In this paper, we propose {FALCON}, a novel combinatorial-optimization-based framework for network pruning that jointly takes into account model accuracy (fidelity), FLOPs, and sparsity constraints. A main building block of our approach is an integer linear program (ILP) that simultaneously handles FLOP and sparsity constraints. We present a novel algorithm to approximately solve the ILP. We propose a novel first-order method for our optimization framework which makes use of our ILP solver. Using problem structure (e.g., the low-rank structure of approx. Hessian), we can address instances with millions of parameters. Our experiments demonstrate that {FALCON}~achieves superior accuracy compared to other pruning approaches within a fixed FLOP budget. For instance, for ResNet50 with 20\% of the total FLOPs retained, our approach improves the accuracy by 48\% relative to state-of-the-art. Furthermore, in gradual pruning settings with re-training between pruning steps, our framework outperforms existing pruning methods, emphasizing the significance of incorporating both FLOP and sparsity constraints for effective network pruning.

**The effect of Leaky ReLUs on the training and generalization of overparameterized networks**

Yinglong Guo · Shaohan Li · Gilad Lerman

We investigate the training and generalization errors of overparameterized neural networks (NNs) with a wide class of leaky rectified linear unit (ReLU) functions. More specifically, we carefully upper bound both the convergence rate of the training error and the generalization error of such NNs and investigate the dependence of these bounds on the Leaky ReLU parameter, $\alpha$. We show that $\alpha =-1$, which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.

**Soft-constrained Schr\"odinger Bridge: a Stochastic Control Approach**

Jhanvi Garg · Xianyang Zhang · Quan Zhou

Schrödinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process whose terminal distribution coincides with a pre-specified target distribution. We propose to generalize this problem by allowing the terminal distribution to differ from the target but penalizing the Kullback-Leibler divergence between the two distributions. We call this new control problem soft-constrained Schrödinger bridge (SSB). The main contribution of this work is a theoretical derivation of the solution to SSB, which shows that the terminal distribution of the optimally controlled process is a geometric mixture of the target and some other distribution. This result is further extended to a time series setting. One application is the development of robust generative diffusion models. We propose a score matching-based algorithm for sampling from geometric mixtures and showcase its use via a numerical example for the MNIST data set.

**Understanding the Generalization Benefits of Late Learning Rate Decay**

Yinuo Ren · Chao Ma · Lexing Ying

Why do neural networks trained with large learning rates for longer time often lead to better generalization? In this paper, we delve into this question by examining the relation between training and testing loss in neural networks. Through visualization of these losses, we note that the training trajectory with a large learning rate navigates through the minima manifold of the training loss, finally nearing the neighborhood of the testing loss minimum. Motivated by these findings, we introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks. Upon investigating the training process using SGD on our model, we demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss, which may achieve near-optimal generalization, thereby affirming the empirically observed benefits of late learning rate decay.

**BLIS-Net: Classifying and Analyzing Signals on Graphs**

Charles Xu · Laney Goldman · Valentina Guo · Benjamin Hollander-Bodie · Maedee Trank-Greene · Ian Adelstein · Edward De Brouwer · Rex Ying · Smita Krishnaswamy · Michael Perlmutter

Graph neural networks (GNNs) have emerged as a powerful tool for tasks such as node classification and graph classification. However, much less work has been done on signal classification, where the data consists of many functions (referred to as signals) defined on the vertices of a single graph. These tasks require networks designed differently from those designed for traditional GNN tasks. Indeed, traditional GNNs rely on localized low-pass filters, and signals of interest may have intricate multi-frequency behavior and exhibit long range interactions. This motivates us to introduce the BLIS-Net (Bi-Lipschitz Scattering Net), a novel GNN that builds on the previously introduced geometric scattering transform. Our network is able to capture both local and global signal structure and is able to capture both low-frequency and high-frequency information. We make several crucial changes to the original geometric scattering architecture which we prove increase the ability of our network to capture information about the input signal and show that BLIS-Net achieves superior performance on both synthetic and real-world data sets based on traffic flow and fMRI data.

**Two Birds with One Stone: Enhancing Uncertainty Quantification and Interpretability with Graph Functional Neural Process**

Lingkai Kong · Haotian Sun · Yuchen Zhuang · Haorui Wang · Wenhao Mu · Chao Zhang

Graph neural networks (GNNs) are powerful tools on graph data. However, their predictions are mis-calibrated and lack interpretability, limiting their adoption in critical applications. To address this issue, we propose a new uncertainty-aware and interpretable graph classification model that combines graph functional neural process and graph generative model. The core of our method is to assume a set of latent rationales which can be mapped to a probabilistic embedding space; the predictive distribution of the classifier is conditioned on such rationale embeddings by learning a stochastic correlation matrix. The graph generator serves to decode the graph structure of the rationales from the embedding space for model interpretability. For efficient model training, we adopt an alternating optimization procedure which mimics the well known Expectation-Maximization (EM) algorithm. The proposed method is general and can be applied to any existing GNN architecture. Extensive experiments on five graph classification datasets demonstrate that our framework outperforms state-of-the-art methods in both uncertainty quantification and GNN interpretability. We also conduct case studies to show that the decoded rationale structure can provide meaningful explanations.

**Understanding Inverse Scaling and Emergence in Multitask Representation Learning**

Muhammed Ildiz · Zhe Zhao · Samet Oymak

Large language models exhibit strong multitasking capabilities, however, their learning dynamics as a function of task characteristics, sample size, and model complexity remain mysterious. For instance, it is known that, as the model size grows, large language models exhibit emerging abilities where certain tasks can abruptly jump from poor to respectable performance. Such phenomena motivate a deeper understanding of how individual tasks evolve during multitasking. To this aim, we study a multitask representation learning setup where tasks can have distinct distributions, quantified by their covariance priors. Through random matrix theory, we precisely characterize the optimal linear representation for few-shot learning that minimizes the average test risk in terms of task covariances. When tasks have equal sample sizes, we prove a reduction to an equivalent problem with a single effective covariance from which the individual task risks of the original problem can be deduced. Importantly, we introduce “task competition” to explain how tasks with dominant covariance eigenspectrum emerge faster than others. We show that task competition can potentially explain the inverse scaling of certain tasks i.e. reduced test accuracy as the model grows. Overall, this work sheds light on the risk and emergence of individual tasks and uncovers new high-dimensional phenomena (including multiple-descent risk curves) that arise in multitask representation learning.

**Deep Classifier Mimicry without Data Access**

Steven Braun · Martin Mundt · Kristian Kersting

Access to pre-trained models has recently emerged as a standard across numerous machine learning domains. Unfortunately, access to the original data the models were trained on may not equally be granted. This makes it tremendously challenging to fine-tune, compress models, adapt continually, or to do any other type of data-driven update. We posit that original data access may however not be required. Specifically, we propose Contrastive Abductive Knowledge Extraction (CAKE), a model-agnostic knowledge distillation procedure that mimics deep classifiers without access to the original data. To this end, CAKE generates pairs of noisy synthetic samples and diffuses them contrastively toward a model’s decision boundary. We empirically corroborate CAKE's effectiveness using several benchmark datasets and various architectural choices, paving the way for broad application.

**Data Driven Threshold and Potential Initialization for Spiking Neural Networks**

Velibor Bojkovic · srinivas anumasa · Giulia De Masi · Bin Gu · Huan Xiong

Spiking neural networks (SNNs) present an increasingly popular alternative to artificial neural networks (ANNs), due to their energy and time efficiency when deployed on neuromorphic hardware. However, due to their discrete and highly non-differentiable nature, training SNNs is a challenging task and remains an active area of research. Some of the most prominent ways to train SNNs are based on ANN-to-SNN conversion where an SNN model is initialized with parameters from the corresponding, pre-trained ANN model. SNN models trained through ANN-to-SNN conversion or hybrid training show state of the art performance among SNNs on many machine learning tasks, comparable to those of ANNs. However, the top performing models need high latency or tailored ANNs to perform well, and in general are not using the full information available from ANNs. In this work, we propose novel method to initialize SNN's thresholds and initial membrane potential after ANN-to-SNN conversion, using distributions of ANN's activation values. We provide a theoretical framework for feature distribution-based conversion error, providing theoretical results on optimal membrane initialization and thresholds which minimize this error, as well as a practical algorithm for finding these optimal values. We test our method, both as a stand-alone ANN-to-SNN conversion and in combination with other methods, and show state of the art results on high-dimensional datasets such as CIFAR10, CIFAR100 and ImageNet and various architectures. Our code is available at \url{https://github.com/srinuvaasu/data*driven*init}

**Revisiting the Noise Model of Stochastic Gradient Descent**

Barak Battash · Lior Wolf · Ofir Lindenbaum

The effectiveness of stochastic gradient descent (SGD) in neural network optimization is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, SGN was initially described as Gaussian, but recently Simsekli et al (2019) demonstrated that the $S\alpha S$ Lévy distribution provides a better fit for the SGN. This assertion was purportedly debunked and rebounded to the Gaussian noise model that had been previously proposed. This study provides robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the $S\alpha S$ distribution. Our experiments include several datasets and multiple models, both discriminative and generative. Furthermore, we argue that different network parameters preserve distinct SGN properties. We develop a novel framework based on a Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. We use our framework to study SGD properties near local minima; these include the mean escape time and preferable exit directions.

Real-world systems often involve some pool of users choosing between a set of services. With the increase in popularity of online learning algorithms, these services can now self-optimize, leveraging data collected on users to maximize some reward such as service quality. On the flipside, users may strategically choose which services to use in order to pursue their own reward functions, in the process wielding power over which services can see and use their data. Extensive prior research has been conducted on the effects of strategic users in single-service settings, with strategic behavior manifesting in the manipulation of observable features to achieve a desired classification; however, this can often be costly or unattainable for users and fails to capture the full behavior of multi-service dynamic systems. As such, we analyze a setting in which strategic users choose among several available services in order to pursue positive classifications, while services seek to minimize loss functions on their observations. We focus our analysis on realizable settings, and show that naive retraining can still lead to oscillation even if all users are observed at different times; however, if this retraining uses memory of past observations, convergent behavior can be guaranteed for certain loss function classes. We provide results obtained from synthetic and real-world data to empirically validate our theoretical findings.

**Random Oscillators Network for Time Series Processing**

Andrea Ceni · Andrea Cossu · Maximilian Stölzle · Jingyue Liu · Cosimo Della Santina · Davide Bacciu · Claudio Gallicchio

We introduce the Random Oscillators Network (RON), a physically-inspired recurrent model derived from a network of heterogeneous oscillators. Unlike traditional recurrent neural networks, RON keeps the connections between oscillators untrained by leveraging on smart random initialisations, leading to exceptional computational efficiency.A rigorous theoretical analysis finds the necessary and sufficient conditions for the stability of RON, highlighting the natural tendency of RON to lie at the edge of stability, a regime of configurations offering particularly powerful and expressive models.Through an extensive empirical evaluation on several benchmarks, we show four main advantages of RON.1) RON shows excellent long-term memory and sequence classification ability, outperforming other randomised approaches.2) RON outperforms fully-trained recurrent models and state-of-the-art randomised models in chaotic time series forecasting. 3) RON provides expressive internal representations even in a small parametrisation regime making it amenable to be deployed on low-powered devices and at the edge.4) RON is up to two orders of magnitude faster than fully-trained models.

**Non-Convex Joint Community Detection and Group Synchronization via Generalized Power Method**

Sijin Chen · Xiwei Cheng · Anthony Man-Cho So

This paper proposes a Generalized Power Method (GPM) to simultaneously solve the joint problem of community detection and group synchronization in a direct non-convex manner, in contrast to the existing method of semidefinite programming (SDP). Under a natural extension of stochastic block model (SBM), our theoretical analysis proves that the proposed algorithm is able to exactly recover the ground truth in $O(n\log^2 n)$ time for problems of size $n$, sharply outperforming the $O(n^{3.5})$ runtime of SDP. Moreover, we give a lower bound of model parameters as a sufficient condition for the exact recovery of GPM. The new bound breaches the information-theoretic limit for pure community detection under SBM, thus demonstrating the superiority of our simultaneous optimization algorithm over any two-stage method that performs the two tasks in succession. We also conduct numerical experiments on GPM and SDP to corroborate our theoretical analysis.

**Fast Minimization of Expected Logarithmic Loss via Stochastic Dual Averaging**

Chung-En Tsai · Hao-Chung Cheng · Yen-Huan Li

Consider the problem of minimizing an expected logarithmic loss over either the probability simplex or the set of quantum density matrices. This problem includes tasks such as solving the Poisson inverse problem, computing the maximum-likelihood estimate for quantum state tomography, and approximating positive semi-definite matrix permanents with the currently tightest approximation ratio. Although the optimization problem is convex, standard iteration complexity guarantees for first-order methods do not directly apply due to the absence of Lipschitz continuity and smoothness in the loss function.In this work, we propose a stochastic first-order algorithm named $B$-sample stochastic dual averaging with the logarithmic barrier. For the Poisson inverse problem, our algorithm attains an $\varepsilon$-optimal solution in $\smash{\tilde{O}}(d^2/\varepsilon^2)$ time, matching the state of the art, where $d$ denotes the dimension. When computing the maximum-likelihood estimate for quantum state tomography, our algorithm yields an $\varepsilon$-optimal solution in $\smash{\tilde{O}}(d^3/\varepsilon^2)$ time. This improves on the time complexities of existing stochastic first-order methods by a factor of $d^{\omega-2}$ and those of batch methods by a factor of $d^2$, where $\omega$ denotes the matrix multiplication exponent. Numerical experiments demonstrate that empirically, our algorithm outperforms existing methods with explicit complexity guarantees.

**General Tail Bounds for Non-Smooth Stochastic Mirror Descent**

Khaled Eldowa · Andrea Paudice

In this paper, we provide novel tail bounds on the optimization error of Stochastic Mirror Descent for convex and Lipschitz objectives. Our analysis extends the existing tail bounds from the classical light-tailed Sub-Gaussian noise case to heavier-tailed noise regimes. We study the optimization error of the last iterate as well as the average of the iterates. We instantiate our results in two important cases: a class of noise with exponential tails and one with polynomial tails. A remarkable feature of our results is that they do not require an upper bound on the diameter of the domain. Finally, we support our theory with illustrative experiments that compare the behavior of the average of the iterates with that of the last iterate in heavy-tailed noise regimes.

**Communication-Efficient Federated Learning With Data and Client Heterogeneity**

Hossein Zakerinia · Shayan Talaei · Giorgi Nadiradze · Dan Alistarh

Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally. However, executing FL at scale comes with inherent practical challenges: 1) heterogeneity of the local node data distributions, 2) heterogeneity of node computational speeds (asynchrony), but also 3) constraints in the amount of communication between the clients and the server. In this work, we present the first variant of the classic federated averaging (FedAvg) algorithm which, at the same time, supports data heterogeneity, partial client asynchrony, and communication compression. Our algorithm comes with a novel, rigorous analysis showing that, in spite of these system relaxations, it can provide similar convergence to FedAvg in interesting parameter regimes. Experimental results in the rigorous LEAF benchmark on setups of up to $300$ nodes show that our algorithm ensures fast convergence for standard federated tasks, improving upon prior quantized and asynchronous approaches.

**Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence**

Ilyas Fatkhullin · Niao He

This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-{\L}ojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.

**Why is parameter averaging beneficial in SGD? An objective smoothing perspective**

Atsushi Nitanda · Ryuhei Kikuchi · Shugo Maeda · Denny Wu

It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.

**Stochastic Extragradient with Random Reshuffling: Improved Convergence for Variational Inequalities**

Konstantinos Emmanouilidis · Rene Vidal · Nicolas Loizou

The Stochastic Extragradient (SEG)method is one of the most popular algorithms for solving finite-sum min-maxoptimization and variational inequality problems (VIPs) appearing in variousmachine learning tasks. However, existingconvergence analyses of SEG focus on itswith-replacement variants, while practicalimplementations of the method randomlyreshuffle components and sequentiallyuse them. Unlike the well-studied with-replacement variants, SEG with RandomReshuffling (SEG-RR) lacks establishedtheoretical guarantees. In this work, weprovide a convergence analysis of SEG-RRfor three classes of VIPs: (i) stronglymonotone, (ii) affine, and (iii) monotone. We derive conditions underwhich SEG-RR achieves a faster convergence rate thanthe uniform with-replacement samplingSEG. In the monotone setting, our analysis of SEG-RR guarantees convergence to an arbitrary accuracy without large batch sizes, a strong requirement needed in the classical with-replacement SEG. As a byproduct of our results, we provide convergence guarantees forShuffle Once SEG (shuffles the data onlyat the beginning of the algorithm) and theIncremental Extragradient (does not shuffle thedata). We supplement our analysis with experiments validating empirically the superior performance of SEG-RR over the classical with-replacement sampling SEG.

**Fast and Accurate Estimation of Low-Rank Matrices from Noisy Measurements via Preconditioned Non-Convex Gradient Descent**

Jialun Zhang · Richard Zhang · Hong-Ming Chiu

Non-convex gradient descent is a common approach for estimating a low-rank $n\times n$ ground truth matrix from noisy measurements, because it has per-iteration costs as low as $O(n)$ time, and is in theory capable of converging to a minimax optimal estimate. However, the practitioner is often constrained to just tens to hundreds of iterations, and the slow and/or inconsistent convergence of non-convex gradient descent can prevent a high-quality estimate from being obtained. Recently, the technique of \emph{preconditioning} was shown to be highly effective at accelerating the local convergence of non-convex gradient descent when the measurements are noiseless. In this paper, we describe how preconditioning should be done for noisy measurements to accelerate local convergence to minimax optimality. For the symmetric matrix sensing problem, our proposed preconditioned method is guaranteed to locally converge to minimax error at a linear rate that is immune to ill-conditioning and/or over-parameterization. Using our proposed preconditioned method, we perform a 60 megapixel medical image denoising task, and observe significantly reduced noise levels compared to previous approaches.

**LP-based Construction of DC Decompositions for Efficient Inference of Markov Random Fields**

Chaitanya Murti · Dhruva Kashyap · Chiranjib Bhattacharyya

The success of the convex-concave procedure (CCCP), a widely used technique for non-convex optimization, crucially depends on finding a decomposition of the objective function as a difference of convex functions (dcds). Despite the widespread applicability of CCCP, finding such dcds has attracted little attention in machine learning. For graphical models with polynomial potentials, existing methods for finding dcds require solving a Sum-of-Squares (SOS) program, which is often prohibitively expensive. In this work, we leverage tools from algebraic geometry certifying the positivity of polynomials, to derive LP-based constructions of dcds of polynomials which are particularly suited for graphical model inference. Our experiments demonstrate that using our LP-based technique constructs dcds for polynomial potentials of Markov random fields significantly faster compared to SOS-based approaches used in previous works.

**On Parameter Estimation in Deviated Gaussian Mixture of Experts**

Huy Nguyen · Khai Nguyen · Nhat Ho

We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - \lambda^{\ast}) g_0(Y| X)+ \lambda^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},\sigma_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $\lambda^{\ast} \in [0, 1]$ is true but unknown mixing proportion, and $(p_{i}^{\ast}, a_{i}^{\ast}, b_{i}^{\ast}, \sigma_{i}^{\ast})$ for $1 \leq i \leq k^{\ast}$ are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from $g_{0}(Y|X)$ (null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between $g_0$ and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.

**THE SAMPLE COMPLEXITY OF ERM IN EUCLIDEAN STOCHASTIC CONVEX OPTIMIZATION**

Daniel Carmon · Amir Yehudayoff · Roi Livni

Stochastic convex optimization is a model for optimization in a stochastic environment that is also motivated by the theory of machine learning. Empirical risk minimization (ERM) is a fundamental algorithmic principle. We study the sample complexity of ERM in stochastic convex optimization in the standard Euclidean setup in d-dimensional space with accuracy parameter $\epsilon > 0$. A folklore example shows that the sample complexity is at least $\Omega(\frac{1}{\epsilon^2})$. Shalev-Shwartz, Shamir, Srebro and Sridharan proved that the complexity is at least $\Omega(\log(d))$ for $\epsilon = \frac{1}{2}$.Later on, Feldman proved that it is at least $\Omega(\frac{d}{\epsilon})$. Proving asymptotically stronger lower bounds was left as an open problem. We prove that, in fact, the sample complexity is at most $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$. The proof builds a mechanism for controlling stochastic convexoptimization problems.

**Multi-objective Optimization via Wasserstein-Fisher-Rao Gradient Flow**

Yinuo Ren · Tesi Xiao · Tanmay Gangwani · Anshuka Rangi · Holakou Rahmanian · Lexing Ying · Subhajit Sanyal

Multi-objective optimization (MOO) aims to optimize multiple, possibly conflicting objectives with widespread applications. We introduce a novel interacting particle method for MOO inspired by molecular dynamics simulations. Our approach combines overdamped Langevin and birth-death dynamics, incorporating a ``dominance potential'' to steer particles toward global Pareto optimality. In contrast to previous methods, our method is able to relocate dominated particles, making it particularly adept at managing Pareto fronts of complicated geometries. Our method is also theoretically grounded as a Wasserstein-Fisher-Rao gradient flow with convergence guarantees. Extensive experiments confirm that our approach outperforms state-of-the-art methods on challenging synthetic and real-world datasets.

**Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization**

Wei Shen · Minhui Huang · Jiawei Zhang · Cong Shen

In recent years, federated minimax optimization has attracted growing interest due to its extensive applications in various machine learning tasks. While Smoothed Alternative Gradient Descent Ascent (Smoothed-AGDA) has proved successful in centralized nonconvex minimax optimization, how and whether smoothing techniques could be helpful in a federated setting remains unexplored. In this paper, we propose a new algorithm termed Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), which utilizes the smoothing technique for federated minimax optimization. We prove that FESS-GDA can be uniformly applied to solve several classes of federated minimax problems and prove new or better analytical convergence results for these settings. We showcase the practical efficiency of FESS-GDA in practical federated learning tasks of training generative adversarial networks (GANs) and fair classification.

**Informative Path Planning with Limited Adaptivity**

Rayen Tan · Rohan Ghuge · viswanath nagarajan

We consider the informative path planning (IPP) problem in which a robot interacts with an uncertain environment and gathers information by visiting locations. The goal is to minimize its expected travel cost to cover a given submodular function. Adaptive solutions, where the robot incorporates all available information to select the next location to visit, achieve the best objective. However, such a solution is resource-intensive as it entails recomputing after every visited location. A more practical approach is to design solutions with a small number of adaptive "rounds", where the robot recomputes only once at the start of each round. In this paper, we design an algorithm for IPP parameterized by the number k of adaptive rounds, and prove a smooth tradeoff between k and the solution quality (relative to fully adaptive solutions). We validate our theoretical results by experiments on a real road network, where we observe that a few rounds of adaptivity suffice to obtain solutions of cost almost as good as fully-adaptive ones.

**Acceleration and Implicit Regularization in Gaussian Phase Retrieval**

Tyler Maunu · Martin Molina-Fructuoso

We study accelerated optimization methods in the Gaussian phase retrieval problem. In this setting, we prove that gradient methods with Polyak or Nesterov momentum have similar implicit regularization to gradient descent. This implicit regularization ensures that the algorithms remain in a nice region, where the cost function is strongly convex and smooth despite being nonconvex in general. This ensures that these accelerated methods achieve faster rates of convergence than gradient descent. Experimental evidence demonstrates that the accelerated methods converge faster than gradient descent in practice.

**Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements**

Emmanouil Vasileios Vlatakis-Gkaragkounis · Angeliki Giannou · Yudong Chen · Qiaomin Xie

For min-max optimization and variational inequalities problems (VIPs), Stochastic Extragradient (SEG) and Stochastic Gradient Descent Ascent (SGDA) have emerged as preeminent algorithms. Constant step-size versions of SEG/SGDA have gained popularity due to several appealing benefits, but their convergence behaviors are complicated even in rudimentary bilinear models. Our work elucidates the probabilistic behavior of these algorithms and their projected variants, for a wide range of monotone and non-monotone VIPs with potentially biased stochastic oracles. By recasting them as time-homogeneous Markov Chains, we establish geometric convergence to a unique invariant distribution and aymptotic normality. Specializing to min-max optimization, we characterize the relationship between the step-size and the induced bias with respect to the global solution, which in turns allows for bias refinement via the Richardson-Romberg scheme. Our theoretical analysis is corroborated by numerical experiments.

**Weight-Sharing Regularization**

Mehran Shakerinava · Motahareh Sohrabi · Siamak Ravanbakhsh · Simon Lacoste-Julien

Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a ``weight-sharing regularization'' penalty on the weights $w \in \mathbb{R}^d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\mathrm{prox}_{\mathcal{R}}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log^3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on \href{https://github.com/motahareh-sohrabi/weight-sharing-regularization}{github}.

**First Passage Percolation with Queried Hints**

Kritkorn Karntikoon · Yiheng Shen · Sreenivas Gollapudi · Kostas Kollias · Aaron Schild · Ali Sinop

Solving optimization problems leads to elegant and practical solutions in a wide variety of real-world applications. In many of those real-world applications, some of the information required to specify the relevant optimization problem is noisy, uncertain, and expensive to obtain. In this work, we study how much of that information needs to be queried in order to obtain an approximately optimal solution to the relevant problem. In particular, we focus on the shortest path problem in graphs with dynamic edge costs. We adopt the {\em first passage percolation} model from probability theory wherein a graph $G'$ is derived from a weighted base graph $G$ by multiplying each edge weight by an independently chosen, random number in $[1, \rho]$. Mathematicians have studied this model extensively when $G$ is a $d$-dimensional grid graph, but the behavior of shortest paths in this model is still poorly understood in general graphs. We make progress in this direction for a class of graphs that resemble real-world road networks. Specifically, we prove that if $G$ has a constant continuous doubling dimension, then for a given $s-t$ pair, we only need to probe the weights on $((\rho \log n )/ \epsilon)^{O(1)}$ edges in $G'$ in order to obtain a $(1 + \epsilon)$-approximation to the $s-t$ distance in $G'$. We also generalize the result to a correlated setting and demonstrate experimentally that probing improves accuracy in estimating $s-t$ distances.

**Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence Rate**

Ruichen Jiang · Parameswaran Raman · Shoham Sabach · Aryan Mokhtari · Mingyi Hong · Volkan Cevher

Second-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second-order updates within a lower-dimensional subspace, giving rise to subspace second-order methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $d$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of $\mathcal{O}\left(\frac{1}{mk}+\frac{1}{k^2}\right)$ for solving convex optimization problems. Here, $m$ represents the subspace dimension, which can be significantly smaller than $d$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.

**Adaptive Federated Minimax Optimization with Lower Complexities**

Feihu Huang · Xinrui Wang · Junyi Li · Songcan Chen

Federated learning is a popular distributed and privacy-preserving learning paradigm in machine learning. Recently, some federated learning algorithms have been proposed to solve the distributed minimax problems. However, these federated minimax algorithms still suffer from high gradient or communication complexity. Meanwhile, few algorithm focuses on using adaptive learning rate to accelerate these algorithms. To fill this gap, in the paper, we study a class of nonconvex minimax optimization, and propose an efficient adaptive federated minimax optimization algorithm (i.e., AdaFGDA) to solve these distributed minimax problems. Specifically, our AdaFGDA builds on the momentum-based variance reduced and local-SGD techniques, and it can flexibly incorporate various adaptive learning rates by using the unified adaptive matrices. Theoretically, we provide a solid convergence analysis framework for our AdaFGDA algorithm under non-i.i.d. setting. Moreover, we prove our AdaFGDA algorithm obtains a lower gradient (i.e., stochastic first-order oracle, SFO) complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding $\epsilon$-stationary point of the nonconvex minimax problems. Experimentally, we conduct some experiments on the deep AUC maximization and robust neural network training tasks to verify efficiency of our algorithms.

**Adaptive Batch Sizes for Active Learning: A Probabilistic Numerics Approach**

Masaki Adachi · Satoshi Hayakawa · Martin Jørgensen · Xingchen Wan · Vu Nguyen · Harald Oberhauser · Michael A. Osborne

Active learning parallelization is widely used, but typically relies on fixing the batch size throughout experimentation. This fixed approach is inefficient because of a dynamic trade-off between cost and speed---larger batches are more costly, smaller batches lead to slower wall-clock run-times---and the trade-off may change over the run (larger batches are often preferable earlier). To address this trade-off, we propose a novel Probabilistic Numerics framework that adaptively changes batch sizes. By framing batch selection as a quadrature task, our integration-error-aware algorithm facilitates the automatic tuning of batch sizes to meet predefined quadrature precision objectives, akin to how typical optimizers terminate based on convergence thresholds. This approach obviates the necessity for exhaustive searches across all potential batch sizes. We also extend this to scenarios with constrained active learning and constrained optimization, interpreting constraint violations as reductions in the precision requirement, to subsequently adapt batch construction. Through extensive experiments, we demonstrate that our approach significantly enhances learning efficiency and flexibility in diverse Bayesian batch active learning and Bayesian optimization applications.

**Sharpened Lazy Incremental Quasi-Newton Method**

Aakash Sunil Lahoti · Spandan Senapati · Ketan Rajawat · Alec Koppel

The problem of minimizing the sum of $n$ functions in $d$ dimensions is ubiquitous in machine learning and statistics.In many applications where the number of observations $n$ is large, it is necessary to use incremental or stochastic methods, as their per-iteration cost is independent of $n$.Of these, Quasi-Newton (QN) methods strike a balance between the per-iteration cost and the convergence rate.Specifically, they exhibit a superlinear rate with $O(d^2)$ cost in contrast to the linear rate of first-order methods with $O(d)$ cost and the quadratic rate of second-order methods with $O(d^3)$ cost. However, existing incremental methods have notable shortcomings: Incremental Quasi-Newton (IQN) only exhibits asymptotic superlinear convergence.In contrast, Incremental Greedy BFGS (IGS) offers explicit superlinear convergence but suffers from poor empirical performance and has a per-iteration cost of $O(d^3)$.To address these issues, we introduce the Sharpened Lazy Incremental Quasi-Newton Method (SLIQN) that achieves the best of both worlds: an explicit superlinear convergence rate, and superior empirical performance at a per-iteration $O(d^2)$ cost.SLIQN features two key changes: first, it incorporates a hybrid strategy of using both classic and greedy BFGS updates, allowing it to empirically outperform both IQN and IGS.Second, it employs a clever constant multiplicative factor along with a lazy propagation strategy, which enables it to have a cost of $O(d^2)$.Additionally, our experiments demonstrate the superiority of SLIQN over other incremental and stochastic Quasi-Newton variants and establish its competitiveness with second-order incremental methods.

**SDEs for Minimax Optimization**

Enea Monzio Compagnoni · Antonio Orvieto · Hans Kersting · Frank Proske · Aurelien Lucchi

Minimax optimization problems have attracted a lot of attention over the past few years, with applications ranging from economics to machine learning. While advanced optimization methods exist for such problems, characterizing their dynamics in stochastic scenarios remains notably challenging. In this paper, we pioneer the use of stochastic differential equations (SDEs) to analyze and compare Minimax optimizers. Our SDE models for Stochastic Gradient Descent-Ascent, Stochastic Extragradient, and Stochastic Hamiltonian Gradient Descent are provable approximations of their algorithmic counterparts, clearly showcasing the interplay between hyperparameters, implicit regularization, and implicit curvature-induced noise. This perspective also allows for a unified and simplified analysis strategy based on the principles of Itô calculus. Finally, our approach facilitates the derivation of convergence conditions and closed-form solutions for the dynamics in simplified settings, unveiling further insights into the behavior of different optimizers.

**Parameter-Agnostic Optimization under Relaxed Smoothness**

Florian Hübler · Junchi Yang · Xiang Li · Niao He

Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.

**Stochastic Frank-Wolfe: Unified Analysis and Zoo of Special Cases**

Ruslan Nazykov · Aleksandr Shestakov · Vladimir Solodkin · Aleksandr Beznosikov · Gauthier Gidel · Alexander Gasnikov

The Conditional Gradient (or Frank-Wolfe) method is one of the most well-known methods for solving constrained optimization problems appearing in various machine learning tasks. The simplicity of iteration and applicability to many practical problems helped the method to gain popularity in the community. In recent years, the Frank-Wolfe algorithm received many different extensions, including stochastic modifications with variance reduction and coordinate sampling for training of huge models or distributed variants for big data problems. In this paper, we present a unified convergence analysis of the Stochastic Frank-Wolfe method that covers a large number of particular practical cases that may have completely different nature of stochasticity, intuitions and application areas. Our analysis is based on a key parametric assumption on the variance of the stochastic gradients. But unlike most works on unified analysis of other methods, such as SGD, we do not assume an unbiasedness of the real gradient estimation. We conduct analysis for convex and non-convex problems due to the popularity of both cases in machine learning. With this general theoretical framework, we not only cover rates of many known methods, but also develop numerous new methods. This shows the flexibility of our approach in developing new algorithms based on the Conditional Gradient approach. We also demonstrate the properties of the new methods through numerical experiments.

**The Galerkin method beats Graph-Based Approaches for Spectral Algorithms**

Vivien Cabannes · Francis Bach

Historically, the machine learning community has derived spectral decompositions from graph-based approaches. We break with this approach and prove the statistical and computational superiority of the Galerkin method, which consists in restricting the study to a small set of test functions. In particular, we introduce implementation tricks to deal with differential operators in large dimensions with structured kernels. Finally, we extend on the core principles beyond our approach to apply them to non-linear spaces of functions, such as the ones parameterized by deep neural networks, through loss-based optimization procedures.

Estimating the structure of directed acyclic graphs (DAGs) from observational data remains a significant challenge in machine learning. Most research in this area concentrates on learning a single DAG for the entire population. This paper considers an alternative setting where the graph structure varies across individuals based on available "contextual" features. We tackle this contextual DAG problem via a neural network that maps the contextual features to a DAG, represented as a weighted adjacency matrix. The neural network is equipped with a novel projection layer that ensures the output matrices are sparse and satisfy a recently developed characterization of acyclicity. We devise a scalable computational framework for learning contextual DAGs and provide a convergence guarantee and an analytical gradient for backpropagating through the projection layer. Our experiments suggest that the new approach can recover the true context-specific graph where existing approaches fail.

Graphical models use graphs to represent conditional independence structure in the distribution of a random vector. In stochastic processes, graphs may represent so-called local independence or conditional Granger causality. Under some regularity conditions, a local independence graph implies a set of independences using a graphical criterion known as delta-separation, or using its generalization, mu-separation. This is a stochastic process analogue of d-separation in DAGs. However, there may be more independences than implied by this graph and this is a violation of so-called faithfulness. We characterize faithfulness in local independence graphs and give a method to construct a faithful graph from any local independence model such that the output equals the true graph when Markov and faithfulness assumptions hold. We discuss various assumptions that are weaker than faithfulness, and we explore different structure learning algorithms and their properties under varying assumptions.

**On the connection between Noise-Contrastive Estimation and Contrastive Divergence**

Amanda Olmin · Jakob Lindqvist · Lennart Svensson · Fredrik Lindsten

Noise-contrastive estimation (NCE) is a popular method for estimating unnormalised probabilistic models, such as energy-based models, which are effective for modelling complex data distributions. Unlike classical maximum likelihood (ML) estimation that relies on importance sampling (resulting in ML-IS) or MCMC (resulting in contrastive divergence, CD), NCE uses a proxy criterion to avoid the need for evaluating an often intractable normalisation constant. Despite apparent conceptual differences, we show that two NCE criteria, ranking NCE (RNCE) and conditional NCE (CNCE), can be viewed as ML estimation methods. Specifically, RNCE is equivalent to ML estimation combined with conditional importance sampling, and both RNCE and CNCE are special cases of CD. These findings bridge the gap between the two method classes and allow us to apply techniques from the ML-IS and CD literature to NCE, offering several advantageous extensions.

**A Greedy Approximation for k-Determinantal Point Processes**

Julia Grosse · Rahel Fischer · Roman Garnett · Philipp Hennig

Determinantal point processes (DPPs) are an important concept in random matrix theory and combinatorics, and increasingly in machine learning. Samples from these processes exhibit a form of self-avoidance, so they are also helpful in guiding algorithms that explore to reduce uncertainty, such as in active learning, Bayesian optimization, reinforcement learning, and marginalization in graphical models. The best-known algorithms for sampling from DPPs exactly require significant computational expense, which can be unwelcome in machine learning applicationswhen the cost of sampling is relatively low and capturing the precise repulsive nature of the DPP may not be critical. We suggest an inexpensive approximate strategy for sampling a fixed number of points (as would typically be desired in a machine learning setting) from a so-called $k$-DPP based on iterative inverse transform sampling. We prove that our algorithm satisfies a $(1 - 1/\epsilon)$ approximationguarantee relative to exact sampling from the $k$-DPP, and provide an efficient implementation for many common kernels used in machine learning, including the Gaussian and Mat\'ern class. Finally, we compare the empirical runtime of our method to exact and Markov-Chain-Monte-Carlo (MCMC) samplers and investigate the approximation quality in a Bayesian Quadrature (BQ) setting.

**Approximate Control for Continuous-Time POMDPs**

Yannick Eich · Bastian Alt · Heinz Koeppl

This work proposes a decision-making framework for partially observable systems in continuous time with discrete state and action spaces. As optimal decision-making becomes intractable for large state spaces we employ approximation methods for the filtering and the control problem that scale well with an increasing number of states. Specifically, we approximate the high-dimensional filtering distribution by projecting it onto a parametric family of distributions, and integrate it into a control heuristic based on the fully observable system to obtain a scalable policy. We demonstrate the effectiveness of our approach on several partially observed systems, including queueing systems and chemical reaction networks.

**Policy Evaluation for Reinforcement Learning from Human Feedback: A Sample Complexity Analysis**

Zihao Li · Xiang Ji · Minshuo Chen · Mengdi Wang

A recently popular approach to solving reinforcement learning is with data from human preferences. In fact, human preference data are now used with classic reinforcement learning algorithms such as actor-critic methods, which involve evaluating an intermediate policy over a reward learned from human preference data with distribution shift, known as off-policy evaluation (OPE). Such algorithm includes (i) learning reward function from human preference dataset, and (ii) learning expected cumulative reward of a target policy. Despite the huge empirical success, existing OPE methods with preference data often lack theoretical understandings and rely heavily on heuristics. In this paper, we study the sample efficiency of OPE with human preference and establish a statistical guarantee for it. Specifically, we approach OPE with learning the value function by fitted-Q-evaluation with a deep neural network. By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Under the assumption of high reward smoothness, our results almost align with the classical OPE results with observable reward data. To the best of our knowledge, this is the first result that establishes a provably efficient guarantee for off-policy evaluation with RLHF.

**Symmetric Equilibrium Learning of VAEs**

Boris Flach · Dmitrij Schlesinger · Alexander Shekhovtsov

We view variational autoencoders (VAE) as decoder-encoder pairs, which map distributions in the data space to distributions in the latent space and vice versa. The standard learning approach for VAEs is the maximisation of the evidence lower bound (ELBO). It is asymmetric in that it aims at learning a latent variable model while using the encoder as an auxiliary means only. Moreover, it requires a closed form a-priori latent distribution. This limits its applicability in more complex scenarios, such as general semi-supervised learning and employing complex generative models as priors. We propose a Nash equilibrium learning approach, which is symmetric with respect to the encoder and decoder and allows learning VAEs in situations where both the data and the latent distributions are accessible only by sampling. The flexibility and simplicity of this approach allows its application to a wide range of learning scenarios and downstream tasks.

**On Feynman–Kac training of partial Bayesian neural networks**

Zheng Zhao · Sebastian Mair · Thomas Schön · Jens Sjölund

Recently, partial Bayesian neural networks (pBNNs), which only consider a subset of the parameters to be stochastic, were shown to perform competitively with full Bayesian neural networks. However, pBNNs are often multi-modal in the latent variable space and thus challenging to approximate with parametric models. To address this problem, we propose an efficient sampling-based training strategy, wherein the training of a pBNN is formulated as simulating a Feynman–Kac model. We then describe variations of sequential Monte Carlo samplers that allow us to simultaneously estimate the parameters and the latent posterior distribution of this model at a tractable computational cost. Using various synthetic and real-world datasets we show that our proposed training scheme outperforms the state of the art in terms of predictive performance.

**No-Regret Algorithms for Safe Bayesian Optimization with Monotonicity Constraints**

Arpan Losalka · Jonathan Scarlett

We consider the problem of sequentially maximizing an unknown function $f$ over a set of actions of the form $(s, x)$, where the selected actions must satisfy a safety constraint with respect to an unknown safety function $g$. We model $f$ and $g$ as lying in a reproducing kernel Hilbert space (RKHS), which facilitates the use of Gaussian process methods. While existing works for this setting have provided algorithms that are guaranteed to identify a near-optimal safe action, the problem of attaining low cumulative regret has remained largely unexplored, with a key challenge being that expanding the safe region can incur high regret. To address this challenge, we show that if $g$ is monotone with respect to just the single variable $s$ (with no such constraint on $f$), sublinear regret becomes achievable with our proposed algorithm. In addition, we show that a modified version of our algorithm is able to attain sublinear regret (for suitably defined notions of regret) for the task of finding a near-optimal $s$ corresponding to every $x$, as opposed to only finding the global safe optimum. Our findings are supported with empirical evaluations on various objective and safety functions.

**Variational Resampling**

Oskar Kviman · Nicola Branchini · Victor Elvira · Jens Lagergren

We cast the resampling step in particle filters (PFs) as a variational inference problem, resulting in a new class of resampling schemes: variational resampling. Variational resampling is flexible as it allows for choices of 1) divergence to minimize, 2) target distribution to input to the divergence, and 3) divergence minimization algorithm. With this novel application of VI to particle filters, variational resampling further unifies these two powerful and popular methodologies. We construct two variational resamplers that replicate particles in order to maximize lower bounds with respect to two different target measures. We benchmark our variational resamplers on challenging smoothing tasks, outperforming PFs that implement the state-of-the-art resampling schemes.

**Score Operator Newton transport**

Nisha Chandramoorthy · Florian Schaefer · Youssef Marzouk

We propose a new approach for sampling and Bayesian computation that uses the score of the target distribution to construct a transport from a given reference distribution to the target. Our approach is an infinite-dimensional Newton method, involving an elliptic PDE, for finding a zero of a ``score-residual'' operator. We prove sufficient conditions for convergence to a valid transport map. Our Newton iterates can be computed by exploiting fast solvers for elliptic PDEs, resulting in new algorithms for Bayesian inference and other sampling tasks. We identify elementary settings where score-operator Newton transport achieves fast convergence while avoiding mode collapse.

**Posterior Uncertainty Quantification in Neural Networks using Data Augmentation**

Luhuan Wu · Sinead Williamson

In this paper, we approach the problem of uncertainty quantification in deep learning through a predictive framework, which captures uncertainty in model parameters by specifying our assumptions about the predictive distribution of unseen future data. Under this view, we show that deep ensembling (Lakshminarayanan et al., 2017) is a fundamentally mis-specified model class, since it assumes that future data are supported on existing observations only---a situation rarely encountered in practice. To address this limitation, we propose MixupMP, a method that constructs a more realistic predictive distribution using popular data augmentation techniques. MixupMP operates as a drop-in replacement for deep ensembles, where each ensemble member is trained on a random simulation from this predictive distribution. Grounded in the recently-proposed framework of Martingale posteriors (Fong et al., 2023), MixupMP returns samples from an implicitly defined Bayesian posterior. Our empirical analysis showcases that MixupMP achieves superior predictive per- formance and uncertainty quantification on various image classification datasets, when compared with existing Bayesian and non-Bayesian approaches.

**Data-Driven Confidence Intervals with Optimal Rates for the Mean of Heavy-Tailed Distributions**

Ambrus Tamás · Szabolcs Szentpéteri · Balázs Csanád Csáji

Estimating the expected value is one of the key problems of statistics, and it serves as a backbone for countless methods in machine learning. In this paper we propose a new algorithm to build non-asymptotically exact confidence intervals for the mean of a symmetric distribution based on an independent, identically distributed sample. The method combines resampling with median-of-means estimates to ensure optimal subgaussian bounds for the sizes of the confidence intervals under mild, heavy-tailed moment conditions. The scheme is completely data-driven: the construction does not need any information about the moments, yet it manages to build exact confidence regions which shrink at the optimal rate. We also show how to generalize the approach to higher dimensions and prove dimension-free, subgaussian PAC bounds for the exclusion probabilities of false candidates. Finally, we illustrate the method and its properties for heavy-tailed distributions with numerical experiments.

**Trigonometric Quadrature Fourier Features for Scalable Gaussian Process Regression**

Kevin Li · Max Balakirsky · Simon Mak

Fourier feature approximations have been successfully applied in the literature for scalable Gaussian Process (GP) regression. In particular, Quadrature Fourier Features (QFF) derived from Gaussian quadrature rules have gained popularity in recent years due to their improved approximation accuracy and better calibrated uncertainty estimates compared to Random Fourier Feature (RFF) methods. However, a key limitation of QFF is that its performance can suffer from well-known pathologies related to highly oscillatory quadrature, resulting in mediocre approximation with limited features. We address this critical issue via a new Trigonometric Quadrature Fourier Feature (TQFF) method, which uses a novel non-Gaussian quadrature rule specifically tailored for the desired Fourier transform. We derive an exact quadrature rule for TQFF, along with kernel approximation error bounds for the resulting feature map. We then demonstrate the improved performance of our method over RFF and Gaussian QFF in a suite of numerical experiments and applications, and show the TQFF enjoys accurate GP approximations over a broad range of length-scales using fewer features.

**Cylindrical Thompson Sampling for High-Dimensional Bayesian Optimization**

Bahador Rashidi · Kerrick Johnstonbaugh · Chao Gao

Many industrial and scientific applications require optimization of one or more objectives by tuning dozens or hundreds of input parameters. While Bayesian optimization has been a popular approach for the efficient optimization of blackbox functions, its performance decreases drastically as the dimensionality of the search space increases (i.e., above twenty). Recent advancements in high-dimensional Bayesian optimization (HDBO) seek to mitigate this issue through techniques such as adaptive local search with trust regions or dimensionality reduction using random embeddings. In this paper, we provide a close examination of these advancements and show that sampling strategy plays a prominent role and is key to tackling the curse-of-dimensionality. We then propose cylindrical Thompson sampling (CTS), a novel strategy that can be integrated into single- and multi-objective HDBO algorithms. We demonstrate this by integrating CTS as a modular component in state-of-the-art HDBO algorithms. We verify the effectiveness of CTS on both synthetic and real-world high-dimensional problems, and show that CTS largely enhances existing HDBO methods.

**Benefits of Non-Linear Scale Parameterizations in Black Box Variational Inference through Smoothness Results and Gradient Variance Bounds**

Alexandra Hotti · Lennart Van der Goten · Jens Lagergren

Black box variational inference has consistently produced impressive empirical results. Convergence guarantees require that the variational objective exhibits specific structural properties and that the noise of the gradient estimator can be controlled. In this work we study the smoothness and the variance of the gradient estimator for location-scale variational families with non-linear covariance parameterizations. Specifically, we derive novel theoretical results for the popular exponential covariance parameterization and tighter gradient variance bounds for the softplus parameterization. These results reveal the benefits of using non-linear scale parameterizations on large scale datasets. With a non-linear scale parameterization, the smoothness constant of the variational objective and the upper bound on the gradient variance decrease as the scale parameter becomes smaller. Learning posterior approximations with small scales is essential in Bayesian statistics with sufficient amount of data, since under appropriate assumptions, the posterior distribution is known to contract around the parameter of interest as the sample size increases. We validate our theoretical findings through empirical analysis on several large-scale datasets, underscoring the importance of non-linear parameterizations.

We consider the problem of identifying a conditional causal effect through covariate adjustment. We focus on the setting where the causal graph is known up to one of two types of graphs: a maximally oriented partially directed acyclic graph (MPDAG) or a partial ancestral graph (PAG). Both MPDAGs and PAGs represent equivalence classes of possible underlying causal models. After defining adjustment sets in this setting, we provide a necessary and sufficient graphical criterion -- the conditional adjustment criterion -- for finding these sets under conditioning on variables unaffected by treatment. We further provide explicit sets from the graph that satisfy the conditional adjustment criterion, and therefore, can be used as adjustment sets for conditional causal effect identification.

**Equivalence Testing: The Power of Bounded Adaptivity**

Diptarka Chakraborty · Sourav Chakraborty · GUNJAN KUMAR · Kuldeep Meel

Equivalence testing, a fundamental problem in the field of distribution testing,seeks to infer if two unknown distributions on $[n]$ are the same or far apart in thetotal variation distance. Conditional sampling has emerged as a powerful querymodel and has been investigated by theoreticians and practitioners alike, leading to the design of optimal algorithms albeit in a sequential setting (also referred to as adaptive tester). Given the profound impact of parallel computing over the past decades, there has been a strong desire to design algorithms that enable high parallelization. Despite significant algorithmic advancements over the last decade, parallelizable techniques (also termed non-adaptive testers) have $\tilde{O}(\log^{12}n)$ query complexity, a prohibitively large complexity to be of practical usage. Therefore, the primary challenge is whether it is possible to design algorithms that enable high parallelization while achieving efficient query complexity.Our work provides an affirmative answer to the aforementioned challenge: we present a highly parallelizable tester with a query complexity of $\tilde{O}(\log n)$, achieved through a single round of adaptivity, marking a significant stride towardsharmonizing parallelizability and efficiency in equivalence testing.

**Optimal estimation of Gaussian (poly)trees**

Yuhao WANG · Ming Gao · Wai Ming Tai · Bryon Aragam · Arnab Bhattacharyya

We develop optimal algorithms for learning undirected Gaussian trees and directed Gaussian polytrees from data. We consider both problems of distribution learning (i.e. in KL distance) and structure learning (i.e. exact recovery). The first approach is based on the Chow-Liu algorithm, and learns an optimal tree-structured distribution efficiently. The second approach is a modification of the PC algorithm for polytrees that uses partial correlation as a conditional independence tester for constraint-based structure learning. We derive explicit finite-sample guarantees for both approaches, and show that both approaches are optimal by deriving matching lower bounds. Additionally, we conduct numerical experiments to compare the performance of various algorithms, providing further insights and empirical evidence.

**Inconsistency of Cross-Validation for Structure Learning in Gaussian Graphical Models**

Zhao Lyu · Wai Ming Tai · Mladen Kolar · Bryon Aragam

Despite numerous years of research into the merits and trade-offs of various model selection criteria, obtaining robust results that elucidate the behavior of cross-validation remains a challenging endeavor. In this paper, we highlight the inherent limitations of cross-validation when employed to discern the structure of a Gaussian graphical model. We provide finite-sample bounds on the probability that the Lasso estimator for the neighborhood of a node within a Gaussian graphical model, optimized using a prediction oracle, misidentifies the neighborhood. Our results pertain to both undirected and directed acyclic graphs, encompassing general, sparse covariance structures. To support our theoretical findings, we conduct an empirical investigation of this inconsistency by contrasting our outcomes with other commonly used information criteria through an extensive simulation study. Given that many algorithms designed to learn the structure of graphical models require hyperparameter selection, the precise calibration of this hyperparameter is paramount for accurately estimating the inherent structure. Consequently, our observations shed light on this widely recognized practical challenge.

**Hodge-Compositional Edge Gaussian Processes**

Maosheng Yang · Slava Borovitskiy · Elvin Isufi

We propose principled Gaussian processes (GPs) for modeling functions defined over the edge set of a simplicial 2-complex, a structure similar to a graph in which edges may form triangular faces. This approach is intended for learning flow-type data on networks where edge flows can be characterized by the discrete divergence and curl. Drawing upon the Hodge decomposition, we first develop classes of divergence-free and curl-free edge GPs, suitable for various applications. We then combine them to create Hodge-compositional edge GPs that are expressive enough to represent any edge function. These GPs facilitate direct and independent learning for the different Hodge components of edge functions, enabling us to capture their relevance during hyperparameter optimization. To highlight their practical potential, we apply them for flow data inference in currency exchange, ocean currents and water supply networks, comparing them to alternative models.

Cyclical MCMC is a novel MCMC framework recently proposed by Zhang et al. (2019) to address the challenge posed by high- dimensional multimodal posterior distributions like those arising in deep learning. The algorithm works by generating a nonhomogeneous Markov chain that tracks – cyclically in time – tempered versions of the target distribution. We show in this work that cyclical MCMC converges to the desired probability distribution in settings where the Markov kernels used are fast mixing, and sufficiently long cycles are employed. However in the far more common settings of slow mixing kernels, the algorithm may fail to produce samples from the desired distribution. In particular, in a simple mixture example with unequal variance we show by simulation that cyclical MCMC fails to converge to the desired limit. Finally, we show that cyclical MCMC typically estimates well the local shape of the target distribution around each mode, even when we do not have convergence to the target.

**Adaptive importance sampling for heavy-tailed distributions via $\alpha$-divergence minimization**

Thomas Guilmeau · Nicola Branchini · Emilie Chouzenoux · Victor Elvira

Adaptive importance sampling (AIS) algorithms are widely used to approximate expectations with respect to complicated target probability distributions. When the target has heavy tails, existing AIS algorithms can provide inconsistent estimators or exhibit slow convergence, as they often neglect the target’s tail behaviour. To avoid this pitfall, we propose an AIS algorithm that approximates the target by Student-t proposal distributions. We adapt location and scale parameters by matching the escort moments - which are defined even for heavy-tailed distributions - of the target and proposal. These updates minimize the $\alpha$-divergence between the target and the proposal, thereby connecting with variational inference. We then show that the $\alpha$-divergence can be approximated by a generalized notion of effective sample size and leverage this new perspective to adapt the tail parameter with Bayesian optimization. We demonstrate the efficacy of our approach through applications to synthetic targets and a Bayesian Student-t regression task on a real example with clinical trial data.

We develop a family of fast variational methods for sequential control in dynamic settings where an agent is incentivized to maximize information gain. We consider the case of optimal control in continuous nonlinear dynamical systems that prohibit exact evaluation of the mutual information (MI) reward. Our approach couples efficient message-passing inference with variational bounds on the MI objective under Gaussian projections. We also develop a Gaussian mixture approximation that enables exact MI evaluation under constraints on the component covariances. We validate our methodology in nonlinear systems with superior and faster control compared to standard particle-based methods. We show our approach improves the accuracy and efficiency of one-shot robotic learning with intrinsic MI rewards. Furthermore, we demonstrate that our method is applicable to a wider range of contexts, e.g., the active information acquisition problem.

**An Analytic Solution to Covariance Propagation in Neural Networks**

Oren Wright · Yorie Nakahira · José M. F. Moura

Uncertainty quantification of neural networks is critical to measuring the reliability and robustness of deep learning systems. However, this often involves costly or inaccurate sampling methods and approximations. This paper presents a sample-free moment propagation technique that propagates mean vectors and covariance matrices across a network to accurately characterize the input-output distributions of neural networks. A key enabler of our technique is an analytic solution for the covariance of random variables passed through nonlinear activation functions, such as Heaviside, ReLU, and GELU. The wide applicability and merits of the proposed technique are shown in experiments analyzing the input-output distributions of trained neural networks and training Bayesian neural networks.

**Think Global, Adapt Local: Learning Locally Adaptive K-Nearest Neighbor Kernel Density Estimators**

Kenny Olsen · Rasmus Hoeegh Lindrup · Morten Mørup

Kernel density estimation (KDE) is a powerful technique for non-parametric density estimation, yet practical use of KDE-based methods remains limited by insufficient representational flexibility, especially for higher-dimensional data.Contrary to KDE, K-nearest neighbor (KNN) density estimation procedures locally adapt the density based on the K-nearest neighborhood, but unfortunately only provide asymptotically correct density estimates. We present the KNN-KDE method introducing observation-specific kernels for KDE that are locally adapted through priors defined by the covariance of the K-nearest neighborhood, forming a fully Bayesian model with exact density estimates. We further derive a scalable inference procedure that infers parameters through variational inference by optimizing the predictive likelihood exploiting sparsity, batched optimization, and parallel computation for massive inference speedups.We find that KNN-KDE provides valid density estimates superior to conventional KDE and KNN density estimation on both synthetic and real data sets. We further observe that the bayesian KNN-KDE even outperforms recent neural density estimation procedures on two of the five considered real data sets. The KNN-KDE unifies conventional kernel and KNN density estimation providing a scalable, generic and accurate framework for density estimation.

**Sinkhorn Flow as Mirror Flow: A Continuous-Time Framework for Generalizing the Sinkhorn Algorithm**

Mohammad Reza Karimi · Ya-Ping Hsieh · Andreas Krause

Many problems in machine learning can be formulated as solving entropy-regularized optimal transport on the space of probability measures. The canonical approach involves the Sinkhorn iterates, renowned for their rich mathematical properties. Recently, the Sinkhorn algorithm has been recast within the mirror descent framework, thus benefiting from classical optimization theory insights. Here, we build upon this result by introducing a continuous-time analogue of the Sinkhorn algorithm. This perspective allows us to derive novel variants of Sinkhorn schemes that are robust to noise and bias. Moreover, our continuous-time dynamics offers a unified perspective on several recently discovered dynamics in machine learning and mathematics, such as the "Wasserstein mirror flow" of Deb et al. (2023) or the "mean-field Schrödinger equation" of Claisse et al. (2023).

**Near-optimal Per-Action Regret Bounds for Sleeping Bandits**

Manh Quan Nguyen · Nishant Mehta

We derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.

**Generative Flow Networks as Entropy-Regularized RL**

Daniil Tiapkin · Nikita Morozov · Alexey Naumov · Dmitry Vetrov

The recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to a general case. We demonstrate how the task of learning a generative flow network can be efficiently redefined as an entropy-regularized RL problem with a specific reward and regularizer structure. Furthermore, we illustrate the practical efficiency of this reformulation by applying standard soft RL algorithms to GFlowNet training across several probabilistic modeling tasks. Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods. This perspective opens a direct path for integrating reinforcement learning principles into the realm of generative flow networks.

Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a consistency guarantee, we present a general posterior stacking framework to make use of all available approximations. Our stacking method is able to combine densities, simulation draws, confidence intervals, and moments, and address the overall precision, calibration, coverage, and bias of the posterior approximation at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task.

**Sequential Monte Carlo for Inclusive KL Minimization in Amortized Variational Inference**

Declan McNamara · Jackson Loper · Jeffrey Regier

For training an encoder network to perform amortized variational inference, the Kullback-Leibler (KL) divergence from the exact posterior to its approximation, known as the inclusive or forward KL, is an increasingly popular choice of variational objective due to the mass-covering property of its minimizer. However, minimizing this objective is challenging. A popular existing approach, Reweighted Wake-Sleep (RWS), suffers from heavily biased gradients and a circular pathology that results in highly concentrated variational distributions. As an alternative, we propose SMC-Wake, a procedure for fitting an amortized variational approximation that uses likelihood-tempered sequential Monte Carlo samplers to estimate the gradient of the inclusive KL divergence. We propose three gradient estimators, all of which are asymptotically unbiased in the number of iterations and two of which are strongly consistent. Our method interleaves stochastic gradient updates, SMC samplers, and iterative improvement to an estimate of the normalizing constant to reduce bias from self-normalization. In experiments with both simulated and real datasets, SMC-Wake fits variational distributions that approximate the posterior more accurately than existing methods.

A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during inference in order to reduce computational cost. However, state of the art methods for tuning coreset weights are expensive, require nontrivial user input, and impose constraints on the model. In this work, we propose a new method---coreset MCMC---that simulates a Markov chain targeting the coreset posterior, while simultaneously updating the coreset weights using those same draws. Coreset MCMC is simple to implement and tune, and can be used with any existing MCMC kernel. We analyze coreset MCMC in a representative setting to obtain key insights about the convergence behaviour of the method. Empirical results demonstrate that coreset MCMC provides higher quality posterior approximations and reduced computational cost compared with other coreset construction methods. Further, compared with other general subsampling MCMC methods, we find that coreset MCMC has a higher sampling efficiency with competitively accurate posterior approximations.

**Identifiability of Product of Experts Models**

Manav Kant · Eric Ma · Andrei Staicu · Leonard Schulman · Spencer Gordon

Product of experts (PoE) are layered networks in which the value at each node is an AND (or product) of the values (possibly negated) at its inputs. These were introduced as a neural network architecture that can efficiently learn to generate high-dimensional data which satisfy many low-dimensional constraints—thereby allowing each individual expert to perform a simple task. PoEs have found a variety of applications in learning. We study the problem of identifiability of a product of experts model having a layer of binary latent variables, and a layer of binary observables that are iid conditional on the latents. The previous best upper bound on the number of observables needed to identify the model was exponential in the number of parameters. We show: (a) When the latents are uniformly distributed, the model is identifiable with a number of observables equal to the number of parameters (and hence best possible). (b) In the more general case of arbitrarily distributed latents, the model is identifiable for a number of observables that is still linear in the number of parameters (and within a factor of two of best-possible). The proofs rely on root interlacing phenomena for some special three-term recurrences.

In numerical integration, Bayesian quadrature (BQ) excels at producing estimates with quantified uncertainties, particularly in sparse data settings. However, its computational scalability and kernel learning capabilities have lagged behind modern advances in Gaussian process research. To bridge this gap, we recast the BQ posterior integral as a convolution operation, which enables efficient computation via fast Fourier transform of low-rank matrices. We introduce two new methods enabled by recasting BQ as a convolution: fast Fourier Bayesian quadrature and sparse spectrum Bayesian quadrature. These methods enhance the computational scalability of BQ and expand kernel flexibility, enabling the use of \textit{any} stationary kernel in the BQ setting. We empirically validate the efficacy of our approach through a range of integration tasks, substantiating the benefits of the proposed methodology.

**autoMALA: Locally adaptive Metropolis-adjusted Langevin algorithm**

Miguel Biron-Lattes · Nikola Surjanovic · Saifuddin Syed · Trevor Campbell · Alexandre Bouchard-Côté

Selecting the step size for the Metropolis-adjusted Langevin algorithm (MALA) is necessary in order to obtain satisfactory performance. However, finding an adequate step size for an arbitrary target distribution can be a difficult task and even the best step size can perform poorly in specific regions of the space when the target distribution is sufficiently complex. To resolve this issue we introduce autoMALA, a new Markov chain Monte Carlo algorithm based on MALA that automatically sets its step size at each iteration based on the local geometry of the target distribution. We prove that autoMALA has the correct invariant distribution, despite continual automatic adjustments of the step size. Our experiments demonstrate that autoMALA is competitive with related state-of-the-art MCMC methods, in terms of the number of log density evaluations per effective sample, and it outperforms state-of-the-art samplers on targets with varying geometries. Furthermore, we find that autoMALA tends to find step sizes comparable to optimally-tuned MALA when a fixed step size suffices for the whole domain.

This paper introduces AL$\ell_0$CORE, a new form of probabilistic non-negative tensor decomposition.AL$\ell_0$CORE is a Tucker decomposition that constrains the number of non-zero elements (i.e., the $\ell_0$-norm) of the core tensor to be at most $Q$.While the user dictates the total budget $Q$, the locations and values of the non-zero elements are latent variables allocated across the core tensor during inference.AL$\ell_0$CORE---i.e., allocated $\ell_0$-constrained core---thus enjoys both the computational tractability of canonical polyadic (CP) decomposition and the qualitatively appealing latent structure of Tucker.In a suite of real-data experiments, we demonstrate that AL$\ell_0$CORE typically requires only tiny fractions (e.g., 1\%) of the core to achieve the same results as Tucker at a correspondingly small fraction of the cost.

**Warped Diffusion for Latent Differentiation Inference**

Masahiro Nakano · Hiroki Sakuma · Ryo Nishikimi · Ryohei Shibue · Takashi Sato · Tomoharu Iwata · Kunio Kashino

This paper proposes a Bayesian nonparametric diffusion model with a black-box warping function represented by a Gaussian process to infer potential diffusion structures latent in observed data, such as differentiation mechanisms of living cells and phylogenetic evolution processes of media information. In general, the task of inferring latent differentiation structures is very difficult to handle due to two interrelated settings. One is that the conversion mechanism between hidden structure and often higher dimensional observations is unknown (and is a complex mechanism). The other is that the topology of the hidden diffuse structure itself is unknown. Therefore, in this paper, we propose a BNP-based strategy as a natural way to deal with these two challenging settings simultaneously. Specifically, as an extension of the Gaussian process latent variable model, we propose a model in which the black box transformation from latent variable space to observed data space is represented by a Gaussian process, and introduce a BNP diffusion model for the latent variable space. We show its application to the visualization of the diffusion structure of media information and to the task of inferring cell differentiation structure from single-cell gene expression levels.

**Differentiable Rendering with Reparameterized Volume Sampling**

Nikita Morozov · Denis Rakitin · Oleg Desheulin · Dmitry Vetrov · Kirill Struminsky

In view synthesis, a neural radiance field approximates underlying density and radiance fields based on a sparse set of scene pictures. To generate a pixel of a novel view, it marches a ray through the pixel and computes a weighted sum of radiance emitted from a dense set of ray points. This rendering algorithm is fully differentiable and facilitates gradient-based optimization of the fields. However, in practice, only a tiny opaque portion of the ray contributes most of the radiance to the sum. We propose a simple end-to-end differentiable sampling algorithm based on inverse transform sampling. It generates samples according to the probability distribution induced by the density field and picks non-transparent points on the ray. We utilize the algorithm in two ways. First, we propose a novel rendering approach based on Monte Carlo estimates. This approach allows for evaluating and optimizing a neural radiance field with just a few radiance field calls per ray. Second, we use the sampling algorithm to modify the hierarchical scheme proposed in the original NeRF work. We show that our modification improves reconstruction quality of hierarchical models, at the same time simplifying the training procedure by removing the need for auxiliary proposal network losses.

**Electronic Medical Records Assisted Digital Clinical Trial Design**

Xinrui Ruan · Jingshen Wang · Yingfei Wang · Waverly Wei

Randomized controlled trials (RCTs) are gold standards for assessing intervention efficacy. Yet, generalizing evidence from classical RCTs can be challenging and sometimes problematic due to their limited external validity under stringent eligibility criteria and inadequate statistical power resulting from limited sample sizes under budgetary constraints. "Digital clinical trial," which utilizes digital technology and electronic medical records (EMRs) to expand eligibility criteria and enhance data collection efficiency, offers a promising concept for solving the above-mentioned conundrums encountered in classical RCTs. In this paper, we propose two novel digital clinical trial design strategies assisted by EMRs collected from diverse patient populations. On the one hand, leveraging digital technologies, our design strategies adaptively modify both the eligibility criteria and treatment assignment mechanism to enhance data collection efficiency. As a result, evidence gathered from our design can possess greater statistical power. On the other hand, since EMRs capture diverse patient populations and provide large sample sizes, our design not only broadens the trial's eligibility criteria but also enhances its statistical power, enabling us to collect more generalizable evidence with boosted statistical power for evaluating intervention efficacy than classical RCTs. We demonstrate the validity and merit of the proposed designs with detailed theoretical investigation, simulation studies, and a synthetic case study.

**Sample-efficient neural likelihood-free Bayesian inference of implicit HMMs**

Sanmitra Ghosh · Paul Birrell · Daniela De Angelis

Likelihood-free inference methods based on neural conditional density estimation were shown to drastically reduce the simulation burden in comparison to classical methods such as ABC. When applied in the context of any latent variable model, such as a Hidden Markov model (HMM), these methods are designed to only estimate the parameters, rather than the joint distribution of the parameters and the hidden states. Naive application of these methods to a HMM, ignoring the inference of this joint posterior distribution, will thus produce an inaccurate estimate of the posterior predictive distribution, in turn hampering the assessment of goodness-of-fit. To rectify this problem, we propose a novel, sample-efficient likelihood-free method for estimating the high-dimensional hidden states of an implicit HMM. Our approach relies on learning directly the intractable posterior distribution of the hidden states, using an autoregressive-flow, by exploiting the Markov property. Upon evaluating our approach on some implicit HMMs, we found that the quality of the estimates retrieved using our method is comparable to what can be achieved using a much more computationally expensive SMC algorithm.

**Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection**

Yujin Han · · Leying Guan

The Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction $C(x)$. Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest's effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at https://github.com/yujinhan98/CSForest.

We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.

**Discriminant Distance-Aware Representation on Deterministic Uncertainty Quantification Methods**

Jiaxin Zhang · Kamalika Das · Sricharan Kumar

Uncertainty estimation is a crucial aspect of deploying dependable deep learning models in safety-critical systems. In this study, we introduce a novel and efficient method for deterministic uncertainty estimation called Discriminant Distance-Awareness Representation (DDAR). Our approach involves constructing a DNN model that incorporates a set of prototypes in its latent representations, enabling us to analyze valuable feature information from the input data. By leveraging a distinction maximization layer over optimal trainable prototypes, DDAR can learn a discriminant distance-awareness representation. We demonstrate that DDAR overcomes feature collapse by relaxing the Lipschitz constraint that hinders the practicality of deterministic uncertainty methods (DUMs) architectures. Our experiments show that DDAR is a flexible and architecture-agnostic method that can be easily integrated as a pluggable layer with distance-sensitive metrics, outperforming state-of-the-art uncertainty estimation methods on multiple benchmark problems.

**SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits**

Subhojyoti Mukherjee · Qiaomin Xie · Josiah Hanna · Robert Nowak

In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting with the knowledge of noise variances. This design minimizes the mean squared error (MSE) of the estimated value of the target policy and is termed the oracle design. Since the noise variance is typically unknown, we then introduce a novel algorithm, SPEED\ (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign), that tracks the oracle design and derive its regret with respect to the oracle design. We show that regret scales as $\widetilde{O}_{}(d^3 n^{-3/2})$ and prove a matching lower bound of $\Omega(d^2 n^{-3/2})$. Finally, we evaluate SPEED on a set of policy evaluation tasks and demonstrate that it achieves MSE comparable to an optimal oracle and much lower than simply running the target policy.

**Pessimistic Off-Policy Multi-Objective Optimization**

Shima Alizadeh · Aniruddha Bhargava · Karthick Gopalswamy · Lalit Jain · Branislav Kveton · Ge Liu

Multi-objective optimization is a class of optimization problems with multiple conflicting objectives. We study offline optimization of multi-objective policies from data collected by a previously deployed policy. We propose a pessimistic estimator for policy values that can be easily plugged into existing formulas for hypervolume computation and optimized. The estimator is based on inverse propensity scores (IPS), and improves upon a naive IPS estimator in both theory and experiments. Our analysis is general, and applies beyond our IPS estimators and methods for optimizing them.

**Leveraging PAC-Bayes Theory and Gibbs Distributions for Generalization Bounds with Complexity Measures**

Paul Viallard · Rémi Emonet · Amaury Habrard · Emilie Morvant · Valentina Zantedeschi

In statistical learning theory, a generalization bound usually involves a complexity measure imposed by the considered theoretical framework. This limits the scope of such bounds, as other forms of capacity measures or regularizations are used in algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayes bounds to derive a general generalization bound instantiable with arbitrary complexity measures. One trick to prove such a result involves considering a commonly used family of distributions: the Gibbs distributions. Our bound stands in probability jointly over the hypothesis and the learning sample, which allows the complexity to be adapted to the generalization gap as it can be customized to fit both the hypothesis class and the task.

This paper studies offline reinforcement learning with linear function approximation in a setting with decision-theoretic, but not estimation sparsity.The structural restrictions of the data-generating process presume that the transitions factor into a sparse component that affects the reward and could affect additional exogenous dynamics that do not affect the reward.Although the minimally sufficient adjustment set for estimation of full-state transition properties depends on the whole state, the optimal policy and therefore state-action value function depends only on the sparse component: we call this causal/decision-theoretic sparsity.We develop a method for reward-filtering the estimation of the state-action value function to the sparse component by a modification of thresholded lasso in least-squares policy evaluation.We provide theoretical guarantees for our reward-filtered linear fitted-Q-iteration, with sample complexity depending only on the size of the sparse component.

**Stochastic Multi-Armed Bandits with Strongly Reward-Dependent Delays**

Yifu Tang · Yingfei Wang · Zeyu Zheng

There has been increasing interest in applying multi-armed bandits to adaptive designs in clinical trials. However, most literature assumes that a previous patient’s survival response of a treatment is known before the next patient is treated, which is unrealistic. The inability to account for response delays is cited frequently as one of the problems in using adaptive designs in clinical trials. More critically, the “delays” in observing the survival response are the same as the rewards rather than being external stochastic noise. We formalize this problem as a novel stochastic multi-armed bandit (MAB) problem with reward-dependent delays, where the delay at each round depends on the reward generated on the same round. For general reward/delay distributions with finite expectation, our proposed censored-UCB algorithm achieves near-optimal regret in terms of both problem-dependent and problem-independent bounds. With bounded or sub-Gaussian reward distributions, the upper bounds are optimal with a matching lower bound. Our theoretical results and the algorithms' effectiveness are validated by empirical experiments.

**Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition**

Long-Fei Li · Peng Zhao · Zhi-Hua Zhou

We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\tilde{\mathcal{O}}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\tilde{\mathcal{O}}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (\romannumeral1) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (\romannumeral2) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

**Learning the Pareto Set Under Incomplete Preferences: Pure Exploration in Vector Bandits**

Efe Mert Karagözlü · Yaşar Cahit Yıldırım · Çağın Ararat · Cem Tekin

We study pure exploration in bandit problems with vector-valued rewards, where the goal is to (approximately) identify the Pareto set of arms given incomplete preferences induced by a polyhedral convex cone. We address the open problem of designing sample-efficient learning algorithms for such problems. We propose Pareto Vector Bandits (PaVeBa), an adaptive elimination algorithm that nearly matches the gap-dependent and worst-case lower bounds on the sample complexity of $(\epsilon, \delta)$-PAC Pareto set identification. Finally, we provide an in-depth numerical investigation of PaVeBa and its heuristic variants by comparing them with the state-of-the-art multi-objective and vector optimization algorithms on several real-world datasets with conflicting objectives.

**Multi-Agent Learning in Contextual Games under Unknown Constraints**

Anna Maddux · Maryam Kamgarpour

We consider the problem of learning to play a repeated contextual game with unknown reward and unknown constraints functions. Such games arise in applications where each agent's action needs to belong to a feasible set, but the feasible set is a priori unknown. For example, in constrained multi-agent reinforcement learning, the constraints on the agents' policies are a function of the unknown dynamics and hence, are themselves unknown. Under kernel-based regularity assumptions on the unknown functions, we develop a no-regret, no-violation approach that exploits similarities among different reward and constraint outcomes. The no-violation property ensures that the time-averaged sum of constraint violations converges to zero as the game is repeated. We show that our algorithm referred to as c.z.AdaNormalGP, obtains kernel-dependent regret bounds, and the cumulative constraint violations have sublinear kernel-dependent upper bounds. In addition, we introduce the notion of constrained contextual coarse correlated equilibria (c.z.CCE) and show that $\epsilon$-c.z.CCEs can be approached whenever players follow a no-regret no-violation strategy. Finally, we experimentally demonstrate the effectiveness of c.z.AdaNormalGP on an instance of multi-agent reinforcement learning.

**Learning Unknown Intervention Targets in Structural Causal Models from Heterogeneous Data**

Yuqin Yang · Saber Salehkaleybar · Negar Kiyavash

We study the problem of identifying the unknown intervention targets in structural causal models where we have access to heterogeneous data collected from multiple environments. The unknown intervention targets are the set of endogenous variables whose corresponding exogenous noises change across the environments. We propose a two-phase approach which in the first phase recovers the exogenous noises corresponding to unknown intervention targets whose distributions have changed across environments. In the second phase, the recovered noises are matched with the corresponding endogenous variables. For the recovery phase, we provide sufficient conditions for learning these exogenous noises up to some component-wise invertible transformation. For the matching phase, under the causal sufficiency assumption, we show that the proposed method uniquely identifies the intervention targets. In the presence of latent confounders, the intervention targets among the observed variables cannot be determined uniquely. We provide a candidate intervention target set which is a superset of the true intervention targets. Our approach improves upon the state of the art as the returned candidate set is always a subset of the target set returned by previous work. Moreover, we do not require restrictive assumptions such as linearity of the causal model or performing invariance tests to learn whether a distribution is changing across environments which could be highly sample inefficient. Our experimental results show the effectiveness of our proposed algorithm in practice.

**Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent in High-Dimensional Least Squares**

Pratik Patil · Yuchen Wu · Ryan Tibshirani

We analyze the statistical properties of generalized cross-validation (GCV) and leave-one-out cross-validation (LOOCV) applied to early-stopped gradient descent (GD) iterates in high-dimensional least squares regression. Surprisingly, our results show that GCV can be inconsistent for estimating the squared prediction risk, even under a well-specified linear model with isotropic design. In contrast, we prove that LOOCV converges uniformly along the GD trajectory to the prediction risk. Our theory holds under mild assumptions on the data distribution and does not require the underlying regression function to be linear. Furthermore, by suitably extending LOOCV, we construct consistent estimators for the entire prediction error distribution along the GD trajectory and for a wide class of its functionals. This in particular enables the construction of pathwise prediction intervals for the unknown response with asymptotically correct nominal coverage conditional on the training data.

**Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data**

Srikar Katta · Harsh Parikh · Cynthia Rudin · Alexander Volfovsky

Many modern causal questions ask how treatments affect complex outcomes that are measured using wearable devices and sensors. Current analysis approaches require summarizing these data into scalar statistics (e.g., the mean), but these summaries can be misleading. For example, disparate distributions can have the same means, variances, and other statistics. Researchers can overcome the loss in information by instead representing the data as distributions. We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision making: Analyzing Distributional Data via Matching After Learning to Stretch (ADD MALTS). We (i) provide analytical guarantees of the correctness of our estimation strategy, (ii) demonstrate via simulation that ADD MALTS outperforms other distributional data analysis methods at estimating treatment effects, and (iii) illustrate ADD MALTS' ability to verify whether there is enough cohesion between treatment and control units within subpopulations to trustworthily estimate treatment effects. We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks.

**Consistent and Asymptotically Unbiased Estimation of Proper Calibration Errors**

Teodora Popordanoska · Sebastian G. Gruber · Aleksei Tiulpin · Florian Buettner · Matthew Blaschko

Proper scoring rules evaluate the quality of probabilistic predictions, playing an essential role in the pursuit of accurate and well-calibrated models. Every proper score decomposes into two fundamental components -- proper calibration error and refinement -- utilizing a Bregman divergence.While uncertainty calibration has gained significant attention, current literature lacks a general estimator for these quantities with known statistical properties. To address this gap, we propose a method that allows consistent, and asymptotically unbiased estimation of all proper calibration errors and refinement terms. In particular, we introduce Kullback-Leibler calibration error, induced by the commonly used cross-entropy loss. As part of our results, we prove the relation between refinement and f-divergences, which implies information monotonicity in neural networks, regardless of which proper scoring rule is optimized.Our experiments validate empirically the claimed properties of the proposed estimator and suggest that the selection of a post-hoc calibration method should be determined by the particular calibration error of interest.

**Identification and Estimation of ``Causes of Effects'' using Covariate-Mediator Information**

Ryusei Shingaki · Manabu Kuroki

In this paper, we deal with the evaluation problem of "causes of effects" (CoE), which focuses on the likelihood that one event was the cause of another. To assess this likelihood, three types of probabilities of causation have been utilized: probability of necessity, probability of sufficiency, and probability of necessity and sufficiency. However, these usually cannot be estimated, even if "effects of causes" (EoC) is estimable from statistical data, regardless of how large the data is. To solve this problem, we propose novel identification conditions for CoE, using an intermediate variable together with covariate information. Additionally, we also propose a new method for estimating CoE that is applicable whenever they are identifiable through the proposed identification conditions.

**Sequential learning of the Pareto front for multi-objective bandits**

Élise Crepon · Aurélien Garivier · Wouter Koolen

We study the problem of sequential learning of the Pareto front in multi-objective multi-armed bandits. An agent is faced with $K$ possible arms to pull. At each turn she picks one, and receives a vector-valued reward. When she thinks she has enough information to identify the Pareto front of the different arm means, she stops the game and gives an answer. We are interested in designing algorithms such that the answer given is correct with probability at least $1-\delta$. Our main contribution is an efficient implementation of an algorithm achieving the optimal sample complexity when the risk $\delta$ is small. With $K$ arms in $d$ dimensions, $p$ of which are in the Pareto set, the algorithm runs in time $O(K p^d)$ per round.

**On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions**

Simon Martin · Francis Bach · Giulio Biroli

We study the training dynamics of a shallow neural network with quadratic activation functions and quadratic cost in a teacher-student setup. In line with previous works on the same neural architecture, the optimization is performed following the gradient flow on the population risk, where the average over data points is replaced by the expectation over their distribution, assumed to be Gaussian. We first derive convergence properties for the gradient flow and quantify the overparameterization that is necessary to achieve a strong signal recovery. Then, assuming that the teachers and the students at initialization form independent orthonormal families, we derive a high-dimensional limit for the flow and show that the minimal overparameterization is sufficient for strong recovery. We verify by numerical experiments that these results hold for more general initializations.

**Minimax Excess Risk of First-Order Methods for Statistical Learning with Data-Dependent Oracles**

Kevin Scaman · Mathieu Even · Batiste Le Bars · Laurent Massoulié

In this paper, our aim is to analyse the generalization capabilities of first-order methods for statistical learning in multiple, different yet related, scenarios including supervised learning, transfer learning, robust learning and federated learning. To do so, we provide sharp upper and lower bounds for the minimax excess risk of strongly convex and smooth statistical learning when the gradient is accessed through partial observations given by a data-dependent oracle. This novel class of oracles can query the gradient with any given data distribution, and is thus well suited to scenarios in which the training data distribution does not match the target (or test) distribution. In particular, our upper and lower bounds are proportional to the smallest mean square error achievable by gradient estimators, thus allowing us to easily derive multiple sharp bounds in the aforementioned scenarios using the extensive literature on parameter estimation.

**On the Misspecification of Linear Assumptions in Synthetic Controls**

Achille Nazaret · claudia shi · David Blei

The synthetic control (SC) method is popular for estimating causal effects from observational panel data. It rests on a crucial assumption that we can write the treated unit as a linear combination of the untreated units. In practice, this assumption may not hold, and when violated, the resulting SC estimates are incorrect. This paper examines two questions: (1) How large can the misspecification error be? (2) How can we minimize it? First, we provide theoretical bounds to quantify the misspecification error. The bounds are comforting: small misspecifications induce small errors. With these bounds in hand, we develop new SC estimators specially designed to minimize misspecification error. The estimators are based on additional data about each unit. %, which is used to produce the SC weights. (E.g., if the units are countries, it might be demographic information about each.) We study our estimators on synthetic data; we find they produce more accurate causal estimates than standard SC. We then re-analyze the California tobacco program data of the original SC paper, now including additional data from the US census about per-state demographics. Our estimators show that the observations in the pre-treatment period lie within the bounds of misspecification error and that observations post-treatment lie outside of those bounds. This is evidence that our SC methods have uncovered a true effect.

**Multi-Agent Bandit Learning through Heterogeneous Action Erasure Channels**

Osama Hanna · Merve Karakas · Lin Yang · Christina Fragouli

Multi-Armed Bandit (MAB) systems are witnessing an upswing in applications within multi-agent distributed environments, leading to the advancement of collaborative MAB algorithms. In such settings, communication between agents executing actions and the primary learner making decisions can hinder the learning process. A prevalent challenge in distributed learning is action erasure, often induced by communication delays and/or channel noise. This results in agents possibly not receiving the intended action from the learner, subsequently leading to misguided feedback. In this paper, we introduce novel algorithms that enable learners to interact concurrently with distributed agents across heterogeneous action erasure channels with different action erasure probabilities. We illustrate that, in contrast to existing bandit algorithms, which experience linear regret, our algorithms assure sub-linear regret guarantees. Our proposed solutions are founded on a meticulously crafted repetition protocol and scheduling of learning across heterogeneous channels. To our knowledge, these are the first algorithms capable of effectively learning through heterogeneous action erasure channels. We substantiate the superior performance of our algorithm through numerical experiments, emphasizing their practical significance in addressing issues related to communication constraints and delays in multi-agent environments.

**Conditions on Preference Relations that Guarantee the Existence of Optimal Policies**

Jonathan Colaço Carr · Prakash Panangaden · Doina Precup

Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. We show that a decision-making problem can have optimal policies -- that are characterized by recursive optimality equations -- even when no reward function can express the learning goal. These findings underline the need to explore preference-based learning strategies which do not assume that preferences are generated by reward.

**Membership Testing in Markov Equivalence Classes via Independence Queries**

Jiaqi Zhang · Kirankumar Shiragur · Caroline Uhler

Understanding causal relationships between variables is a fundamental problem with broad impacts in numerous scientific fields. While extensive research have been dedicated to _learning_ causal graphs from data, its complementary concept of _testing_ causal relationships has remained largely unexplored. In our work, we take the initiative to formally delve into the _testing_ aspect of causal discovery. While _learning_ involves the task of recovering the Markov equivalence class (MEC) of the underlying causal graph from observational data, our _testing_ counterpart addresses a critical question: _Given a specific MEC, can we determine if the underlying causal graph belongs to it with observational data?_We explore constraint-based testing methods by establishing bounds on the required number of conditional independence tests. Our bounds are in terms of the size of the maximum clique ($s'$) and the size of the maximum undirected clique ($s$) of the given MEC. In the worst case, we show a lower bound of $\exp(\Omega(s))$ independence tests. We then give an algorithm that resolves the task with $\exp(O(s'))$ independence tests. Notably, our lower and upper bounds coincide when considering moral MECs ($s'=s$). Compared to _learning_, where algorithms often use a number of independence tests that is exponential in the maximum in-degree, our results show that _testing_ is a relatively more manageable task. In particular, it requires exponentially less independence tests in graphs featuring high in-degrees and small clique sizes.

**General Identifiability and Achievability for Causal Representation Learning**

Burak Varici · Emre Acartürk · Karthikeyan Shanmugam · Ali Tajer

This paper focuses on causal representation learning (CRL) under a general nonparametric latent causal model and a general transformation model that maps the latent data to the observational data. It establishes identifiability and achievability results using two hard uncoupled interventions per node in the latent causal graph. Notably, one does not know which pair of intervention environments have the same node intervened (hence, uncoupled). For identifiability, the paper establishes that perfect recovery of the latent causal model and variables is guaranteed under uncoupled interventions. For achievability, an algorithm is designed that uses observational and interventional data and recovers the latent causal model and variables with provable guarantees. This algorithm leverages score variations across different environments to estimate the inverse of the transformer and, subsequently, the latent variables. The analysis, additionally, recovers the identifiability result for two hard coupled interventions, that is when metadata about the pair of environments that have the same node intervened is known. This paper also shows that when observational data is available, additional faithfulness assumptions that are adopted by the existing literature are unnecessary.

**Proxy Methods for Domain Adaptation**

Katherine Tsai · Stephen Pfohl · Olawale Salaudeen · Nicole Chiou · Matt Kusner · Alexander D'Amour · Sanmi Koyejo · Arthur Gretton

We study the problem of domain adaptation under distribution shift, where the shift is due to a change in the distribution of an unobserved, latent variable that confounds both the covariates and the labels. In this setting, neither the covariate shift nor the label shift assumptions apply. Our approach to adaptation employs proximal causal learning, a technique for estimating causal effects in settings where proxies of unobserved confounders are available.We demonstrate that proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables. We consider two settings, (i) Concept Bottleneck: an additional ``concept'' variable is observed that mediates the relationship between the covariates and labels; (ii) Multi-domain: training data from multiple source domains is available, where each source domain exhibits a different distribution over the latent confounder.We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings. In our experiments, we show that our approach outperforms other methods, notably those which explicitly recover the latent confounder.

**Self-Compatibility: Evaluating Causal Discovery without Ground Truth**

Philipp M. Faller · Leena C Vankadara · Atalanti Mastakouri · Francesco Locatello · Dominik Janzing

As causal ground truth is incredibly rare, causal discovery algorithms are commonly only evaluated on simulated data. This is concerning, given that simulations reflect preconceptions about generating processes regarding noise distributions, model classes, and more. In this work, we propose a novel method for falsifying the output of a causal discovery algorithm in the absence of ground truth. Our key insight is that while statistical learning seeks stability across subsets of data points, causal learning should seek stability across subsets of variables. Motivated by this insight, our method relies on a notion of compatibility between causal graphs learned on different subsets of variables. We prove that detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Although passing such compatibility tests is only a necessary criterion for good performance, we argue that it provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution. We also demonstrate experimentally that detection of incompatibilities can aid in causal model selection.

**Benchmarking Observational Studies with Experimental Data under Right-Censoring**

Ilker Demirel · Edward De Brouwer · Zeshan Hussain · Michael Oberst · Anthony Philippakis · David Sontag

Drawing causal inferences from observational studies (OS) requires unverifiable validity assumptions; however, one can falsify those assumptions by benchmarking the OS with experimental data from a randomized controlled trial (RCT). A major limitation of existing procedures is not accounting for censoring, despite the abundance of RCTs and OSes that report right-censored time-to-event outcomes. We consider two cases where censoring time (1) is independent of time-to-event and (2) depends on time-to-event the same way in OS and RCT. For the former, we adopt a censoring-doubly-robust signal for the conditional average treatment effect (CATE) to facilitate an equivalence test of CATEs in OS and RCT, which serves as a proxy for testing if the validity assumptions hold. For the latter, we show that the same test can still be used even though unbiased CATE estimation may not be possible. We verify the effectiveness of our censoring-aware tests via semi-synthetic experiments and analyze RCT and OS data from the Women's Health Initiative study.

We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an arbitrarily accurate approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or the absolute budget. Given our hardness results, our approximation guarantees are the best possible under worst-case analysis.

Accurate estimation of conditional average treatment effects (CATE) is at the core of personalized decision making. While there is a plethora of models for CATE estimation, model selection is a non-trivial task, due to the fundamental problem of causal inference. Recent empirical work provides evidence in favor of proxy loss metrics with double robust properties and in favor of model ensembling. However, theoretical understanding is lacking. Direct application of prior theoretical works leads to suboptimal oracle model selection rates due to the non-convexity of the model selection problem. We provide regret rates for the major existing CATE ensembling approaches and propose a new CATE model ensembling approach based on $Q$-aggregation using the doubly robust loss. Our main result shows that causal $Q$-aggregation achieves statistically optimal oracle model selection regret rates of $\log(M)/n$ (with $M$ models and $n$ samples), with the addition of higher-order estimation error terms related to products of errors in the nuisance functions. Crucially, our regret rate does not require that any of the candidate CATE models be close to the truth. We validate our new method on many semi-synthetic datasets and also provide extensions of our work to CATE model selection with instrumental variables and unobserved confounding.

**Learning Policies for Localized Interventions from Observational Data**

Myrl Marmarelis · Fred Morstatter · Aram Galstyan · Greg Ver Steeg

A largely unaddressed problem in causal inference is that of learning reliable policies in continuous, high-dimensional treatment variables from observational data. Especially in the presence of strong confounding, it can be infeasible to learn the entire heterogeneous response surface from treatment to outcome. It is also not particularly useful, when there are practical constraints on the size of the interventions altering the observational treatments. Since it tends to be easier to learn the outcome for treatments near existing observations, we propose a new framework for evaluating and optimizing the effect of small, tailored, and localized interventions that nudge the observed treatment assignments. Our doubly robust effect estimator plugs into a policy learner that stays within the interventional scope by optimal transport. Consequently, the error of the total policy effect is restricted to prediction errors nearby the observational distribution, rather than the whole response surface.

**Improved Regret Bounds of (Multinomial) Logistic Bandits via Regret-to-Confidence-Set Conversion**

Junghyun Lee · Se-Young Yun · Kwang-Sung Jun

Logistic bandit is a ubiquitous framework of modeling users' choices, e.g., click vs.\ no click for advertisement recommender system. We observe that the prior works overlook or neglect dependencies in $S \geq \Vert \theta_\star \Vert_2$, where $\theta_\star \in \mathbb{R}^d$ is the unknown parameter vector, which is particularly problematic when $S$ is large, e.g., $S \geq d$. In this work, we improve the dependency on $S$ via a novel approach called {\it regret-to-confidence set conversion (R2CS)}, which allows us to construct a convex confidence set based on only the \textit{existence} of an online learning algorithm with a regret guarantee. Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility and the dependence on other factors such as $d$ and $T$. We apply our new confidence set to the regret analyses of logistic bandits with a new martingale concentration step that circumvents an additional factor of $S$. We then extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS. While we applied R2CS to the (multinomial) logistic model, R2CS is a generic approach for developing confidence sets that can be used for various models, which can be of independent interest.

**Gibbs-Based Information Criteria and the Over-Parameterized Regime**

Haobo Chen · Gregory Wornell · Yuheng Bu

Double-descent refers to the unexpected drop in test loss of a learning algorithm beyond an interpolating threshold with over-parameterization, which is not predicted by information criteria in their classical forms due to the limitations in the standard asymptotic approach. We update these analyses using the information risk minimization framework and provide Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for models learned by the Gibbs algorithm. Notably, the penalty terms for the Gibbs-based AIC and BIC correspond to specific information measures, i.e., symmetrized KL information and KL divergence.We extend this information-theoretic analysis to over-parameterized models by providing two different Gibbs-based BICs to compute the marginal likelihood of random feature models in the regime where the number of parameters $p$ and the number of samples $n$ tend to infinity, with $p/n$ fixed. Our experiments demonstrate that the Gibbs-based BIC can select the high-dimensional model and reveal the mismatch between marginal likelihood and population risk in the over-parameterized regime, providing new insights to understand double-descent.

**Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources**

Rohan Deb · Aadirupa Saha · Arindam Banerjee

We consider the problem of reward maximization in the dueling bandit setup along with constraints on resource consumption. As in the classic dueling bandits, at each round the learner has to choose a pair of items from a set of $K$ items and observe a relative feedback for the current pair. Additionally, for both items, the learner also observes a vector of resource consumptions. The objective of the learner is to maximize the cumulative reward, while ensuring that the total consumption of any resource is within the allocated budget. We show that due to the relative nature of the feedback, the problem is more difficult than its bandit counterpart and that without further assumptions the problem is not learnable from a regret minimization perspective. Thereafter, by exploiting assumptions on the available budget, we provide an EXP3 based dueling algorithm that also considers the associated consumptions and show that it achieves an $\tilde{\mathcal{O}}\left(\big({\frac{OPT^{(b)}}{B}}+1\big)K^{1/3}T^{2/3}\right)$ regret, where $OPT^{(b)}$ is the optimal value and $B$ is the available budget. Finally, we provide numerical simulations to demonstrate the efficacy of our proposed method.

**Sum-max Submodular Bandits**

Stephen Pasteris · Alberto Rumi · Fabio Vitale · Nicolò Cesa-Bianchi

Many online decision-making problems correspond to maximizing a sequence of submodular functions.In this work, we introduce sum-max functions, a subclass of monotone submodular functions capturing several interesting problems, including best-of-$K$-bandits, combinatorial bandits, and the bandit versions on $M$-medians and hitting sets.We show that all functions in this class satisfy a key property that we call pseudo-concavity.This allows us to prove $\big(1 - \frac{1}{e}\big)$-regret bounds for bandit feedback in the nonstochastic setting of the order of $\sqrt{MKT}$ (ignoring log factors), where $T$ is the time horizon and $M$ is a cardinality constraint.This bound, attained by a simple and efficient algorithm, significantly improves on the $\widetilde{\mathcal{O}}\big(T^{2/3}\big)$ regret bound for online monotone submodular maximization with bandit feedback.We also extend our results to a bandit version of the facility location problem.

**Bounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input Uncertainty**

Yu Inatsu · Shion Takeno · Hiroyuki Hanada · Kazuki Iwata · Ichiro Takeuchi

In this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method performs as well or better than existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.

**To Pool or Not To Pool: Analyzing the Regularizing Effects of Group-Fair Training on Shared Models**

Cyrus Cousins · I. Elizabeth Kumar · Suresh Venkatasubramanian

In fair machine learning, one source of performance disparities between groups is overfitting to groups with relatively few training samples.We derive group-specific bounds on the generalization error of welfare-centric fair machine learning that benefit from the larger sample size of the majority group.We do this by considering group-specific Rademacher averages over a restricted hypothesis class, which contains the family of models likely to perform well with respect to a fair learning objective (e.g., a power-mean).Our simulations demonstrate these bounds improve over a na\"ive method, as expected by theory, with particularly significant improvement for smaller group sizes.

**Causal Bandits with General Causal Models and Interventions**

Zirui Yan · Dennis Wei · Dmitriy Katz · Prasanna Sattigeri · Ali Tajer

This paper considers causal bandits (CBs) for the sequential design of interventions in a causal system. The objective is to optimize a reward function via minimizing a measure of cumulative regret with respect to the best sequence of interventions in hindsight. The paper advances the results on CBs in three directions. First, the structural causal models (SCMs) are assumed to be unknown and drawn arbitrarily from a general class $\mathcal{F}$ of Lipschitz-continuous functions. Existing results are often focused on (generalized) linear SCMs. Second, the interventions are assumed to be generalized soft with any desired level of granularity, resulting in an infinite number of possible interventions. The existing literature, in contrast, generally adopts atomic and hard interventions. Third, we provide general upper and lower bounds on regret. The upper bounds subsume (and improve) known bounds for special cases. The lower bounds are generally hitherto unknown. These bounds are characterized as functions of the (i) graph parameters, (ii) eluder dimension of the space of SCMs, denoted by $\mathrm{dim}(\mathcal{F})$, and (iii)~the covering number of the function space, denoted by $\mathrm{cn}(\mathcal{F})$. Specifically, the cumulative achievable regret over horizon $T$ is $\mathcal{O}(K d^{L-1}\sqrt{T\,\mathrm{dim}(\mathcal{F}) \log(\mathrm{cn}(\mathcal{F}))})$, where $K$ is related to the Lipschitz constants, $d$ is the graph’s maximum in-degree, and $L$ is the length of the longest causal path. The upper bound is further refined for special classes of SCMs (neural network, polynomial, and linear), and their corresponding lower bounds are provided.

**On the price of exact truthfulness in incentive-compatible online learning with bandit feedback: a regret lower bound for WSU-UX**

Ali Mortazavi · Junhao Lin · Nishant Mehta

In one view of the classical game of prediction with expert advice with binary outcomes, in each round, each expert maintains an adversarially chosen belief and honestly reports this belief. We consider a recently introduced, strategic variant of this problem with selfish (reputation-seeking) experts, where each expert strategically reports in order to maximize their expected future reputation based on their belief. In this work, our goal is to design an algorithm for the selfish experts problem that is incentive-compatible (IC, or \emph{truthful}), meaning each expert's best strategy is to report truthfully, while also ensuring the algorithm enjoys sublinear regret with respect to the expert with the best belief. Freeman et al. (2020) recently studied this problem in the full information and bandit settings and obtained truthful, no-regret algorithms by leveraging prior work on wagering mechanisms. While their results under full information match the minimax rate for the classical ("honest experts") problem, the best-known regret for their bandit algorithm WSU-UX is $O(T^{2/3})$, which does not match the minimax rate for the classical ("honest bandits") setting. It was unclear whether the higher regret was an artifact of their analysis or a limitation of WSU-UX. We show, via explicit construction of loss sequences, that the algorithm suffers a worst-case $\Omega(T^{2/3})$ lower bound. Left open is the possibility that a different IC algorithm obtains $O(\sqrt{T})$ regret. Yet, WSU-UX was a natural choice for such an algorithm owing to the limited design room for IC algorithms in this setting.

**Faster Recalibration of an Online Predictor via Approachability**

Princewill Okoroafor · Bobby Kleinberg · Wen Sun

Predictive models in ML need to be trustworthy and reliable, which often at the very least means outputting calibrated probabilities. This can be particularly difficult to guarantee in the online prediction setting when the outcome sequence can be generated adversarially. In this paper we introduce a technique using Blackwell’s approachability theorem for taking an online predictive model which might not be calibrated and transforming its predictions to calibrated predictions without much increase to the loss of the original model. Our proposed algorithm achieves calibration and accuracy at a faster rate than existing techniques (Kuleshov and Ermon, 2017) and is the first algorithm to offer a flexible tradeoff between calibration error and accuracy in the online setting. We demonstrate this by characterizing the space of jointly achievable calibration and regret using our technique.

**Theoretically Grounded Loss Functions and Algorithms for Score-Based Multi-Class Abstention**

Anqi Mao · Mehryar Mohri · Yutao Zhong

Learning with abstention is a key scenario where the learner can abstain from making a prediction at some cost. In this paper, we analyze the score-based formulation of learning with abstention in the multi-class classification setting. We introduce new families of surrogate losses for the abstention loss function, which include the state-of-the-art surrogate losses in the single-stage setting and a novel family of loss functions in the two-stage setting. We prove strong non-asymptotic and hypothesis set-specific consistency guarantees for these surrogate losses, which upper-bound the estimation error of the abstention loss function in terms of the estimation error of the surrogate loss. Our bounds can help compare different score-based surrogates and guide the design of novel abstention algorithms by minimizing the proposed surrogate losses. We experimentally evaluate our new algorithms on CIFAR-10, CIFAR-100, and SVHN datasets and the practical significance of our new surrogate losses and two-stage abstention algorithms. Our results also show that the relative performance of the state-of-the-art score-based surrogate losses can vary across datasets.

**Differentially Private Reward Estimation with Preference Feedback**

Sayak Ray Chowdhury · Xingyu Zhou · Nagarajan Natarajan

Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. Instead of relying on numerical rewards, the generative models are trained using reinforcement learning with human feedback (RLHF). These approaches first solicit feedback from human labelers typically in the form of pairwise comparisons between two possible actions, then estimate a reward model using these comparisons, and finally employ a policy based on the estimated reward model. An adversarial attack in any step of the above pipeline might reveal private and sensitive information of human labelers. In this work, we adopt the notion of \emph{label differential privacy} (DP) and focus on the problem of reward estimation from preference-based feedback while protecting privacy of each individual labelers. Specifically, we consider the parametric Bradley-Terry-Luce (BTL) model for such pairwise comparison feedback involving a latent reward parameter $\theta^* \in \mathbb{R}^d$. Within a standard minimax estimation framework, we provide tight upper and lower bounds on the error in estimating $\theta^*$ under both \emph{local} and \emph{central} models of DP. We show, for a given privacy budget $\epsilon$ and number of samples $n$, that the additional cost to ensure label-DP under local model is $\Theta \big(\frac{1}{ e^\epsilon-1}\sqrt{\frac{d}{n}}\big)$, while it is $\Theta\big(\frac{\sqrt{d}}{\epsilon n} \big)$ under the weaker central model. We perform simulations on synthetic data that corroborate these theoretical results.

**Identifying Confounding from Causal Mechanism Shifts**

Sarah Mameche · Jilles Vreeken · David Kaltenpoth

Causal discovery methods commonly assume that all data is independently and identically distributed (i.i.d.) and that there are no unmeasured confounding variables. In practice, neither is likely to hold, and detecting confounding in non-i.i.d. settings poses a significant challenge. Motivated by this, we explore how to discover confounders from data in multiple environments with causal mechanism shifts. We show that the mechanism changes of observed variables can reveal which variable sets are confounded. Based on this idea, we propose an empirically testable criterion based on mutual information, show under which conditions it can identify confounding, and introduce CoCo to discover confounders from data in multiple contexts. In our experiments, we show that CoCo works well on synthetic and real-world data.

In this paper, we address the problem of testing exchangeability of a sequence of random variables, $X_1, X_2,\cdots$. This problem has been studied under the recently popular framework of \emph{testing by betting}. But the mapping of testing problems to game is not one to one: many games can be designed for the same test. Past work established that it is futile to play single game betting on every observation: test martingales in the data filtration are powerless. Two avenues have been explored to circumvent this impossibility: betting in a reduced filtration (wealth is a test martingale in a coarsened filtration), or playing many games in parallel (wealth is an e-process in the data filtration). The former has proved to be difficult to theoretically analyze, while the latter only works for binary or discrete observation spaces. Here, we introduce a different approach that circumvents both drawbacks. We design a new (yet simple) game in which we observe the data sequence in pairs. Even though betting on individual observations is futile, we show that betting on pairs of observations is not. To elaborate, we prove that our game leads to a nontrivial test martingale, which is interesting because it has been obtained by shrinking the filtration very slightly. We show that our test controls type-1 error despite continuous monitoring, and is consistent for both binary and continuous observations, under a broad class of alternatives. Due to the shrunk filtration, optional stopping is only allowed at even stopping times: a relatively minor price. We provide a variety of simulations that align with our theoretical findings.

**Boundary-Aware Uncertainty for Feature Attribution Explainers**

Davin Hill · Aria Masoomi · Max Torop · Sandesh Ghimire · Jennifer Dy

Post-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications. However, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. In this work we propose the Gaussian Process Explanation unCertainty (GPEC) framework, which generates a unified uncertainty estimate combining decision boundary-aware uncertainty with explanation function approximation uncertainty. We introduce a novel geodesic-based kernel, which captures the complexity of the target black-box decision boundary. We show theoretically that the proposed kernel similarity increases with decision boundary complexity. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method. Empirical results on multiple tabular and image datasets show that the GPEC uncertainty estimate improves understanding of explanations as compared to existing methods.

**EM for Mixture of Linear Regression with Clustered Data**

Amirhossein Reisizadeh · Khashayar Gatmiry · Asuman Ozdaglar

Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from m batches of dependent samples each containing n measurements. Discarding the clustered structure in the mixture model, EM is knownto require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)}$). In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as m grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.

**Federated Experiment Design under Distributed Differential Privacy**

Wei-Ning Chen · Graham Cormode · Akash Bharadwaj · Peter Romov · Ayfer Ozgur

Experiment design has a rich history dating back over a century and has found many critical applications across various fields since then. The use and collection of users' data in experiments often involve sensitive personal information, so additional measures to protect individual privacy are required during data collection, storage, and usage. In this work, we focus on the rigorous protection of users' privacy (under the notion of differential privacy (DP)) while minimizing the trust toward service providers. Specifically, we consider the estimation of the average treatment effect (ATE) under DP, while only allowing the analyst to collect population-level statistics via secure aggregation, a distributed protocol enabling a service provider to aggregate information without accessing individual data. Although a vital component in modern A/B testing workflows, private distributed experimentation has not previously been studied.To achieve DP, we design local privatization mechanisms that are compatible with secure aggregation and analyze the utility, in terms of the width of confidence intervals, both asymptotically and non-asymptotically. We show how these mechanisms can be scaled up to handle the very large number of participants commonly found in practice.In addition, when introducing DP noise, it is imperative to cleverly split privacy budgets to estimate both the mean and variance of the outcomes and carefully calibrate the confidence intervals according to the DP noise. Last, we present comprehensive experimental evaluations of our proposed schemes and show the privacy-utility trade-offs in experiment design.

**Can Probabilistic Feedback Drive User Impacts in Online Platforms?**

Jessica Dai · Bailey Flanigan · Meena Jagadeesan · Nika Haghtalab · Chara Podimata

A common explanation for negative user impacts of content recommender systems is misalignment between the platform's objective and user welfare. In this work, we show that misalignment in the platform's objective is not the only potential cause of unintended impacts on users: even when the platform's objective is fully aligned with user welfare, the platform's learning algorithm can induce negative downstream impacts on users. The source of these user impacts is that different pieces of content may generate observable user reactions (feedback information) at different rates; these feedback rates may correlate with content properties, such as controversiality or demographic similarity of the creator, that affect the user experience. Since differences in feedback rates can impact how often the learning algorithm engages with different content, the learning algorithm may inadvertently promote content with certain such properties. Using the multi-armed bandit framework with probabilistic feedback, we examine the relationship between feedback rates and a learning algorithm's engagement with individual arms for different no-regret algorithms. We prove that no-regret algorithms can exhibit a wide range of dependencies: if the feedback rate of an arm increases, some no-regret algorithms engage with the arm more, some no-regret algorithms engage with the arm less, and other no-regret algorithms engage with the arm approximately the same number of times. From a platform design perspective, our results highlight the importance of looking beyond regret when measuring an algorithm's performance, and assessing the nature of a learning algorithm's engagement with different types of content as well as their resulting downstream impacts.

**Achieving Group Distributional Robustness and Minimax Group Fairness with Interpolating Classifiers**

Natalia Martinez · Martin Bertran · Guillermo Sapiro

Group distributional robustness optimization methods (GDRO) learn models that guarantee performance across a broad set of demographics. GDRO is often framed as a minimax game where an adversary proposes data distributions under which the model performs poorly; importance weights are used to mimic the adversarial distribution on finite samples. Prior work has show that applying GDRO with interpolating classifiers requires strong regularization to generalize to unseen data. Moreover, these classifiers are not responsive to importance weights in the asymptotic training regime. In this work we propose Bi-level GDRO, a provably convergent formulation that decouples the adversary's and model learner's objective and improves generalization guarantees. To address non-responsiveness of importance weights, we combine Bi-level GDRO with a learner that optimizes a temperature-scaled loss that can provably trade off performance between demographics, even on interpolating classifiers. We experimentally demonstrate the effectiveness of our proposed method on learning minimax classifiers on a variety of datasets. Code is available at github.com/MartinBertran/BiLevelGDRO.

**Pr{I}sing: Privacy-Preserving Peer Effect Estimation via {I}sing Model**

Abhinav Chakraborty · Anirban Chatterjee · Abhinandan Dalal

The Ising model, originally developed as a spin-glass model for ferromagnetic elements, has gained popularity as a network-based model for capturing dependencies in agents' outputs. Its increasing adoption in healthcare and the social sciences has raised privacy concerns regarding the confidentiality of agents' responses. In this paper, we present a novel $(\varepsilon,\delta)$-differentially private algorithm specifically designed to protect the privacy of individual agents' outcomes. Our algorithm allows for precise estimation of the natural parameter using a single network through an objective perturbation technique. Furthermore, we establish regret bounds for this algorithm and assess its performance on synthetic datasets and two real-world networks: one involving HIV status in a social network and the other concerning the political leaning of online blogs.

**Invariant Aggregator for Defending against Federated Backdoor Attacks**

Xiaoyang Wang · Dimitrios Dimitriadis · Sanmi Koyejo · Shruti Tople

Federated learning enables training high-utility models across several clients without directly sharing their private data. As a downside, the federated setting makes the model vulnerable to various adversarial attacks in the presence of malicious clients. Despite the theoretical and empirical success in defending against attacks that aim to degrade models' utility, defense against backdoor attacks that increase model accuracy on backdoor samples exclusively without hurting the utility on other samples remains challenging. To this end, we first analyze the failure modes of existing defenses over a flat loss landscape, which is common for well-designed neural networks such as Resnet (He et al., 2015) but is often overlooked by previous works. Then, we propose an invariant aggregator that redirects the aggregated update to invariant directions that are generally useful via selectively masking out the update elements that favor few and possibly malicious clients. Theoretical results suggest that our approach provably mitigates backdoor attacks and remains effective over flat loss landscapes. Empirical results on three datasets with different modalities and varying numbers of clients further demonstrate that our approach mitigates a broad class of backdoor attacks with a negligible cost on the model utility.

**Unveiling Latent Causal Rules: A Temporal Point Process Approach for Abnormal Event Explanation**

Yiling Kuang · Chao Yang · Yang Yang · Shuang Li

In high-stakes systems such as healthcare, it is critical to understand the causal reasons behind unusual events, such as sudden changes in patient's health. Unveiling the causal reasons helps with quick diagnoses and precise treatment planning. In this paper, we propose an automated method for uncovering ``if-then'' logic rules to explain observational events. We introduce {\it temporal point processes} to model the events of interest, and discover the set of latent rules to explain the occurrence of events. To achieve this goal, we employ an Expectation-Maximization (EM) algorithm. In the E-step, we calculate the posterior probability of each event being explained by each discovered rule. In the M-step, we update both the rule set and model parameters to enhance the likelihood function's lower bound. Notably, we will optimize the rule set in a {\it differential} manner. Our approach demonstrates accurate performance in both discovering rules and identifying root causes. We showcase its promising results using synthetic and real healthcare datasets.

**Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias**

Yu Yang · Eric Gan · Gintare Karolina Dziugaite · Baharan Mirzasoleiman

Neural networks trained with (stochastic) gradient descent have an inductive bias towards learning simpler solutions. This makes them highly prone to learning spurious correlations in the training data, that may not hold at test time. In this work, we provide the first theoretical analysis of the effect of simplicity bias on learning spurious correlations. Notably, we show that examples with spurious features are provably separable based on the model’s output early in training. We further illustrate that if spurious features have a small enough noise-to-signal ratio, the network’s output on majority of examples is almost exclusively determined by the spurious features, leading to poor worst-group test accuracy. Finally,we propose SPARE, which identifies spurious correlations early in training, and utilizes importance sampling to alleviate their effect. Empirically, we demonstrate that SPARE outperforms state-of-the-art methods by up to 21.1\% in worst-group accuracy, while being up to 12x faster. We also show the applicability of SPARE, as a highly effective but lightweight method, to discover spurious correlations.

**Density-Regression: Efficient and Distance-aware Deep Regressor for Uncertainty Estimation under Distribution Shifts**

Ha Manh Bui · Anqi Liu

Morden deep ensembles technique achieves strong uncertainty estimation performance by going through multiple forward passes with different models. This is at the price of a high storage space and a slow speed in the inference (test) time. To address this issue, we propose Density-Regression, a method that leverages the density function in uncertainty estimation and achieves fast inference by a single forward pass. We prove it is distance aware on the feature space, which is a necessary condition for a neural network to produce high-quality uncertainty estimation under distribution shifts. Empirically, we conduct experiments on regression tasks with the cubic toy dataset, benchmark UCI, weather forecast with time series, and depth estimation under real-world shifted applications. We show that Density-Regression has competitive uncertainty estimation performance under distribution shifts with modern deep regressors while using a lower model size and a faster inference speed.

**The Relative Gaussian Mechanism and its Application to Private Gradient Descent**

Hadrien Hendrikx · Paul Mangold · Aurélien Bellet

The Gaussian Mechanism (GM), which consists in adding Gaussian noise to a vector-valued query before releasing it, is a standard privacy protection mechanism. In particular, given that the query respects some L2 sensitivity property (the L2 distance between outputs on any two neighboring inputs is bounded), GM guarantees Rényi Differential Privacy (RDP). Unfortunately, precisely bounding the L2 sensitivity can be hard, thus leading to loose privacy bounds. In this work, we consider a Relative L2 sensitivity assumption, in which the bound on the distance between two query outputs may also depend on their norm. Leveraging this assumption, we introduce the Relative Gaussian Mechanism (RGM), in which the variance of the noise depends on the norm of the output. We prove tight bounds on the RDP parameters under relative L2 sensitivity, and characterize the privacy loss incurred by using output-dependent noise. In particular, we show that RGM naturally adapts to a latent variable that would control the norm of the output. Finally, we instantiate our framework to show tight guarantees for Private Gradient Descent, a problem that naturally fits our relative L2 sensitivity assumption.

**Approximate Leave-one-out Cross Validation for Regression with l1 Regularizers**

Arnab Auddy · Haolin Zou · Kamiar Rahnama Rad · Arian Maleki

The out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding approach to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is an efficient estimate of LO (and OO) for generalized linear models with differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO's error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with non-differentiable regularizers. We bound the error \(|{\rm ALO}-{\rm LO}|\) in terms of intuitive metrics such as the size of leave-\(i\)-out perturbations in active sets, sample size $n$, number of features $p$ and signal-to-noise ratio (SNR). As a consequence, for the elastic-net problem, we show that $|{\rm ALO}-{\rm LO}| \xrightarrow{p\rightarrow \infty} 0$ while $n/p$ and SNR remain bounded.

**Formal Verification of Unknown Stochastic Systems via Non-parametric Estimation**

Zhi Zhang · Chenyu Ma · Saleh Soudijani · Sadegh Soudjani

A novel data-driven method for formal verification is proposed to study complex systems operating in safety-critical domains. The proposed approach is able to formally verify discrete-time stochastic dynamical systems against temporal logic specifications only using observation samples and without the knowledge of the model, and provide a probabilistic guarantee on the satisfaction of the specification. We first propose the theoretical results for using non-parametric estimation to estimate an asymptotic upper bound for the \emph{Lipschitz constant} of the stochastic system, which can determine a finite abstraction of the system. Our results prove that the asymptotic convergence rate of the estimation is $O(n^{-\frac{1}{3+d}})$, where $d$ is the dimension of the system and n is the data scale. We then construct interval Markov decision processes using two different data-driven methods, namely non-parametric estimation and empirical estimation of transition probabilities, to perform formal verification against a given temporal logic specification. Multiple case studies are presented to validate the effectiveness of the proposed methods.

**Implicit Bias in Noisy-SGD: With Applications to Differentially Private Training**

Tom Sander · Maxime Sylvestre · Alain Durmus

Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) often results in superior test performance compared to larger batches. This implicit bias is attributed to the specific noise structure inherent to SGD. When ensuring Differential Privacy (DP) in DNNs' training, DP-SGD adds Gaussian noise to the clipped gradients. However, large-batch training still leads to a significant performance decrease, posing a challenge as strong DP guarantees necessitate the use of massive batches.Our study first demonstrates that this phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity, not the clipping, is responsible for this implicit bias, even with additional isotropic Gaussian noise. We then theoretically analyze the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings. Our analysis reveals that the additional noise indeed amplifies the implicit bias.It suggests that the performance issues of private training stem from the same underlying principles as SGD, offering hope for improvements in large batch training strategies.

**SIFU: Sequential Informed Federated Unlearning for Efficient and Provable Client Unlearning in Federated Optimization**

Yann Fraboni · Martin Van Waerebeke · Kevin Scaman · Richard Vidal · Laetitia Kameni · Marco Lorenzi

Machine Unlearning (MU) is an increasingly important topic in machine learning safety, aiming at removing the contribution of a given data point from a training procedure. Federated Unlearning (FU) consists in extending MU to unlearn a given client’s contribution from a federated training routine. While several FU methods have been proposed, we currently lack a general approach providing formal unlearning guarantees to the FedAvg routine, while ensuring scalability and generalization beyond the convex assumption on the clients’ loss functions. We aim at filling this gap by proposing SIFU (Sequential Informed Federated Unlearning), a new FU method applying to both convex and non-convex optimization regimes. SIFU naturally applies to FedAvg without additional computational cost for the clients and provides formal guarantees on the quality of the unlearning task. We provide a theoretical analysis of the unlearning properties of SIFU, and practically demonstrate its effectiveness as compared to a panel of unlearning methods from the state-of-the-art.

**SVARM-IQ: Efficient Approximation of Any-order Shapley Interactions through Stratification**

Patrick Kolpaczki · Maximilian Muschalik · Fabian Fumagalli · Barbara Hammer · Eyke Hüllermeier

Addressing the limitations of individual attribution scores via the Shapley value (SV), the field of explainable AI (XAI) has recently explored intricate interactions of features or data points. In particular, extensions of the SV, such as the Shapley Interaction Index (SII), have been proposed as a measure to still benefit from the axiomatic basis of the SV. However, similar to the SV, their exact computation remains computationally prohibitive. Hence, we propose with SVARM-IQ a sampling-based approach to efficiently approximate Shapley-based interaction indices of any order. SVARM-IQ can be applied to a broad class of interaction indices, including the SII, by leveraging a novel stratified representation. We provide non-asymptotic theoretical guarantees on its approximation quality and empirically demonstrate that SVARM-IQ achieves state-of-the-art estimation results in practical XAI scenarios on different model classes and application domains.

**Differentially Private Conditional Independence Testing**

Iden Kalemaj · Shiva Kasiviswanathan · Aaditya Ramdas

Conditional independence (CI) tests are widely used in statistical data analysis, e.g., they are the building block of many algorithms for causal graph discovery. The goal of a CI test is to accept or reject the null hypothesis that $X \perp \!\!\! \perp Y \mid Z$, where $X \in \mathbb{R}, Y \in \mathbb{R}, Z \in \mathbb{R}^d$. In this work, we investigate conditional independence testing under the constraint of differential privacy. We design two private CI testing procedures: one based on the generalized covariance measure of Shah and Peters (2020) and another based on the conditional randomization test of Cand{\`e}s et al. (2016) (under the model-X assumption). We provide theoretical guarantees on the performance of our tests and validate them empirically. These are the first private CI tests with rigorous theoretical guarantees that work for the general case when $Z$ is continuous.

**Fair Machine Unlearning: Data Removal while Mitigating Disparities**

Alex Oesterling · Jiaqi Ma · Flavio Calmon · Himabindu Lakkaraju

The Right to be Forgotten is a core principle outlined by regulatory frameworks such as the EU's General Data Protection Regulation (GDPR). This principle allows individuals to request that their personal data be deleted from deployed machine learning models. While "forgetting" can be naively achieved by retraining on the remaining dataset, it is computationally expensive to do to so with each new request. As such, several machine unlearning methods have been proposed as efficient alternatives to retraining. These methods aim to approximate the predictive performance of retraining, but fail to consider how unlearning impacts other properties critical to real-world applications such as fairness. In this work, we demonstrate that most efficient unlearning methods cannot accommodate popular fairness interventions, and we propose the first fair machine unlearning method that can efficiently unlearn data instances from a fair objective. We derive theoretical results which demonstrate that our method can provably unlearn data and provably maintain fairness performance. Extensive experimentation with real-world datasets highlight the efficacy of our method at unlearning data instances while preserving fairness. Code is provided at https://github.com/AI4LIFE-GROUP/fair-unlearning.

**On the Effect of Key Factors in Spurious Correlation: A theoretical Perspective**

Yipei Wang · Xiaoqian Wang

Spurious correlations arise when irrelevant patterns in input data are mistakenly associated with labels, compromising the generalizability of machine learning models. While these models may be confident during the training stage, they often falter in real-world testing scenarios due to the shift of these misleading correlations. Current solutions to this problem typically involve altering the correlations or regularizing latent representations. However, while these methods show promise in experiments, a rigorous theoretical understanding of their effectiveness and the underlying factors of spurious correlations is lacking. In this work, we provide a comprehensive theoretical analysis, supported by empirical evidence, to understand the intricacies of spurious correlations. Drawing on our proposed theorems, we investigate the behaviors of classifiers when confronted with spurious features, and present our findings on how various factors influence these correlations and their impact on model performances, including the Mahalanobis distance of groups, and training/testing spurious correlation ratios. Additionally, by aligning empirical outcomes with our theoretical discoveries, we highlight the feasibility of assessing the degree of separability of intertwined real-world features. This research paves the way for a nuanced comprehension of spurious correlations, laying a solid theoretical groundwork that promises to steer future endeavors toward crafting more potent mitigation techniques.

**How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability**

Jorge García-Carrasco · Alejandro Maté · Juan Carlos Trujillo

Transformer-based language models are treated as black-boxes because of their large number of parameters and complex internal interactions, which is a serious safety concern. Mechanistic Interpretability (MI) intends to reverse-engineer neural network behaviors in terms of human-understandable components. In this work, we focus on understanding how GPT-2 Small performs the task of predicting three-letter acronyms. Previous works in the MI field have focused so far on tasks that predict a single token. To the best of our knowledge, this is the first work that tries to mechanistically understand a behavior involving the prediction of multiple consecutive tokens. We discover that the prediction is performed by a circuit composed of 8 attention heads (${\sim}5\%$ of the total heads) which we classified in three groups according to their role. We also demonstrate that these heads concentrate the acronym prediction functionality. In addition, we mechanistically interpret the most relevant heads of the circuit and find out that they use positional information which is propagated via the causal mask mechanism. We expect this work to lay the foundation for understanding more complex behaviors involving multiple-token predictions.

**Learning Under Random Distributional Shifts**

Kirk Bansak · Elisabeth Paulson · Dominik Rothenhaeusler

Many existing approaches for generating predictions in settings with distribution shift model distribution shifts as adversarial or low-rank in suitable representations. In various real-world settings, however, we might expect shifts to arise through the superposition of many small and random changes in the population and environment. Thus, we consider a class of random distribution shift models that capture arbitrary changes in the underlying covariate space, and dense, random shocks to the relationship between the covariates and the outcomes. In this setting, we characterize the benefits and drawbacks of several alternative prediction strategies: the standard approach that directly predicts the long-term outcomes of interest, the proxy approach that directly predicts shorter-term proxy outcomes, and a hybrid approach that utilizes both the long-term policy outcome and (shorter-term) proxy outcome(s). We show that the hybrid approach is robust to the strength of the distribution shift and the proxy relationship. We apply this method to datasets in two high-impact domains: asylum-seeker placement and early childhood education. In both settings, we find that the proposed approach results in substantially lower mean-squared error than current approaches.

**Towards Costless Model Selection in Contextual Bandits: A Bias-Variance Perspective**

Sanath Kumar Krishnamurthy · Adrienne Propp · Susan Athey

Model selection in supervised learning provides costless guarantees as if the model that best balances bias and variance was known a priori. We study the feasibility of similar guarantees for cumulative regret minimization in the stochastic contextual bandit setting. Recent work [Marinov and Zimmert, 2021] identifies instances where no algorithm can guarantee costless regret bounds. Nevertheless, we identify benign conditions where costless model selection is feasible: gradually increasing class complexity, and diminishing marginal returns for best-in-class policy value with increasing class complexity. Our algorithm is based on a novel misspecification test, and our analysis demonstrates the benefits of using model selection for reward estimation. Unlike prior work on model selection in contextual bandits, our algorithm carefully adapts to the evolving bias-variance trade-off as more data is collected. In particular, our algorithm and analysis go beyond adapting to the complexity of the simplest realizable class and instead adapt to the complexity of the simplest class whose estimation variance dominates the bias. For short horizons, this provides improved regret guarantees that depend on the complexity of simpler classes.

**NoisyMix: Boosting Model Robustness to Common Corruptions**

N. Benjamin Erichson · Soon Hoe Lim · Winnie Xu · Francisco Utrera · Ziang Cao · Michael Mahoney

The robustness of neural networks has become increasingly important in real-world applications where stable and reliable performance is valued over simply achieving high predictive accuracy. To address this, data augmentation techniques have been shown to improve robustness against input perturbations and domain shifts. In this paper, we propose a new training scheme called NoisyMix that leverages noisy augmentations in both input and feature space to improve model robustness and in-domain accuracy. We demonstrate the effectiveness of NoisyMix on several benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P. Additionally, we provide theoretical analysis to better understand the implicit regularization and robustness properties of NoisyMix.

**On the Vulnerability of Fairness Constrained Learning to Malicious Noise**

Avrim Blum · Princewill Okoroafor · Aadirupa Saha · Kevin Stangl

We consider the vulnerability of fairness-constrained learning to small amounts of malicious noise in the training data. [Konstantinov and Lampert, 2021] initiated the study of this question and presented negative results showing there exist data distributions where for several fairness constraints, any proper learner will exhibit high vulnerability when group sizes are imbalanced. Here, we present a more optimistic view, showing that if we allow randomized classifiers, then the landscape is much more nuanced. For example, for Demographic Parity we show we can incur only a $\Theta(\alpha)$ loss in accuracy, where $\alpha$ is the malicious noise rate, matching the best possible even without fairness constraints. For Equal Opportunity, we show we can incur an $O(\sqrt{\alpha})$ loss, and give a matching $\Omega(\sqrt{\alpha})$ lower bound. In contrast, [Konstantinov and Lampert, 2021] showed for proper learners the loss in accuracy for both notions is $\Omega(1)$. The key technical novelty of our work is how randomization can bypass simple "tricks" an adversary can use to amplify his power. We also consider additional fairness notions including Equalized Odds and Calibration. For these fairness notions, the excess accuracy clusters into three natural regimes $O(\alpha)$, $O(\sqrt{\alpha})$, and $O(1)$. These results provide a more fine-grained view of the sensitivity of fairness-constrained learning to adversarial noise in training data.

**Private Learning with Public Features**

Walid Krichene · Nicolas Mayoraz · Steffen Rendle · Shuang Song · Abhradeep Thakurta · Li Zhang

We study a class of private learning problems in which the data is a join of private and public features. This is often the case in private personalization tasks such as recommendation or ad prediction, in which features related to individuals are sensitive, while features related to items (the movies or songs to be recommended, or the ads to be shown to users) are publicly available and do not require protection. A natural question is whether private algorithms can achieve higher utility in the presence of public features. We give a positive answer for multi-encoder models where one of the encoders operates on public features. We develop new algorithms that take advantage of this separation by only protecting certain sufficient statistics (instead of adding noise to the gradient). This method has a guaranteed utility improvement for linear regression, and importantly, achieves the state of the art on two standard private recommendation benchmarks, demonstrating the importance of methods that adapt to the private-public feature separation.

**Computing epidemic metrics with edge differential privacy**

George Li · Dung Nguyen · Anil Vullikanti

Metrics such as the outbreak size in an epidemic process on a network are fundamental quantities used in public health analyses.The datasets used in such models used in practice, e.g., the contact network and disease states, are sensitive in many settings.We study the complexity of computing epidemic outbreak size within a given time horizon, under edge differential privacy.These quantities have high sensitivity, and we show that giving algorithms with good utility guarantees is impossible for general graphs. To address these hardness results, we consider a smaller class of graphs with similar properties as social networks (called expander graphs) and give a polynomial-time algorithm with strong utility guarantees. Our results are the first to give any non-trivial guarantees for differentially private infection size estimation.

**Improving Robustness via Tilted Exponential Layer: A Communication-Theoretic Perspective**

Bhagyashree Puranik · Ahmad Beirami · Yao Qin · Upamanyu Madhow

State-of-the-art techniques for enhancing robustness of deep networks mostly rely on empirical risk minimization with suitable data augmentation. In this paper, we propose a complementary approach motivated by communication theory, aimed at enhancing the signal-to-noise ratio at the output of a neural network layer via neural competition during learning and inference. In addition to standard empirical risk minimization, neurons compete to sparsely represent layer inputs by maximization of a tilted exponential (TEXP) objective function for the layer. TEXP learning can be interpreted as maximum likelihood estimation of matched filters under a Gaussian model for data noise. Inference in a TEXP layer is accomplished by replacing batch norm by a tilted softmax, which can be interpreted as computation of posterior probabilities for the competing signaling hypotheses represented by each neuron. After providing insights via simplified models, we show, by experimentation on standard image datasets, that TEXP learning and inference enhances robustness against noise and other common corruptions, without requiring data augmentation. Further cumulative gains in robustness against this array of distortions can be obtained by appropriately combining TEXP with data augmentation techniques. The code for all our experiments is available at \url{https://github.com/bhagyapuranik/texp*for*robustness}.

**Non-vacuous Generalization Bounds for Adversarial Risk in Stochastic Neural Networks**

Waleed Mustafa · Philipp Liznerski · Antoine Ledent · Dennis Wagner · Puyu Wang · Marius Kloft

Adversarial examples are manipulated samples used to deceive machine learning models, posing a serious threat in safety-critical applications. Existing safety certificates for machine learning models are limited to individual input examples, failing to capture generalization to unseen data. To address this limitation, we propose novel generalization bounds based on the PAC-Bayesian and randomized smoothing frameworks, providing certificates that predict the model's performance and robustness on unseen test samples based solely on the training data. We present an effective procedure to train and compute the first non-vacuous generalization bounds for neural networks in adversarial settings. Experimental results on the widely recognized MNIST and CIFAR-10 datasets demonstrate the efficacy of our approach, yielding the first robust risk certificates for stochastic convolutional neural networks under the $L_2$ threat model. Our method offers valuable tools for evaluating model susceptibility to real-world adversarial risks.

**Pathwise Explanation of ReLU Neural Networks**

Seongwoo Lim · Won Jo · Joohyung Lee · Jaesik Choi

Neural networks have demonstrated a wide range of successes, but their ``black box" nature raises concerns about transparency and reliability. Previous research on ReLU networks has sought to unwrap these networks into linear models based on activation states of all hidden units. In this paper, we introduce a novel approach that considers subsets of the hidden units involved in the decision making path. This pathwise explanation provides a clearer and more consistent understanding of the relationship between the input and the decision-making process. Our method also offers flexibility in adjusting the range of explanations within the input, i.e., from an overall attribution input to particular components within the input. Furthermore, it allows for the decomposition of explanations for a given input for more detailed explanations. Our experiments demonstrate that the proposed method outperforms existing methods both quantitatively and qualitatively.

**Provable Mutual Benefits from Federated Learning in Privacy-Sensitive Domains**

Nikita Tsoy · Anna Mihalkova · Teodora Todorova · Nikola Konstantinov

Cross-silo federated learning (FL) allows data owners to train accurate machine learning models by benefiting from each others private datasets. Unfortunately, the model accuracy benefits of collaboration are often undermined by privacy defenses. Therefore, to incentivize client participation in privacy-sensitive domains, a FL protocol should strike a delicate balance between privacy guarantees and end-model accuracy. In this paper, we study the question of when and how a server could design a FL protocol provably beneficial for all participants. First, we provide necessary and sufficient conditions for the existence of mutually beneficial protocols in the context of mean estimation and convex stochastic optimization. We also derive protocols that maximize the total clients' utility, given symmetric privacy preferences. Finally, we design protocols maximizing end-model accuracy and demonstrate their benefits in synthetic experiments.

**Mitigating Underfitting in Learning to Defer with Consistent Losses**

Shuqi Liu · Yuzhou Cao · Qiaozhen Zhang · Lei Feng · Bo An

Learning to defer (L2D) allows the classifier to defer its prediction to an expert for safer predictions, by balancing the system's accuracy and extra costs incurred by consulting the expert. Various loss functions have been proposed for L2D, but they were shown to cause the underfitting of trained classifiers when extra consulting costs exist, resulting in degraded performance. In this paper, we propose a novel loss formulation that can mitigate the underfitting issue while remaining the statistical consistency. We first show that our formulation can avoid a common characteristic shared by most existing losses, which has been shown to be a cause of underfitting, and show that it can be combined with the representative losses for L2D to enhance their performance and yield consistent losses. We further study the regret transfer bounds of the proposed losses and experimentally validate its improvements over existing methods.

**Efficient Conformal Prediction under Data Heterogeneity**

Vincent Plassier · Nikita Kotelevskii · Aleksandr Rubashevskii · Fedor Noskov · Maksim Velikanov · Alexander Fishkov · Samuel Horvath · Martin Takac · Eric Moulines · Maxim Panov

Conformal prediction (CP) stands out as a robust framework for uncertainty quantification, which is crucial for ensuring the reliability of predictions. However, common CP methods heavily rely on the data exchangeability, a condition often violated in practice. Existing approaches for tackling non-exchangeability lead to methods that are not computable beyond the simplest examples. In this work, we introduce a new efficient approach to CP that produces provably valid confidence sets for fairly general non-exchangeable data distributions.We illustrate the general theory with applications to the challenging setting of federated learning under data heterogeneity between agents. Our method allows constructing provably valid personalized prediction sets for agents in a fully federated way. The effectiveness of the proposed method is demonstrated in a series of experiments on real-world datasets.

**Bandit Pareto Set Identification: the Fixed Budget Setting**

Cyrille Kone · Emilie Kaufmann · Laura Richert

We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the ``hardness to classify'' each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms.

**Restricted Isometry Property of Rank-One Measurements with Random Unit-Modulus Vectors**

Wei Zhang · Zhenni Wang

The restricted isometry property (RIP) is essential for the linear map to guarantee the successful recovery of low-rank matrices. The existing works show that the linear map generated by the measurement matrices with independent and identically distributed (i.i.d.) entries satisfies RIP with high probability. However, when dealing with non-i.i.d. measurement matrices, such as the rank-one measurements, the RIP compliance may not be guaranteed. In this paper, we show that the RIP can still be achieved with high probability, when the rank-one measurement matrix is constructed by the random unit-modulus vectors. Compared to the existing works, we first address the challenge of establishing RIP for the linear map in non-i.i.d. scenarios. As validated in the experiments, this linear map is memory-efficient, and not only satisfies the RIP but also exhibits similar recovery performance of the low-rank matrices to that of conventional i.i.d. measurement matrices.

Traditional supervised learning aims to learn an unknown mapping by fitting a function to a set of input-output pairs with a fixed dimension. The fitted function is then defined on inputs of the same dimension. However, in many settings, the unknown mapping takes inputs in any dimension; examples include graph parameters defined on graphs of any size and physics quantities defined on an arbitrary number of particles. We leverage a newly-discovered phenomenon in algebraic topology, called representation stability, to define equivariant neural networks that can be trained with data in a fixed dimension and then extended to accept inputs in any dimension. Our approach is black-box and user-friendly, requiring only the network architecture and the groups for equivariance, and can be combined with any training procedure. We provide a simple open-source implementation of our methods and offer preliminary numerical experiments.

**Multivariate Time Series Forecasting By Graph Attention Networks With Theoretical Guarantees**

Zhi Zhang · Weijian Li · Han Liu

Multivariate time series forecasting (MTSF) aims to predict future values of multiple variables based on past values of multivariate time series, and has been applied in fields including traffic flow prediction, stock price forecasting, and anomaly detection. Capturing the inter-dependencies among multiple series poses one significant challenge to MTSF. Recent works have considered modeling the correlated series as graph nodes and using graph neural network (GNN)-based approaches with attention mechanisms added to improve the test prediction accuracy, however, none of them have theoretical guarantees regarding the generalization error. In this paper, we develop a new norm-bounded graph attention network (GAT) for MTSF by upper-bounding the Frobenius norm of weights in each layer of the GAT model to enhance performance. We theoretically establish that the generalization error bound for our model is associated with various components of GAT models: the number of attention heads, the maximum number of neighbors, the upper bound of the Frobenius norm of the weight matrix in each layer, and the norm of the input features. Empirically, we investigate the impact of different components of GAT models on the generalization performance of MTSF on real data. Our experiment verifies our theoretical findings. We compare with multiple prior frequently cited graph-based methods for MTSF using real data sets and the experiment results show our method can achieve the best performance for MTSF. Our method provides novel perspectives for improving the generalization performance of MTSF, and our theoretical guarantees give substantial implications for designing graph-based methods with attention mechanisms for MTSF.

**Spectrum Extraction and Clipping for Implicitly Linear Layers**

Ali Ebrahimpour Boroojeny · Matus Telgarsky · Hari Sundaram

We show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers. We provide the first clipping method which is correct for general convolution layers, and illuminate the representational limitation that caused correctness issues in prior work. We study the effect of the batch normalization layers when concatenated with convolutional layers and show how our clipping method can be applied to their composition. By comparing the accuracy and performance of our algorithms to the state-of-the-art methods, using various experiments, we show they are more precise and efficient and lead to better generalization and adversarial robustness. We provide the code for using our methods at https://github.com/Ali-E/FastClip.

**Preventing Arbitrarily High Confidence on Far-Away Data in Point-Estimated Discriminative Neural Networks**

Ahmad Rashid · Serena Hacker · Guojun Zhang · · Pascal Poupart

Discriminatively trained, deterministic neural networks are the de facto choice for classification problems. However, even though they achieve state-of-the-art results on in-domain test sets, they tend to be overconfident on out-of-distribution (OOD) data. For instance, ReLU networks---a popular class of neural network architectures---have been shown to almost always yield high confidence predictions when the test data are far away from the training set, even when they are trained with OOD data. We overcome this problem by adding a term to the output of the neural network that corresponds to the logit of an extra class, that we design to dominate the logits of the original classes as we move away from the training data. This technique provably prevents arbitrarily high confidence on far-away test data while maintaining a simple discriminative point-estimate training. Evaluation on various benchmarks demonstrates strong performance against competitive baselines on both far-away and realistic OOD data.

**Euclidean, Projective, Conformal: Choosing a Geometric Algebra for Equivariant Transformers**

Pim de Haan · Taco Cohen · Johann Brehmer

The Geometric Algebra Transformer (GATr) is a versatile architecture for geometric deep learning based on projective geometric algebra. We generalize this architecture into a blueprint that allows one to construct a scalable transformer architecture given any geometric (or Clifford) algebra. We study versions of this architecture for Euclidean, projective, and conformal algebras, all of which are suited to represent 3D data, and evaluate them in theory and practice. The simplest Euclidean architecture is computationally cheap, but has a smaller symmetry group and is not as sample-efficient, while the projective model is not sufficiently expressive. Both the conformal algebra and an improved version of the projective algebra define powerful, performant architectures.

**Causally Inspired Regularization Enables Domain General Representations**

Olawale Salaudeen · Sanmi Koyejo

Given a causal graph representing the data-generating process shared across different domains/distributions, enforcing sufficient graph-implied conditional independencies can identify domain-general (non-spurious) feature representations. For the standard input-output predictive setting, we categorize the set of graphs considered in the literature into two distinct groups: (i) those in which the empirical risk minimizer across training domains gives domain-general representations and (ii) those where it does not. For the latter case (ii), we propose a novel framework with regularizations, which we demonstrate are sufficient for identifying domain-general feature representations without a priori knowledge (or proxies) of the spurious features. Empirically, our proposed method is effective for both (semi) synthetic and real-world data, outperforming other state-of-the-art methods in average and worst-domain transfer accuracy.

**XB-MAML: Learning Expandable Basis Parameters for Effective Meta-Learning with Wide Task Coverage**

Jae-Jun Lee · SUNG WHAN YOON

Meta-learning, which pursues an effective initialization model, has emerged as a promising approach to handling unseen tasks. However, a limitation remains to be evident when a meta-learner tries to encompass a wide range of task distribution, e.g., learning across distinctive datasets or domains. Recently, a group of works has attempted to employ multiple model initializations to cover widely-ranging tasks, but they are limited in adaptively expanding initializations. We introduce XB-MAML, which learns expandable basis parameters, where they are linearly combined to form an effective initialization to a given task. XB-MAML observes the discrepancy between the vector space spanned by the basis and fine-tuned parameters to decide whether to expand the basis. Our method surpasses the existing works in the multi-domain meta-learning benchmarks and opens up new chances of meta-learning for obtaining the diverse inductive bias that can be combined to stretch toward the effective initialization for diverse unseen tasks.

**SDMTR: A Brain-inspired Transformer for Relation Inference**

Xiangyu Zeng · jie lin · Piao Hu · li zhihao · Tianxi Huang

Deep learning has seen a movement towards the concepts of modularity, module coordination and sparse interactions to fit the working principles of biological systems. Inspired by Global Workspace Theory and long-term memory system in human brain, both are instrumental in constructing biologically plausible artificial intelligence systems, we introduce the shared dual-memory Transformers (SDMTR)— a model that builds upon Transformers. The proposed approach includes the shared long-term memory and workspace with finite capacity in which different specialized modules compete to write information. Later, crucial information from shared workspace is inscribed into long-term memory through outer product attention mechanism to reduce information conflict and build a knowledge reservoir, thereby facilitating subsequent inference, learning and problem-solving. We apply SDMTR to multi-modality question-answering and reasoning challenges, including text-based bAbI-20k, visual Sort-of-CLEVR and triangle relations detection tasks. The results demonstrate that our SDMTR significantly outperforms the vanilla Transformer and its recent improvements. Additionally, visualization analyses indicate that the presence of memory positively correlates with model effectiveness on inference tasks. This research provides novel insights and empirical support to advance biologically plausible deep learning frameworks.

**Towards Convergence Rates for Parameter Estimation in Gaussian-gated Mixture of Experts**

Huy Nguyen · TrungTin Nguyen · Khai Nguyen · Nhat Ho

Originally introduced as a neural network for ensemble learning, mixture of experts (MoE) has recently become a fundamental building block of highly successful modern deep neural networks for heterogeneous data analysis in several applications of machine learning and statistics. Despite its popularity in practice, a satisfactory level of theoretical understanding of the MoE model is far from complete. To shed new light on this problem, we provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model. The main challenge of that analysis comes from the inclusion of covariates in the Gaussian gating functions and expert networks, which leads to their intrinsic interaction via some partial differential equations with respect to their parameters. We tackle these issues by designing novel Voronoi loss functions among parameters to accurately capture the heterogeneity of parameter estimation rates. Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions, namely when all these parameters are non-zero versus when at least one among them vanishes. Notably, these behaviors can be characterized by the solvability of two different systems of polynomial equations. Finally, we conduct a simulation study to empirically verify our theoretical results.

**FairRR: Pre-Processing for Group Fairness through Randomized Response**

Joshua Ward · Xianli Zeng · Guang Cheng

The increasing usage of machine learning models in consequential decision-making processes has spurred research into the fairness of these systems. While significant work has been done to study group fairness in the in-processing and post-processing setting, there has been little that theoretically connects these results to the pre-processing domain. This paper extends recent fair statistical learning results and proposes that achieving group fairness in downstream models can be formulated as finding the optimal design matrix in which to modify a response variable in a Randomized Response framework. We show that measures of group fairness can be directly controlled for with optimal model utility, proposing a pre-processing algorithm called FairRR that yields excellent downstream model utility and fairness.

**Asymptotic Characterisation of the Performance of Robust Linear Regression in the Presence of Outliers**

Matteo Vilucchio · Emanuele Troiani · Vittorio Erba · FLORENT KRZAKALA

We study robust linear regression in high-dimension, when both the dimension $d$ and the number of data points $n$ diverge with a fixed ratio $\alpha=n/d$, and study a data model that includes outliers. We provide exact asymptotics for the performances of the empirical risk minimisation (ERM) using $\ell_2$-regularised $\ell_2$, $\ell_1$, and Huber losses, which are the standard approach to such problems. We focus on two metrics for the performance: the generalisation error to similar datasets with outliers, and the estimation error of the original, unpolluted function. Our results are compared with the information theoretic Bayes-optimal estimation bound. For the generalization error, we find that optimally-regularised ERM is asymptotically consistent in the large sample complexity limit if one perform a simple calibration, and compute the rates of convergence. For the estimation error however, we show that due to a norm calibration mismatch, the consistency of the estimator requires an oracle estimate of the optimal norm, or the presence of a cross-validation set not corrupted by the outliers. We examine in detail how performance depends on the loss function and on the degree of outlier corruption in the training set and identify a region of parameters where the optimal performance of the Huber loss is identical to that of the $\ell_2$ loss, offering insights into the use cases of different loss functions.