### Poster Session

## Poster Session 2

##### Auditorium 1 Foyer

**Learning Safety Constraints from Demonstrations with Unknown Rewards**

David Lindner · Xin Chen · Sebastian Tschiatschek · Katja Hofmann · Andreas Krause

We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions.While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics.CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations.For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret.We evaluate CoCoRL in gridworld environments and a driving simulation with multiple constraints.CoCoRL learns constraints that lead to safe driving behavior.Importantly, we can safely transfer the learned constraints to different tasks and environments.In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.

**Equivariant bootstrapping for uncertainty quantification in imaging inverse problems**

Marcelo Pereyra · Julian Tachella

Scientific imaging problems are often severely ill-posed, and hence have significant intrinsic uncertainty. Accurately quantifying the uncertainty in the solutions to such problems is therefore critical for the rigorous interpretation of experimental results as well as for reliably using the reconstructed images as scientific evidence. Unfortunately, existing imaging methods are unable to quantify the uncertainty in the reconstructed images in a manner that is robust to experiment replications. This paper presents a new uncertainty quantification methodology based on an equivariant formulation of the parametric bootstrap algorithm that leverages symmetries and invariance properties commonly encountered in imaging problems. Additionally, the proposed methodology is general and can be easily applied with any image reconstruction technique, including unsupervised training strategies that can be trained from observed data alone, thus enabling uncertainty quantification in situations where there is no ground truth data available. We demonstrate the proposed approach with a series of numerical experiments and through comparisons with alternative uncertainty quantification strategies from the state-of-the-art, such as Bayesian strategies involving score-based diffusion models and Langevin samplers. In all our experiments, the proposed method delivers remarkably accurate high-dimensional confidence regions and outperforms the competing approaches in terms of estimation accuracy, uncertainty quantification accuracy, and computing time.

**Deep Dependency Networks and Advanced Inference Schemes for Multi-Label Classification**

Shivvrat Arya · Yu Xiang · Vibhav Gogate

We present a unified framework called deep dependency networks (DDNs) that combines dependency networks and deep learning architectures for multi-label classification, with a particular emphasis on image and video data. The primary advantage of dependency networks is their ease of training, in contrast to other probabilistic graphical models like Markov networks. In particular, when combined with deep learning architectures, they provide an intuitive, easy-to-use loss function for multi-label classification. A drawback of DDNs compared to Markov networks is their lack of advanced inference schemes, necessitating the use of Gibbs sampling. To address this challenge, we propose novel inference schemes based on local search and integer linear programming for computing the most likely assignment to the labels given observations. We evaluate our novel methods on three video datasets (Charades, TACoS, Wetlab) and three image datasets (MS-COCO, PASCAL VOC, NUS-WIDE), comparing their performance with (a) basic neural architectures and (b) neural architectures combined with Markov networks equipped with advanced inference and learning techniques. Our results demonstrate the superiority of our new DDN methods over the two competing approaches.

**Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels**

Da Long · Wei Xing · Aditi Krishnapriyan · Robert Kirby · Shandian Zhe · Michael Mahoney

Discovering governing equations from data is important to many scientific and engineering applications. Despite promising successes, existing methods are still challenged by data sparsity and noise issues, both of which are ubiquitous in practice. Moreover, state-of-the-art methods lack uncertainty quantification and/or are costly in training. To overcome these limitations, we propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS). We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We combine it with a Bayesian spike-and-slab prior --- an ideal Bayesian sparse distribution --- for effective operator selection and uncertainty quantification. We develop an expectation-propagation expectation-maximization (EP-EM) algorithm for efficient posterior inference and function estimation. To overcome the computational challenge of kernel regression, we place the function values on a mesh and induce a Kronecker product construction, and we use tensor algebra to enable efficient computation and optimization. We show the advantages of KBASS on a list of benchmark ODE and PDE discovery tasks. The code is available at \url{https://github.com/long-da/KBASS}.

**Nonparametric Automatic Differentiation Variational Inference with Spline Approximation**

Yuda Shao · Shan Yu · Tianshu Feng

Automatic Differentiation Variational Inference (ADVI) is efficient in learning probabilistic models. Classic ADVI relies on the parametric approach to approximate the posterior. In this paper, we develop a spline-based nonparametric approximation approach that enables flexible posterior approximation for distributions with complicated structures, such as skewness, multimodality, and bounded support. Compared with widely-used nonparametric variational inference methods, the proposed method is easy to implement and adaptive to various data structures. By adopting the spline approximation, we derive a lower bound of the importance weighted autoencoder and establish the asymptotic consistency. Experiments demonstrate the efficiency of the proposed method in approximating complex posterior distributions and improving the performance of generative models with incomplete data.

Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.

**Best-of-Both-Worlds Algorithms for Linear Contextual Bandits**

Yuko Kuroki · Alberto Rumi · Taira Tsuchiya · Fabio Vitale · Nicolò Cesa-Bianchi

We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\poly\!\log(dKT)}{\Delta_{\min}}$, where $\Delta_{\min}$ is the mininum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{\cO}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{\cO}(dK\sqrt{\Lambda^*})$ bound, where $L^*$ is the cumulative loss of the best action and $\Lambda^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde\cO\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.

**Fixed-Budget Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit**

Shintaro Nakamura · Masashi Sugiyama

We study the real-valued combinatorial pure exploration of the multi-armed bandit in the fixed-budget setting. We first introduce an algorithm named the Combinatorial Successive Asign (CSA) algorithm, which is the first algorithm that can identify the best action even when the size of the action class is exponentially large with respect to the number of arms. We show that the upper bound of the probability of error of the CSA algorithm matches a lower bound up to a logarithmic factor in the exponent. Then, we introduce another algorithm named the Minimax Combinatorial Successive Accepts and Rejects (Minimax-CombSAR) algorithm for the case where the size of the action class is polynomial, and show that it is optimal, which matches a lower bound. Finally, we experimentally compare the algorithms with previous methods and show that our algorithm performs better.

**Generalization Bounds for Label Noise Stochastic Gradient Descent**

Jung Eun Huh · Patrick Rebeschini

We develop generalization error bounds for stochastic gradient descent (SGD) with label noise in non-convex settings under uniform dissipativity and smoothness conditions. Under a suitable choice of semimetric, we establish a contraction in Wasserstein distance of the label noise stochastic gradient flow that depends polynomially on the parameter dimension $d$. Using the framework of algorithmic stability, we derive time-independent generalisation error bounds for the discretized algorithm with a constant learning rate. The error bound we achieve scales polynomially with $d$ and with the rate of $n^{-2/3}$, where $n$ is the sample size. This rate is better than the best-known rate of $n^{-1/2}$ established for stochastic gradient Langevin dynamics (SGLD)---which employs parameter-independent Gaussian noise---under similar conditions. Our analysis offers quantitative insights into the effect of label noise.

**Importance Matching Lemma for Lossy Compression with Side Information**

Truong Buu Phan · Ashish Khisti · Christos Louizos

We propose two extensions to existing importance sampling based methods for lossy compression.First, we introduce an importance sampling based compression scheme that is a variant of ordered random coding (Theis and Ahmed, 2022) and is amenable to direct evaluation of the achievable compression rate for a finite number of samples.Our second and major contribution is the \emph{importance matching lemma}, which is a finite proposal counterpart of the recently introduced {Poisson matching lemma} (Li and Anantharam, 2021).By integrating with deep learning, we provide a new coding scheme for distributed lossy compression with side information at the decoder.We demonstrate the effectiveness of the proposed scheme through experiments involving synthetic Gaussian sources, distributed image compression with MNIST and vertical federated learning with CIFAR-10.

**Central Limit Theorem for Two-Timescale Stochastic Approximation with Markovian Noise: Theory and Applications**

Jie Hu · Vishwaraj Doshi · Do Young Eun

Two-timescale stochastic approximation (TTSA) is among the most general frameworks for iterative stochastic algorithms. This includes well-known stochastic optimization methods such as SGD variants and those designed for bilevel or minimax problems, as well as reinforcement learning like the family of gradient-based temporal difference (GTD) algorithms. In this paper, we conduct an in-depth asymptotic analysis of TTSA under controlled Markovian noise via central limit theorem (CLT), uncovering the coupled dynamics of TTSA influenced by the underlying Markov chain, which has not been addressed by previous CLT results of TTSA only with Martingale difference noise. Building upon our CLT, we expand its application horizon of efficient sampling strategies from vanilla SGD to a wider TTSA context in distributed learning, thus broadening the scope of Hu et al. 2020. In addition, we leverage our CLT result to deduce the statistical properties of GTD algorithms with nonlinear function approximation using Markovian samples and show their identical asymptotic performance, a perspective not evident from current finite-time bounds.

**Transductive conformal inference with adaptive scores**

Ulysse Gazin · Gilles Blanchard · Etienne Roquain

Conformal inference is a fundamental and versatile tool that provides distribution-free guarantees. We consider the transductive setting where decisions are made for a test sample of $m$ new points, giving rise to a family of $m$ conformal $p$-values. While classical results only concern their marginal distribution, this paper shows that their joint distribution can be described with a P\'olya urn model, which entails a concentration inequality for their empirical distribution function. These results hold for arbitrary exchangeable scores, including some adaptive ones that can use the covariates of the test sample. We demonstrate the usefulness of these general theoretical results by providing uniform guarantees for two machine learning tasks of current interest: interval prediction for transductive transfer learning and novelty detection based on two-class classification.

**Learning Latent Partial Matchings with Gumbel-IPF Networks**

Hedda Cohen Indelman · Tamir Hazan

Learning to match discrete objects has been a central task in machine learning, often facilitated by a continuous relaxation of the matching structure.However, practical problems entail partial matchings due to missing correspondences, which pose difficulties to the one-to-one matching learning techniques that dominate the state-of-the-art. This paper introduces Gumbel-IPF networks for learning latent partial matchings. At the core of our method is the differentiable Iterative Proportional Fitting (IPF) procedure that biproportionally projects onto the transportation polytope of target marginals. Our theoretical framework also allows drawing samples from the temperature-dependent partial matching distribution. We investigate the properties of common-practice relaxations through the lens of biproportional fitting and introduce a new metric, the empirical prediction shift. Our method's advantages are demonstrated in experimental results on the semantic keypoints partial matching task on the Pascal VOC, IMC-PT-SparseGM, and CUB2001 datasets.

**Integrating Uncertainty Awareness into Conformalized Quantile Regression**

Raphael Rossellini · Rina Foygel Barber · Rebecca Willett

Conformalized Quantile Regression (CQR) is a recently proposed method for constructing prediction intervals for a response $Y$ given covariates $X$, without making distributional assumptions. However, existing constructions of CQR can be ineffective for problems where the quantile regressors perform better in certain parts of the feature space than others. The reason is that the prediction intervals of CQR do not distinguish between two forms of uncertainty: first, the variability of the conditional distribution of $Y$ given $X$ (i.e., aleatoric uncertainty), and second, our uncertainty in estimating this conditional distribution (i.e., epistemic uncertainty). This can lead to intervals that are overly narrow in regions where epistemic uncertainty is high. To address this, we propose a new variant of the CQR methodology, Uncertainty-Aware CQR (UACQR), that explicitly separates these two sources of uncertainty to adjust quantile regressors differentially across the feature space. Compared to CQR, our methods enjoy the same distribution-free theoretical coverage guarantees, while demonstrating in our experiments stronger conditional coverage properties in simulated settings and real-world data sets alike.

**On the Expected Size of Conformal Prediction Sets**

Guneet Singh Dhillon · George Deligiannidis · Tom Rainforth

While conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction sets under the split conformal prediction framework. As this precise formulation cannot usually be calculated directly, we further derive point estimates and high-probability interval bounds that can be empirically computed, providing a practical method for characterizing the expected set size. We corroborate the efficacy of our results with experiments on real-world datasets for both regression and classification problems.

**Distributionally Robust Off-Dynamics Reinforcement Learning: Provable Efficiency with Linear Function Approximation **

Zhishuai Liu · Pan Xu

We study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain. We aim to solve this problem via online distributionally robust Markov decision processes (DRMDPs), where the learning algorithm actively interacts with the source domain while seeking the optimal performance under the worst possible dynamics that is within an uncertainty set of the source domain's transition kernel. We provide the first study on online DRMDPs with function approximation for off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity, even when the nominal transition kernel is linear, leading to error propagation. By designing a $d$-rectangular uncertainty set using the total variation distance, we remove this additional nonlinearity and bypass the error propagation. We then introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics RL with function approximation, and establish a polynomial suboptimality bound that is independent of the state and action space sizes. Our work makes the first step towards a deeper understanding of the provable efficiency of online DRMDPs with linear function approximation. Finally, we substantiate the performance and robustness of DR-LSVI-UCB through different numerical experiments.

**Model-Based Best Arm Identification for Decreasing Bandits**

Sho Takemori · Yuhei Umeda · Aditya Gopalan

We study the problem of reliably identifying the best (lowest loss) arm in a stochastic multi-armed bandit when the expected loss of each arm is monotone decreasing as a function of its pull count. This models, for instance, scenarios where each arm itself represents an optimization algorithm for finding the minimizer of a common function, and there is a limited time available to test the algorithms before committing to one of them. We assume that the decreasing expected loss of each arm depends on the number of its pulls as a (inverse) polynomial with unknown coefficients. We propose two fixed-budget best arm identification algorithms -- one for the case of sparse polynomial decay models and the other for general polynomial models -- along with bounds on the identification error probability. We also derive algorithm-independent lower bounds on the error probability. These bounds are seen to be factored into the product of the usual problem complexity and the model complexity that only depends on the parameters of the model. This indicates that our methods can identify the best arm even when the budget is smaller. We conduct empirical studies of our algorithms to complement our theoretical findings.

**A/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarity**

Zhihan Xiong · Romain Camilleri · Maryam Fazel · Lalit Jain · Kevin Jamieson

We investigate the fixed-budget best-arm identification (BAI) problem for linear bandits in a potentially non-stationary environment. Given a finite arm set $\mathcal{X}\subset\mathbb{R}^d$, a fixed budget $T$, and an unpredictable sequence of parameters $\left\lbrace\theta_t\right\rbrace_{t=1}^{T}$, an algorithm will aim to correctly identify the best arm $x^* := \arg\max_{x\in\mathcal{X}}x^\top\sum_{t=1}^{T}\theta_t$ with probability as high as possible. Prior work has addressed the stationary setting where $\theta_t = \theta_1$ for all $t$ and demonstrated that the error probability decreases as $\exp(-T /\rho^*)$ for a problem-dependent constant $\rho^*$. But in many real-world $A/B/n$ multivariate testing scenarios that motivate our work, the environment is non-stationary and an algorithm expecting a stationary setting can easily fail. For robust identification, it is well-known that if arms are chosen randomly and non-adaptively from a G-optimal design over $\mathcal{X}$ at each time then the error probability decreases as $\exp(-T\Delta^2_{(1)}/d)$, where $\Delta_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T \theta_t$. As there exist environments where $\Delta_{(1)}^2/ d \ll 1/ \rho^*$, we are motivated to propose a novel algorithm P1-RAGE that aims to obtain the best of both worlds: robustness to non-stationarity and fast rates of identification in benign settings. We characterize the error probability of P1-RAGE and demonstrate empirically that the algorithm indeed never performs worse than G-optimal design but compares favorably to the best algorithms in the stationary setting.

**Causal Discovery under Off-Target Interventions**

Davin Choo · Kirankumar Shiragur · Caroline Uhler

Causal graph discovery is a significant problem with applications across various disciplines. However, with observational data alone, the underlying causal graph can only be recovered up to its Markov equivalence class, and further assumptions or interventions are necessary to narrow down the true graph. This work addresses the causal discovery problem under the setting of stochastic interventions with the natural goal of minimizing the number of interventions performed. We propose the following stochastic intervention model which subsumes existing adaptive noiseless interventions in the literature while capturing scenarios such as fat-hand interventions and CRISPR gene knockouts: any intervention attempt results in an actual intervention on a random subset of vertices, drawn from a \emph{distribution dependent on attempted action}. Under this model, we study the two fundamental problems in causal discovery of verification and search and provide approximation algorithms with polylogarithmic competitive ratios and provide some preliminary experimental results.

**Conformalized Deep Splines for Optimal and Efficient Prediction Sets**

Nathaniel Diamant · Ehsan Hajiramezanali · Tommaso Biancalani · Gabriele Scalia

Uncertainty estimation is critical in high-stakes machine learning applications. One effective way to estimate uncertainty is conformal prediction, which can provide predictive inference with statistical coverage guarantees. We present a new conformal regression method, Spline Prediction Intervals via Conformal Estimation (SPICE), that estimates the conditional density using neural- network-parameterized splines. We prove universal approximation and optimality results for SPICE, which are empirically reflected by our experiments. SPICE is compatible with two different efficient-to- compute conformal scores, one designed for size-efficient marginal coverage (SPICE-ND) and the other for size-efficient conditional coverage (SPICE-HPD). Results on benchmark datasets demonstrate SPICE-ND models achieve the smallest average prediction set sizes, including average size reductions of nearly 50\% for some datasets compared to the next best baseline. SPICE-HPD models achieve the best conditional coverage compared to baselines. The SPICE implementation is made available.

**Optimal Exploration is no harder than Thompson Sampling**

Zhaoqi Li · Kevin Jamieson · Lalit Jain

Given a set of arms $\mathcal{Z}\subset \mathbb{R}^d$ and an unknown parameter vector $\theta_\ast\in\mathbb{R}^d$, the pure exploration linear bandits problem aims to return $\arg\max_{z\in \mathcal{Z}} z^{\top}\theta_{\ast}$, with high probability through noisy measurements of $x^{\top}\theta_{\ast}$ with $x\in \mathcal{X}\subset \mathbb{R}^d$. Existing (asymptotically) optimal methods require either a) potentially costly projections for each arm $z\in \mathcal{Z}$ or b) explicitly maintaining a subset of $\mathcal{Z}$ under consideration at each time. This complexity is at odds with the popular and simple Thompson Sampling algorithm for regret minimization, which just requires access to a posterior sampling and argmax oracle, and does not need to enumerate $\mathcal{Z}$ at any point. Unfortunately, Thompson sampling is known to be sub-optimal for pure exploration. In this work, we pose a natural question: is there an algorithm that can explore optimally and only needs the same computational primitives as Thompson Sampling? We answer the question in the affirmative. We provide an algorithm that leverages only sampling and argmax oracles and achieves an exponential convergence rate, with the exponent equal to the exponent of the optimal fixed allocation asymptotically. In addition, we show that our algorithm can be easily implemented and performs as well empirically as existing asymptotically optimal methods.

**Sample-Efficient Personalization: Modeling User Parameters as Low Rank Plus Sparse Components**

Soumyabrata Pal · Prateek Varshney · Gagan Madan · Prateek Jain · Abhradeep Thakurta · Gaurav Aggarwal · Pradeep Shenoy · Gaurav Srivastava

Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation systems. Standard personalization approaches involve learning a user/domain specific \emph{embedding} that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain --- a.k.a meta-learning --- has high storage/infrastructure cost. Moreover, rigorous theoretical studies of scalable personalization approaches have been very limited. To address the above issues, we propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse components. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose a computationally efficient alternating minimization method with iterative hard thresholding --- AMHT-LRS --- to learn the low-rank and sparse part. Theoretically, for the realizable Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal sample complexity. Finally, a significant challenge in personalization is ensuring privacy of each user's sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees.

**Robust SVD Made Easy: A fast and reliable algorithm for large-scale data analysis**

Sangil Han · Sungkyu Jung · Kyoowon Kim

The singular value decomposition (SVD) is a crucial tool in machine learning and statistical data analysis. However, it is highly susceptible to outliers in the data matrix. Existing robust SVD algorithms often sacrifice speed for robustness or fail in the presence of only a few outliers. This study introduces an efficient algorithm, called Spherically Normalized SVD, for robust SVD approximation that is highly insensitive to outliers, computationally scalable, and provides accurate approximations of singular vectors. The proposed algorithm achieves remarkable speed by utilizing only two applications of a standard reduced-rank SVD algorithm to appropriately scaled data, significantly outperforming competing algorithms in computation times. To assess the robustness of the approximated singular vectors and their subspaces against data contamination, we introduce new notions of breakdown points for matrix-valued input, including row-wise, column-wise, and block-wise breakdown points. Theoretical and empirical analyses demonstrate that our algorithm exhibits higher breakdown points compared to standard SVD and its modifications. We empirically validate the effectiveness of our approach in applications such as robust low-rank approximation and robust principal component analysis of high-dimensional microarray datasets. Overall, our study presents a highly efficient and robust solution for SVD approximation that overcomes the limitations of existing algorithms in the presence of outliers.

Finding the mean of sampled data is a fundamental task in machine learning and statistics. However, in cases where the data samples are graph objects, defining a mean is an inherently difficult task. We propose a novel framework for defining a graph mean via embeddings in the space of smooth graph signal distributions, where graph similarity can be measured using the Wasserstein metric. By finding a mean in this embedding space, we can recover a mean graph that preserves structural information. We establish the existence and uniqueness of the novel graph mean, and provide an iterative algorithm for computing it. To highlight the potential of our framework as a valuable tool for practical applications in machine learning, it is evaluated on various tasks, including k-means clustering of structured aligned graphs, classification of functional brain networks, and semi-supervised node classification in multi-layer graphs. Our experimental results demonstrate that our approach achieves consistent performance, outperforms existing baseline approaches, and improves the performance of state-of-the-art methods.

**Tight Verification of Probabilistic Robustness in Bayesian Neural Networks**

Ben Batten · Mehran Hosseini · Alessio Lomuscio

We introduce two algorithms for computing tight guarantees on the probabilistic robustness of Bayesian Neural Networks (BNNs). Computing robustness guarantees for BNNs is a significantly more challenging task than verifying the robustness of standard Neural Networks (NNs) because it requires searching the parameters' space for safe weights. Moreover, tight and complete approaches for the verification of standard NNs, such as those based on Mixed-Integer Linear Programming (MILP), cannot be directly used for the verification of BNNs because of the polynomial terms resulting from the consecutive multiplication of variables encoding the weights. Our algorithms efficiently and effectively search the parameters' space for safe weights by using iterative expansion and the network's gradient and can be used with any verification algorithm of choice for BNNs. In addition to proving that our algorithms compute tighter bounds than the SoA, we also evaluate our algorithms against the SoA on standard benchmarks, such as MNIST and CIFAR10, showing that our algorithms compute bounds up to 40\% tighter than the SoA.

**Positivity-free Policy Learning with Observational Data**

Pan Zhao · Antoine Chambaz · julie Josse · Shu Yang

Policy learning utilizing observational data is pivotal across various domains, with the objective of learning the optimal treatment assignment policy while adhering to specific constraints such as fairness, budget, and simplicity. This study introduces a novel positivity-free (stochastic) policy learning framework designed to address the challenges posed by the impracticality of the positivity assumption in real-world scenarios. This framework leverages incremental propensity score policies to adjust propensity score values instead of assigning fixed values to treatments. We characterize these incremental propensity score policies and establish identification conditions, employing semiparametric efficiency theory to propose efficient estimators capable of achieving rapid convergence rates, even when integrated with advanced machine learning algorithms. This paper provides a thorough exploration of the theoretical guarantees associated with policy learning and validates the proposed framework's finite-sample performance through comprehensive numerical experiments, ensuring the identification of causal effects from observational data is both robust and reliable.

**Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling**

Arman Adibi · Nicolo' Dal Fabbro · luca schenato · Sanjeev Kulkarni · H. Vincent Poor · George Pappas · Hamed Hassani · Aritra Mitra

Motivated by applications in large-scale and multi-agent reinforcement learning, we study the non-asymptotic performance of stochastic approximation (SA) schemes with delayed updates under Markovian sampling. While the effect of delays has been extensively studied for optimization, the manner in which they interact with the underlying Markov process to shape the finite-time performance of SA remains poorly understood. In this context, our first main contribution is to show that under time-varying bounded delays, the delayed SA update rule guarantees exponentially fast convergence of the \emph{last iterate} to a ball around the SA operator's fixed point. Notably, our bound is \emph{tight} in its dependence on both the maximum delay $\tau_{max}$, and the mixing time $\tau_{mix}$. To achieve this tight bound, we develop a novel inductive proof technique that, unlike various existing delayed-optimization analyses, relies on establishing uniform boundedness of the iterates. As such, our proof may be of independent interest. Next, to mitigate the impact of the maximum delay on the convergence rate, we provide the first finite-time analysis of a delay-adaptive SA scheme under Markovian sampling. In particular, we show that the exponent of convergence of this scheme gets scaled down by $\tau_{avg}$, as opposed to $\tau_{max}$ for the vanilla delayed SA rule; here, $\tau_{avg}$ denotes the average delay across all iterations. Moreover, the adaptive scheme requires no prior knowledge of the delay sequence for step-size tuning. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms, including TD learning, Q-learning, and stochastic gradient descent under Markovian sampling.

**Causal Modeling with Stationary Diffusions**

Lars Lorch · Andreas Krause · Bernhard Schölkopf

We develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system's behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion's generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest.

**Vector Quantile Regression on Manifolds**

Marco Pegoraro · Sanketh Vedula · Aviv Rosenberg · Irene Tallini · Emanuele Rodolà · Alex Bronstein

Quantile regression (QR) is a statistical tool for distribution-free estimation of conditional quantiles of a target variable given explanatory features.QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain.Although the notion of quantiles was recently extended to multi-variate distributions,QR for multi-variate distributions on manifolds remains underexplored, even thoughmany important applications inherently involve data distributed on, e.g., spheres (climate and geological phenomena), and tori (dihedral angles in proteins).By leveraging optimal transport theory and c-concave functions, we meaningfully define conditional vector quantile functions of high-dimensional variables on manifolds (M-CVQFs).Our approach allows for quantile estimation, regression, and computation of conditional confidence sets and likelihoods.We demonstrate the approach's efficacy and provide insights regarding the meaning of non-Euclidean quantiles through synthetic and real data experiments.

**Convergence to Nash Equilibrium and No-regret Guarantee in (Markov) Potential Games**

Jing Dong · Baoxiang Wang · Yaoliang Yu

In this work, we study potential games and Markov potential games under stochastic cost and bandit feedback. We propose a variant of the Frank-Wolfe algorithm with sufficient exploration and recursive gradient estimation, which provably converges to the Nash equilibrium while attaining sublinear regret for each individual player. Our algorithm simultaneously achieves a Nash regret and a regret bound of $O(T^{4/5})$ for potential games, which matches the best available result, without using additional projection steps. Through carefully balancing the reuse of past samples and exploration of new samples, we then extend the results to Markov potential games and improve the best available Nash regret from $O(T^{5/6})$ to $O(T^{4/5})$. Moreover, our algorithm requires no knowledge of the game, such as the distribution mismatch coefficient, which provides more flexibility in its practical implementation. Experimental results corroborate our theoretical findings and underscore the practical effectiveness of our method.

**On Convergence in Wasserstein Distance and f-divergence Minimization Problems**

Cheuk Ting Li · Jingwei Zhang · Farzan Farnia

The zero-sum game in generative adversarial networks (GANs) for learning the distribution of observed data is known to reduce to the minimization of a divergence measure between the underlying and generative models. However, the current theoretical understanding of the role of the target divergence in the characteristics of GANs' generated samples remains largely inadequate. In this work, we aim to analyze the influence of the divergence measure on the local optima and convergence properties of divergence minimization problems in learning a multi-modal data distribution. We show a mode-seeking f-divergence, e.g. the Jensen-Shannon (JS) divergence in the vanilla GAN, could lead to poor locally optimal solutions missing some underlying modes. On the other hand, we demonstrate that the optimization landscape of 1-Wasserstein distance in Wasserstein GANs does not suffer from such suboptimal local minima. Furthermore, we prove that a randomly-initialized gradient-based optimization of the Wasserstein distance will, with high probability, capture all the existing modes. We present numerical results on standard image datasets, revealing the success of Wasserstein GANs compared to JS-GANs in avoiding suboptimal local optima under a mixture model.

**Near Optimal Adversarial Attacks on Stochastic Bandits and Defenses with Smoothed Responses**

Shiliang Zuo

I study adversarial attacks against stochastic bandit algorithms. At each round, the learner chooses an arm, and a stochastic reward is generated. The adversary strategically adds corruption to the reward, and the learner is only able to observe the corrupted reward at each round. Two sets of results are presented in this paper. The first set studies the optimal attack strategies for the adversary. The adversary has a target arm he wishes to promote, and his goal is to manipulate the learner into choosing this target arm $T - o(T)$ times. I design attack strategies against UCB and Thompson Sampling that only spends $\widehat{O}(\sqrt{\log T})$ cost. Matching lower bounds are presented, and the vulnerability of UCB, Thompson sampling and $\varepsilon$-greedy are exactly characterized. The second set studies how the learner can defend against the adversary. Inspired by literature on smoothed analysis and behavioral economics, I present two simple algorithms that achieve a competitive ratio arbitrarily close to 1.

**Learning Extensive-Form Perfect Equilibria in Two-Player Zero-Sum Sequential Games**

Martino Bernasconi · Alberto Marchesi · Francesco Trovò

Designing efficient algorithms for computing refinements of the Nash equilibrium (NE) in two-player zero-sum sequential games is of paramount importance, since the NE may prescribe sub-optimal actions off the equilibrium path. The extensive-form perfect equilibrium (EFPE) amends such a weakness by accounting for the possibility that players may make mistakes. This is crucial in the real world, which involves humans with bounded rationality, and it is also key in boosting superhuman agents for games like Poker. Nevertheless, there are only few algorithms for computing NE refinements, which either lack convergence guarantees to exact equilibria or do not scale to large games. We provide the first efficient iterative algorithm that provably converges to an EFPE in two-player zero-sum sequential games. Our algorithm works by tracking a sequence of equilibria of regularized-perturbed games, by using a procedure that is specifically tailored to converge last iterate to such equilibria. The procedure can be implemented efficiently by visiting the game tree, making our method computationally appealing. We also empirically evaluate our algorithm, showing that its strategies are much more robust to players' mistakes than those of state-of-the-art algorithms.

A scientist tests a continuous stream of hypotheses over time in the course of her investigation —-- she does not test a predetermined, fixed number of hypotheses.The scientist wishes to make as many discoveries as possible while ensuring the number of false discoveries is controlled --- a well recognized way for accomplishing this is to control the false discovery rate (FDR). Prior methods for FDR control in the online setting have focused on formulating algorithms when specific dependency structures are assumed to exist between the test statistics of each hypothesis. However, in practice, these dependencies often cannot be known beforehand or tested after the fact. Our algorithm, e-LOND, provides FDR control under arbitrary, possibly unknown, dependence. We show that our method is more powerful than existing approaches to this problem through simulations. We also formulate extensions of this algorithm to utilize randomization for increased power and for constructing confidence intervals in online selective inference.

**User-level Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates**

Daogao Liu · Hilal Asi

We study differentially private stochastic convex optimization (DP-SCO) under user-level privacy, where each user may hold multiple data items.Existing work for user-level DP-SCO either requires super-polynomial runtime (Ghazi et al., 2023) or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions (Bassily et al., 2023).We develop new algorithms for user-level DP-SCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension.Moreover, our algorithms are the first to obtain optimal rates for non-smooth functions in polynomial time.These algorithms are based on multiple-pass DP-SGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.

**Non-Neighbors Also Matter to Kriging: A New Contrastive-Prototypical Learning**

Zhishuai Li · Yunhao Nie · Ziyue Li · Lei Bai · Yisheng Lv · Rui Zhao

Kriging aims to estimate the attributes of unseen geo-locations from observations in the spatial vicinity or physical connections.Existing works assume that neighbors' information offers the basis for estimating the unobserved target while ignoring non-neighbors.However, neighbors could also be quite different or even misleading, and the non-neighbors could still offer constructive information.To this end, we propose "Contrastive-Prototypical" self-supervised learning for Kriging (KCP): (1) The neighboring contrastive module coarsely pushes neighbors together and non-neighbors apart. (2) In parallel, the prototypical module identifies similar representations via exchanged prediction, such that it refines the misleading neighbors and recycles the useful non-neighbors from the neighboring contrast component. As a result, not all the neighbors and some of the non-neighbors will be used to infer the target. (3) To learn general and robust representations, we design an adaptive augmentation module that encourages data diversity. Theoretical bound is derived for the proposed augmentation. Extensive experiments on real-world datasets demonstrate the superior performance of KCP compared to its peers with 6\% improvements and exceptional transferability and robustness.

**Mind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware Priors**

Tim G. J. Rudner · Ya Shi Zhang · Andrew Gordon Wilson · Julia Kempe

Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance---even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts.

**Privacy-Preserving Decentralized Actor-Critic for Cooperative Multi-Agent Reinforcement Learning**

Maheed Ahmed · Mahsa Ghasemi

Multi-agent reinforcement learning has a wide range of applications in cooperative settings, but ensuring data privacy among agents is a significant challenge. To address this challenge, we propose Privacy-Preserving Decentralized Actor-Critic (PPDAC), an algorithm that motivates agents to cooperate while maintaining their data privacy. Leveraging trajectory ranking, PPDAC enables the agents to learn a cooperation reward that encourages agents to account for other agents' preferences. Subsequently, each agent trains a policy that maximizes not only its local reward as in independent actor-critic (IAC) but also the cooperation reward, hence, increasing cooperation. Importantly, communication among agents is restricted to their ranking of trajectories that only include public identifiers without any private local data. Moreover, as an additional layer of privacy, the agents can perturb their rankings with the randomized response method. We evaluate PPDAC on the level-based foraging (LBF) environment and a coin-gathering environment. We compare with IAC and Shared Experience Actor-Critic (SEAC) which achieves SOTA results for the LBF environment. The results show that PPDAC consistently outperforms IAC. In addition, PPDAC outperforms SEAC in the coin-gathering environment and achieves similar performance in the LBF environment, all while providing better privacy.

Contrastive Analysis (CA) deals with the discovery of what is common and what is distinctive of a target domain compared to a background one. This is of great interest in many applications, such as medical imaging. Current state-of-the-art (SOTA) methods are latent variable models based on VAE (CA-VAEs). However, they all either ignore important constraints or they don't enforce fundamental assumptions. This may lead to sub-optimal solutions where distinctive factors are mistaken for common ones (or viceversa). Furthermore, the generated images have a rather poor quality, typical of VAEs, decreasing their interpretability and usefulness. Here, we propose Double InfoGAN, the first GAN based method for CA that leverages the high-quality synthesis of GAN and the separation power of InfoGAN. Experimental results on four visual datasets, from simple synthetic examples to complex medical images, show that the proposed method outperforms SOTA CA-VAEs in terms of latent separation and image quality. Datasets and code are available online.

**P-tensors: a General Framework for Higher Order Message Passing in Subgraph Neural Networks**

Andrew Hands · Tianyi Sun · Risi Kondor

Several recent papers have proposed increasing the expressiveness of graph neural networks by exploiting subgraphs or other topological structures. In parallel, researchers have investigated higher order permutation equivariant networks. In this paper we tie these two threads together by providing a general framework for higher order permutation equivariant message passing in subgraph neural networks. Our exposition hinges on so-called P-tensors, which provide a simple way to define the most general form of permutation equivariant message passing in this category of networks. We show that this paradigm can achieve state-of-the-art performance on benchmark molecular datasets.

**Testing Generated Distributions in GANs to Penalize Mode Collapse**

Yanxiang Gong · Zhiwei Xie · Mei Xie · Xin Ma

Mode collapse remains the primary unresolved challenge within generative adversarial networks (GANs). In this work, we introduce an innovative approach that supplements the discriminator by additionally enforcing the similarity between the generated and real distributions. We implement a one-sample test on the generated samples and employ the resulting test statistic to penalize deviations from the real distribution. Our method encompasses a practical strategy to estimate distributions, compute the test statistic via a differentiable function, and seamlessly incorporate test outcomes into the training objective. Crucially, our approach preserves the convergence and theoretical integrity of GANs, as the introduced constraint represents a requisite condition for optimizing the generator training objective. Notably, our method circumvents reliance on regularization or network modules, enhancing compatibility and facilitating its practical application. Empirical evaluations on diverse public datasets validate the efficacy of our proposed approach.

**Adaptive Compression in Federated Learning via Side Information**

Berivan Isik · Francesco Pase · Deniz Gunduz · Sanmi Koyejo · Tsachy Weissman · Michele Zorzi

The high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods -- in which the client n sends a sample from a client-only probability distribution $q_{\phi^{(n)}}$, and the server estimates the mean of the clients' distributions using these samples. However, such methods do not take full advantage of the FL setup where the server, throughout the training process, has side information in the form of a global distribution $p_{\theta}$ that is close to the client-only distribution $q_{\phi^{(n)}}$ in Kullback–Leibler (KL) divergence. In this work, we exploit this \emph{closeness} between the clients' distributions $q_{\phi^{(n)}}$'s and the side information $p_{\theta}$ at the server, and propose a framework that requires approximately $D_{KL}(q_{\phi^{(n)}}|| p_{\theta})$ bits of communication. We show that our method can be integrated into many existing stochastic compression frameworks to attain the same (and often higher) test accuracy with up to 82 times smaller bitrate than the prior work -- corresponding to 2,650 times overall compression.

**Compression with Exact Error Distribution for Federated Learning**

Mahmoud Hegazy · Rémi Leluc · Cheuk Ting Li · Aymeric Dieuleveut

Compression schemes have been extensively used in Federated Learning (FL) to reduce the communication cost of distributed learning. While most approaches rely on a bounded variance assumption of the noise produced by the compressor, this paper investigates the use of compression and aggregation schemes that produce a specific error distribution, e.g., Gaussian or Laplace, on the aggregated data. We present and analyze different aggregation schemes based on layered quantizers achieving exact error distribution. We provide different methods to leverage the proposed compression schemes to obtain compression-for-free in differential privacy applications. Our general compression methods can recover and improve standard FL schemes with Gaussian perturbations such as Langevin dynamics and randomized smoothing.

**Mechanics of Next Token Prediction with Self-Attention**

Yingcong Li · Yixiao Huang · Muhammed Ildiz · Ankit Singh Rawat · Samet Oymak

Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: What does a single self-attention layer learn from next-token prediction? We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: (1) Hard retrieval: Given input sequence, self-attention precisely selects the high-priority input tokens associated with the last input token. (2) Soft composition: It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.

**E(3)-Equivariant Mesh Neural Networks**

thuan trang · Khang Ngo · Daniel Levy · Thieu Ngoc Vo · Siamak Ravanbakhsh · Truong Son Hy

Triangular meshes are widely used to represent three-dimensional objects. As a result, many recent works have addressed the need for geometric deep learning on 3D meshes. However, we observe that the complexities in many of these architectures do not translate to practical performance, and simple deep models for geometric graphs are competitive in practice. Motivated by this observation, we minimally extend the update equations of E(n)-Equivariant Graph Neural Networks (EGNNs) (Satorras et al., 2021) to incorporate mesh face information and further improve it to account for long-range interactions through a hierarchy. The resulting architecture, Equivariant Mesh Neural Network (EMNN), outperforms other, more complicated equivariant methods on mesh tasks, with a fast run-time and no expensive preprocessing. Our implementation is available at \url{https://github.com/HySonLab/EquiMesh}.

**A Unified Framework for Discovering Discrete Symmetries**

Pavan Karjol · Rohan Kashyap · Aditya Gopalan · Prathosh A P

We consider the problem of learning a function respecting a symmetry from among a class of symmetries. We develop a unified framework that enables symmetry discovery across a broad range of subgroups including locally symmetric, dihedral and cyclic subgroups. At the core of the framework is a novel architecture composed of linear, matrix-valued and non-linear functions that expresses functions invariant to these subgroups in a principled manner. The structure of the architecture enables us to leverage multi-armed bandit algorithms and gradient descent to efficiently optimize over the linear and the non-linear functions, respectively, and to infer the symmetry that is ultimately learnt. We also discuss the necessity of the matrix-valued functions in the architecture. Experiments on image-digit sum and polynomial regression tasks demonstrate the effectiveness of our approach.

**Unsupervised Novelty Detection in Pretrained Representation Space with Locally Adapted Likelihood Ratio**

Amirhossein Ahmadian · Yifan Ding · Gabriel Eilertsen · Fredrik Lindsten

Detecting novelties given unlabeled examples of normal data is a challenging task in machine learning, particularly when the novel and normal categories are semantically close. Large deep models pretrained on massive datasets can provide a rich representation space in which the simple k-nearest neighbor distance works as a novelty measure. However, as we show in this paper, the basic k-NN method might be insufficient in this context due to ignoring the 'local geometry' of the distribution over representations as well as the impact of irrelevant 'background features'. To address this, we propose a fully unsupervised novelty detection approach that integrates the flexibility of k-NN with a locally adapted scaling of dimensions based on the 'neighbors of nearest neighbor' and computing a 'likelihood ratio' in pretrained (self-supervised) representation spaces. Our experiments with image data show the advantage of this method when off-the-shelf vision transformers (e.g., pretrained by DINO) are used as the feature extractor without any fine-tuning.

The advantage of Vision Transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer’s self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. The proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures. Our code is attached as supplementary.

The success of reinforcement learning (RL) crucially depends on effective function approximation when dealing with complex ground-truth models. Existing sample-efficient RL algorithms primarily employ three approaches to function approximation: policy-based, value-based, and model-based methods. However, in the face of model misspecification—a disparity between the ground-truth and optimal function approximators— it is shown that policy-based approaches can be robust even when the policy function approximation is under a large \emph{locally-bounded} misspecification error, with which the function class may exhibit a $\Omega(1)$ approximation error in specific states and actions, but remains small on average within a policy-induced state distribution. Yet it remains an open question whether similar robustness can be achieved with value-based and model-based approaches, especially with general function approximation. To bridge this gap, in this paper we present a unified theoretical framework for addressing model misspecification in RL. We demonstrate that, through meticulous algorithm design and sophisticated analysis, value-based and model-based methods employing general function approximation can achieve robustness under local misspecification error bounds. In particular, they can attain a regret bound of $\widetilde{O}\left(\mathrm{poly}(dH)\cdot(\sqrt{K} + K\cdot\zeta) \right)$, where $d$ represents the complexity of the function class, $H$ is the episode length, $K$ is the total number of episodes, and $\zeta$ denotes the local bound for misspecification error. Furthermore, we propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $\zeta$, thereby enhancing its practical applicability.

**Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity**

Siddharth Joshi · Arnav Jain · Ali Payani · Baharan Mirzasoleiman

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance.Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \textsc{ClipCov} achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline.The code is available at: \url{https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip}.

**Efficient Low-Dimensional Compression of Overparameterized Models**

Soo Min Kwon · Zekai Zhang · Dogyoon Song · Laura Balzano · Qing Qu

In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers.We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how ourcompressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.

**Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees**

Alexia Jolicoeur-Martineau · Kilian Fatras · Tal Kachman

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at \url{https://github.com/SamsungSAILMontreal/ForestDiffusion}.

**Adaptive Discretization for Event PredicTion (ADEPT)**

Jimmy Hickey · Ricardo Henao · Daniel Wojdyla · Michael Pencina · Matthew Engelhard

Recently developed survival analysis methods improve upon existing approaches by predicting the probability of event occurrence in each of a number pre-specified (discrete) time intervals. By avoiding placing strong parametric assumptions on the event density, this approach tends to improve prediction performance, particularly when data are plentiful. However, in clinical settings with limited available data, it is often preferable to judiciously partition the event time space into a limited number of intervals well suited to the prediction task at hand. In this work, we develop Adaptive Discretization for Event PredicTion (ADEPT) to learn from data a set of cut points defining such a partition. We show that in two simulated datasets, we are able to recover intervals that match the underlying generative model. We then demonstrate improved prediction performance on three real-world observational datasets, including a large, newly harmonized stroke risk prediction dataset. Finally, we argue that our approach facilitates clinical decision-making by suggesting time intervals that are most appropriate for each task, in the sense that they facilitate more accurate risk prediction.

**Sequence Length Independent Norm-Based Generalization Bounds for Transformers**

Jacob Trauger · Ambuj Tewari

This paper provides norm-based generalization bounds for the Transformer architecture that do not depend on the input sequence length. We employ a covering number based approach to prove our bounds. We use three novel covering number bounds for the function class of bounded linear mappings to upper bound the Rademacher complexity of the Transformer. Furthermore, we show this generalization bound applies to the common Transformer training technique of masking and then predicting the masked word. We also run a simulated study on a sparse majority data set that empirically validates our theoretical findings.

**Exploring the Power of Graph Neural Networks in Solving Linear Optimization Problems**

Chendi Qian · Didier Chételat · Christopher Morris

Recently, machine learning, particularly message-passing graph neural networks (MPNNs), has gained traction in enhancing exact optimization algorithms. For example, MPNNs speed up solving mixed-integer optimization problems by imitating computational intensive heuristics like strong branching, which entails solving multiple linear optimization problems (LPs). Despite the empirical success, the reasons behind MPNNs' effectiveness in emulating linear optimization remain largely unclear. Here, we show that MPNNs can simulate standard interior-point methods for LPs, explaining their practical success. Furthermore, we highlight how MPNNs can serve as a lightweight proxy for solving LPs, adapting to a given problem instance distribution. Empirically, we show that MPNNs solve LP relaxations of standard combinatorial optimization problems close to optimality, often surpassing conventional solvers and competing approaches in solving time.

**Accuracy-Preserving Calibration via Statistical Modeling on Probability Simplex**

Yasushi Esaki · Akihiro Nakamura · Keisuke Kawano · Ryoko Tokuhisa · Takuro Kutsuna

Classification models based on deep neural networks (DNNs) must be calibrated to measure the reliability of predictions. Some recent calibration methods have employed a probabilistic model on the probability simplex. However, these calibration methods cannot preserve the accuracy of pre-trained models, even those with a high classification accuracy. We propose an accuracy-preserving calibration method using the Concrete distribution as the probabilistic model on the probability simplex. We theoretically prove that a DNN model trained on cross-entropy loss has optimality as the parameter of the Concrete distribution. We also propose an efficient method that synthetically generates samples for training probabilistic models on the probability simplex. We demonstrate that the proposed method can outperform previous methods in accuracy-preserving calibration tasks using benchmarks.

**Efficient Neural Architecture Design via Capturing Architecture-Performance Joint Distribution**

Yue Liu · Ziyi Yu · Zitu Liu · Wenjie Tian

The relationship between architecture and performance is critical for improving the efficiency of neural architecture design, yet few efforts have been devoted to understanding this relationship between architecture and performance, especially architecture-performance joint distribution. In this paper, we propose Semi-Supervised Generative Adversarial Networks Neural Architecture Design Method or SemiGAN-NAD to capture the architecture-performance joint distribution with few performance labels. It is composed of Bidirectional Transformer of Architecture and Performance (Bi-Arch2Perf) and Neural Architecture Conditional Generation (NACG). Bi-Arch2Perf is developed to learn the joint distribution of architecture and performance from bidirectional conditional distribution through the adversarial training of the discriminator, the architecture generator, and the performance predictor.Then, the incorporation of semi-supervised learning optimizes the construction of Bi-Arch2Perf by utilizing a large amount of architecture information without performance annotation in search space.Based on the learned bidirectional relationship, the performance of architecture is predicted by NACG in high-performance architecture space to efficiently discover well-promising neural architectures. The experimental results on NAS benchmarks demonstrate that SemiGAN-NAD achieves competitive performance with reduced evaluation time compared with the latest NAS methods. Moreover, the high-performance architecture signatures learned by Bi-Arch2Perf are also illustrated in our experiments.

**Analysis of Using Sigmoid Loss for Contrastive Learning**

Chungpa Lee · Joonhwan Chang · Jy-yong Sohn

Contrastive learning has emerged as a prominent branch of self-supervised learning for several years. Especially, CLIP, which applies contrastive learning to large sets of captioned images, has garnered significant attention. Recently, SigLIP, a variant of CLIP, has been proposed, which uses the sigmoid loss instead of the standard InfoNCE loss. SigLIP achieves the performance comparable to CLIP in a more efficient manner by eliminating the need for a global view. However, theoretical understanding of using the sigmoid loss in contrastive learning is underexplored. In this paper, we provide a theoretical analysis of using the sigmoid loss in contrastive learning, in the perspective of the geometric structure of learned embeddings. First, we propose the double-Constant Embedding Model (CCEM), a framework for parameterizing various well-known embedding structures by a single variable. Interestingly, the proposed CCEM is proven to contain the optimal embedding with respect to the sigmoid loss. Second, we mathematically analyze the optimal embedding minimizing the sigmoid loss for contrastive learning. The optimal embedding ranges from simplex equiangular-tight-frame to antipodal structure, depending on the temperature parameter used in the sigmoid loss. Third, our experimental results on synthetic datasets coincide with the theoretical results on the optimal embedding structures.

**On The Temporal Domain of Differential Equation Inspired Graph Neural Networks**

Moshe Eliasof · Eldad Haber · Eran Treister · Carola-Bibiane Schönlieb

Graph Neural Networks (GNNs) have demonstrated remarkable success in modeling complex relationships in graph-structured data. A recent innovation in this field is the family of Differential Equation-Inspired Graph Neural Networks (DE-GNNs), which leverage principles from continuous dynamical systems to model information flow on graphs with built-in properties such as feature smoothing or preservation. However, existing DE-GNNs rely on first or second-order temporal dependencies. In this paper, we propose a neural extension to those pre-defined temporal dependencies. We show that our model, called TDE-GNN, can capture a wide range of temporal dynamics that go beyond typical first or second-order methods, and provide use cases where existing temporal models are challenged. We demonstrate the benefit of learning the temporal dependencies using our method rather than using pre-defined temporal dynamics on several graph benchmarks.

**Adaptive and non-adaptive minimax rates for weighted Laplacian-Eigenmap based nonparametric regression**

Zhaoyang Shi · Krishna Balasubramanian · Wolfgang Polonik

We show both adaptive and non-adaptive minimax rates of convergence for a family of weighted Laplacian-Eigenmap based nonparametric regression methods, when the true regression function belongs to a Sobolev space and the sampling density is bounded from above and below. The adaptation methodology is based on extensions of Lepski's method and is over both the smoothness parameter ($s\in\mathbb{N}_{+}$) and the norm parameter ($M>0$) determining the constraints on the Sobolev space. Our results extend the non-adaptive result in Green et al., (2023), established for a specific normalized graph Laplacian, to a wide class of weighted Laplacian matrices used in practice, including the unnormalized Laplacian and random walk Laplacian.

**Learning Dynamics in Linear VAE: Posterior Collapse Threshold, Superfluous Latent Space Pitfalls, and Speedup with KL Annealing**

Yuma Ichikawa · Koji Hukushima

Variational autoencoders (VAEs) face a notorious problem wherein the variational posterior often aligns closely with the prior, a phenomenon known as posterior collapse, which hinders the quality of representation learning. To mitigate this problem, an adjustable hyperparameter $\beta$ and a strategy for annealing this parameter, called KL annealing, are proposed. This study presents a theoretical analysis of the learning dynamics in a minimal VAE. It is rigorously proved that the dynamics converge to a deterministic process within the limit of large input dimensions, thereby enabling a detailed dynamical analysis of the generalization error. Furthermore, the analysis shows that the VAE initially learns entangled representations and gradually acquires disentangled representations. A fixed-point analysis of the deterministic process reveals that when $\beta$ exceeds a certain threshold, posterior collapse becomes inevitable regardless of the learning period. Additionally, the superfluous latent variables for the data-generative factors lead to overfitting of the background noise; this adversely affects both generalization and learning convergence. The analysis further unveiled that appropriately tuned KL annealing can accelerate convergence.

**Understanding Progressive Training Through the Framework of Randomized Coordinate Descent**

Rafał Szlendak · Elnur Gasanov · Peter Richtarik

We propose a Randomized Progressive Training algorithm (RPT) – a stochastic proxy for the well-known Progressive Training method (PT) (Karras et al., 2017). Originally designed to train GANs (Goodfellow et al., 2014), PT was proposed as a heuristic, with no convergence analysis even for the simplest objective functions. On the contrary, to the best of our knowledge, RPT is the first PT-type algorithm with rigorous and sound theoretical guarantees for general smooth objective functions. We cast our method into the established framework of Randomized Coordinate Descent (RCD) (Nesterov, 2012; Richtarik \& Takac, 2014), for which (as a by-product of our investigations) we also propose a novel, simple and general convergence analysis encapsulating strongly-convex, convex and nonconvex objectives. We then use this framework to establish a convergence theory for RPT. Finally, we validate the effectiveness of our method through extensive computational experiments.

**On the Theoretical Expressive Power and the Design Space of Higher-Order Graph Transformers**

Cai Zhou · Rose Yu · Yusu Wang

Graph transformers have recently received significant attention in graph learning, partly due to their ability to capture more global interaction via self-attention. Nevertheless, while higher-order graph neural networks have been reasonably well studied, the exploration of extending graph transformers to higher-order variants is just starting. Both theoretical understanding and empirical results are limited. In this paper, we provide a systematic study of the theoretical expressive power of order-$k$ graph transformers and sparse variants. We first show that, an order-$k$ graph transformer without additional structural information is less expressive than the $k$-Weisfeiler Lehman ($k$-WL) test despite its high computational cost. We then explore strategies to both sparsify and enhance the higher-order graph transformers, aiming to improve both their efficiency and expressiveness. Indeed, sparsification based on neighborhood information can enhance the expressive power, as it provides additional information about input graph structures. In particular, we show that a natural neighborhood-based sparse order-$k$ transformer model is not only computationally efficient, but also expressive -- as expressive as $k$-WL test. We further study several other sparse graph attention models that are computationally efficient and provide their expressiveness analysis. Finally, we provide experimental results to show the effectiveness of the different sparsification strategies.

**Free-form Flows: Make Any Architecture a Normalizing Flow**

Felix Draxler · Peter Sorrenson · Lea Zimmermann · Armand Rousselot · Ullrich Köthe

Normalizing Flows are generative models that directly maximize the likelihood. Previously, the design of normalizing flows was largely constrained by the need for analytical invertibility. We overcome this constraint by a training procedure that uses an efficient estimator for the gradient of the change of variables formula. This enables any dimension-preserving neural network to serve as a generative model through maximum likelihood training. Our approach allows placing the emphasis on tailoring inductive biases precisely to the task at hand. Specifically, we achieve excellent results in molecule generation benchmarks utilizing E(n)-equivariant networks at greatly improved sampling speed. Moreover, our method is competitive in an inverse problem benchmark, while employing off-the-shelf ResNet architectures. We publish our code at https://github.com/vislearn/FFF.

**Monotone Operator Theory-Inspired Message Passing for Learning Long-Range Interaction on Graphs**

Justin Baker · Qingsong Wang · Martin Berzins · Thomas Strohmer · Bao Wang

Learning long-range interactions (LRI) between distant nodes is crucial for many graph learning tasks. Predominant graph neural networks (GNNs) rely on local message passing and struggle to learn LRI. In this paper, we propose DRGNN to learn LRI leveraging monotone operator theory. DRGNN contains two key components: (1) we use a full node similarity matrix beyond adjacency matrix -- drawing inspiration from the personalized PageRank matrix -- as the aggregation matrix for message passing, and (2) we implement message-passing on graphs using Douglas-Rachford splitting to circumvent prohibitive matrix inversion. We demonstrate that DRGNN surpasses various advanced GNNs, including Transformer-based models, on several benchmark LRI learning tasks arising from different application domains, highlighting its efficacy in learning LRI. Code is available at \url{https://github.com/Utah-Math-Data-Science/PR-inspired-aggregation}.

**Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers**

Krzysztof Choromanski · Shanda Li · Valerii Likhosherstov · Kumar Avinava Dubey · Shengjie Luo · Di He · Yiming Yang · Tamas Sarlos · Thomas Weingarten · Adrian Weller

We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces. FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE mask. Besides, FLTs allow for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and give accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly test FLTs on other data modalities and tasks, such as image classification, 3D molecular modeling, and learnable optimizers. To the best of our knowledge, for 3D molecular data, FLTs are the first Transformer architectures providing linear attention and incorporating RPE masking.

**Implicit Regularization in Deep Tucker Factorization: Low-Rankness via Structured Sparsity**

Kais HARIZ · Hachem Kadri · Stephane Ayache · Maher Moakher · Thierry Artieres

We theoretically analyze the implicit regularization of deep learning for tensor completion. We show that deep Tucker factorization trained by gradient descent induces a structured sparse regularization. This leads to a characterization of the effect of the depth of the neural network on the implicit regularization and provides a potential explanation for the bias of gradient descent towards solutions with low multilinear rank. Numerical experiments confirm our theoretical findings and give insights into the behavior of gradient descent in deep tensor factorization.

**Simulating weighted automata over sequences and trees with transformers**

Michael Rizvi-Martel · Maude Lizaire · Clara Lacroce · Guillaume Rabusseau

Transformers are ubiquitous models in the natural language processing (NLP) community and have shown impressive empirical successes in the past few years. However, little is understood about how they reason and the limits of their computational capabilities. These models do not process data sequentially, and yet outperform sequential neural models such as RNNs. Recent work has shown that these models can compactly simulate the sequential reasoning abilities of deterministic finite automata (DFAs). This leads to the following question: can transformers simulate the reasoning of more complex finite state machines? In this work, we show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA), a generalization of weighted automata to tree structured inputs. We prove these claims formally and provide upper bounds on the size of the transformer models needed as a function of the number of states of the target automata. Empirically, we perform synthetic experiments showing that transformers are able to learn these compact solutions via standard gradient-based training.

**Functional Graphical Models: Structure Enables Offline Data-Driven Optimization**

Kuba Grudzien · Masatoshi Uehara · Sergey Levine · Pieter Abbeel

While machine learning models are typically trained to solve prediction problems, we might often want to use them for optimization problems. For example, given a dataset of proteins and their corresponding fluorescence levels, we might want to optimize for a new protein with the highest possible fluorescence. This kind of data-driven optimization (DDO) presents a range of challenges beyond those in standard prediction problems, since we need models that successfully predict the performance of new designs that are better than the best designs seen in the training set. It is not clear theoretically when existing approaches can even perform better than the na \"{i}ve approach that simply selects the best design in the dataset. In this paper, we study how structure can enable sample-efficient data-driven optimization. To formalize the notion of structure, we introduce functional graphical models (FGMs) and show theoretically how they can provide for principled data-driven optimization by decomposing the original high-dimensional optimization problem into smaller sub-problems. This allows us to derive much more practical regret bounds for DDO, and the result implies that DDO with FGMs can achieve nearly optimal designs in situations where na\"ive approaches fail due to insufficient coverage of the offline data. We further present a data-driven optimization algorithm that inferes the FGM structure itself, either over the original input variables or a latent variable representation of the inputs.

**From Coupled Oscillators to Graph Neural Networks: Reducing Over-smoothing via a Kuramoto Model-based Approach**

Tuan Nguyen · Hirotada Honda · Takashi Sano · Vinh NGUYEN · Shugo Nakamura · Tan Nguyen

We propose the Kuramoto Graph Neural Network (KuramotoGNN), a novel class of continuous-depth graph neural networks (GNNs) that employs the Kuramoto model to mitigate the over-smoothing phenomenon, in which node features in GNNs become indistinguishable as the number of layers increases. The Kuramoto model captures the synchronization behavior of non-linear coupled oscillators. Under the view of coupled oscillators, we first show the connection between Kuramoto model and basic GNN and then over-smoothing phenomenon in GNNs can be interpreted as phase synchronization in Kuramoto model. The KuramotoGNN replaces this phase synchronization with frequency synchronization to prevent the node features from converging into each other while allowing the system to still reach a stable synchronized state. We experimentally verify the advantages of the KuramotoGNN over the baseline GNNs and existing methods in reducing over-smoothing on various graph deep learning benchmark tasks.

**A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization**

Mathieu DAGREOU · Thomas Moreau · Samuel Vaiter · Pierre Ablin

Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk minimization problems and therefore have a sum structure. In this context, we propose a bilevel extension of the celebrated SARAH algorithm. We demonstrate that the algorithm requires $O((n+m)^{1/2}\epsilon^{-1})$ oracle calls to achieve $\epsilon$-stationarity with $n+m$ the total number of samples, which improves over all previous bilevel algorithms. Moreover, we provide a lower bound on the number of oracle calls required to get an approximate stationary point of the objective function of the bilevel problem. This lower bound is attained by our algorithm, making it optimal in terms of sample complexity.

**Fast 1-Wasserstein distance approximations using greedy strategies**

Guillaume Houry · Han Bao · Han Zhao · Makoto Yamada

Among numerous linear approximation methods proposed for optimal transport (OT), tree-based methods appear to be fairly reliable, notably for language processing applications.Inspired by these tree methods, we introduce several greedy heuristics aiming to compute even faster approximations of OT.We first explicitly establish the equivalence between greedy matching and optimal transport for tree metrics, and then we show that tree greedy matching can be reduced to greedy matching on a one-dimensional line.Next, we propose two new greedy-based algorithms in one dimension: the $k$-Greedy and 1D-ICT algorithms.This novel approach provides Wasserstein approximations with accuracy similar to the original tree methods on text datasets while being faster in practice.Finally, these algorithms are applicable beyond tree approximations: using sliced projections of the original data still provides fairly good accuracy while eliminating the need for embedding the data in a fixed and rigid tree structure.This property makes these approaches even more versatile than the original tree OT methods.

**Graph Partitioning with a Move Budget**

Mina Dalirrooyfard · Elaheh Fata · Majid Behbahani · Yuriy Nevmyvaka

In many real world networks, there already exists a (not necessarily optimal) $k$-partitioning of the network. Oftentimes, for such networks, one aims to find a $k$-partitioning with a smaller cut value by moving only a few nodes across partitions. The number of nodes that can be moved across partitions is often a constraint forced by budgetary limitations. Motivated by such real-world applications, we introduce and study the $r$-move $k$-partitioning~problem, a natural variant of the Multiway cut problem. Given a graph, a set of $k$ terminals and an initial partitioning of the graph, the $r$-move $k$-partitioning~problem aims to find a $k$-partitioning with the minimum-weighted cut among all the $k$-partitionings that can be obtained by moving at most $r$ non-terminal nodes to partitions different from their initial ones. Our main result is a polynomial time $3(r+1)$ approximation algorithm for this problem. We further show that this problem is $W[1]$-hard, and give an FPTAS for when $r$ is a small constant.

**AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms**

Rustem Islamov · Mher Safaryan · Dan Alistarh

We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting, where each worker has its own computation and communication speeds, as well as data distribution. In these algorithms, workers compute possibly stale and stochastic gradients associated with their local data at some iteration back in history and then return those gradients to the server without synchronizing with other workers. We present a unified convergence theory for non-convex smooth functions in the heterogeneous regime. The proposed analysis provides convergence for pure asynchronous SGD and its various modifications. Moreover, our theory explains what affects the convergence rate and what can be done to improve the performance of asynchronous algorithms. In particular, we introduce a novel asynchronous method based on worker shuffling. As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD. The derived rates match the best-known results for those algorithms, highlighting the tightness of our approach. Finally, our numerical evaluations support theoretical findings and show the good practical performance of our method.

**Generalization Bounds of Nonconvex-(Strongly)-Concave Stochastic Minimax Optimization**

Siqi Zhang · Yifan Hu · Liang Zhang · Niao He

This paper studies the generalization performance of algorithms for solving nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization measured by the stationarity of primal functions. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem. The sample complexities for achieving $\epsilon$-generalization are $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{-2})$ and $\tilde{\mathcal{O}}(d\epsilon^{-4})$ for NC-SC and NC-C settings, respectively, where $d$ is the dimension of the primal variable and $\kappa$ is the condition number. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between stability and generalization. As a result, we establish algorithm-dependent generalization bounds for stochastic gradient descent ascent (SGDA) and the more general sampling-determined algorithms (SDA).

**Lower-level Duality Based Reformulation and Majorization Minimization Algorithm for Hyperparameter Optimization**

He Chen · Haochen Xu · Rujun Jiang · Anthony Man-Cho So

Hyperparameter tuning is an important task of machine learning, which can be formulated as a bilevel program (BLP). However, most existing algorithms are not applicable for BLP with non-smooth lower-level problems. To address this, we propose a single-level reformulation of the BLP based on lower-level duality without involving any implicit value function. To solve the reformulation, we propose a majorization minimization algorithm that marjorizes the constraint in each iteration. Furthermore, we show that the subproblems of the proposed algorithm for several widely-used hyperparameter turning models can be reformulated into conic programs that can be efficiently solved by the off-the-shelf solvers. We theoretically prove the convergence of the proposed algorithm and demonstrate its superiority through numerical experiments.

**Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems**

Nikita Puchkin · Eduard Gorbunov · Nickolay Kutuzov · Alexander Gasnikov

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $O(K^{-2(\alpha - 1) / \alpha})$, when the stochastic gradients have finite $\alpha$-th moment, $\alpha \in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.

**Adaptive Quasi-Newton and Anderson Acceleration Framework with Explicit Global (Accelerated) Convergence Rates**

Damien Scieur

Despite the impressive numerical performance of the quasi-Newton and Anderson/nonlinear acceleration methods, their global convergence rates have remained elusive for over 50 years. This study addresses this long-standing issue by introducing a framework that derives novel, adaptive quasi-Newton and nonlinear/Anderson acceleration schemes. Under mild assumptions, the proposed iterative methods exhibit explicit, non-asymptotic convergence rates that blend those of the gradient descent and Cubic Regularized Newton's methods. The proposed approach also includes an accelerated version for convex functions. Notably, these rates are achieved adaptively without prior knowledge of the function's parameters. The framework presented in this study is generic, and its special cases include algorithms such as Newton's method with random subspaces, finite differences, or lazy Hessian. Numerical experiments demonstrated the efficiency of the proposed framework, even compared to the l-BFGS algorithm with Wolfe line-search. The code used in the experiments is available on \url{https://github.com/windows7lover/QN*With*Guarantees}.

**Learning-Based Algorithms for Graph Searching Problems**

Adela DePavia · Erasmo Tani · Ali Vakilian

We consider the problem of graph searching with prediction recently introduced by Banerjee et al. (2023). In this problem, an agent starting at some vertex r has to traverse a (potentially unknown) graph G to find a hidden goal node g while minimizing the total distance traveled. We study a setting in which at any node v, the agent receives a noisy estimate of the distance from v to g. We design algorithms for this search task on unknown graphs. We establish the first formal guarantees on unknown weighted graphs and provide lower bounds showing that the algorithms we propose have optimal or nearly-optimal dependence on the prediction error. Further, we perform numerical experiments demonstrating that in addition to being robust to adversarial error, our algorithms perform well in typical instances in which the error is stochastic. Finally, we provide simpler performance bounds on the algorithms of Banerjee et al. (2023) for the case of searching on a known graph and establish new lower bounds for this setting.

**End-to-end Feature Selection Approach for Learning Skinny Trees**

Shibal Ibrahim · Kayhan Behdin · Rahul Mazumder

Joint feature selection and tree ensemble learning is a challenging task. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importances, which are known to be misleading, and can significantly hurt performance. We propose Skinny Trees: a toolkit for feature selection in tree ensembles, such that feature selection and tree ensemble learning occurs simultaneously. It is based on anend-to-end optimization approach that considers feature selection in differentiable trees with Group L0 − L2 regularization. We optimize with a first-order proximal method and present convergence guarantees of our algorithmic approach for a non-convex and non-smooth objective. Interestingly, dense-to-sparse regularization scheduling can leadto more expressive and sparser tree ensembles than vanilla proximal method. On 15 synthetic and real-world datasets, Skinny Trees can achieve 1.5x− 620x feature compression rates, leading up to 10× faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for 25% feature budget, Skinny Trees outperforms LightGBM by 10.2% (up to 37.7%), and Random Forests by 3% (up to 12.5%).

**Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization**

Zhenzhang Ye · Gabriel Peyré · Daniel Cremers · Pierre Ablin

Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to compute the so-called hypergradient of the outer problem is to use the Implicit Function Theorem (IFT). As a function of the error of the inner problem resolution, we study the error of the IFT method. We analyze two strategies to reduce this error: preconditioning the IFT formula and reparameterizing the inner problem. We give a detailed account of the impact of these two modifications on the error, highlighting the role played by higher-order derivatives of the functionals at stake. Our theoretical findings explain when super efficiency, namely reaching an error on the hypergradient that depends quadratically on the error on the inner problem, is achievable and compare the two approaches when this is impossible. Numerical evaluations on hyperparameter tuning for regression problems substantiate our theoretical findings.

**Fairness in Submodular Maximization over a Matroid Constraint**

Marwa El Halabi · Jakub Tarnawski · Ashkan Norouzi-Fard · Thuy-Duong Vuong

Submodular maximization over a matroid constraint is a fundamental problem with various applications in machine learning. Some of these applications involve decision-making over datapoints with sensitive attributes such as gender or race. In such settings, it is crucial to guarantee that the selected solution is fairly distributed with respect to this attribute.Recently, fairness has been investigated in submodular maximization under a cardinality constraint in both the streaming and offline settings, however the more general problem with matroid constraint has only been considered in the streaming setting and only for monotone objectives. This work fills this gap. We propose various algorithms and impossibility results offering different trade-offs between quality, fairness, and generality.

**Submodular Minimax Optimization: Finding Effective Sets**

Loay Mualem · Ethan Elenberg · Moran Feldman · Amin Karbasi

Despite the rich existing literature about minimax optimization in continuous settings, only very partial results of this kind have been obtained for combinatorial settings.In this paper, we fill this gap by providing a characterization of submodular minimax optimization, the problem of finding a set (for either the min or the max player) that is effective against every possible response.We show when and under what conditions we can find such sets.We also demonstrate how minimax submodular optimization provides robust solutions for downstream machine learning applications such as (i) prompt engineering in large language models, (ii) identifying robust waiting locations for ride-sharing, (iii) kernelization of the difficulty of instances of the last setting, and (iv) finding adversarial images. Our experiments show that our proposed algorithms consistently outperform other baselines.

**Learning Sampling Policy to Achieve Fewer Queries for Zeroth-Order Optimization**

Zhou Zhai · wanli shi · Heng Huang · Yi Chang · Bin Gu

Zeroth-order (ZO) methods, which use the finite difference of two function evaluations (also called ZO gradient) to approximate first-order gradient, have attracted much attention recently in machine learning because of their broad applications.The accuracy of the ZO gradient highly depends on how many finite differences are averaged, which are intrinsically determined by the number of perturbations randomly drawn from a distribution. Existing ZO methods try to learn a data-driven distribution for sampling the perturbations to improve the efficiency of ZO optimization (ZOO) algorithms. In this paper, we explore a new and parallel direction, \textit{i.e.}, learn an optimal sampling policy instead of using a totally random strategy to generate perturbations based on the techniques of reinforcementlearning (RL), which makes it possible to approximate the gradient with only two function evaluations. Specifically, we first formulate the problem of learning a sampling policy as a Markov decision process. Then, we propose our ZO-RL algorithm, \textit{i.e.}, using deep deterministic policy gradient, an actor-critic RL algorithm to learn a sampling policy that can guide the generation of perturbed vectors in getting ZO gradients as accurately as possible. Importantly, the existing ZOO algorithms for learning a distribution can be plugged in to improve the exploration of ZO-RL.Experimental results with different ZO estimators show that our ZO-RL algorithm can effectively reduce the query complexity of ZOO algorithms and converge faster than existing ZOO algorithms, especially in the later stage of the optimization process.

**Communication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved Rates**

Ahmad Rammal · Kaja Gruntkowska · Nikita Fedin · Eduard Gorbunov · Peter Richtarik

Byzantine robustness is an essential feature of algorithms for certain distributed optimization problems, typically encountered in collaborative/federated learning. These problems are usually huge-scale, implying that communication compression is also imperative for their resolution. These factors have spurred recent algorithmic and theoretical developments in the literature of Byzantine-robust learning with compression. In this paper, we contribute to this research area in two main directions. First, we propose a new Byzantine-robust method with compression -- Byz-DASHA-PAGE -- and prove that the new method has better convergence rate (for non-convex and Polyak-Lojasiewicz smooth optimization problems), smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees (Byz-VR-MARINA). Secondly, we develop the first Byzantine-robust method with communication compression and error feedback -- Byz-EF21 -- along with its bi-directional compression version -- Byz-EF21-BC -- and derive the convergence rates for these methods for non-convex and Polyak-Lojasiewicz smooth case. We test the proposed methods and illustrate our theoretical findings in the numerical experiments.

**Risk Seeking Bayesian Optimization under Uncertainty for Obtaining Extremum**

Shogo Iwazaki · Tomohiko Tanabe · Mitsuru Irie · Shion Takeno · Yu Inatsu

Real-world black-box optimization tasks often focus on obtaining the best reward, which includes an intrinsic random quantity from uncontrollable environmental factors. For this problem, we formulate a novel risk-seeking optimization problem whose aim is to obtain the best possible reward within a fixed budget under uncontrollable factors. We consider two settings: (1) environmental model setting for the case that we can observe uncontrollable environmental variables that affect the observation as the input of a target function, and (2) heteroscedastic model setting for the case that any uncontrollable variables cannot be observed. We propose a novel Bayesian optimization method called kernel explore-then-commit (kernel-ETC) and provide the regret upper bound for both settings. We demonstrate the effectiveness of kernel-ETC through several numerical experiments, including the hyperparameter tuning task and the simulation function derived from polymer synthesis real data.

**Absence of spurious solutions far from ground truth: A low-rank analysis with high-order losses**

Ziye Ma · Ying Chen · Javad Lavaei · Somayeh Sojoudi

Matrix sensing problems exhibit pervasive non-convexity, plaguing optimization with a proliferation of suboptimal spurious solutions. Avoiding convergence to these critical points poses a major challenge. This work provides new theoretical insights that help demystify the intricacies of the non-convex landscape. In this work, we prove that under certain conditions, critical points sufficiently distant from the ground truth matrix exhibit favorable geometry by being strict saddle points rather than troublesome local minima. Moreover, we introduce the notion of higher-order losses for the matrix sensing problem and show that the incorporation of such losses into the objective function amplifies the negative curvature around those distant critical points. This implies that increasing the complexity of the objective function via high-order losses accelerates the escape from such critical points and acts as a desirable alternative to increasing the complexity of the optimization problem via over-parametrization. By elucidating key characteristics of the non-convex optimization landscape, this work makes progress towards a comprehensive framework for tackling broader machine learning objectives plagued by non-convexity.

**A 4-Approximation Algorithm for Min Max Correlation Clustering**

Holger Heidrich · Jannik Irmai · Bjoern Andres

We introduce a lower bounding technique for the min max correlation clustering problem and, based on this technique, a combinatorial 4-approximation algorithm for complete graphs. This improves upon the previous best known approximation guarantees of 5, using a linear program formulation (Kalhan et al., 2019), and 40, for a combinatorial algorithm (Davies et al., 2023). We extend this algorithm by a greedy joining heuristic and show empirically that it improves the state of the art in solution quality and runtime on several benchmark datasets.

**An Efficient Stochastic Algorithm for Decentralized Nonconvex-Strongly-Concave Minimax Optimization**

Lesi Chen · Haishan Ye · Luo Luo

This paper studies the stochastic nonconvex-strongly-concave minimax optimization over a multi-agent network. We propose an efficient algorithm, called Decentralized Recursive gradient descEnt Ascent Method (DREAM), which achieves the best-known theoretical guarantee for finding the $\epsilon$-stationary points. Concretely, it requires $\mathcal{O}(\min (\kappa^3\epsilon^{-3},\kappa^2 \sqrt{N} \epsilon^{-2} ))$ stochastic first-order oracle (SFO) calls and $\tilde \mathcal O(\kappa^2 \epsilon^{-2})$ communication rounds, where $\kappa$ is the condition number and $N$ is the total number of individual functions. Our numerical experiments also validate the superiority of DREAM over previous methods.

**Near-Optimal Convex Simple Bilevel Optimization with a Bisection Method**

Jiulin Wang · Xu Shi · Rujun Jiang

This paper studies a class of simple bilevel optimization problems where we minimize a composite convex function at the upper-level subject to a composite convex lower-level problem.Existing methods either provide asymptotic guarantees for the upper-level objective or attain slow sublinear convergence rates.We propose a bisection algorithm to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective.In each iteration, the binary search narrows the interval by assessing inequality system feasibility.Under mild conditions, the total operation complexity of our method is ${{\mathcal{O}}}\left(\max\{\sqrt{L_{f_1}/\epsilon_f},\sqrt{L_{g_1}/\epsilon_g}\} \right)$.Here, a unit operation can be a function evaluation, gradient evaluation, or the invocation of the proximal mapping, $L_{f_1}$ and $L_{g_1}$ are the Lipschitz constants of the upper- and lower-level objectives' smooth components, and ${\mathcal{O}}$ hides logarithmic terms.Our approach achieves a near-optimal rate in unconstrained smooth or composite convex optimization when disregarding logarithmic terms.Numerical experiments demonstrate the effectiveness of our method.

**Multi-Level Symbolic Regression: Function Structure Learning for Multi-Level Data**

Kei Sen Fong · Mehul Motani

Symbolic Regression (SR) is an approach which learns a closed-form function relating the predictors to the outcome in a dataset. Datasets are often multi-level (MuL), meaning that certain features can be used to split data into groups for analysis (we refer to these features as levels). The advantage of viewing datasets as MuL is that we can exploit the high similarity of data within a group. SR is well-suited for MuL datasets, in which the learnt function structure serves as `shared information' between the groups while the learnt parameter values capture the unique relationships within each group. In this context, this paper makes three contributions: (i) We design an algorithm, Multi-level Symbolic Regression (MSR), which runs multiple parallel SR processes for each group and merges them to produce a single function structure. (ii) To tackle datasets that are not explicitly MuL, we develop a metric termed MLICC to select the best feature to serve as a level. (iii) We also release MSRBench, a database of MuL datasets (synthetic and real-world) which we developed and collated, that can be used to evaluate MSR. Our results and ablation studies demonstrate that MSR achieves a higher recovery rate and lower error on MSRBench compared to SOTA methods for SR and MuL datasets.

**Extragradient Type Methods for Riemannian Variational Inequality Problems**

ZIHAO HU · Guanghui Wang · Xi Wang · Andre Wibisono · Jacob Abernethy · Molei Tao

In this work, we consider monotone Riemannian Variational Inequality Problems (RVIPs), which encompass both Riemannian convex optimization and minimax optimization as particular cases. In Euclidean space, the last-iterates of both the extragradient (EG) and past extragradient (PEG) methods converge to the solution of monotone variational inequality problems at a rate of $O\left(\frac{1}{\sqrt{T}}\right)$ (Cai et al., 2022). However, analogous behavior on Riemannian manifolds remains open. To bridge this gap, we introduce the Riemannian extragradient (REG) and Riemannian past extragradient (RPEG) methods. We demonstrate that both exhibit $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence and $O\left(\frac{1}{{T}}\right)$ average-iterate convergence, aligning with observations in the Euclidean case. These results are enabled by judiciously addressing the holonomy effect so that additional complications in Riemannian cases can be reduced and the Euclidean proof inspired by the performance estimation problem (PEP) technique or the sum-of-squares (SOS) technique can be applied again.

Natural gradient methods have been used to optimise the parameters of probability distributions in a variety of settings, often resulting in fast-converging procedures. Unfortunately, for many distributions of interest, computing the natural gradient has a number of challenges. In this work we propose a novel technique for tackling such issues, which involves reframing the optimisation as one with respect to the parameters of a surrogate distribution, for which computing the natural gradient is easy. We give several examples of existing methods that can be interpreted as applying this technique, and propose a new method for applying it to a wide variety of problems. Our method expands the set of distributions that can be efficiently targeted with natural gradients. Furthermore, it is fast, easy to understand, simple to implement using standard autodiff software, and does not require lengthy model-specific derivations. We demonstrate our method on maximum likelihood estimation and variational inference tasks.

**Lexicographic Optimization: Algorithms and Stability**

Jacob Abernethy · Robert Schapire · Umar Syed

A lexicographic maximum of a set $X \subseteq R^n$ is a vector in $X$ whose smallest component is as large as possible, and subject to that requirement, whose second smallest component is as large as possible, and so on for the third smallest component, etc. Lexicographic maximization has numerous practical and theoretical applications, including fair resource allocation, analyzing the implicit regularization of learning algorithms, and characterizing refinements of game-theoretic equilibria. We prove that a minimizer in $X$ of the exponential loss function $L_c(x) = \sum_i \exp(-c x_i)$ converges to a lexicographic maximum of $X$ as $c \to \infty$, provided that $X$ is \emph{stable} in the sense that a well-known iterative method for finding a lexicographic maximum of $X$ cannot be made to fail simply by reducing the required quality of each iterate by an arbitrarily tiny degree. Our result holds for both near and exact minimizers of the exponential loss, while earlier convergence results made much stronger assumptions about the set $X$ and only held for the exact minimizer. We are aware of no previous results showing a connection between the iterative method for computing a lexicographic maximum and exponential loss minimization. We show that every convex polytope is stable, but that there exist compact, convex sets that are not stable. We also provide the first analysis of the convergence rate of an exponential loss minimizer (near or exact) and discover a curious dichotomy: While the two smallest components of the vector converge to the lexicographically maximum values very quickly (at roughly the rate $\frac{\log n}{c}$), all other components can converge arbitrarily slowly.

**Escaping Saddle Points in Heterogeneous Federated Learning via Distributed SGD with Communication Compression**

Sijin Chen · Zhize Li · Yuejie Chi

We consider the problem of finding second-order stationary points in the optimization of heterogeneous federated learning (FL). Previous works in FL mostly focus on first-order convergence guarantees, which do not rule out the scenario of unstable saddle points. Meanwhile, it is a key bottleneck of FL to achieve communication efficiency without compensating the learning accuracy, especially when local data are highly heterogeneous across different clients. Given this, we propose a novel algorithm PowerEF-SGD that only communicates compressed information via a novel error-feedback scheme. To our knowledge, PowerEF-SGD is the first distributed and compressed SGD algorithm that provably escapes saddle points in heterogeneous FL without any data homogeneity assumptions. In particular, PowerEF-SGD improves to second-order stationary points after visiting first-order (possibly saddle) points, using additional gradient queries and communication rounds only of almost the same order required by first-order convergence, and the convergence rate shows a linear-speedup pattern in terms of the number of workers. Our theory improves/recovers previous results, while extending to much more tolerant settings on the local data. Numerical experiments are provided to complement the theory.

**Complexity of Single Loop Algorithms for Nonlinear Programming with Stochastic Objective and Constraints**

Ahmet Alacaoglu · Stephen Wright

We analyze the sample complexity of single-loop quadratic penalty and augmented Lagrangian algorithms for solving nonconvex optimization problems with functional equality constraints. We consider three cases, in all of which the objective is stochastic, that is, an expectation over an unknown distribution that is accessed by sampling. The nature of the equality constraints differs among the three cases: deterministic and linear in the first case, deterministic and nonlinear in the second case, and stochastic and nonlinear in the third case. Variance reduction techniques are used to improve the complexity. To find a point that satisfies $\varepsilon$-approximate first-order conditions, we require $\widetilde{O}(\varepsilon^{-3})$ complexity in the first case, $\widetilde{O}(\varepsilon^{-4})$ in the second case, and $\widetilde{O}(\varepsilon^{-5})$ in the third case. For the first and third cases, they are the first algorithms of ``single loop'' type that also use $O(1)$ samples at each iteration and still achieve the best-known complexity guarantees.

**Fixed-kinetic Neural Hamiltonian Flows for enhanced interpretability and reduced complexity**

Vincent Souveton · Arnaud Guillin · Jens Jasche · Guilhem Lavaux · Manon Michel

Normalizing Flows (NF) are Generative models which transform a simple prior distribution into the desired target. They however require the design of an invertible mapping whose Jacobian determinant has to be computable. Recently introduced, Neural Hamiltonian Flows (NHF) are Hamiltonian dynamics-based flows, which are continuous, volume-preserving and invertible and thus make for natural candidates for robust NF architectures. In particular, their similarity to classical Mechanics could lead to easier interpretability of the learned mapping. In this paper, we show that the current NHF architecture may still pose a challenge to interpretability. Inspired by Physics, we introduce a fixed-kinetic energy version of the model. This approach improves interpretability and robustness while requiring fewer parameters than the original model. We illustrate that on a 2D Gaussian mixture and on the MNIST and Fashion-MNIST datasets. Finally, we show how to adapt NHF to the context of Bayesian inference and illustrate the method on an example from cosmology.

**Online Learning of Decision Trees with Thompson Sampling**

Ayman Chaouki · Jesse Read · Albert Bifet

Decision Trees are prominent prediction models for interpretable Machine Learning. They have been thoroughly researched, mostly in the batch setting with a fixed labelled dataset, leading to popular algorithms such as C4.5, ID3 and CART. Unfortunately, these methods are of heuristic nature, they rely on greedy splits offering no guarantees of global optimality and often leading to unnecessarily complex and hard-to-interpret Decision Trees. Recent breakthroughs addressed this suboptimality issue in the batch setting, but no such work has considered the online setting with data arriving in a stream. To this end, we devise a new Monte Carlo Tree Search algorithm, Thompson Sampling Decision Trees (TSDT), able to produce optimal Decision Trees in an online setting. We analyse our algorithm and prove its almost sure convergence to the optimal tree. Furthermore, we conduct extensive experiments to validate our findings empirically. The proposed TSDT outperforms existing algorithms on several benchmarks, all while presenting the practical advantage of being tailored to the online setting.

**An Impossibility Theorem for Node Embedding**

T. Mitchell Roddenberry · Yu Zhu · Santiago Segarra

With the increasing popularity of graph-based methods for dimensionality reduction and representation learning, node embedding functions have become important objects of study in the literature. In this paper, we take an axiomatic approach to understanding node embedding methods. Motivated by desirable properties of node embeddings for encoding the role of a node in the structure of a network, we first state three properties for embedding dissimilarity networks. We then prove that no node embedding method can satisfy all three properties at once, reflecting fundamental difficulties inherent to the task. Having identified these difficulties, we show that mild relaxations of these axioms allow for certain node embedding methods to be admissible.

**Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes**

Washim Uddin Mondal · Vaneet Aggarwal

We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2}})$ sample complexity and $\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization where $\epsilon$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.

**Learning Fair Division from Bandit Feedback**

Hakuei Yamada · Junpei Komiyama · Kenshi Abe · Atsushi Iwasaki

This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents’ values or utilities. Departing from conventional online algorithms, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing dual averaging, enabling gradual learning of both the type distribution of arriving items and agents’ values through bandit feedback. This approach enables the algorithms to asymptotically achieve optimal Nash social welfare in linear Fisher markets with agents having additive utilities. We also empirically verify the performance of the proposed algorithms across synthetic and empirical datasets.

**Optimal Transport for Measures with Noisy Tree Metric**

Tam Le · Truyen Nguyen · Kenji Fukumizu

We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem (i.e., tree-Wasserstein (TW)) admits a closed-form expression, but depends fundamentally on the underlying tree structure over supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. To mitigate this issue, we follow the max-min robust OT approach which considers the maximal possible distances between two input measures over an uncertainty set of tree metrics. In general, this approach is hard to compute, even for measures supported in one-dimensional space, due to its non-convexity and non-smoothness which hinders its practical applications, especially for large-scale settings. In this work, we propose novel uncertainty sets of tree metrics from the lens of edge deletion/addition which covers a diversity of tree structures in an elegant framework. Consequently, by building upon the proposed uncertainty sets, and leveraging the tree structure over supports, we show that the robust OT also admits a closed-form expression for a fast computation as its counterpart standard OT (i.e., TW). Furthermore, we demonstrate that the robust OT satisfies the metric property and is negative definite. We then exploit its negative definiteness to propose positive definite kernels and test them in several simulations on various real-world datasets on document classification and topological data analysis.

**Probabilistic Calibration by Design for Neural Network Regression**

Victor Dheur · Souhaib BEN TAIEB

Generating calibrated and sharp neural network predictive distributions for regression problems is essential for optimal decision-making in many real-world applications. To address the miscalibration issue of neural networks, various methods have been proposed to improve calibration, including post-hoc methods that adjust predictions after training and regularization methods that act during training. While post-hoc methods have shown better improvement in calibration compared to regularization methods, the post-hoc step is completely independent of model training. We introduce a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters. We also present a unified algorithm that includes our method and other post-hoc and regularization methods, as particular cases. We demonstrate the performance of our method in a large-scale experiment involving 57 tabular regression datasets, showcasing improved predictive accuracy while maintaining calibration. We also conduct an ablation study to evaluate the significance of different components within our proposed method, as well as an in-depth analysis of the impact of the base model and different hyperparameters on predictive accuracy.

**A Scalable Algorithm for Individually Fair k-Means Clustering**

MohammadHossein Bateni · Vincent Cohen-Addad · Alessandro Epasto · Silvio Lattanzi

We present a scalable algorithm for the individually fair ($p$,$k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.

**Offline Primal-Dual Reinforcement Learning for Linear MDPs**

Germano Gabbianelli · Gergely Neu · Matteo Papini · Nneka Okolo

Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy. This problem has attracted a lot of attention recently, but most existing methods with strong theoretical guarantees are restricted to finite-horizon or tabular settings. In contrast, few algorithms for infinite-horizon settings with function approximation and minimal assumptions on the dataset are both sample and computationally efficient. Another gap in the current literature is the lack of theoretical analysis for the average-reward setting, which is more challenging than the discounted setting. In this paper, we address both of these issues by proposing a primal-dual optimization method based on the linear programming formulation of RL. Our key contribution is a new reparametrization that allows us to derive low-variance gradient estimators that can be used in a stochastic optimization scheme using only samples from the behavior policy. Our method finds an $\varepsilon$-optimal policy with $O(\varepsilon^{-4})$ samples, while being computationally efficient for infinite-horizon discounted and average-reward MDPs with realizable linear function approximation and partial coverage. Moreover, to the best of our knowledge, this is the first theoretical result for average-reward offline RL.

**Learning multivariate temporal point processes via the time-change theorem**

Guilherme Augusto Zagatti · See Kiong Ng · Stéphane Bressan

Marked temporal point processes (TPPs) are a class of stochastic processes that describe the occurrence of a countable number of marked events over continuous time. In machine learning, the most common representation of marked TPPs is the univariate TPP coupled with a conditional mark distribution. Alternatively, we can represent marked TPPs as a multivariate temporal point process in which we model each sequence of marks interdependently. We introduce a learning framework for multivariate TPPs leveraging recent progress on learning univariate TPPs via time-change theorems to propose a deep-learning, invertible model for the conditional intensity. We rely neither on Monte Carlo approximation for the compensator nor on thinning for sampling. Therefore, we have a generative model that can efficiently sample the next event given a history of past events. Our models show strong alignment between the percentiles of the distribution expected from theory and the empirical ones.

**Model-based Policy Optimization under Approximate Bayesian Inference**

Chaoqi Wang · Yuxin Chen · Kevin Murphy

Model-based reinforcement learning algorithms~(MBRL) present an exceptional potential to enhance sample efficiency within the realm of online reinforcement learning (RL). Nevertheless, a substantial proportion of prevalent MBRL algorithms fail to adequately address the dichotomy of exploration and exploitation. Posterior sampling reinforcement learning (PSRL) emerges as an innovative strategy adept at balancing exploration and exploitation, albeit its theoretical assurances are contingent upon exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks.

**Training a Tucker Model With Shared Factors: a Riemannian Optimization Approach**

Ivan Peshekhonov · Aleksei Arzhantsev · Maxim Rakhuba

Factorization of a matrix into a product of two rectangular factors, is a classic tool in various machine learning applications. Tensor factorizations generalize this concept to more than two dimensions. In applications, where some of the tensor dimensions have the same size or encode the same objects (e.g., knowledge graphs with entity-relation-entity 3D tensors), it can also be beneficial for the respective factors to be shared. In this paper, we consider a well-known Tucker tensor factorization and study its properties under the shared factor constraint. We call it a shared-factor Tucker factorization (SF-Tucker). Since sharing factors breaks polylinearity of classical tensor factorizations, common optimization schemes such as alternating least squares become inapplicable. Nevertheless, as we show in this paper, a set of fixed-rank SF-Tucker tensors preserves a Riemannian manifold structure. Therefore, we develop efficient algorithms for the main ingredients of Riemannian optimization on the SF-Tucker manifold and implement a Riemannian optimization method with momentum. We showcase the benefits of our approach on several machine learning tasks including knowledge graph completion and compression of neural networks.

**Don't Be Pessimistic Too Early: Look K Steps Ahead!**

Chaoqi Wang · Ziyu Ye · Kevin Murphy · Yuxin Chen

Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.

**Multi-Resolution Active Learning of Fourier Neural Operators**

Shibo Li · Xin Yu · Wei Xing · Robert Kirby · Akil Narayan · Shandian Zhe

Fourier Neural Operator (FNO) is a popular operator learning framework. It not only achieves the state-of-the-art performance in many tasks, but also is highly efficient in training and prediction. However, collecting training data for the FNO can be a costly bottleneck in practice, because it often demands expensive physical simulations. To overcome this problem, we propose Multi-Resolution Active learning of FNO (MRA-FNO), which can dynamically select the input functions and resolutions to lower the data cost as much as possible while optimizing the learning efficiency. Specifically, we propose a probabilistic multi-resolution FNO and use ensemble Monte-Carlo to develop an effective posterior inference algorithm. To conduct active learning, we maximize a utility-cost ratio as the acquisition function to acquire new examples and resolutions at each step. We use moment matching and the matrix determinant lemma to enable tractable, efficient utility computation. Furthermore, we develop a cost annealing framework to avoid over-penalizing high-resolution queries at the early stage. The over-penalization is severe when the cost difference is significant between the resolutions, which renders active learning often stuck at low-resolution queries and inferior performance. Our method overcomes this problem and applies to general multi-fidelity active learning and optimization problems. We have shown the advantage of our method in several benchmark operator learning tasks.

**Identifiable Feature Learning for Spatial Data with Nonlinear ICA**

Hermanni Hälvä · Jonathan So · Richard Turner · Aapo Hyvarinen

Recently, nonlinear ICA has surfaced as a popular alternative to the many heuristic models used in deep representation learning and disentanglement. An advantage of nonlinear ICA is that a sophisticated identifiability theory has been developed; in particular, it has been proven that the original components can be recovered under sufficiently strong latent dependencies. Despite this general theory, practical nonlinear ICA algorithms have so far been mainly limited to data with one-dimensional latent dependencies, especially time-series data. In this paper, we introduce a new nonlinear ICA framework that employs $t$-process (TP) latent components which apply naturally to data with higher-dimensional dependency structures, such as spatial and spatio-temporal data. In particular, we develop a new learning and inference algorithm that extends variational inference methods to handle the combination of a deep neural network mixing function with the TP prior, and employs the method of inducing points for computational efficacy. On the theoretical side, we show that such TP independent components are identifiable under very general conditions. Further, Gaussian Process (GP) nonlinear ICA is established as a limit of the TP Nonlinear ICA model, and we prove that the identifiability of the latent components at this GP limit is more restricted. Namely, those components are identifiable if and only if they have distinctly different covariance kernels. Our algorithm and identifiability theorems are explored on simulated spatial data and real world spatio-temporal data.

**Resilient Constrained Reinforcement Learning**

Dongsheng Ding · Zhengyan Huan · Alejandro Ribeiro

We study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before training. It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward maximization objective and the constraint satisfaction, which is ubiquitous in constrained decision-making. To tackle this issue, we propose a new constrained RL approach that searches for policy and constraint specifications together. This method features the adaptation of relaxing the constraint according to a relaxation cost introduced in the learning objective. Since this feature mimics how ecological systems adapt to disruptions by altering operation, our approach is termed as resilient constrained RL. Specifically, we provide a set of sufficient conditions that balance the constraint satisfaction and the reward maximization in notion of resilient equilibrium, propose a tractable formulation of resilient constrained policy optimization that takes this equilibrium as an optimal solution, and advocate two resilient constrained policy search algorithms with non-asymptotic convergence guarantees on the optimality gap and constraint satisfaction. Furthermore, we demonstrate the merits and the effectiveness of our approach in computational experiments.

In this work, we propose a novel dimensionality reduction technique, \textit{DiffRed}, which first projects the data matrix, A, along first $k_1$ principal components and the residual matrix $A^{*}$ (left after subtracting its $k_1$-rank approximation) along $k_2$ Gaussian random vectors. We evaluate \emph{M1}, the distortion of mean-squared pair-wise distance, and \emph{Stress}, the normalized value of RMS of distortion of the pairwise distances. We rigorously prove that \textit{DiffRed} achieves a general upper bound of $O\left(\sqrt{\frac{1-p}{k_2}}\right)$ on \emph{Stress} and $O\left(\frac{1-p}{\sqrt{k_2*\rho(A^{*})}}\right)$ on \emph{M1} where $p$ is the fraction of variance explained by the first $k_1$ principal components and $\rho(A^{*})$ is the \textit{stable rank} of $A^{*}$.These bounds are tighter than the currently known results for Random maps. Our extensive experiments on a variety of real-world datasets demonstrate that \textit{DiffRed} achieves near zero \emph{M1} and much lower values of \emph{Stress} as compared to the well-known dimensionality reduction techniques. In particular, \textit{DiffRed} can map a 6 million dimensional dataset to 10 dimensions with 54\% lower \emph{Stress} than PCA.

**Learning to Defer to a Population: A Meta-Learning Approach**

Dharmesh Tailor · Aditya Patra · Rajeev Verma · Putra Manggala · Eric Nalisnick

The learning to defer (L2D) framework allows autonomous systems to be safe and robust by allocating difficult decisions to a human expert. All existing work on L2D assumes that each expert is well-identified, and if any expert were to change, the system should be re-trained. In this work, we alleviate this constraint, formulating an L2D system that can cope with never-before-seen experts at test-time. We accomplish this by using a meta-learning architecture for the deferral function: given a small context set to identify the currently available expert, the model can quickly adapt its deferral policy. We also employ an attention mechanism that is able to look for points in the context set that are similar to a given test point, leading to an even more precise assessment of the expert's abilities. In the experiments, we demonstrate the usefulness of this architecture on image recognition, traffic sign detection, and skin lesion diagnosis benchmarks.

**On learning history-based policies for controlling Markov decision processes**

Gandharv Patil · Aditya Mahajan · Doina Precup

Reinforcement learning (RL) folklore suggests that methods of function approximation based on history, such as recurrent neural networks or state abstractions that include past information, outperform those without memory, because function approximation in Markov decision processes (MDP) can lead to a scenario akin to dealing with a partially observable MDP (POMDP). However, formal analysis of history-based algorithms has been limited, with most existing frameworks concentrating on features without historical context. In this paper, we introduce a theoretical framework to examine the behaviour of RL algorithms that control an MDP using feature abstraction mappings based on historical data. Additionally, we leverage this framework to develop a practical RL algorithm and assess its performance across various continuous control tasks.

**Length independent PAC-Bayes bounds for Simple RNNs**

Volodimir Mitarchuk · Clara Lacroce · Rémi Eyraud · Rémi Emonet · Amaury Habrard · Guillaume Rabusseau

While the practical interest of Recurrent neural networks (RNNs) is attested, much remains to be done to develop a thorough theoretical understanding of their abilities, particularly in what concerns their learning capacities. A powerful framework to tackle this question is the one of PAC-Bayes theory, which allows one to derive bounds providing guarantees on the expected performance of learning models on unseen data. In this paper, we provide an extensive study on the conditions leading to PAC-Bayes bounds for non-linear RNNs that are independent of the length of the data. The derivation of our results relies on a perturbation analysis on the weights of the network. We prove bounds that hold for \emph{$\beta$-saturated} and \emph{DS $\beta$-saturated} SRNs, classes of RNNs we introduce to formalize saturation regimes of RNNs.The first regime corresponds to the case where the values of the hidden state of the SRN are always close to the boundaries of the activation functions.The second one, closely related to practical observations, only requires that it happens at least once in each component of the hidden state on a sliding window of a given size.

**On the estimation of persistence intensity functions and linear representations of persistence diagrams**

Weichen Wu · Jisu Kim · Alessandro Rinaldo

Persistence diagrams are one of the most popular types of data summaries used in Topological Data Analysis. The prevailing statistical approach to analyzing persistence diagrams is concerned with filtering out topological noise.In this paper, we adopt a different viewpoint and aim at estimating the actual distribution of a random persistence diagram, which captures both topological signal and noise. To that effect, Chazel et al., (2018) proved that, under general conditions, the expected value of a random persistence diagram is a measure admitting a Lebesgue density, called the persistence intensity function. In this paper, we are concerned with estimating the persistence intensity function and a novel, normalized version of it -- called the persistence density function. We present a class of kernel-based estimators based on an i.i.d. sample of persistence diagrams and derive estimation rates in the supremum norm. As a direct corollary, we obtain uniform consistency rates for estimating linear representations of persistence diagrams, including Betti numbers and persistence surfaces. Interestingly, the persistence density function delivers stronger statistical guarantees.

We study a novel sequential decision-making setting, namely the dissimilarity bandits. At each round, the learner pulls an arm that provides a stochastic d-dimensional observation vector. The learner aims to identify the pair of arms with the maximum dissimilarity, where such an index is computed over pairs of expected observation vectors. We propose Successive Elimination for Dissimilarity (SED), a fixed-confidence best-pair identification algorithm based on sequential elimination. SED discards individual arms when there is statistical evidence that they cannot belong to a pair of most dissimilar arms and, thus, effectively exploits the structure of the setting by reusing the estimates of the expected observation vectors. We provide results on the sample complexity of SED, depending on {HP}, a novel index characterizing the complexity of identifying the pair of the most dissimilar arms. Then, we provide a sample complexity lower bound, highlighting the challenges of the identification problem for dissimilarity bandits, which is almost matched by our SED. Finally, we compare our approach over synthetically generated data and a realistic environmental monitoring domain against classical and combinatorial best-arm identification algorithms for the cases $d=1$ and $d>1$.

**Consistent Optimal Transport with Empirical Conditional Measures**

Piyushi Manupriya · Rachit Keerti Das · Sayantan Biswas · SakethaNath Jagarlapudi

Given samples from two joint distributions, we consider the problem of Optimal Transportation (OT) between them when conditioned on a common variable. We focus on the general setting where the conditioned variable may be continuous, and the marginals of this variable in the two joint distributions may not be the same. In such settings, standard OT variants cannot be employed, and novel estimation techniques are necessary. Since the main challenge is that the conditional distributions are not explicitly available, the key idea in our OT formulation is to employ kernelized-least-squares terms computed over the joint samples, which implicitly match the transport plan's marginals with the empirical conditionals. Under mild conditions, we prove that our estimated transport plans, as a function of the conditioned variable, are asymptotically optimal. For finite samples, we show that the deviation in terms of our regularized objective is bounded by $O(m^{-1/4})$, where $m$ is the number of samples. We also discuss how the conditional transport plan could be modelled using explicit probabilistic models as well as using implicit generative ones. We empirically verify the consistency of our estimator on synthetic datasets, where the optimal plan is analytically known. When employed in applications like prompt learning for few-shot classification and conditional-generation in the context of predicting cell responses to treatment, our methodology improves upon state-of-the-art methods.

**Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation**

Jiayi Huang · Han Zhong · Liwei Wang · Lin Yang

To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analysis: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.

**Learning Adaptive Kernels for Statistical Independence Tests**

Yixin Ren · Yewei Xia · Hao Zhang · Jihong Guan · Shuigeng Zhou

We propose a novel framework for kernel-based statistical independence tests that enable adaptatively learning parameterized kernels to maximize test power. Our framework can effectively address the pitfall inherent in the existing signal-to-noise ratio criterion by modeling the change of the null distribution during the learning process. Based on the proposed framework, we design a new class of kernels that can adaptatively focus on the significant dimensions of variables to judge independence, which makes the tests more flexible than using simple kernels that are adaptive only in length-scale, and especially suitable for high-dimensional complex data. Theoretically, we demonstrate the consistency of our independence tests, and show that the non-convex objective function used for learning fits the L-smoothing condition, thus benefiting the optimization. Experimental results on both synthetic and real data show the superiority of our method. The source code and datasets are available at \url{https://github.com/renyixin666/HSIC-LK.git}.

**On the Nystr\"om Approximation for Preconditioning in Kernel Machines**

Amirhesam Abedsoltan · Parthe Pandit · Luis Rademacher · Mikhail Belkin

Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner.Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nyström-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.

**Fitting ARMA Time Series Models without Identification: A Proximal Approach**

Yin Liu · Sam Davanloo Tajbakhsh

Fitting autoregressive moving average (ARMA) time series models requires model identification before parameter estimation. Model identification involves determining the order of the autoregressive and moving average components which is generally performed by inspection of the autocorrelation and partial autocorrelation functions or other offline methods. In this work, we regularize the parameter estimation optimization problem with a non-smooth hierarchical sparsity-inducing penalty based on two path graphs that allow performing model identification and parameter estimation simultaneously. A proximal block coordinate descent algorithm is then proposed to solve the underlying optimization problem efficiently. The resulting model satisfies the required stationarity and invertibility conditions for ARMA models. Numerical results supporting the proposed method are also presented.

**Unsupervised Change Point Detection in Multivariate Time Series**

DAOPING WU · Suhas Gundimeda · Shaoshuai Mou · Christopher Quinn

We consider the challenging problem of unsupervised change point detection in multivariate time series when the number of change points is unknown. Our method eliminates the user's need for careful parameter tuning, enhancing its practicality and usability. Our approach identifies time series segments with similar empirically estimated distributions, coupled with a novel greedy algorithm guided by the minimum description length principle. We provide theoretical guarantees and, through experiments on synthetic and real-world data, provide empirical evidence for its improved performance in identifying meaningful change points in practical settings.

**A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent**

Mehdi Jafarnia · Rahul Jain · Ashutosh Nayyar

In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $\tilde\mathcal{O}(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion.Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon.We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy.Our regret bound improves on the best existing regret bound of $\tilde\mathcal{O}(\sqrt[3]{DS^2AT^2})$ by Wei et al., (2017) under the same assumption and matches the theoretical lower bound in $T$.

**Near-Optimal Policy Optimization for Correlated Equilibrium in General-Sum Markov Games**

Yang Cai · Haipeng Luo · Chen-Yu Wei · Weiqiang Zheng

We study policy optimization algorithms for computing correlated equilibria in multi-player general-sum Markov Games. Previous results achieve $\Tilde{O}(T^{-1/2})$ convergence rate to a correlated equilibrium and an accelerated $\Tilde{O}(T^{-3/4})$ convergence rate to the weaker notion of coarse correlated equilibrium. In this paper, we improve both results significantly by providing an uncoupled policy optimization algorithm that attains a near-optimal $\Tilde{O}(T^{-1})$ convergence rate for computing a correlated equilibrium. Our algorithm is constructed by combining two main elements (i) smooth value updates and (ii) the \emph{optimistic-follow-the-regularized-leader} algorithm with the log barrier regularizer.

**Supervised Feature Selection via Ensemble Gradient Information from Sparse Neural Networks**

Kaiting Liu · Zahra Atashgahi · Ghada Sokar · Mykola Pechenizkiy · Decebal Constantin Mocanu

Feature selection algorithms aim to select a subset of informative features from a dataset to reduce the data dimensionality, consequently saving resource consumption and improving the model's performance and interpretability. In recent years, feature selection based on neural networks has become a new trend, demonstrating superiority over traditional feature selection methods. However, most existing methods use dense neural networks to detect informative features, which requires significant computational and memory overhead. In this paper, taking inspiration from the successful application of local sensitivity analysis on neural networks, we propose a novel resource-efficient supervised feature selection algorithm based on sparse multi-layer perceptron called ``GradEnFS". By utilizing the gradient information of various sparse models from different training iterations, our method successfully detects the informative feature subset. We performed extensive experiments on nine classification datasets spanning various domains to evaluate the effectiveness of our method. The results demonstrate that our proposed approach outperforms the state-of-the-art methods in terms of selecting informative features while saving resource consumption substantially. Moreover, we show that using a sparse neural network for feature selection not only alleviates resource consumption but also has a significant advantage over other methods when performing feature selection on noisy datasets.

**Timing as an Action: Learning When to Observe and Act**

Helen Zhou · Audrey Huang · Kamyar Azizzadenesheli · David Childers · Zachary Lipton

In standard reinforcement learning setups, the agent receives observations and performs actions at evenly spaced intervals. However, in many real-world settings, observations are expensive, forcing agents to commit to courses of action for designated periods of time. Consider that doctors, after each visit, typically set not only a treatment plan but also a follow-up date at which that plan might be revised. In this work, we formalize the setup of timing-as-an-action. Through theoretical analysis in the tabular setting, we show that while the choice of delay intervals could be naively folded in as part of a composite action, these actions have a special structure and handling them intelligently yields statistical advantages. Taking a model-based perspective, these gains owe to the fact that delay actions do not add any parameters to the underlying model. For model estimation, we provide provable sample-efficiency improvements, and our experiments demonstrate empirical improvements in both healthcare simulators and classical reinforcement learning environments.

The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporating the knowledge from other basins implicitly through distillation, we show that the gap in performance can be mitigated by re-discovering (multi-basin) deep ensembles within a single basin. Thus, we conjecture that while the extra-basin knowledge is at least partially present in any given basin, it cannot be easily harnessed without learning it from other basins.

**Independent Learning in Constrained Markov Potential Games**

Philip Jordan · Anas Barakat · Niao He

Constrained Markov games offer a formal mathematical framework for modeling multi-agent reinforcement learning problems where the behavior of the agents is subject to constraints. In this work, we focus on the recently introduced class of constrained Markov Potential Games. While centralized algorithms have been proposed for solving such constrained games, the design of converging independent learning algorithms tailored for the constrained setting remains an open question. We propose an independent policy gradient algorithm for learning approximate constrained Nash equilibria: Each agent observes their own actions and rewards, along with a shared state. Inspired by the optimization literature, our algorithm performs proximal-point-like updates augmented with a regularized constraint set. Each proximal step is solved inexactly using a stochastic switching gradient algorithm. Notably, our algorithm can be implemented independently without a centralized coordination mechanism requiring turn-based agent updates. Under some technical constraint qualification conditions, we establish convergence guarantees towards constrained approximate Nash equilibria. We perform simulations to illustrate our results.

**Low-rank MDPs with Continuous Action Spaces**

Miruna Oprescu · Andrew Bennett · Nathan Kallus

Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain a similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a H\"older smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also H\"older smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.

Graph Laplacian learning, also known as network topology inference, is a problem of great interest to multiple communities. In Gaussian graphical models (GM), graph learning amounts to endowing covariance selection with the Laplacian structure. In graph signal processing (GSP), it is essential to infer the unobserved graph from the outputs of a filtering system. In this paper, we study the problem of learning Cartesian product graphs under Laplacian constraints. The Cartesian graph product is a natural way for modeling higher-order conditional dependencies and is also the key for generalizing GSP to multi-way tensors. We establish statistical consistency for the penalized maximum likelihood estimation (MLE) of a Cartesian product Laplacian, and propose an efficient algorithm to solve the problem. We also extend our method for efficient joint graph learning and imputation in the presence of structural missing values. Experiments on synthetic and real-world datasets demonstrate that our method is superior to previous GSP and GM methods.

**Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing Systems**

Neharika Jali · Guannan Qu · Weina Wang · Gauri Joshi

We consider the problem of efficiently routing jobs that arrive into a central queue to a system of heterogeneous servers. Unlike homogeneous systems, a threshold policy, that routes jobs to the slow server(s) when the queue length exceeds a certain threshold, is known to be optimal for the one-fast-one-slow two-server system. But an optimal policy for the multi-server system is unknown and non-trivial to find. While Reinforcement Learning (RL) has been recognized to have great potential for learning policies in such cases, our problem has an exponentially large state space size, rendering standard RL inefficient. In this work, we propose ACHQ, an efficient policy gradient based algorithm with a low dimensional soft threshold policy parameterization that leverages the underlying queueing structure. We provide stationary-point convergence guarantees for the general case and despite the low-dimensional parameterization prove that ACHQ converges to an approximate global optimum for the special case of two servers. Simulations demonstrate an improvement in expected response time of up to ${\sim}30\%$ over the greedy policy that routes to the fastest available server.

**Multi-resolution Time-Series Transformer for Long-term Forecasting**

Yitian Zhang · Liheng Ma · Soumyasundar Pal · Yingxue Zhang · Mark Coates

The performance of transformers for time-series forecasting has improved significantly. Recent architectures learn complex temporal patterns by segmenting a time-series into patches and using the patches as tokens. The patch size controls the ability of transformers to learn the temporal patterns at different frequencies: shorter patches are effective for learning localized, high-frequency patterns, whereas mining long-term seasonalities and trends requires longer patches. Inspired by this observation, we propose a novel framework, Multi-resolution Time-Series Transformer (MTST), which consists of a multi-branch architecture for simultaneous modeling of diverse temporal patterns at different resolutions. In contrast to many existing time-series transformers, we employ relative positional encoding, which is better suited for extracting periodic components at different scales. Extensive experiments on several real-world datasets demonstrate the effectiveness of MTST in comparison to state-of-the-art forecasting techniques.

**Consistency of Dictionary-Based Manifold Learning**

Samson Koelle · Hanyu Zhang · Octavian-Vlad Murad · Marina Meila

We analyze a paradigm for interpretable Manifold Learning for scientific data analysis, whereby one parametrizes a manifold with d smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse regression in the manifold tangent bundle, bypassing more standard, agnostic manifold learning algorithms. We prove conditions for the existence of such parameterizations in function space and the first end to end recovery results from finite samples. The method is demonstrated on both synthetic problems and with data from a real scientific domain.

**Probabilistic Modeling for Sequences of Sets in Continuous-Time**

Yuxin Chang · Alex Boyd · Padhraic Smyth

Neural marked temporal point processes have been a valuable addition to the existing toolbox of statistical parametric models for continuous-time event data. These models are useful for sequences where each event is associated with a single item (a single type of event or a ``mark'')---but such models are not suited to the practical situation where each event is associated with a set of items. In this work, we develop a general framework for modeling set-valued data in continuous-time, building on recurrent neural point process models. In addition we develop inference methods that can use such models to answer probabilistic queries such as ``the probability of item $A$ being observed before item $B$,'' conditioned on sequence history. Computing exact answers for such queries is generally intractable for neural models due to both the continuous-time nature of the problem setting and the combinatorially-large space of event types for each event. To address this, we propose a class of importance sampling methods and demonstrate orders-of-magnitude improvements in efficiency over direct sampling via systematic experiments with four real-world datasets. We also illustrate how to use this framework to perform model-selection using likelihoods that do not involve one-step-ahead prediction.

**Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate**

Hongchang Gao

Stochastic multi-level compositional optimization problems cover many new machine learning paradigms, e.g., multi-step model-agnostic meta-learning, which require efficient optimization algorithms for large-scale data. This paper studies the decentralized stochastic multi-level optimization algorithm, which is challenging because the multi-level structure and decentralized communication scheme may make the number of levels significantly affect the order of the convergence rate. To this end, we develop two novel decentralized optimization algorithms to optimize the multi-level compositional optimization problem. Our theoretical results show that both algorithms can achieve the level-independent convergence rate for nonconvex problems under much milder conditions compared with existing single-machine algorithms. To the best of our knowledge, this is the first work that achieves the level-independent convergence rate under the decentralized setting. Moreover, extensive experiments confirm the efficacy of our proposed algorithms.

**Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems**

Wesley A Suttle · Vipul Sharma · Krishna Chaitanya Kosaraju · Sivaranjani Seetharaman · Ji Liu · Vijay Gupta · Brian Sadler

We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.

**A General Theoretical Paradigm to Understand Learning from Human Preferences**

Mohammad Gheshlaghi Azar · Zhaohan Daniel Guo · Bilal Piot · Remi Munos · Mark Rowland · Michal Valko · Daniele Calandriello

The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation DPO has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation.In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called ${\Psi}$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of ${\Psi}$PO) and to identify their potential pitfalls. We then consider another special case for ${\Psi}$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate itsempirical superiority to DPO on some illustrative examples.

**Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization**

Yutong Wang · Rishi Sonthalia · Wei Hu

We study the generalization capability of nearly-interpolating linear regressors: ${\beta}$'s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix ${\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, ${\beta}$ has squared $\ell_2$-norm $\mathbb{E}[\|{{\beta}}\|_{2}^{2}] = \Omega(n^{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i({\Sigma}) \sim i^{-\alpha}$.This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals thatlarger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

**On the Generalization Ability of Unsupervised Pretraining**

yuyang deng · Junyuan Hong · Jiayu Zhou · Mehrdad Mahdavi

Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage.To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.

Surrogate modeling and active subspaces have emerged as powerful paradigms in computational science and engineering. Porting such techniques to computational models in the social sciences brings into sharp relief their limitations in dealing with discontinuous simulators, such as Agent-Based Models, which have discrete outputs. Nevertheless, prior applied work has shown that surrogate estimates of active subspaces for such estimators can yield interesting results. But given that active subspaces are defined by way of gradients, it is not clear what quantity is being estimated when this methodology is applied to a discontinuous simulator. We begin this article by showing some pathologies that can arise when conducting such an analysis. This motivates an extension of active subspaces to discontinuous functions, clarifying what is actually being estimated in such analyses. We also conduct numerical experiments on synthetic test functions to compare Gaussian process estimates of active subspaces on continuous and discontinuous functions. Finally, we deploy our methodology on Flee, an agent-based model of refugee movement, yielding novel insights into which parameters of the simulation are most important across 8 displacement crises in Africa and the Middle East.

**Continual Domain Adversarial Adaptation via Double-Head Discriminators**

Yan Shen · Zhanghexuan Ji · Chunwei Ma · Mingchen Gao

Domain adversarial adaptation in a continual setting poses significant challenges due to the limitations of accessing previous source domain data. Despite extensive research in continual learning, adversarial adaptation cannot be effectively accomplished using only a small number of stored source domain data, a standard setting in memory replay approaches. This limitation arises from the erroneous empirical estimation of $\mathcal{H}$-divergence with few source domain samples. To tackle this problem, we propose a double-head discriminator algorithm by introducing an addition source-only domain discriminator trained solely on the source learning phase. We prove that by introducing a pre-trained source-only domain discriminator, the empirical estimation error of $\mathcal{H}$-divergence related adversarial loss is reduced from the source domain side. Further experiments on existing domain adaptation benchmarks show that our proposed algorithm achieves more than 2$\%$ improvement on all categories of target domain adaptation tasks while significantly mitigating the forgetting of the source domain.

**Mixture-of-Linear-Experts for Long-term Time Series Forecasting**

Ronghao Ni · Zinan Lin · Shuaiqi Wang · Giulia Fanti

Long-term time series forecasting (LTSF) aims to predict future values of a time series given the past values. The current state-of-the-art (SOTA) on this problem is attained in some cases by linear-centric models, which primarily feature a linear mapping layer. However, due to their inherent simplicity, they are not able to adapt their prediction rules to periodic changes in time series patterns. To address this challenge, we propose a Mixture-of-Experts-style augmentation for linear-centric models and propose Mixture-of-Linear-Experts (MoLE). Instead of training a single model, MoLE trains multiple linear-centric models (i.e., experts) and a router model that weighs and mixes their outputs. While the entire framework is trained end-to-end, each expert learns to specialize in a specific temporal pattern, and the router model learns to compose the experts adaptively. Experiments show that MoLE reduces forecasting error of linear-centric models, including DLinear, RLinear, and RMLP, in over 78\% of the datasets and settings we evaluated. By using MoLE existing linear-centric models can achieve SOTA LTSF results in 68\% of the experiments that PatchTST reports and we compare to, whereas existing single-head linear-centric models achieve SOTA results in only 25\% of cases.

**Provable Policy Gradient Methods for Average-Reward Markov Potential Games**

Min Cheng · Ruida Zhou · P. R. Kumar · Chao Tian

We study Markov potential games under the infinite horizon average reward criterion. Most previous studies have been for discounted rewards. We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion. To set the stage for gradient-based methods, we first establish that the average reward is a smooth function of policies and provide sensitivity bounds for the differential value functions, under certain conditions on ergodicity and the second largest eigenvalue of the underlying Markov decision process (MDP). We prove that three algorithms, policy gradient, proximal-Q, and natural policy gradient (NPG), converge to an $\epsilon$-Nash equilibrium with time complexity $O(\frac{1}{\epsilon^2})$, given a gradient/differential Q function oracle. When policy gradients have to be estimated, we propose an algorithm with $\tilde{O}(\frac{1}{\min_{s,a}\pi(a|s)\delta})$ sample complexity to achieve $\delta$ approximation error w.r.t~the $\ell_2$ norm. Equipped with the estimator, we derive the first sample complexity analysis for a policy gradient ascent algorithm, featuring a sample complexity of $\tilde{O}(1/\epsilon^5)$. Simulation studies are presented.

**A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning**

Mizhaan Maniyar · Prashanth L.A. · Akash Mondal · Shalabh Bhatnagar

We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.

**Efficient Active Learning Halfspaces with Tsybakov Noise: A Non-convex Optimization Approach**

Yinan Li · Chicheng Zhang

We study the problem of computationally and label efficient PAC active learning $d$-dimensional halfspaces with Tsybakov Noise~(Tsybakov, 2004) under structured unlabeled data distributions.Inspired by~Diakonikolas et al., (2020c), we prove that any approximate first-order stationary point of a smooth nonconvex loss function yields a halfspace with a low excess error guarantee.In light of the above structural result, we design a nonconvex optimization-based algorithm with a label complexity of $\tilde{O}(d (\frac{1}{\epsilon})^{\frac{8-6\alpha}{3\alpha-1}})$, under the assumption that the Tsybakov noise parameter $\alpha \in (\frac13, 1]$, which narrows down the gap between the label complexities of the previously known efficient passive or active algorithms~(Diakonikolas et al., 2020b; Zhang and Li, 2021) and the information-theoretic lower bound in this setting.

**Consistent Hierarchical Classification with A Generalized Metric**

Yuzhou Cao · Lei Feng · Bo An

In multi-class hierarchical classification, a natural evaluation metric is the tree distance loss that takes the value of two labels' distance on the pre-defined tree hierarchy. This metric is motivated by that its Bayes optimal solution is the deepest label on the tree whose induced superclass (subtree rooted at it) includes the true label with probability at least $\frac{1}{2}$. However, it can hardly handle the risk sensitivity of different tasks since its accuracy requirement for induced superclasses is fixed at $\frac{1}{2}$. In this paper, we first introduce a new evaluation metric that generalizes the tree distance loss, whose solution's accuracy constraint $\frac{1+c}{2}$ can be controlled by a penalty value $c$ tailored for different tasks: a higher c indicates the emphasis on prediction's accuracy and a lower one indicates that on specificity. Then, we propose a novel class of consistent surrogate losses based on an intuitive presentation of our generalized metric and its regret, which can be compatible with various binary losses. Finally, we theoretically derive the regret transfer bounds for our proposed surrogates and empirically validate their usefulness on benchmark datasets.

**A/B testing under Interference with Partial Network Information**

Shiv Shankar · Ritwik Sinha · Yash Chandak · Saayan Mitra · Madalina Fiterau

A/B tests are often required to be conducted on subjects that might have social connections. For e.g., experiments on social media, or medical and social interventions to control the spread of an epidemic. In such settings, the SUTVA assumption for randomized-controlled trials is violated due to network interference, or spill-over effects, as treatments to group A can potentially also affect the control group B. When the underlying social network is known exactly, prior works have demonstrated how to conduct A/B tests adequately to estimate the global average treatment effect (GATE). However, in practice, it is often impossible to obtain knowledge about the exact underlying network. In this paper, we present UNITE: a novel estimator that relax this assumption and can identify GATE while only relying on knowledge of the superset of neighbors for any subject in the graph. Through theoretical analysis and extensive experiments, we show that the proposed approach performs better in comparison to standard estimators.

In this work, we study the personalized federated $\mathcal{X}$-armed bandit problem, where the heterogeneous local objectives of the clients are optimized simultaneously in the federated learning paradigm. We propose the \texttt{PF-PNE} algorithm with a unique double elimination strategy, which safely eliminates the non-optimal regions while encouraging federated collaboration through biased but effective evaluations of the local objectives. The proposed \texttt{PF-PNE} algorithm is able to optimize local objectives with arbitrary levels of heterogeneity, and its limited communications protects the confidentiality of the client-wise reward data. Our theoretical analysis shows the benefit of the proposed algorithm over single-client algorithms. Experimentally, \texttt{PF-PNE} outperforms multiple baselines on both synthetic and real life datasets.

We derive generic information-theoretic and PAC-Bayesian generalization bounds involving an arbitrary convex comparator function, which measures the discrepancy between the training loss and the population loss. The bounds hold under the assumption that the cumulant-generating function (CGF) of the comparator is upper-bounded by the corresponding CGF within a family of bounding distributions. We show that the tightest possible bound is obtained with the comparator being the convex conjugate of the CGF of the bounding distribution, also known as the Cram\'er function. This conclusion applies more broadly to generalization bounds with a similar structure. This confirms the near-optimality of known bounds for bounded and sub-Gaussian losses and leads to novel bounds under other bounding distributions.

It was recently observed that Elo ratings fail at preserving transitive relations among strategies and therefore cannot correctly extract the transitive component of a game. We provide a characterization of transitive games as a weak variant of ordinal potential games and show that Elo ratings actually do preserve transitivity when computed in the right space, using suitable invertible mappings. Leveraging this insight, we introduce a new game decomposition of an arbitrary game into transitive and cyclic components that is learnt using a neural network-based architecture and that prioritises capturing the sign pattern of the game, namely transitive and cyclic relations among strategies. We link our approach to the known concept of sign-rank, and evaluate our methodology using both toy examples and empirical data from real-world games.

**A Specialized Semismooth Newton Method for Kernel-Based Optimal Transport**

Tianyi Lin · Marco Cuturi · Michael Jordan

Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples. Recent works suggest that these estimators are more statistically efficient than plug-in (linear programming-based) OT estimators when comparing probability measures in high-dimensions (Vacher et al., 2021). Unfortunately,that statistical benefit comes at a very steep computational price: because their computation relies on the short-step interior-point method (SSIPM), which comes with a large iteration count in practice, these estimators quickly become intractable w.r.t.\ sample size $n$. To scale these estimators to larger $n$, we propose a nonsmooth fixedpoint model for the kernel-based OT problem, and show that it can be efficiently solved via a specialized semismooth Newton (SSN) method: We show, exploring the problem's structure, that the per-iteration cost of performing one SSN step can be significantly reduced in practice. We prove that our SSN method achieves a global convergence rate of $O(1/\sqrt{k})$, and a local quadratic convergence rate under standard regularity conditions. We show substantial speedups over SSIPM on both synthetic and real datasets.

**Maximum entropy GFlowNets with soft Q-learning**

Sobhan Mohammadpour · Emmanuel Bengio · Emma Frejinger · Pierre-Luc Bacon

Generative Flow Networks (GFNs) have emerged as a powerful tool for sampling discrete objects from unnormalized distributions, offering a scalable alternative to Markov Chain Monte Carlo (MCMC) methods. While GFNs draw inspiration from maximum entropy reinforcement learning (RL), the connection between the two has largely been unclear and seemingly applicable only in specific cases. This paper addresses the connection by constructing an appropriate reward function, thereby establishing an exact relationship between GFNs and maximum entropy RL. This construction allows us to introduce maximum entropy GFNs, which achieve the maximum entropy attainable by GFNs without constraints on the state space, in contrast to GFNs with uniform backward policy.

**Identifying Copeland Winners in Dueling Bandits with Indifferences**

Viktor Bengs · Björn Haddenhorst · Eyke Hüllermeier

We consider the task of identifying the Copeland winner(s) in a dueling bandits problem with ternary feedback. This is an underexplored but practically relevant variant of the conventional dueling bandits problem, in which, in addition to strict preference between two arms, one may observe feedback in the form of an indifference. We provide a lower bound on the sample complexity for any learning algorithm finding the Copeland winner(s) with a fixed error probability. Moreover, we propose POCOWISTA, an algorithm with a sample complexity that almost matches this lower bound, and which shows excellent empirical performance, even for the conventional dueling bandits problem. For the case where the preference probabilities satisfy a specific type of stochastic transitivity, we provide a refined version with an improved worst case sample complexity.

**Breaking isometric ties and introducing priors in Gromov-Wasserstein distances**

Pinar Demetci · Quang Huy TRAN · Ievgen Redko · Ritambhara Singh

Gromov-Wasserstein distance has many applications in machine learning due to its ability to compare measures across metric spaces and its invariance to isometric transformations. However, in certain applications, this invariant property can be too flexible, thus undesirable. Moreover, the Gromov-Wasserstein distance solely considers pairwise sample similarities in input datasets, disregarding the raw feature representations. We propose a new optimal transport formulation, called Augmented Gromov-Wasserstein (AGW), that allows for some control over the level of rigidity to transformations. It also incorporates feature alignments, enabling us to better leverage prior knowledge on the input data for improved performance. We first present theoretical insights into the proposed method. We then demonstrate its usefulness for single-cell multi-omic alignment tasks and heterogeneous domain adaptation in machine learning.

**Pure Exploration in Bandits with Linear Constraints**

Emil Carlsson · Debabrota Basu · Fredrik Johansson · Devdatt Dubhashi

We address the problem of identifying the optimal policy with a fixed confidence level in a multi-armed bandit setup, when \emph{the arms are subject to linear constraints}. Unlike the standard best-arm identification problem which is well studied, the optimal policy in this case may not be deterministic and could mix between several arms. This changes the geometry of the problem which we characterize via an information-theoretic lower bound. We introduce two asymptotically optimal algorithms for this setting, one based on the Track-and-Stop method and the other based on a game-theoretic approach. Both these algorithms try to track an optimal allocation based on the lower bound and computed by a weighted projection onto the boundary of a normal cone. Finally, we provide empirical results that validate our bounds and visualize how constraints change the hardness of the problem.

**Multi-armed bandits with guaranteed revenue per arm**

Dorian Baudry · Nadav Merlis · Mathieu Molina · Hugo Richard · Vianney Perchet

We consider a Multi-Armed Bandit problem with covering constraints, where the primary goal is to ensure that each arm receives a minimum expected reward while maximizing the total cumulative reward. In this scenario, the optimal policy then belongs to some unknown feasible set. Unlike much of the existing literature, we do not assume the presence of a safe policy or a feasibility margin, which hinders the exclusive use of conservative approaches. Consequently, we propose and analyze an algorithm that switches between pessimism and optimism in the face of uncertainty. We prove both precise problem-dependent and problem-independent bounds, demonstrating that our algorithm achieves the best of the two approaches – depending on the presence or absence of a feasibility margin – in terms of constraint violation guarantees. Furthermore, our results indicate that playing greedily on the constraints actually outperforms pessimism when considering long-term violations rather than violations on a per-round basis.

**Constant or Logarithmic Regret in Asynchronous Multiplayer Bandits with Limited Communication**

Hugo Richard · Etienne Boursier · Vianney Perchet

Multiplayer bandits have recently garnered significant attention due to their relevance in cognitive radio networks. While the existing body of literature predominantly focuses on synchronous players, real-world radio networks, such as those in IoT applications, often feature asynchronous (i.e., randomly activated) devices. This highlights the need for addressing the more challenging asynchronous multiplayer bandits problem. Our first result shows that a natural extension of UCB achieves a minimax regret of $\mathcal{O}(\sqrt{T\log(T)})$ in the centralized setting. More significantly, we introduce Cautious Greedy, which uses $\mathcal{O}(\log(T))$ communications and whose instance-dependent regret is constant if the optimal policy assigns at least one player to each arm (a situation proven to occur when arm means are sufficiently close). Otherwise, the regret is, as usual, $\log(T)$ times the sum of some inverse sub-optimality gaps. We substantiate the optimality of Cautious Greedy through lower-bound analysis based on data-dependent terms. Therefore, we establish a strong baseline for asynchronous multiplayer bandits, at least with $\mathcal{O}(\log(T))$ communications.

**Robust Non-linear Normalization of Heterogeneous Feature Distributions with Adaptive Tanh-Estimators**

Felip Guimerà Cuevas · Helmut Schmid

Feature normalization is a crucial step in machine learning that scales numerical values to improve model effectiveness. Noisy or impure datasets can pose a challenge for traditional normalization methods as they may contain outliers that violate statistical assumptions, leading to reduced model performance and increased unpredictability. Non-linear Tanh-Estimators (TE) have been found to provide robust feature normalization, but their fixed scaling factor may not be appropriate for all distributions of feature values. This work presents a refinement to the TE that employs the Wasserstein distance to adaptively estimate the optimal scaling factor for each feature individually against a specified target distribution. The results demonstrate that this adaptive approach can outperform the current TE method in the literature in terms of convergence speed by enabling better initial training starts, thus reducing or eliminating the need to re-adjust model weights during early training phases due to inadequately scaled features. Empirical evaluation was done on synthetic data, standard toy computer vision datasets, and a real-world numeric tabular dataset.

**Faster Convergence with MultiWay Preferences**

Aadirupa Saha · Vitaly Feldman · Yishay Mansour · Tomer Koren

We address the problem of convex optimization with preference feedback, where the goal is to minimize a convex function given a weaker form of comparison queries.Each query consists of two points and the dueling feedback returns a (noisy) single-bit binary comparison of the function values of the two queried points.Here we consider the sign-function-based comparison feedback model and analyze the convergence rates with batched and multiway (argmin of a set queried points) comparisons.Our main goal is to understand the improved convergence rates owing to parallelization in sign-feedback-based optimization problems.Our work is the first to study the problem of convex optimization with multiway preferences and analyze the optimal convergence rates.Our first contribution lies in designing efficient algorithms with a convergence rate of $\smash{\widetilde O}(\frac{d}{\min\{m,d\} \epsilon})$ for $m$-batched preference feedback where the learner can query $m$-pairs in parallel.We next study a $m$-multiway comparison (`battling') feedback, where the learner can get to see the argmin feedback of $m$-subset of queried points and show a convergence rate of $\smash{\widetilde O}(\frac{d}{ \min\{\log m,d\}\epsilon })$.We show further improved convergence rates with an additional assumption of strong convexity.Finally, we also study the convergence lower bounds for batched preferences and multiway feedback optimization showing the optimality of our convergence rates w.r.t.\ $m$.

**Minimax optimal density estimation using a shallow generative model with a one-dimensional latent variable**

Hyeok Kyu Kwon · Minwoo Chae

A deep generative model yields an implicit estimator for the unknown distribution or density function of the observation. This paper investigates some statistical properties of the implicit density estimator pursued by VAE-type methods from a nonparametric density estimation framework. More specifically, we obtain convergence rates of the VAE-type density estimator under the assumption that the underlying true density function belongs to a locally Holder class. Remarkably, a near minimax optimal rate with respect to the Hellinger metric can be achieved by the simplest network architecture, a shallow generative model with a one-dimensional latent variable.

**Delegating Data Collection in Decentralized Machine Learning**

Nivasini Ananthakrishnan · Stephen Bates · Michael Jordan · Nika Haghtalab

Motivated by the emergence of decentralized machine learning (ML) ecosystems, we study the delegation of data collection. Taking the field of contract theory as our starting point, we design optimal and near-optimal contracts that deal with two fundamental information asymmetries that arise in decentralized ML: uncertainty in the assessment of model quality and uncertainty regarding the optimal performance of any model. We show that a principal can cope with such asymmetry via simple linear contracts that achieve $1-1/\epsilon$ fraction of the optimal utility. To address the lack of a priori knowledge regarding the optimal performance, we give a convex program that can adaptively and efficiently compute the optimal contract. We also analyze the optimal utility and linear contracts for the more complex setting of multiple interactions.

**Efficient Quantum Agnostic Improper Learning of Decision Trees**

Sagnik Chatterjee · Tharrmashastha SAPV · Debajyoti Bera

The agnostic setting is the hardest generalization of the PAC model since it is akin to learning with adversarial noise. In this paper, we give a poly $(n, t, 1/\epsilon)$ quantum algorithm for learning size $t$ decision trees over $n$-bit inputs with uniform marginal over instances, in the agnostic setting, without membership queries (MQ). This is the first algorithm (classical or quantum) for efficiently learning decision trees without MQ. First, we construct a quantum agnostic weak learner by designing a quantum variant of the classical Goldreich-Levin algorithm that works with strongly biased function oracles. Next, we show how to quantize the agnostic boosting algorithm by Kalai and Kanade (2009) to obtain the first efficient quantum agnostic boosting algorithm (that has a polynomial speedup over existing adaptive quantum boosting algorithms). We then use the quantum agnostic boosting algorithm to boost the weak quantum agnostic learner constructed previously to obtain a quantum agnostic learner for decision trees. Using the above framework, we also give quantum decision tree learning algorithms without MQ in weaker noise models.

We introduce a technique called graph fission which takes in a graph which potentially contains only one observation per node (whose distribution lies in a known class) and produces two (or more) independent graphs with the same node/edge set in a way that splits the original graph's information amongst them in any desired proportion. Our proposal builds on data fission/thinning, a method that uses external randomization to create independent copies of an unstructured dataset. We extend this idea to the graph setting where there may be latent structure between observations. We demonstrate the utility of this framework via two applications: inference after structural trend estimation on graphs and a model selection procedure we term "graph cross-validation"'.

**Meta Learning in Bandits within shared affine Subspaces**

Steven Bilaj · Sofien Dhouib · Setareh Maghsudi

We study the problem of meta-learning several contextual stochastic bandits tasks by leveraging their concentration around a low dimensional affine subspace, which we learn via online principal component analysis to reduce the expected regret over the encountered bandits. We propose and theoretically analyze two strategies that solve the problem: One based on the principle of optimism in the face of uncertainty and the other via Thompson sampling. Our framework is generic and includes previously proposed approaches as special cases. Besides, the empirical results show that our methods significantly reduce the regret on several bandit tasks.

In this paper we develop a novel nonparametric framework to test the independence of two random variables $X$ and $Y$ with unknown respective marginals $H(dx)$ and $G(dy)$ and joint distribution $F(dxdy)$, based on Receiver Operating Characteristic (ROC) analysis and bipartite ranking. The rationale behind our approach relies on the fact that, the independence hypothesis $\mathcal{H}_0$ is necessarily false as soon as the optimal scoring function related to the pair of distributions $(H\otimes G,\; F)$, obtained from a bipartite ranking algorithm, has a ROC curve that deviates from the main diagonal of the unit square. We consider a wide class of rank statistics encompassing many ways of deviating from the diagonal in the ROC space to build tests of independence. Beyond its great flexibility, this new method has theoretical properties that far surpass those of its competitors. Nonasymptotic bounds for the two types of testing errors are established. From an empirical perspective, the novel procedure we promote in this paper exhibits a remarkable ability to detect small departures, of various types, from the null assumption $\mathcal{H}_0$, even in high dimension, as supported by the numerical experiments presented here.

**Structured Transforms Across Spaces with Cost-Regularized Optimal Transport**

Othmane Sebbouh · Marco Cuturi · Gabriel Peyré

Matching a source to a target probability measure is often solved by instantiating a linear optimal transport (OT) problem, parameterized by a ground cost function that quantifies discrepancy between points. When these measures live in the same metric space, the ground cost often defaults to its distance. When instantiated across two different spaces, however, choosing that cost in the absence of aligned data is a conundrum. As a result, practitioners often resort to solving instead a quadratic Gromow-Wasserstein (GW) problem.We exploit in this work a parallel between GW and cost-regularized OT, the regularized minimization of a linear OT objective parameterized by a ground cost.We use this cost-regularized formulation to match measures across two different Euclidean spaces, where the cost is evaluated between transformed source points and target points. We show that several quadratic OT problems fall in this category, and consider enforcing structure in linear transform (e.g. sparsity), by introducing structure-inducing regularizers. We provide a proximal algorithm to extract such transforms from unaligned data, and demonstrate its applicability to single-cell spatial transcriptomics/multiomics matching tasks.

**Federated Linear Contextual Bandits with Heterogeneous Clients**

Ethan Blaser · Chuanhao Li · Hongning Wang

The demand for collaborative and private bandit learning across multiple agents is surging due to the growing quantity of data generated from distributed systems. Federated bandit learning has emerged as a promising framework for private, efficient, and decentralized online learning. However, almost all previous works rely on strong assumptions of client homogeneity, i.e., all participating clients shall share the same bandit model; otherwise, they all would suffer linear regret. This greatly restricts the application of federated bandit learning in practice. In this work, we introduce a new approach for federated bandits for heterogeneous clients, which clusters clients for collaborative bandit learning under the federated learning setting. Our proposed algorithm achieves non-trivial sub-linear regret and communication cost for all clients, subject to the communication protocol under federated learning that at anytime only one model can be shared by the server.

**Directional Optimism for Safe Linear Bandits**

Spencer Hutchinson · Berkay Turan · Mahnoosh Alizadeh

The safe linear bandit problem is a version of the classical stochastic linear bandit problem where the learner's actions must satisfy an uncertain constraint at all rounds. Due its applicability to many real-world settings, this problem has received considerable attention in recent years. By leveraging a novel approach that we call directional optimism, we find that it is possible to achieve improved regret guarantees for both well-separated problem instances and action sets that are finite star convex sets. Furthermore, we propose a novel algorithm for this setting that improves on existing algorithms in terms of empirical performance, while enjoying matching regret guarantees. Lastly, we introduce a generalization of the safe linear bandit setting where the constraints are convex and adapt our algorithms and analyses to this setting by leveraging a novel convex-analysis based approach.

**TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression**

Zelin He · Ying Sun · Runze Li

The main challenge that sets transfer learning apart from traditional supervised learning is the distribution shift, reflected as the shift between the source and target models and that between the marginal covariate distributions. In this work, we tackle model shifts in the presence of covariate shifts in the high-dimensional regression setting. Specifically, we propose a two-step method with a novel fused regularizer that effectively leverages samples from source tasks to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model, showing the robustness of the proposed method to covariate shifts. We further establish conditions under which the estimator is minimax-optimal. Additionally, we extend the method to a distributed setting, allowing for a pretraining-finetuning strategy, requiring just one round of communication while retaining the estimation rate of the centralized version. Numerical tests validate our theory, highlighting the method's robustness to covariate shifts.

**Exploration via linearly perturbed loss minimisation**

David Janz · Shuai Liu · Alex Ayoub · Csaba Szepesvari

We introduce `exploration via linearly perturbed loss minimisation' (ELPeLM), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, ELPeLM reduces to perturbed history exploration (PHE), where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. Our analysis suggests the use of data-dependent reward perturbations, not present in previous PHE-type methods, with which we are able to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside of generalised linear bandits where PHE leads to inconsistent estimates and thus linear regret, while ELPeLM remains a capable algorithm. While more principled, just like PHE, ELPeLM can be implemented in just a few lines of code.

**Proximal Causal Inference for Synthetic Control with Surrogates**

Jizhou Liu · Eric Tchetgen Tchetgen · Carlos Varjão

The synthetic control method (SCM) has become a popular tool for estimating causal effects in policy evaluation, where a single treated unit is observed. However, SCM faces challenges in accurately predicting post-intervention potential outcomes had, contrary to fact, the treatment been withheld, when the pre-intervention period is short or the post-intervention period is long. To address these issues, we propose a novel method that leverages post-intervention information, specifically time-varying correlates of the causal effect called "surrogates", within the synthetic control framework. We establish conditions for identifying model parameters using the proximal inference framework and apply the generalized method of moments (GMM) approach for estimation and inference about the average treatment effect on the treated (ATT). Interestingly, we uncover specific conditions under which exclusively using post-intervention data suffices for estimation within our framework. Through a synthetic experiment and a real-world application, we demonstrate that our method can outperform other synthetic control methods in estimating both short-term and long-term effects, yielding more accurate inferences.

**Oracle-Efficient Pessimism: Offline Policy Optimization In Contextual Bandits**

Lequn Wang · Akshay Krishnamurthy · Aleksandrs Slivkins

We consider offline policy optimization (OPO) in contextual bandits, where one is given a fixed dataset of logged interactions. While pessimistic regularizers are typically used to mitigate distribution shift, prior implementations thereof are either specialized or computationally inefficient. We present the first \emph{general} oracle-efficient algorithm for pessimistic OPO: it reduces to supervised learning, leading to broad applicability. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We instantiate our approach for both discrete and continuous actions and perform experiments in both settings, showing advantage over unregularized OPO across a wide range of configurations.

The SLOPE estimator has the particularity of having null components (sparsity) and components that are equal in absolute value (clustering). The number of clusters depends on the regularization parameter of the estimator. This parameter can be chosen as a trade-off between interpretability (with a small number of clusters) and accuracy (with a small mean squared error or a small prediction error). Finding such a compromise requires to compute the solution path, that is the function mapping the regularization parameter to the estimator. We provide in this article an algorithm to compute the solution path of SLOPE and show how it can be used to adjust the regularization parameter.

**Graph Pruning for Enumeration of Minimal Unsatisfiable Subsets**

Panagiotis Lymperopoulos · Liping Liu

Finding Minimal Unsatisfiable Subsets (MUSes) of boolean constraints is a common problem in infeasibility analysis of over-constrained systems. However, because of the exponential search space of the problem, enumerating MUSes is extremely time-consuming in real applications. In this work, we propose to prune formulas using a learned model to speed up MUS enumeration. We represent formulas as graphs and then develop a graph-based learning model to predict which part of the formula should be pruned. Importantly, the training of our model does not require labeled data. It does not even require training data from the target application because it extrapolates to data with different distributions. In our experiments we combine our model with existing MUS enumerators and validate its effectiveness in multiple benchmarks including a set of real-world problems outside our training distribution. The experiment results show that our method significantly accelerates MUS enumeration on average on these benchmark problems.

**Sharp error bounds for imbalanced classification: how many examples in the minority class?**

Anass Aghbalou · Anne Sabourin · François Portier

When dealing with imbalanced classification data, reweighting the loss functionis a standard procedure allowing to equilibrate between the true positive and truenegative rates within the risk measure. Despite significant theoretical work inthis area, existing results do not adequately address a main challenge within theimbalanced classification framework, which is the negligible size of one classin relation to the full sample size and the need to rescale the risk function by aprobability tending to zero. To address this gap, we present two novel contributions in the setting where the rare class probability approaches zero: (1) a non asymptotic fast rate probability bound for constrained balanced empirical risk minimization, and (2) a consistent upper bound for balanced nearest neighbors estimates. Our findings provide a clearer understanding of the benefits of class-weighting in realistic settings, opening new avenues for further research in this field.

**Multi-Domain Causal Representation Learning via Weak Distributional Invariances**

Kartik Ahuja · Amin Mansouri · Yixin Wang

Causal representation learning has emerged as the center of action in causal machine learning research. In particular, multi-domain datasets present a natural opportunity for showcasing the advantages of causal representation learning over standard unsupervised representation learning. While recent works have taken crucial steps towards learning causal representations, they often lack applicability to multi-domain datasets due to over-simplifying assumptions about the data; e.g. each domain comes from a different single-node perfect intervention. In this work, we relax these assumptions and capitalize on the following observation: there often exists a subset of latents whose certain distributional properties (e.g., support, variance) remain stable across domains; this property holds when, for example, each domain comes from a multi-node imperfect intervention. Leveraging this observation, we show that autoencoders that incorporate such invariances can provably identify the stable set of latents from the rest across different settings.

**DeepFDR: A Deep Learning-based False Discovery Rate Control Method for Neuroimaging Data**

Taehyo Kim · Hai Shu · Qiran Jia · Mony de Leon

Voxel-based multiple testing is widely used in neuroimaging data analysis. Traditional false discovery rate (FDR) control methods often ignore the spatial dependence among the voxel-based tests and thus suffer from substantial loss of testing power. While recent spatial FDR control methods have emerged, their validity and optimality remain questionable when handling the complex spatial dependencies of the brain. Concurrently, deep learning methods have revolutionized image segmentation, a task closely related to voxel-based multiple testing. In this paper, we propose DeepFDR, a novel spatial FDR control method that leverages unsupervised deep learning-based image segmentation to address the voxel-based multiple testing problem. Numerical studies, including comprehensive simulations and Alzheimer's disease FDG-PET image analysis, demonstrate DeepFDR's superiority over existing methods. DeepFDR not only excels in FDR control and effectively diminishes the false nondiscovery rate, but also boasts exceptional computational efficiency highly suited for tackling large-scale neuroimaging data.

**Robust Sparse Voting**

Youssef Allouah · Rachid Guerraoui · Lê-Nguyên Hoang · Oscar Villemaud

Many applications, such as content moderation and recommendation, require reviewing and scoring a large number of alternatives. Doing so robustly is however very challenging. Indeed, voters' inputs are inevitably sparse: most alternatives are only scored by a small fraction of voters. This sparsity amplifies the effects of biased voters introducing unfairness, and of malicious voters seeking to hack the voting process by reporting dishonest scores.We give a precise definition of the problem of robust sparse voting, highlight its underlying technical challenges, and present a novel voting mechanism addressing the problem. We prove that, using this mechanism, no voter can have more than a small parameterizable effect on each alternative's score; a property we call Lipschitz resilience. We also identify conditions of voters comparability under which any unanimous preferences can be recovered, even when each voter provides sparse scores, on a scale that is potentially very different from any other voter's score scale. Proving these properties required us to introduce, analyze and carefully compose novel aggregation primitives which could be of independent interest.

Transfer learning plays a key role in modern data analysis when: (1) the target data are scarce but the source data are sufficient; (2) the distributions of the source and target data are heterogeneous.This paper develops an interpretable unified transfer learning model, termed as UTrans, which can detect both transferable variables and source data.More specifically, we establish the estimation error bounds and prove that our bounds are lower than those with target data only.Besides, we propose a source detection algorithm based on hypothesis testing to exclude the nontransferable data.We evaluate and compare UTrans to the existing algorithms in multiple experiments.It is shown that UTrans attains much lower estimation and prediction errors than the existing methods, while preserving interpretability.We finally apply it to the US intergenerational mobility data and compare our proposed algorithms to the classical machine learning algorithms.

**Hidden yet quantifiable: A lower bound for confounding strength using randomized trials**

Piersilvio De Bartolomeis · Javier Abad Martinez · Konstantin Donhauser · Fanny Yang

In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding above a certain strength. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world example.

**Distributionally Robust Quickest Change Detection using Wasserstein Uncertainty Sets**

Liyan Xie · Yuchen Liang · Venugopal V. Veeravalli

The problem of quickest detection of a change in the distribution of streaming data is considered. It is assumed that the pre-change distribution is known, while the only information about the post-change is through a (small) set of labeled data. This post-change data is used in a data-driven minimax robust framework, where an uncertainty set for the post-change distribution is constructed. The robust change detection problem is studied in an asymptotic setting where the mean time to false alarm goes to infinity. It is shown that the least favorable distribution (LFD) is an exponentially tilted version of the pre-change density and can be obtained efficiently. A Cumulative Sum (CuSum) test based on the LFD, which is referred to as the distributionally robust (DR) CuSum test, is then shown to be asymptotically robust. The results are extended to the case with multiple post-change uncertainty sets and validated using synthetic and real data examples.

**Information-theoretic Analysis of Bayesian Test Data Sensitivity**

Futoshi Futami · Tomoharu Iwata

Bayesian inference is often used to quantify uncertainty. Several recent analyses have rigorously decomposed uncertainty in prediction by Bayesian inference into two types: the inherent randomness in the data generation process and the variability due to lack of data respectively. Existing studies have analyzed these uncertainties from an information-theoretic perspective, assuming the model is well-specified and treating the model parameters as latent variables. However, such information-theoretic uncertainty analysis fails to account for a widely believed property of uncertainty known as sensitivity between test and training data. This means that if the test data is similar to the training data in some sense, the uncertainty will be smaller. In this study, we study such sensitivity using a new decomposition of uncertainty. Our analysis successfully defines such sensitivity using information-theoretic quantities. Furthermore, we extend the existing analysis of Bayesian meta-learning and show the novel sensitivities among tasks for the first time.

**Filter, Rank, and Prune: Learning Linear Cyclic Gaussian Graphical Models**

Soheun Yi · Sanghack Lee

Causal structures in the real world often exhibit cycles naturally due to equilibrium, homeostasis, or feedback. However, causal discovery from observational studies regarding cyclic models has not been investigated extensively because the underlying structure of a linear cyclic structural equation model (SEM) cannot be determined solely from observational data. Inspired by the Bayesian information Criterion (BIC), we construct a score function that assesses both accuracy and sparsity of the structure to determine which linear Gaussian SEM is the best when only observational data is given. Then, we formulate a causal discovery problem as an optimization problem of the measure and propose the Filter, Rank, and Prune (FRP) method for solving it. We empirically demonstrate that our method outperforms competitive cyclic causal discovery baselines.

**Efficient Data Valuation for Weighted Nearest Neighbor Algorithms**

Jiachen T. Wang · Prateek Mittal · Ruoxi Jia

This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.