Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 1

Main Ballroom
Sat 2 May 7 a.m. PDT — 10 a.m. PDT
Abstract:
Chat is not available.

In this paper, we address the problem of sparse regression vector estimation in the presence of corrupted samples, with a particular focus on accurately identifying the support. Traditional methods, such as the Least Absolute Shrinkage and Selection Operator (LASSO), often fail in such scenarios, exhibiting inconsistency. To tackle this challenge, we propose a combinatorial, non-convex, and robust variant of LASSO framework, designed to enhance estimation accuracy under corruption. Our approach is supported by theoretical guarantees, which establish its reliability and robustness. Our method also handles corruption from heavy-tailed distributions, with only a few bounded moments. We validate our theoretical results through extensive experiments, comparing the performance of our method against the LASSO and its other robust variants. These comparisons highlight the efficacy of our framework, demonstrating its practical applicability in sparse regression tasks involving corrupted data.


10
Prior shift estimation for positive unlabeled data through the lens of kernel embedding

Jan Mielniczuk ⋅ Wojciech Rejchel ⋅ Paweł Teisseyre

We study estimation of a class prior for unlabeled target samples which possibly differs from that of source population. Moreover, it is assumed that the source data is partially observable: only samples from the positive class and from the whole population are available (PU learning scenario). We introduce a novel direct estimator of the class prior which avoids estimation of posterior probabilities in both populations and has a simple geometric interpretation. It is based on a distribution matching technique together with kernel embedding in Reproducing Kernel Hilbert Space and is obtained as an explicit solution to an optimisation task. We establish its asymptotic consistency as well as an explicit non-asymptotic bound on its deviation from the unknown prior, which is calculable in practice. We study finite sample behaviour for synthetic and real data and show that the proposal works consistently on par or better than its competitors.


100
Influence Attributions can be Systematically Altered by Model Manipulation

Chhavi Yadav ⋅ Ruihan Wu ⋅ Kamalika Chaudhuri

Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influence-based attributions and investigate whether these attributions can be \textit{systematically} altered by an adversary. We show that small systemic perturbations to models can indeed alter influence-based attributions \textit{as desired}. We work on logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions in adversarial circumstances.

Offline reinforcement learning is important in many settings with available observational data but the inability to deploy new policies online due to safety, cost, and other concerns. Many recent advances in causal inference and machine learning target estimation of ``causal contrast" functions such as CATE, which can adapt to potentially smoother structure. We develop a dynamic generalization of the R-learner (Nie et al 2021, Lewis and Syrgkanis 2021) for estimating and optimizing the difference of $Q^\pi$-functions, $Q^\pi(s,a)-Q^\pi(s,a_0)$, for potential discrete-valued actions $a,a_0$, which can be used to optimize multiple-valued actions without loss of generality. We leverage orthogonal estimation to improve convergence rates, even if $Q$ and behavior policy (so-called nuisance functions) converge at slower rates and prove consistency of policy optimization under a margin condition. The method can leverage black-box estimators of the $Q$-function and behavior policy to target estimation of a more structured $Q$-function contrast, and comprises of simple squared-loss minimization.


102
Beyond Pooling: Matching for Robust Generalization Under Data Heterogeneity

Ayush Roy ⋅ Rudrasis Chakraborty ⋅ Lav R. Varshney ⋅ Vishnu Lokhande

Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on Github.


103
Implicit Updates for Average-Reward Temporal Difference Learning

Hwanwoo Kim ⋅ Dongkyu Cho ⋅ Eric Laber

Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($\lambda$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($\lambda$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($\lambda$). In contrast to prior finite-time analyses of average-reward TD($\lambda$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($\lambda$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($\lambda$).

Maximum inner product search is a fundamental problem in recommender systems, information retrieval, and machine learning. Recently proposed bandit-based approaches have achieved high scalability with respect to dimensionality and offer favorable precision-speedup trade-offs. However, the lengths of their confidence intervals are determined independently of the actual reward distributions, which can lead to search inefficiency in practice. In this paper, we propose a data-dependent bandit-based algorithm in which the lengths of the confidence intervals are adaptively adjusted based on observed samples. Theoretical analysis demonstrates that our algorithm guarantees $\delta$-correctness for a broad class of distributions, including all log-concave continuous distributions, and that the sample complexity can be reduced adaptively according to individual reward distributions. In experiments, our approach outperformed existing algorithms on both synthetic and real-world datasets.


105
Linear Convergence of the Frank-Wolfe Algorithm over Product Polytopes

Gabriele Iommazzo ⋅ David Martínez-Rubio ⋅ Francisco Criado ⋅ Elias Wirth ⋅ Sebastian Pokutta

We study the linear convergence of Frank-Wolfe algorithms over product polytopes. We analyze two condition numbers for the product polytope, namely the pyramidal width and the vertex-facet distance, based on the condition numbers of individual polytope components. As a result, for convex objectives that are $\mu$-Polyak-Łojasiewicz, we show linear convergence rates quantified in terms of the resulting condition numbers. We apply our results to the problem of approximately finding a feasible point in a polytope intersection in high-dimensions, and demonstrate the practical efficiency of our algorithms through empirical results.


106
On the Weight Density of L2-Regularized Linear Classification and Regression

He Zhe Lin ⋅ Zhi-Bao Lu ⋅ Sheng-Wei Chen ⋅ Cheng-Hung Liu ⋅ Chih-Jen Lin

For traditional linear models with the widely used $L_2$-regularizer, it is often assumed that the resulting models are dense. As a result, little attention has been paid to when the optimal solution for an $L_2$-regularized problem can actually be sparse. In this work, we rigorously prove that for $L_2$-regularized support vector classification/regression, the theoretical optimum can indeed be sparse when the data have sparse feature values. Surprisingly, we observe that some optimization methods fail to preserve this sparsity and instead produce fully dense numerical solutions, leading to unnecessary storage overhead. We explain this phenomenon through detailed analysis. In particular, we novelly show that certain coordinate descent methods naturally yields sparser numerical solutions compared to other optimization algorithms. By applying suitable algorithms that preserve numerical sparsity, the storage can be reduced by up to 50%, which is highly advantageous for large-scale industrial applications.


107
Rethinking Cross-Modal Fine-Tuning: Optimizing the Interaction Between Feature Alignment and Target Fitting

Trong Khiem Tran ⋅ Manh Dao ⋅ Phi Le Nguyen ⋅ Thao Nguyen Truong ⋅ Nghia Hoang

Adapting pre-trained models to unseen feature modalities has become increasingly important due to the growing need for cross-disciplinary knowledge integration. A key challenge here is how to align the representation of new modalities with the most relevant parts of the pre-trained model's representation space to enable accurate knowledge transfer. This requires combining feature alignment with target fine-tuning, but uncalibrated combinations can exacerbate misalignment between the source and target feature-label structures and reduce target generalization. Existing work however lacks a theoretical understanding of this critical interaction between feature alignment and target fitting. To bridge this gap, we develop a principled framework that establishes a provable generalization bound on the target error, which explains the interaction between feature alignment and target fitting through a novel concept of feature-label distortion. This bound offers actionable insights into how this interaction should be optimized for practical algorithm design. The resulting approach achieves significantly improved performance over state-of-the-art methods across a wide range of benchmark datasets.

Mixture proportion estimation (MPE) aims to estimate class priors from unlabeled data. This task is a critical component in weakly supervised learning such as PU learning, learning with label noise, and domain adaptation. Existing MPE methods rely on the irreducibility assumption or its variant for identifiability. In this paper, we propose novel assumptions based on conditional independence (CI) given the class label, which ensure identifiability even when irreducibility does not hold. We develop method of moments estimators under these assumptions and analyze their asymptotic properties. Furthermore, we present weakly-supervised kernel tests to validate the CI assumptions, which are of independent interest in applications such as causal discovery and fairness evaluation. Empirically, we demonstrate the improved performance of our estimators compared with existing methods and that our tests successfully control both type I and type II errors.


109
Neural Additive Experts: Context-Gated Experts for Controllable Model Additivity

Guangzhi Xiong ⋅ Sanchit Sinha ⋅ Aidong Zhang

The trade-off between interpretability and accuracy remains a core challenge in machine learning. Standard Generalized Additive Models (GAMs) offer clear feature attributions but are often constrained by their strictly additive nature, which can limit predictive performance. Introducing feature interactions can boost accuracy yet may obscure individual feature contributions. To address these issues, we propose Neural Additive Experts (NAEs), a novel framework that seamlessly balances interpretability and accuracy. NAEs employ a mixture of experts framework, learning multiple specialized networks per feature, while a dynamic gating mechanism integrates information across features, thereby relaxing rigid additive constraints. Furthermore, we propose targeted regularization techniques to mitigate variance among expert predictions, facilitating a smooth transition from an exclusively additive model to one that captures intricate feature interactions while maintaining clarity in feature attributions. Our theoretical analysis and experiments on synthetic data illustrate the model's flexibility, and extensive evaluations on real-world datasets confirm that NAEs achieve an optimal balance between predictive accuracy and transparent, feature-level explanations. The code is available at https://github.com/Teddy-XiongGZ/NAE.


11
RoseCDL: Robust and Scalable Convolutional Dictionary Learning for rare-event and anomaly detection

Jad Yehya ⋅ Mansour Benbakoura ⋅ Cédric Allain ⋅ Benoît Malézieux ⋅ Matthieu Kowalski ⋅ Thomas Moreau

Detecting rare events and anomalies in large-scale signals is essential in fields such as astronomy, physical simulations, and biomedical science. In many cases, this problem naturally decomposes into identifying common local patterns and detecting deviations that correspond to anomalies. Convolutional Dictionary Learning (CDL) is a powerful tool for modeling local structures, but its adoption for this task has been limited by computational demands and sensitivity to outliers. We introduce RoseCDL, a novel CDL algorithm designed for robust and scalable modeling of signal pattern distribution. RoseCDL leverages stochastic windowing for efficient training and incorporates inline outlier detection to enhance robustness. This enables unsupervised identification of anomalous and rare patterns in long signals based on the local reconstruction loss. Experiments on real-world datasets show that RoseCDL delivers improved detection accuracy and computational efficiency, making CDL practical for challenging detection tasks in large-scale signal analysis.

Diffusion models achieve strong generative performance but often rely on large datasets that may include sensitive content. This challenge is compounded by the models’ tendency to memorize training data, raising privacy concerns. SFBD (Lu et al., 2025) addresses this by training on corrupted data and using limited clean samples to capture local structure and improve convergence. However, its iterative denoising and fine-tuning loop requires manual coordination, making it burdensome to implement. We reinterpret SFBD as an alternating projection algorithm and introduce a continuous variant, SFBD flow, that removes the need for alternating steps. We further show its connection to consistency constraint-based methods, and demonstrate that its practical instantiation, Online SFBD, consistently outperforms strong baselines across benchmarks.


111
FedDuA: Doubly Adaptive Federated Learning

Shokichi Takakura ⋅ Seng Pei Liew ⋅ Satoshi Hasegawa

Federated learning is a distributed learning framework where clients collaboratively train a global model without sharing their raw data. FedAvg is a popular algorithm for federated learning, but it often suffers from slow convergence due to the heterogeneity of local datasets and anisotropy in the parameter space. In this work, we formalize the central server optimization procedure through the lens of mirror descent and propose a novel framework, called FedDuA, which adaptively selects the global learning rate based on both inter-client and coordinate-wise heterogeneity in the local updates. We prove that our proposed doubly adaptive step-size rule is minimax optimal and provide a convergence analysis for convex objectives. Although the proposed method does not require additional communication or computational cost on clients, extensive numerical experiments show that our proposed framework outperforms baselines in various settings and is robust to the choice of hyperparameters.


112
Active learning for stochastic contextual linear bandits

Emma Brunskill ⋅ Ishani Karmarkar ⋅ Zhaoqi Li

A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions and naively (passively) sampling contexts from the underlying context distribution. However, in many practical scenarios---including online content recommendation, survey research, and clinical trials---practitioners can actively sample or recruit contexts based on prior knowledge of the context distribution. Despite this potential for _active learning_, the role of strategic context sampling in stochastic contextual linear bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove _instance-dependent_ theoretical guarantees demonstrating that our active context sampling strategy can improve over the minimax rate by up to a factor of $\smash{\sqrt{d}}$, where $d$ is the linear dimension. We also show empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy, in tasks such as warfarin dose prediction and joke recommendation.

Deep neural networks have attained remarkable success across diverse classification tasks. Recent empirical studies have shown that deep networks learn features that are linearly separable across classes. However, these findings often lack rigorous justifications, even under relatively simple settings. In this work, we address this gap by examining the linear separation capabilities of shallow nonlinear networks. Specifically, inspired by the low intrinsic dimensionality of image data, we model inputs as a union of low-dimensional subspaces (UoS) and demonstrate that a single nonlinear layer can transform such data into linearly separable sets. Theoretically, we show that this transformation occurs with high probability when using random weights and quadratic activations. Notably, we prove this can be achieved when the network width scales polynomially with the intrinsic dimension of the data rather than the ambient dimension. Experimental results corroborate these theoretical findings and demonstrate that similar linear separation properties hold in practical scenarios beyond our analytical scope. This work bridges the gap between empirical observations and theoretical understanding of the separation capacity of nonlinear networks, offering deeper insights into model interpretability and generalization.


114
Dyno-Net: A Dynamic Feature Extraction Model for Gastrointestinal Polyp Detection

Zijie Song ⋅ Jingjing Wan ⋅ Xianchun Meng ⋅ Qingye Hua ⋅ Wenjie Zhu ⋅ Bolun Chen ⋅ WEI SHAO

Gastrointestinal polyps are precursors to colorectal cancer, underscoring the need for accurate early detection. We propose Dyno-Net, a dynamic feature extraction framework integrating multi-scale fusion (DynoFPN), adaptive convolution (DynoConv), and boundary refinement (RefineDet_LSCSBD), achieving 23.5\% higher fusion efficiency, 17.8\% better detection of small/atypical polyps, and mean IoU improvement from 0.68 to 0.81. Experiments confirm superior accuracy and robustness over mainstream detectors, demonstrating Dyno-Net’s clinical utility.


115
Learning in Continuous State-Space MDPs for Network Inventory Management

Hansheng Jiang ⋅ Shunan Jiang ⋅ Zuo-Jun Shen

We consider online learning in infinite-horizon, average-cost Markov Decision Processes (MDPs) with multi-dimensional, continuous state spaces and censored feedback. Our model setting, motivated by network inventory management applications such as vehicle sharing, is characterized by complex, correlated state transitions and the absence of value function convexity, rendering standard analytical techniques for both MDPs and inventory control inapplicable. Our primary contribution is an integrated framework establishing and leveraging the Lipschitz property of the long-run average cost function. This insight allows us to analyze the problem through the lens of Lipschitz bandits, for which we design a provably efficient online learning algorithm that learns a near-optimal policy from censored demand data. We derive a high-probability regret bound of $O(T^{\frac{n}{n+1}} (\log T)^{\frac{1}{n+1}})$, where $n$ is the network size through customized concentration inequalities for cumulative costs in MDPs with state-dependent transitions. Furthermore, we devise a matching lower bound for this learning problem, which captures the inherent dimensionality challenge.


116
Topological Alignment of Shared Vision-Language Embedding Space

Junwon You ⋅ Kang Dasol ⋅ Jae-Hun Jung

Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning. Code is available at https://github.com/junwon0/ToMCLIP.git.


117
Where You Place the Norm Matters: From Prejudiced to Neutral Initializations

Emanuele Francazi ⋅ Francesco Pinto ⋅ Aurelien Lucchi ⋅ Marco Baity-Jesi

Normalization layers were introduced to stabilize and accelerate training, yet their influence is critical already at initialization, where they shape signal propagation and output statistics before parameters adapt to data. In practice, both which normalization to use and where to place it are often chosen heuristically, despite the fact that these decisions can qualitatively alter a model’s behavior. We provide a theoretical characterization of how normalization choice and placement (Pre-Norm vs. Post-Norm) determine the distribution of class predictions at initialization, ranging from unbiased (Neutral) to highly concentrated (Prejudiced) regimes. We show that these architectural decisions induce systematic shifts in the initial prediction regime, thereby modulating subsequent learning dynamics. By linking normalization design directly to prediction statistics at initialization, our results offer principled guidance for more controlled and interpretable network design, including clarifying how widely used choices such as BatchNorm vs. LayerNorm and Pre-Norm vs. Post-Norm shape behavior from the outset of training.

The bilevel variational inequality (BVI) problem is a broad framework covering optimal equilibrium selection and equilibrium problems with equilibrium constraints (EPECs). We propose Regularized Operator Extrapolation (R-OpEx), a single-loop first-order algorithm for smooth and nonsmooth BVIs with stochastic monotone operators. R-OpEx combines Tikhonov regularization with operator extrapolation, requires only one operator evaluation per iteration, and tracks a single sequence of iterates. We show that R-OpEx obtains an $\epsilon$-solution in $\mathcal{O}(\epsilon^{-4})$ iterations for nonsmooth stochastic BVIs. If the inner operator is smooth and stochastic, we show an improved complexity of $\mathcal{O}(\epsilon^{-2})$ for the outer level operator while maintaining $\mathcal{O}(\epsilon^{-4})$ complexity for the inner level operator. For a smooth deterministic inner level operator, the overall complexity reduces to $\mathcal{O}(\epsilon^{-2})$. Finally, we improve the complexities substantially when the outer level is strongly monotone. To our knowledge, this is the first work to establish such guarantees for nonsmooth stochastic BVIs. We validate our results through numerical studies.


119
Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

Avrim Blum ⋅ Daniel Hsu ⋅ Cyrus Rashtchian ⋅ Donya Saless

Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. From a technical point of view, we are the first to study graph query complexity when given partial prior knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $\Omega(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries. We also extend our analysis to verifier-based approaches, quantifying how the cost of test-time verification scales with the unreliability of the pre-trained knowledge.


12
Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need

Dhruv Sarkar ⋅ Nishant Pandey ⋅ Sayak Ray Chowdhury

Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms. To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret, while relying only on additive Hoeffding bounds, and naturally extending to sub-Gaussian rewards. Furthermore, we generalize the algorithm to a broad class of fairness metrics called the $p$-mean regret, proving (nearly) optimal regret bounds uniformly across all $p$ values. This is in contrast to prior work, which made extremely restrictive assumptions on the bandit instances and even then achieved suboptimal regret bounds. Numerical simulations validate our method’s practical efficacy, broadening the accessibility of fairness in bandit algorithms.


120
From Guess2Graph: When and How Can Unreliable Experts Safely Boost Causal Discovery in Finite Samples?

Sujai Hiremath ⋅ Dominik Janzing ⋅ Philipp M. Faller ⋅ Patrick Bloebaum ⋅ Elke Kirschbaum ⋅ Shiva Kasiviswanathan ⋅ Kyra Gan

Causal discovery algorithms often perform poorly with limited samples. While integrating expert knowledge (including from LLMs) as constraints promises to improve performance, guarantees for existing methods require perfect predictions or uncertainty estimates, making them unreliable for practical use. We propose the Guess2Graph (G2G) framework, which uses expert guesses to guide the sequence of statistical tests rather than replacing them. This maintains statistical consistency while enabling performance improvements. We develop two instantiations of G2G: PC-Guess, which augments the PC algorithm, and gPC-Guess, a learning-augmented variant designed to better leverage high-quality expert input. Theoretically, both preserve correctness regardless of expert error, with gPC-Guess provably outperforming its non-augmented counterpart in finite samples when experts are "better than random". Empirically, both show monotonic improvement with expert accuracy, with gPC-Guess achieving significantly stronger gains.


121
Differentially Private Algorithms for the Stochastic Compositional Optimization Problem

Zhuanghua Liu ⋅ Weida Li ⋅ Xiaokui Xiao ⋅ Bryan Kian Hsiang Low

In this paper, we study the stochastic compositional optimization problem under the constraint of differential privacy. We first introduce two private algorithms: noisy stochastic compositional gradient descent (NSCGD) and the noisy stochastically corrected stochastic compositional gradient (NSCSC) method. We use the algorithmic stability approach to establish bounds on the excess population loss of both methods in strongly convex and convex cases. However, these methods require gradient computations that are super-linear in the number of training samples. To address this, we propose a class of output perturbation-based randomized algorithms by exploiting the stability of the compositional empirical risk minimizer under the privacy constraint. These algorithms achieve comparable excess population risk with significantly reduced gradient computations.

The rise of large-scale pretrained models has made it feasible to generate predictive or synthetic features at low cost, raising the question of how to incorporate such surrogate predictions into downstream decision-making. We study this problem in the setting of online linear contextual bandits, where contexts may be complex, nonstationary, and only partially observed. In addition to bandit data, we assume access to an auxiliary historical dataset containing fully observed contexts--common in practice since such data are collected without adaptive interventions. We propose PULSE-UCB, an algorithm that leverages pretrained models trained on the auxiliary data to impute missing features during online decision-making. We establish regret guarantees that decompose into a standard bandit term plus an additional component reflecting pretrained model quality. In the i.i.d. context case with Hölder-smooth missing features, PULSE-UCB achieves near-optimal performance, supported by matching lower bounds. Our results quantify how uncertainty in predicted contexts affects decision quality and how much historical data is needed to improve downstream learning.

In off‑policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space (distributions over latent states) to relax traditional coverage assumptions. By focusing on the policies with stability property, we derive error bounds that mitigate exponential blow-ups in horizon and memory length. Our unified analysis technique applies to a broad class of OPE algorithms, yielding concrete error bounds and coverage requirements expressed in terms of belief space metrics rather than raw history coverage. We illustrate the improved sample efficiency of this framework via case studies: the double sampling Bellman error minimization algorithm, and the memory-based future-dependent value functions (FDVF). In both cases, our coverage definition based on the belief‐space metric yields tighter bounds.


124
Partially Lazy Gradient Descent for Smoothed Online Learning

Naram Mhaisen ⋅ George Iosifidis

We introduce \textsc{$k$-lazyGD}, an online learning algorithm that bridges the gap between greedy Online Gradient Descent (OGD, for $k=1$) and lazy GD/dual-averaging (for $k=T$), creating a spectrum between reactive and stable updates. We analyze this spectrum in Smoothed Online Convex Optimization (SOCO), where the learner incurs both hitting and movement costs. Our main contribution is establishing that laziness is possible without sacrificing hitting performance: we prove that \textsc{$k$-lazyGD} achieves the optimal dynamic regret $\mathcal{O}(\sqrt{(P_T+1)T})$ for any laziness slack $k$ up to $\Theta(\sqrt{T/P_T})$, where $P_T$ is the comparator path length. This result formally connects the allowable laziness to the comparator's shifts, showing that \textsc{$k$-lazyGD} can retain the inherently small movements of lazy methods without compromising tracking ability. We base our analysis on the Follow the Regularized Leader (FTRL) framework, and derive a matching lower bound. Since the slack depends on $P_T$, an ensemble of learners with various slacks is used, yielding a method that is provably stable when it can be, and agile when it must be.


125
Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency Through Shuffling

Daniil Medyakov ⋅ Gleb Molodtsov ⋅ Savelii Chezhegov ⋅ Alexey Rebrikov ⋅ Aleksandr Beznosikov

Stochastic optimization algorithms are widely used for machine learning with large-scale data. However, their convergence often suffers from non-vanishing variance. Variance Reduction (VR) methods, such as SVRG and SARAH, address this issue but introduce a bottleneck by requiring periodic full gradient computations. In this paper, we explore popular VR techniques and propose an approach that eliminates the necessity for expensive full gradient calculations. To avoid these computations and make our approach memory-efficient, we employ two key techniques: the shuffling heuristic and the concept of SAG/SAGA methods. For non-convex objectives, our convergence rates match those of standard shuffling methods, while under strong convexity, they demonstrate an improvement. We empirically validate the efficiency of our approach and demonstrate its scalability on large-scale machine learning tasks including image classification problem on CIFAR-10 and CIFAR-100 datasets.


126
On Relation-Aware Slicing in Cross-Domain Alignment

Dhruv Sarkar ⋅ Aprameyo Chakrabartty ⋅ Anish Chakrabarty ⋅ Swagatam Das

The Sliced Gromov-Wasserstein (SGW) distance, aiming to relieve the computational cost of solving a non-convex quadratic program that is the Gromov-Wasserstein distance, utilizes projecting directions sampled uniformly from unit hyperspheres. This slicing mechanism incurs unnecessary computational costs due to uninformative directions, which also affects the representative power of the distance. However, finding a more appropriate distribution over the projecting directions (slicing distribution) is often an optimization problem in itself that comes with its own computational cost. In addition, with more intricate distributions, the sampling itself may be expensive. As a remedy, we propose an optimization-free slicing distribution that provides fast sampling for the Monte Carlo approximation. We do so by introducing the Relation-Aware Projecting Direction (RAPD), effectively capturing the pairwise association of each of two pairs of random vectors, each following their ambient law. This enables us to derive the Relation-Aware Slicing Distribution (RASD), a location-scale law corresponding to sampled RAPDs. Finally, we introduce the RASGW distance and its variants, e.g., IWRASGW (Importance Weighted RASGW), which overcome the shortcomings experienced by SGW. We theoretically analyze its properties and substantiate its empirical prowess using extensive experiments on various alignment tasks.


127
FIELDING: Clustered Federated Learning with Data Drift

Minghao Li ⋅ Dmitrii Avdiukhin ⋅ Rana Shahout ⋅ Nikita Ivkin ⋅ Vladimir Braverman ⋅ Minlan Yu

Federated Learning (FL) trains deep models across edge devices without centralizing raw data. However, client heterogeneity slows down convergence and limits global model accuracy. Clustered FL (CFL) mitigates this by grouping clients with similar representations and training a separate model for each cluster. In practice, client data evolves over time -- a phenomenon we refer to as data drift -- which breaks cluster homogeneity and degrades performance. Data drift can take different forms depending on whether changes occur in the output values, the input features, or the relationship between them. We propose FIELDING, a CFL framework for handling diverse types of data drift with low overhead. FIELDING detects drift at individual clients and performs selective re-clustering to balance cluster quality and model performance, while remaining robust to varying levels of heterogeneity. Experiments show that FIELDING improves final model accuracy by 2.4–6.9% and achieves target accuracy 1.38x–3.10x faster than existing state-of-the-art CFL methods.

The classic secretary problem, a cornerstone of optimal stopping theory, assumes a random, immutable arrival order of candidates. While recent work has integrated machine-learned predictions to improve selection, the power to set the arrival order based on these predictions remains largely untapped. This paper introduces a novel framework for the secretary problem that leverages predictions for both valuation and strategic scheduling. We propose an algorithm that strategically controls the arrival time of the top-predicted candidate and dynamically adapts its hiring policy based on observed prediction accuracy. Our analysis shows that this approach achieves a worst-case competitive ratio of 0.229, surpassing the 0.215 bound of state-of-the-art algorithms that do not control ordering, bringing it closer to the upper bound of $1/e \approx 0.368$ while maintaining consistency guarantees. Furthermore, we demonstrate that our ordering framework can be adapted to improve fairness guarantees, doubling the success probability in a known fair algorithm from 1/16 to 1/8. Our results highlight that controlling the sequence is a powerful tool for building more robust and fair learning-augmented online algorithms.


13
DeepRV: Accelerating Spatiotemporal Inference with Pre-trained Neural Priors

Jhonathan Navott ⋅ Daniel Jenson ⋅ Seth Flaxman ⋅ Elizaveta Semenova

Gaussian Processes (GPs) provide a flexible and statistically principled foundation for modelling spatiotemporal phenomena, but their $\mathcal{O}(N^3)$ scaling makes them intractable for large datasets. Approximate methods such as variational inference (VI), inducing-point (sparse) GPs, low-rank kernel approximations (e.g., Nystrom methods and random Fourier features), and approximations such as INLA improve scalability but typically trade off accuracy, calibration, or modelling flexibility. We introduce DeepRV, a neural-network surrogate that replaces GP prior sampling, while closely matching full GP accuracy at inference including hyperparameter estimates, and reducing computational complexity to $\mathcal{O}(N^2)$, increasing scalability and inference speed. DeepRV serves as a drop-in replacement for GP prior realisations in e.g. MCMC-based probabilistic programming pipelines, preserving full model flexibility. Across simulated benchmarks, non-separable spatiotemporal GPs, and a real-world application to education deprivation in London (n = 4,994 locations), DeepRV achieves the highest fidelity to exact GPs while substantially accelerating inference. Code is provided in the dl4bi Python package, with all experiments run on a single consumer-grade GPU to ensure accessibility for practitioners.


130
Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

Jonathan Svirsky ⋅ Yehonathan Refael ⋅ Ofir Lindenbaum

Fully finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting. Although methods like low-rank adapters help address these challenges by adding small trainable modules to the frozen LM, they also increase memory usage and do not reduce inference latency. We uncover an intriguing phenomenon: sparsifying specific model rows and columns enables efficient task adaptation without requiring weight tuning. We propose a scheme for effective finetuning via sparsification using training stochastic gates, which requires minimal trainable parameters, reduces inference time, and removes 20--40\% of model parameters without significant accuracy loss. Empirical results show it outperforms recent finetuning baselines in efficiency and performance. Additionally, we provide theoretical guarantees for the convergence of this stochastic gating process, and show that our method admits a simpler and better-conditioned optimization landscape compared to LoRA. Our results highlight sparsity as a compelling mechanism for task-specific adaptation in LMs.

The analysis of neural representation has become an integral part of research aiming to better understand the inner workings of neural networks. While there are many different approaches to investigate neural representations, an important line of research has focused on doing so through the lens of intrinsic dimensions (IDs). Although this perspective has provided valuable insights and stimulated substantial follow-up research, important limitations of this approach have remained largely unaddressed. In this paper, we highlight a crucial discrepancy between theory and practice of IDs in neural representations, theoretically and empirically showing that common ID estimators are, in fact, not tracking the true underlying ID of the representation. We contrast this negative result with an investigation of the underlying factors that may drive commonly reported ID-related results on neural representation in the literature. Building on these insights, we offer a new perspective on ID estimation in neural representations.

Expand-and-sparsify representations are a class of theoretical models that capture sparse representation phenomena observed in the sensory systems of many animals. At a high level, these representations map an input $x \in \mathbb{R}^d$ to a much higher dimension $m \gg d$ via random linear projections before zeroing out all but the $k \ll m$ largest entries. The result is a $k$-sparse vector in $\\{0,1\\}^m$. We study the suitability of this representation for two fundamental statistical problems: density estimation and mode estimation. For density estimation, we show that a simple linear function of the expand-and-sparsify representation produces an estimator with minimax-optimal $\ell_{\infty}$ convergence rates. In mode estimation, we provide simple algorithms on top of our density estimator that recover single or multiple modes at optimal rates up to logarithmic factors under mild conditions.


133
SPIRE: Conditional Personalization for Federated Diffusion Generative Models

Kaan Ozkara ⋅ Ruida Zhou ⋅ Suhas Diggavi

Two defining characteristics of federated learning (FL) client data are distributional heterogeneity and small local sample sizes. These properties necessitate data efficient, and client specific adaptation rather than a one-size-fits-all model. Recent advances in diffusion models have revolutionized generative AI. However, their scale is too large for straightforward fine-tuning; making personalization difficult. To enable personalized diffusion generative models, we propose Shared‑backbone Personal Identity Representation Embeddings (SPIRE), a framework that casts per‑client diffusion based generation as conditional generation in FL. SPIRE factorizes the network into (i) a high‑capacity global backbone that learns a population‑level score function and (ii) lightweight, learnable client embeddings that encode local data statistics. This separation enables parameter‑efficient fine‑tuning that touches $<0.01\%$ of weights. We provide the first theoretical bridge between conditional diffusion training and maximum‑likelihood estimation in Gaussian‑mixture models. For a two‑component mixture we prove that gradient descent on the DDPM with respect to mixing weights loss recovers the optimal mixing weights and enjoys dimension‑free error bounds. Our analysis also hints at how client embeddings act as biases that steer a shared score network toward personalized distributions. Empirically, SPIRE matches or surpasses strong baselines during collaborative pre‑training, and vastly outperforms them when adapting to unseen/new clients—reducing Kernel Inception Distance while updating only hundreds of parameters. SPIRE further mitigates catastrophic forgetting and remains robust across fine-tuning learning‑rate and epoch choices.

In this paper, we study the problem of estimating the variance and covariance of datasets under differential privacy in the add-remove model. While estimation in the swap model has been extensively studied in the literature, the add-remove model remains less explored and more challenging, as the dataset size must also be kept private. To address this issue, we develop efficient mechanisms for variance and covariance estimation based on the Bezier mechanism, a novel moment-release framework that leverages Bernstein bases. We prove that our proposed mechanisms are minimax optimal in the high-privacy regime by establishing new minimax lower bounds. Moreover, beyond worst-case scenarios, we analyze instance-wise utility and show that the Bézier-based estimator consistently achieves better utility compared to alternative mechanisms. Finally, we demonstrate the effectiveness of the Bézier mechanism beyond variance and covariance estimation, showcasing its applicability to other statistical tasks.

Score-based query attacks pose a serious threat to deep learning models by crafting adversarial examples (AEs) using only black-box access to model output scores, iteratively optimizing inputs based on observed loss values. While recent runtime defenses attempt to disrupt this process via output perturbation, most either require access to model parameters or fail when attackers adapt their tactics. In this paper, we first reveal that even the state-of-the-art plug-and-play defense can be bypassed by adaptive attacks, exposing a critical limitation of existing runtime defenses. We then propose Dashed Line Defense (DLD), a plug-and-play post-processing method specifically designed to withstand adaptive query strategies. By introducing ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, DLD prevents attackers from reliably analyzing and adapting their queries, effectively disrupting the AE generation process. We provide theoretical guarantees of DLD’s defense capability and validate its effectiveness through experiments on ImageNet, demonstrating that DLD consistently outperforms prior defenses—even under worst-case adaptive attacks—while preserving the model’s predicted labels.


137
Sparse Offline Reinforcement Learning with Corruption Robustness

Nam Tran ⋅ Andi Nika ⋅ Goran Radanovic ⋅ Long Tran-Thanh ⋅ Debmalya Mandal

We investigate robustness to strong data corruption in offline sparse reinforcement learning (RL). In our setting, an adversary may arbitrarily perturb a fraction of the collected trajectories from a high-dimensional but sparse Markov decision process, and our goal is to estimate a near-optimal policy. The main challenge is that, in the high-dimensional regime where the number of samples $N$ is smaller than the feature dimension $d$, exploiting sparsity is essential for obtaining non-vacuous guarantees but has not been systematically studied in offline RL. We analyse the problem under uniform coverage and sparse single-concentrability assumptions. While Least Square Value Iteration (LSVI), a standard approach for robust offline RL, performs well under uniform coverage, we show that integrating sparsity into LSVI is unnatural, and its analysis may break down due to overly pessimistic bonuses. To overcome this, we propose actor–critic methods with sparse robust estimator oracles, which avoid the use of pointwise pessimistic bonuses and provide the first non-vacuous guarantees for sparse offline RL under single-policy concentrability coverage. Moreover, we extend our results to the contaminated setting and show that our algorithm remains robust under strong contamination. Our results provide the first non-vacuous guarantees in high-dimensional sparse MDPs with single-policy concentrability coverage and corruption, showing that learning near-optimal policy remains possible in regimes where traditional robust offline RL techniques may fail.

The growing focus on distributed data and privacy has spurred the rise of Federated Learning (FL). Empirical studies show that, under equal resources, FL often underperforms centralized training, but the reasons behind this gap remain theoretically unclear. This lack of understanding leaves open whether FL is inherently inferior in generalization and how the gap might be closed. We address this by formulating FL as a server-based SGD optimization problem over distributed data and analyzing the generalization gap within the PAC-Bayesian framework. Our analysis derives non-vacuous bounds on this gap, showing that such a gap necessarily exists under equal resources and depends on training parameters. We further prove that the gap can be fully eliminated only by introducing new clients or adding new data to existing clients, with the latter being more efficient. In contrast, allowing FL to have advantages in other resources, such as larger models or more communication rounds, cannot close the gap. As a complementary analysis, we also confirm from a stability perspective that centralized FL holds a generalization advantage over decentralized FL, justifying our FL formulation choice. Extensive experiments across different model architectures and datasets validate our theory.

The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.

Value-based reinforcement learning (RL) efficiently handles high-dimensional state spaces, but existing methods lack a principled method for hyperparameter tuning without online interaction, limiting use in safety-critical and data-scarce domains. We propose the Duality-based Residual Estimator (DRE), a simple offline validation metric for value-based offline RL. DRE is compatible with standard value-based Off-Policy Evaluation (OPE) and enables automatic hyperparameter selection, which is formalized through an adaptive extension of the Probably Approximately Correct (PAC) guarantee for Q-function selection. Our results address a key theoretical bottleneck toward fully offline value-based RL, which enables deployment without extensive online tuning.


140
An Indicator of Membership Inference Security in Post-Training Quantized Models

Eric AUBINAIS ⋅ Philippe Formont ⋅ Pablo Piantanida ⋅ Elisabeth Gassiat

Quantizing machine learning models has demonstrated its effectiveness in lowering memory and inference costs while maintaining performance levels comparable to those of the original models. In this work, we investigate the impact of quantization procedures on privacy in data-driven models, focusing on their vulnerability to membership inference attacks. Membership Inference Security (MIS) has recently been proposed to characterize the privacy of machine learning models against the most powerful (and possibly unknown) attacks. However, quantifying MIS appears to be computationally very difficult. In this paper, we propose a new MIS indicator for post-training quantization procedures of machine learning models that minimize an empirical loss. This new indicator is a byproduct of a theoretical asymptotic analysis of the MIS in this context. We also present a methodology for empirically estimating our MIS indicator. Using synthetic datasets and real-world data (in the context of drug discovery), we demonstrate the effectiveness of our approach in assessing and ranking the MIS of different quantizers.

Detecting out-of-distribution (OOD) data is critical for machine learning, be it for safety reasons or to enable open-ended learning. However, beyond mere detection, choosing an appropriate course of action typically hinges on the type of OOD data encountered. Unfortunately, the latter is generally not distinguished in practice, as modern OOD detection methods collapse distributional shifts into single scalar outlier scores. This work argues that scalar-based methods are thus insufficient for OOD data to be properly contextualized and prospectively exploited, a limitation we overcome with the introduction of DISC: Diffusion-based Statistical Characterization. DISC leverages the iterative denoising process of diffusion models to extract a rich, multi-dimensional feature vector that captures statistical discrepancies across multiple noise levels. Extensive experiments on image and tabular benchmarks show that DISC matches or surpasses state-of-the-art detectors for OOD detection and, crucially, also classifies OOD type, a capability largely absent from prior work. As such, our work enables a shift from simple binary OOD detection to a more granular detection.


142
Regularizing Extrapolation in Causal Inference

David Arbour ⋅ Harsh Parikh ⋅ Bijan Niknam ⋅ Elizabeth Stuart ⋅ Kara Rudolph ⋅ Avi Feller

Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel ``bias-bias-variance'' tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.

Recovering the random graph model from an observed collection of networks is known to present significant challenges in the setting, where the networks do not share a common node set and have different sizes. More specifically, the goal is the estimation of the graphon function that parametrizes the nonparametric exchangeable random graph model. Existing methods typically suffer from either limited accuracy or high computational complexity. We introduce a new histogram-based estimator with low algorithmic complexity that achieves high accuracy by jointly aligning the nodes of all graphs, in contrast to most conventional methods that order nodes graph by graph. Consistency results of the proposed graphon estimator are established. A numerical study shows that the proposed estimator outperforms existing methods in terms of accuracy, especially when the dataset comprises only small and variable-size networks. Moreover, the computing time of the new method is considerably shorter than that of other consistent methodologies. Additionally, when applied to a graph neural network classification task, the proposed estimator enables more effective data augmentation, yielding improved performance across diverse real-world datasets.


144
A Scalable Lift-and-Project Differentiable Approach For the Maximum Cut Problem

Ismail Alkhouri ⋅ Mian Wu ⋅ CUNXI YU ⋅ Jia Liu ⋅ Rongrong Wang ⋅ Alvaro Velasquez

We propose a scalable framework for solving the Maximum Cut (MaxCut) problem in large graphs using projected gradient ascent on quadratic objectives. Our approach is differentiable and leverages GPUs for gradient-based optimization. It is not a machine learning method and does not require training data. Starting from a continuous relaxation of the classical quadratic binary formulation, we present a parallelized strategy that explores multiple initialization vectors in batch. We analyze the relaxed objective, showing it is convex and has fixed-points corresponding to local optima—particularly at boundary points—highlighting a key challenge in non-convex optimization. To improve exploration, we introduce a lifted quadratic formulation that over-parameterizes the solution space. We also provide a theoretical characterization of these lifted fixed-points. Finally, we propose DECO, a dimension-alternating algorithm that switches between the unlifted and lifted formulations, combined with importance-based degree initialization and a population-based evolutionary hyper-parameter search. Experiments on diverse graph families show that our methods attain comparable or superior performance relative to recent neural networks and GPU-accelerated sampling approaches.


145
Information Hidden in Gradients of Regression with Target Noise

Arash Jamshidi ⋅ Katsiaryna Haitsiukevich ⋅ Kai Puolamäki

Second-order information---such as curvature or data covariance---is critical for optimisation, diagnostics, and robustness. However, in many modern settings, only the gradients are observable. We show that the gradients alone can reveal the Hessian, equalling the data covariance $\Sigma$ for the linear regression. Our key insight is a simple variance calibration: injecting Gaussian noise so that the total target noise variance equals the batch size ensures that the empirical gradient covariance closely approximates the Hessian, even when evaluated far from the optimum. We provide non-asymptotic operator-norm guarantees under sub-Gaussian inputs. We also show that without such calibration, recovery can fail by an $\Omega(1)$ factor. The proposed method is practical (a ``set target-noise variance to $n$’’ rule) and robust (variance $\mathcal{O}(n)$ suffices to recover $\Sigma$ up to scale). Applications include preconditioning for faster optimisation, adversarial risk estimation, and gradient-only training, for example, in distributed systems. We support our theoretical results with experiments on synthetic and real data.


146
Harnessing the Power of Reinforcement Learning for Adaptive MCMC

Congye Wang ⋅ Matthew Fisher ⋅ Heishiro Kanagawa ⋅ Wilson Ye Chen ⋅ Chris Oates

Sampling algorithms drive probabilistic machine learning, and recent years have seen an explosion in the diversity of tools for this task. However, the increasing sophistication of sampling algorithms is correlated with an increase in the tuning burden. There is now a greater need than ever to treat the tuning of samplers as a learning task in its own right. In a conceptual breakthrough, Wang et al. (2025) formulated Metropolis-Hastings as a Markov decision process, opening up the possibility for adaptive tuning using reinforcement learning (RL). Their emphasis was on theoretical foundations; realising the practical benefit of reinforcement learning Metropolis-Hastings (RLMH) was left for subsequent work. The purpose of this paper is twofold: First, we observe the surprising result that natural choices of reward, such as the acceptance rate, or the expected squared jump distance, provide insufficient signal for training RLMH. Instead, we propose a novel reward based on the contrastive divergence, whose superior performance in the context of RLMH is demonstrated. Second, we explore the potential of RLMH and present adaptive gradient-based samplers that balance flexibility of the Markov transition kernel with learnability of the associated RL task. A comprehensive simulation study using the posteriordb benchmark supports the practical effectiveness of RLMH.


147
Semi-Implicit Variational Inference via Kernelized Path Gradient Descent

Tobias Pielok ⋅ Bernd Bischl ⋅ David Rügamer

Semi-implicit variational inference (SIVI) is a powerful framework for approximating complex posterior distributions, but training with the Kullback–Leibler (KL) divergence can be challenging due to high variance and bias in high-dimensional settings. While current state-of-the-art score-based methods, particularly Kernel Semi-Implicit Variational Inference (K-SIVI), have been shown to also work in high dimensions, they can be "blind'' to isolated components and mixing proportions, especially in multi-modal distributions. In this work, we propose a kernelized KL divergence estimator that stabilizes training through nonparametric smoothing, effectively addressing the "blindness'' challenge. To further reduce the bias, we introduce an importance sampling correction. We provide a theoretical connection to the amortized version of the Stein variational gradient descent, which estimates the score gradient via Stein's identity, showing that both methods minimize the same objective, but our semi-implicit approach achieves lower gradient variance. In addition, our method's bias in function space is benign, leading to more stable and efficient optimization. Empirical results demonstrate that our method outperforms or matches state-of-the-art score matching methods in both performance and training efficiency.


148
Brenier Isotonic Regression

Han Bao ⋅ Amirreza Eshraghi ⋅ Yutong Wang

Isotonic regression (IR) is shape-constrained regression to maintain a univariate fitting curve non-decreasing, which has numerous applications including single-index models and probability calibration. When it comes to multi-output regression, the classical IR is no longer applicable because the monotonicity is not readily extendable. We consider a novel multi-output regression problem where a regression function is \emph{cyclically monotone}. Roughly speaking, a cyclically monotone function is the gradient of some convex potential. Whereas enforcing cyclic monotonicity is apparently challenging, we leverage the fact that Kantorovich's optimal transport (OT) always yields a cyclically monotone coupling as an optimal solution. This perspective naturally allows us to interpret a regression function and the convex potential as a link function in generalized linear models and Brenier's potential in OT, respectively, and hence we call this IR extension \emph{Brenier isotonic regression}. We demonstrate experiments with probability calibration and generalized linear models. In particular, IR outperforms many famous baselines in probability calibration robustly.


149
DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

Batuhan Karaman ⋅ Aditya Rawal ⋅ Mohammad Ghavamzadeh ⋅ Suhaila Shakiah ⋅ Arijit Biswas ⋅ Ruida Zhou

Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights $>1$ increase the average token entropy (i.e., exploration) while weights $<1$ decrease it (i.e., distillation) - both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights $>1$) or vanishing response lengths (when weights $<1$). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04\% on AIME'24 (vs.\ 55.42\% CISPO and 50.21\% DAPO) with similar gains across various benchmarks and models.

Two-tower models are widely used for applications involving learning similarities between pairs of entities, such as user-item pairs in recommender systems. These models are commonly trained using stochastic gradient methods. However, uniformly sampling data often leads to problematic batches that lack positive pairs, especially when positives are a minority of the dataset—a situation particularly common in similarity learning. Instead, a strategy known as in-batch sampling is widely adopted to ensure the presence of positive pairs and training efficiency. Nevertheless, in-batch sampling introduces its own issues, such as mistaking positives for negatives and oversampling popular pairs, resulting in significant performance degradation. In this work, we provide the first systematic analysis of these issues, showing that they all arise from the inconsistency between the expected objective under in-batch sampling and the full-data objective. We refer to this inconsistency as the bias of in-batch sampling. To validate our analysis, we design an unbiased batch loss and conduct rigorous experiments directly comparing unbiased and biased losses. The results provide strong empirical confirmation of our theoretical findings.


150
Conformal Robust Control of Linear Systems

Yash Patel ⋅ Sahana Rayan ⋅ Ambuj Tewari

End-to-end engineering design pipelines, in which designs are evaluated using concurrently defined optimal controllers, are becoming increasingly common in practice. To discover designs that perform well even under the misspecification of system dynamics, such end-to-end pipelines have now begun evaluating designs with a robust control objective in place of the nominal optimal control setup. Current approaches of specifying such robust control subproblems, however, rely on hand specification of perturbations anticipated to be present upon deployment or margin methods that ignore problem structure, resulting in a lack of theoretical guarantees and overly conservative empirical performance. We, instead, propose a novel methodology for LQR systems that leverages conformal prediction to specify such uncertainty regions in a data-driven fashion. Such regions have distribution-free coverage guarantees on the true system dynamics, in turn allowing for a probabilistic characterization of the regret of the resulting robust controller. We then demonstrate that such a controller can be efficiently produced via a novel policy gradient method that has convergence guarantees. We finally demonstrate the superior empirical performance of our method over alternate robust control specifications, such as $H_\infty$ and LQR with multiplicative noise, across a collection of engineering control systems.


151
Variance Constrained Distribution Alignment in Few-shot Models

Xiaohong Cai ⋅ Yi SUN ⋅ Zhaowen Lin ⋅ Tianwei Cai

Learning generative models from the limited samples remains challenging due to unstable estimation of class conditional representations. Such instability often leads to intra-class distribution drift and degraded generalization under few sample regimes. To address these challenges, we propose a method that can model class level latent distributions for flexible and efficient few shot synthesis. Specifically, each input is represented by a learnable conditional latent distribution. Metric based statistical modeling effectively disentangles latent variables, contracts intra-class variance, and enlarges inter-class margins while enforcing cross task distributional alignment. Furthermore, we provide a variance based generalization analysis, showing that controlling class conditional variance tightens generalization bounds under few sample regimes. Experiments on the benchmark datasets demonstrate that our method surpasses prior works in visual quality and diversity, highlighting the benefit of statistical alignment for robust few shot generative modeling.

The conditional average treatment effect (CATE) is a commonly targeted statistical parameter for measuring the effect of a treatment conditional on covariates. However, the CATE will fail to capture effects of treatments beyond differences in conditional expectations. Inspired by causal forests for CATE estimation, we develop a forest-based method to estimate the conditional kernel treatment effect (CKTE), based on the recently introduced Distributional Random Forest (DRF) algorithm. Adapting the splitting criterion of DRF, we show how one forest fit can be used to obtain a consistent and asymptotically normal estimator of the CKTE, as well as an approximation of its sampling distribution. This allows to study the difference in distribution between control and treatment group and thus yields a more comprehensive understanding of the treatment effect.


153
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

Wei Shen ⋅ Yaxiang Zhang ⋅ Minhui Huang ⋅ Mengfan Xu ⋅ Jiawei Zhang ⋅ Cong Shen

With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption. Compared to LoRA, MLorc avoids enforcing a fixed-rank constraint on weight update matrices and thus enables full-parameter learning. Compared to GaLore, MLorc directly compress the momentum rather than gradients, thereby better preserving the training dynamics of full-parameter fine-tuning. We provide a theoretical guarantee for its convergence under mild assumptions. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning at small ranks (e.g., $r=4$), and generalizes well across different optimizers, all while not compromising time or memory efficiency.

Relevant feature selection in high-dimensional settings is a central challenge in modern scientific research and decision-making. While many existing methods offer strong statistical guarantees, they are often computationally intractable in high-dimensional problems. To address this issue, we introduce a novel Laplace approximation method based on Le Cam’s one-step procedure, termed \textsf{OLAP}. This approach is specifically designed to alleviate computational burdens while maintaining statistical rigor. Under standard high-dimensional assumptions, we establish that \textsf{OLAP} achieves consistent variable selection. Moreover, the method yields a posterior distribution that can be efficiently explored in polynomial time via a simple Gibbs sampling algorithm. We demonstrate the effectiveness of OLAP through applications to logistic and Poisson regression models, using both simulated and real data.


155
A Geometric Approach to Optimal Experimental Design

Gavin Kerrigan ⋅ Christian Andersson Naesseth ⋅ Tom Rainforth

We introduce a novel geometric framework for optimal experimental design (OED). Traditional OED approaches, such as those based on mutual information, rely explicitly on probability densities, leading to restrictive invariance properties. To address these limitations, we propose the mutual transport dependence (MTD), a measure of statistical dependence grounded in optimal transport theory which provides a geometric objective for optimizing designs. Unlike conventional approaches, the MTD can be tailored to specific downstream estimation problems by choosing appropriate geometries on the underlying spaces. We demonstrate that our framework produces high-quality designs while offering a flexible alternative to standard information-theoretic techniques.

We introduce GKK+, a new design for multi-arm randomized controlled trials. Standard Bernoulli randomization is robust but often yields poor covariate balance, while existing restricted-randomness designs mainly address two-arm settings. GKK+ extends the Karmarkar–Karp (KK) differencing method to multiple arms. When covariates are smooth and well-behaved, GKK+ achieves an exponentially better covariate balance than the standard Bernoulli design while preserving sufficient randomness. GKK+ improves efficiency in estimating treatment effects and supports standard asymptotic inference. Simulations on synthetic and real datasets demonstrate improved balance and lower estimator variance compared to existing methods.


157
Refining Covariance Matrix Estimation in Stochastic Gradient Descent Through Bias Reduction

Ziyang Wei ⋅ Wanrong Zhu ⋅ Jingyang Lyu ⋅ Wei Biao Wu

We study online inference and asymptotic covariance estimation for the stochastic gradient descent (SGD) algorithm. While classical methods—such as plug-in and batch-means estimators—are available, they either require inaccessible second-order (Hessian) information or suffer from slow convergence. To address these challenges, we propose a novel, fully online de-biased covariance estimator that eliminates the need for second-order derivatives while significantly improving estimation accuracy. Our method employs a bias-reduction technique to achieve a convergence rate of $n^{(\alpha-1)/2}\sqrt{\log n}$, outperforming existing Hessian-free alternatives.


158
RL-finetuning LLMs from on- and off-policy data with a single algorithm

Yunhao Tang ⋅ Taco Cohen ⋅ David Zhang ⋅ Gabriel Synnaeve ⋅ Rémi Munos

We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for finetuning Large Language Models. AGRO leverages the concept of response consistency, which states that the optimal policy satisfies a notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the MATH dataset over baseline methods.


159
Likelihood-Free Inference via Structured Score Matching

Haoyu Jiang ⋅ Yuexi Wang ⋅ Yun Yang

In many statistical problems, the data distribution is specified through a generative process for which the likelihood function is analytically intractable, yet inference on the associated model parameters remains of primary interest. We develop a likelihood-free inference framework that combines score matching with gradient-based optimization and bootstrap procedures to facilitate parameter estimation together with uncertainty quantification. The proposed methodology introduces tailored score-matching estimators for approximating likelihood score functions, and incorporates an architectural regularization scheme that embeds the statistical structure of log-likelihood scores to improve both accuracy and scalability. We provide theoretical guarantees and demonstrate the practical utility of the method through simulations and benchmark applications, where it performs favorably compared to existing approaches.

Policymakers often use recursive binary split rules to partition populations based on binary outcomes and target subpopulations whose probability of this adverse binary event exceeds a threshold. We call such problems Latent Probability Classification (LPC). Practitioners typically employ Classification and Regression Trees (CART) for LPC. We prove that, in the context of LPC, classic CART and the knowledge distillation method, in which the student model is a CART (referred to as KD-CART), are suboptimal. We propose Maximizing Distance Final Split (MDFS), which generates split rules that strictly dominate CART/KD-CART under the unique intersect assumption. Under this assumption, MDFS identifies the unique best split rule. Consequently, it targets more vulnerable subpopulations than CART/KD-CART, where ``more vulnerable'' is defined as a higher probability of the adverse binary event. To further relax the assumption, we propose Penalized Final Split (PFS) and weighted Empirical risk Final Split (wEFS). Through extensive simulation studies, we demonstrate that the proposed methods predominantly outperform CART/KD-CART using two risk metrics. When applied to real-world datasets, MDFS generates policies that target more vulnerable subpopulations than the CART/KD-CART.

Complex data are often represented as a graph, which in turn can often be viewed as a realisation of a random graph, such as an inhomogeneous random graph model (IRG). For general fast goodness-of-fit tests in high dimensions, kernelised Stein discrepancy (KSD) tests are a powerful tool. Here, we develop a KSD-type test for IRG models that can be carried out with a single observation of the network. The test applies to networks of any size, but is particularly relevant for small networks for which asymptotic tests are not warranted. We also provide theoretical guarantees.


161
Graphon Mixtures

Sevvandi Kandanaarachchi ⋅ Cheng Soon Ong

Social networks have a small number of large hubs, and a large number of small dense communities. We propose a generative model that captures both hub and dense structures. Based on recent results about graphons on line graphs, our model is a graphon mixture, enabling us to generate sequences of graphs where each graph is a combination of sparse and dense graphs. We propose a new condition on sparse graphs (the max-degree), which enables us to identify hubs. We show theoretically that we can estimate the normalized degree of the hubs, as well as estimate the graphon corresponding to sparse components of graph mixtures. We illustrate our approach on synthetic data and real-world networks, showing the benefits of explicitly modeling sparse graphs.


161
Practical and Efficient Rashomon Set Sampling for Model Interpretability

Sichao Li ⋅ Amanda Barnard ⋅ Quanling Deng

Explaining a single model can be misleading when many near-optimal models (a Rashomon set) yield different feature attributions. We frame this as a Rashomon set sampling problem and propose two practical axioms that any Rashomon sampler should satisfy: generalizability (meaning it must accept arbitrary reference models and loss functions) and Implementation Sparsity (meaning it should return a small, attribution‑diverse subset of valid models). These two axioms are not satisfied by most known attribution methods, which we consider to be a fundamental weakness. Building on these axioms, we propose an $\epsilon$-subgradient-based sampling framework and quantify effectiveness with Search Efficiency Ratio (SER) and Functional Explanation Range (FER). Experiments on a synthetic quadratic task and five real-world datasets show that our sampler achieves comparable or higher FER with up to $\\sim100\\times$ fewer models than exhaustive baselines such as TreeFARMS, while remaining agnostic to model class and loss. Even when the reference model is sub‑optimal in practice, the resulting attributions align with ground truth and accepted domain knowledge.


162
Root Cause Analysis of Outliers in Unknown Cyclic Graphs

Daniela Schkoda ⋅ Dominik Janzing

We study the propagation of outliers in cyclic causal graphs with linear structural equations, tracing them back to one or several "root cause" nodes. We show that it is possible to identify a short list of potential root causes provided that the perturbation is sufficiently strong and propagates according to the same structural equations as in the normal mode. This shortlist consists of the true root causes together with those of its parents lying on a cycle with the root cause. Notably, our method does not require prior knowledge of the causal graph and yields encouraging results on simulated data and real data from biology and cloud computing.

In this work, we investigate the generalization properties of random feature methods. Our analysis extends prior results for Tikhonov regularization to a broad class of spectral regularization techniques and further generalizes the setting to operator-valued kernels. This unified framework enables, for the first time, a rigorous theoretical analysis of neural operators and neural networks through the lens of the Neural Tangent Kernel (NTK). In particular, it allows us to establish optimal learning rates and provides a good understanding of how many neurons are required to achieve a given accuracy. Furthermore, we establish minimax rates in the well-specified case and also in the misspecified case, where the target is not contained in the reproducing kernel Hilbert space. These results sharpen and complete earlier findings for specific kernel algorithms.


164
Optimal Arm Elimination Algorithms for Combinatorial Bandits

Yuxiao Wen ⋅ Yanjun Han ⋅ Zhengyuan Zhou

Combinatorial bandits extend the classical bandit framework to settings where the learner selects multiple arms in each round, motivated by applications such as online recommendation and assortment optimization. While extensions of upper confidence bound (UCB) algorithms arise naturally in this context, adapting arm elimination methods has proved more challenging. We introduce a novel elimination scheme that partitions arms into three categories (confirmed, active, and eliminated), and incorporates explicit exploration to update these sets. We demonstrate the efficacy of our algorithm in two settings: the combinatorial multi-armed bandit with general graph feedback, and the combinatorial linear contextual bandit. Matching lower bounds are also provided. In both cases, our approach achieves near-optimal regret, whereas UCB-based methods can provably fail due to insufficient explicit exploration.

Graph clustering is a fundamental task in unsupervised learning with broad real-world applications. While spectral clustering methods for undirected graphs are well-established and guided by a minimum cut optimization consensus, their extension to directed graphs remains relatively underexplored due to the additional complexity introduced by edge directions. In this paper, we leverage statistical inference on stochastic block models to guide the development of a spectral clustering algorithm for directed graphs. Specifically, we study the maximum likelihood estimation under a widely used directed stochastic block model, and derive a global objective function that aligns with the underlying community structure. Building on its spectral relaxation, we propose two novel spectral clustering algorithms for directed graphs and establish theoretical guarantees for their misclustering error. Extensive experiments on synthetic and real-world datasets demonstrate significant performance gains over existing baselines.


166
Meta Sparse Principal Component Analysis

Imon Banerjee ⋅ Jean Honorio

We study the meta-learning for support recovery (i.e., non-zero coordinates of the eigenvectors) in high-dimensional Principal Component Analysis. We reduce the sufficient sample complexity in a novel task, with the information that is learned from auxiliary tasks, where a task is defined as a random Principal Component (PC) matrix with its own support. We pool data from all the tasks to execute an improper estimation of a single PC matrix, by maximising the $\ell_1$-regularised predictive covariance. With $m$ tasks for $p$-variate sub-Gaussian random vectors, we establish the sufficient sample complexity for each task to be of the order $O(\sqrt{m^{-1}\log p})$, with high probability. This is very relevant for meta-learning where there are many tasks $m = O(\log p)$, each with very few samples, i.e., $n = O(1)$, in an scenario where multi-task learning fails. For a novel task, we prove that the sufficient sample complexity of successful support recovery can be reduced to $O(\log |J|)$, under an additional constraint that the support of the novel task is a subset of the estimated support union ($J$) from the auxiliary tasks. This reduces the original sample complexity of $O(\log p)$ for learning a single task. Theoretical claims are validated with numerical simulations and the problem of true covariance estimation in brain-imaging and cancer genetics data sets are considered to validate the proposed methodology.

We introduce the first variance-aware algorithms for contextual dueling bandits that leverage shallow exploration strategies with neural networks for nonlinear utility approximation. A key theoretical challenge is the absence of a closed-form estimator, which led prior work to require an extremely large network width $m$ (i.e., $m = \widetilde{\Omega}(T^{14})$). We address this constraint with a novel analytical approach that combines iterative self-improvement with spectral analysis. Our analysis significantly reduces the network width requirement to $m = \widetilde{\Omega}(T^{6})$, and shows that our algorithms achieve a sublinear regret of $ \widetilde{\mathcal{O}}\left(d\sqrt{\sum_{t=1}^{T} \sigma_t^2} + \sqrt{dT}\right) $ under both UCB and TS frameworks. Empirical results show that the proposed algorithms are not only computationally efficient and exhibit sublinear regret in practical settings, but also achieve state-of-the-art performance on both synthetic and real-world tasks.


168
Off-policy Distributional Q($\lambda$): Distributional RL without Importance Sampling

Yunhao Tang ⋅ Mark Rowland ⋅ Rémi Munos ⋅ Bernardo Avila Pires ⋅ Will Dabney

We introduce off-policy distributional Q($\lambda$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($\lambda$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($\lambda$) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q($\lambda$) and validate theoretical insights with tabular experiments. We show how distributional Q($\lambda$)-C51, a combination of Q($\lambda$) with the C51 agent, exhibits promising results on deep RL benchmarks.

Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko–Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.


17
Rate-optimal Design for Anytime Best Arm Identification

Junpei Komiyama ⋅ Kyoungseok Jang ⋅ Junya Honda

We consider the best arm identification problem, where the goal is to identify the arm with the highest mean reward from a set of $K$ arms under a limited sampling budget. This problem models many practical scenarios such as A/B testing. We consider a class of algorithms for this problem, which is provably minimax optimal up to a constant factor. This idea is a generalization of existing works in fixed-budget best arm identification, which are limited to a particular choice of risk measures. Based on the framework, we propose Almost Tracking, a closed-form algorithm that has a provable guarantee on the popular risk measure. Unlike existing algorithms, Almost Tracking does not require the total budget in advance nor does it need to discard a significant part of samples, which gives a practical advantage. Through experiments on synthetic and real-world datasets, we show that our algorithm outperforms existing anytime algorithms as well as fixed-budget algorithms. Our recommended algorithm for practitioners is found in the final section.

We provide the first proof of learning rate transfer with a multi-layer perceptron (MLP) parametrized with $\mu P$, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that with $\mu P$, the optimal learning rate converges to a non-zero constant as width goes to infinity. In contrast, we show that this doesn't hold with other parametrizations such as Standard Parameterization (SP) and Neural Tangent Parametrization (NTP). We provide extensive empirical results validating our theoretical findings.

Diagonal linear networks (DLNs) are a tractable model that captures several nontrivial behaviors in neural network training, such as initialization-dependent solutions and incremental learning. These phenomena are typically studied in isolation, leaving the overall dynamics insufficiently understood. In this work, we present a unified analysis of various phenomena in the gradient flow dynamics of DLNs. Using Dynamical Mean-Field Theory (DMFT), we derive a low-dimensional effective process that captures the asymptotic gradient flow dynamics in high dimensions. Analyzing this effective process yields new insights into DLN dynamics, including loss convergence rates and their trade-off with generalization, and systematically reproduces many of the previously observed phenomena. These findings deepen our understanding of DLNs and demonstrate the effectiveness of the DMFT approach in analyzing high-dimensional learning dynamics of neural networks.


172
A projection-based framework for gradient-free and parallel learning

Andreas Bergmeister ⋅ Manish Krishan Lal ⋅ Stefanie Jegelka ⋅ Suvrit Sra

We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.


173
Generalized and Optimal Straight-Through Estimators

James Hooper ⋅ Alexander Shekhovtsov

Modern ML models often utilize discrete components within their computational graphs, making training challenging. In such cases, approximate-chain-rule gradient estimators can be applied. They work reasonably well but are obtained by combining diverse rationales with ad-hoc choices. In this work, we propose a principled axiomatic approach to define a general family of gradient estimators and show that it subsumes many existing methods. Within this family, we derive optimal estimators with respect to a minimum variance criterion subject to interpretable bias-limiting constraints, addressing integer and one-hot categorical discrete variables. We empirically demonstrate that our estimator can achieve a better bias-variance trade-off than existing ones on synthetic problems and outperforms them on training variational auto-encoders with discrete latent variables.


174
Standard Acquisition Is Sufficient for Asynchronous Bayesian Optimization

Ben Riegler ⋅ James Odgers ⋅ Vincent Fortuin

Asynchronous Bayesian optimization is widely used for gradient-free optimization in domains with independent parallel experiments and varying evaluation times. Existing methods posit that standard acquisitions lead to redundant and repeated queries, proposing complex solutions to enforce diversity in queries. Challenging this fundamental premise, we show that methods, like the Upper Confidence Bound, can in fact achieve theoretical guarantees essentially equivalent to those of sequential Thompson sampling. A conceptual analysis of asynchronous Bayesian optimization reveals that existing works neglect intermediate posterior updates, which we find to be generally sufficient to avoid redundant queries. Further investigation shows that by penalizing busy locations, diversity-enforcing methods can over-explore in asynchronous settings, reducing their performance. Our extensive experiments demonstrate that simple standard acquisition functions match or outperform purpose-built asynchronous methods across synthetic and real-world tasks.


175
GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Neeraj Gangwar ⋅ Rishabh Deshmukh ⋅ Michael Shavlovsky ⋅ Hancao Li ⋅ Vivek Mittal ⋅ Lexing Ying ⋅ Nickvash Kani

As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ($8\times$).


176
PAC-Bayesian Bounds on Constrained $f$-Entropic Risk Measures

Hind Atbir ⋅ Farah Cherfaoui ⋅ Guillaume Metzler ⋅ Emilie Morvant ⋅ Paul Viallard

PAC generalization bounds on the risk, when expressed in terms of the expected loss, are often insufficient to capture imbalances between subgroups in the data. To tackle this limitation, we introduce a new family of risk measures, called constrained $f$-entropic risk measures, which enable finer control over distributional shifts and subgroup imbalances via $f$-divergences, and include the Conditional Value at Risk (CVaR), a well-known risk measure. We derive both classical and disintegrated PAC-Bayesian generalization bounds for this family of risks, providing the first disintegrated PAC-Bayesian guarantees beyond standard risks. Building on this theory, we design a self-bounding algorithm minimizing our bounds directly, yielding models with guarantees at the subgroup level. We empirically demonstrate the usefulness of our approach.


177
Rethinking Probabilistic Circuit Parameter Learning

Anji Liu ⋅ Zilei Shao ⋅ Guy Van den Broeck

Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. Although empirical extensions to the mini-batch setting, as well as gradient-based mini-batch algorithms, converge faster than full-batch EM, they generally underperform in terms of final likelihood. We investigate this gap by establishing a novel theoretical connection between these practical algorithms and the general EM objective. Our analysis reveals a fundamental issue that existing mini-batch EM and gradient-based methods fail to properly regularize distribution changes, causing each update to effectively ``overfit'' the current mini-batch. Motivated by this insight, we introduce anemone, a new mini-batch EM algorithm for PCs. Anemone applies an implicit adaptive learning rate to each parameter, scaled by how much it contributes to the likelihood of the current batch. Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance.

Tyler’s M-estimator is a popular method for robust covariance estimation. Although Tyler proposed a widely used fixed-point iteration to compute the estimator nearly four decades ago, the theoretical properties of the fixed-point iteration have remained elusive. In particular, no deterministic global convergence rate has been established, despite decades of research. In this paper, we resolve this longstanding question by interpreting the fixed-point iteration as an instance of the convex-concave procedure, and analyzing it through a novel combination of relative smoothness and Riemannian optimization. Our analysis requires us to navigate two distinct geometric frameworks: one induced by the Kullback--Leibler divergence, and the other one arising from the Riemannian geometry on the manifold of positive definite matrices. This interplay between the two geometries also allows us to prove that a regularized variant of the fixed-point iteration converges linearly. Beyond contributing to the theoretical understanding of Tyler’s M-estimator, our analysis demonstrates how the integration of Euclidean and Riemannian perspectives can lead to new insights into the design and analysis of optimization algorithms.


179
Support Basis: Fast Attention Beyond Bounded Entries

Maryam Aliakbarpour ⋅ Vladimir Braverman ⋅ Junze Yin ⋅ Haochen Zhang

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks. However, the quadratic complexity of softmax attention remains a central bottleneck that limits their scalability. Alman and Song (NeurIPS 2023a; NeurIPS 2024a) proposed sub-quadratic time algorithms for attention inference and training, respectively, but they rely on the restrictive **bounded-entry assumption**. We show that this assumption rarely holds in practice, which significantly limits their applicability to modern LLMs. In this paper, we introduce **support-basis decomposition**, a new technique for accurate and efficient attention inference and training **without** the bounded-entry assumption. We empirically show that the entries of the query and key matrices exhibit sub-Gaussian behavior. Leveraging this widely observed property, we perform exact computation on sparse components and polynomial approximation on dense components. Without relying on restrictive assumptions, we theoretically show that our algorithm achieves sub-quadratic runtime while matching the approximation error of prior work, and we empirically validate its computational efficiency and downstream task performance. We further generalize our method to a multi-threshold setting that eliminates all distributional assumptions, providing the first theoretical justification for the empirical success of polynomial attention. Moreover, we show that softmax attention can be closely approximated by multiple polynomial attentions with significantly smaller $\ell_p$ error.


18
Improved Algorithms for Clustering with Noisy Distance Oracles

Pinki Pradhan ⋅ Anup Bhattacharya ⋅ Ragesh Jaiswal

Bateni *et al.* has recently introduced the *weak-strong distance oracle model* to study clustering problems in settings with limited distance information. Given query access to the strong-oracle and weak-oracle in the weak-strong oracle model, the authors design approximation algorithms for $k$-means and $k$-center clustering problems. In this work, we design algorithms with improved guarantees for $k$-means and $k$-center clustering problems in the weak-strong oracle model. The $k$-means++ algorithm is routinely used to solve $k$-means in settings where complete distance information is available. One of the main contributions of this work is to show that $k$-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model. In particular, our $k$-means++ based algorithm gives a constant approximation for $k$-means and uses $O(k^2 \log^2{n})$ strong-oracle queries. This improves on the algorithm of Bateni *et al.* that uses $O(k^2 \log^4n \log^2 \log n)$ strong-oracle queries for a constant factor approximation of $k$-means. For the $k$-center problem, we give a simple *ball-carving* based $6(1 + \epsilon)$-approximation algorithm that uses $O(k^3 \log^2{n} \log{\frac{\log{n}}{\epsilon}})$ strong-oracle queries. This is an improvement over the $14(1 + \epsilon)$-approximation algorithm of Bateni *et al.* that uses $O(k^2 \log^4{n} \log^2{\frac{\log{n}}{\epsilon}})$ strong-oracle queries. To show the effectiveness of our algorithms, we perform empirical evaluations on real-world datasets and show that our algorithms significantly outperform the algorithms of Bateni *et al.*


180
Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Marina Sheshukova ⋅ Sergey Samsonov ⋅ Denis Belomestny ⋅ Eric Moulines ⋅ Qi-Man Shao ⋅ Zhuo-Song Zhang ⋅ Alexey Naumov

In this paper, we establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing the confidence sets using the Stochastic Gradient Descent (SGD) algorithm. Under appropriate regularity conditions, our approach avoids the need to approximate the limiting covariance of Polyak-Ruppert SGD iterates, which allows us to derive approximation rates in convex distance of order up to $1/\sqrt{n}$. Notably, this rate can be faster than the one that can be proven in the Polyak-Juditsky central limit theorem. To our knowledge, this provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. Our analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.


181
Learning Hyperparameters via a Data-Emphasized Variational Objective

Ethan Harvey ⋅ Mikhail Petrov ⋅ Michael Hughes

When training large models on limited data, avoiding overfitting is paramount. Common grid search or smarter search methods rely on expensive separate runs for each candidate hyperparameter, while carving out a validation set that reduces available training data. In this paper, we study gradient-based learning of hyperparameters via the evidence lower bound (ELBO) objective from Bayesian variational methods. This avoids the need for any validation set. We focus on scenarios where the model is over-parameterized for flexibility and the approximate posterior is chosen to be Gaussian with isotropic covariance for tractability, even though it cannot match the true posterior. In such scenarios, we find the ELBO prioritizes posteriors that match the prior, leading to severe underfitting. Instead, we recommend a data-emphasized ELBO that upweights the likelihood but not the prior. In Bayesian transfer learning of image and text classifiers, our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable lengthscale kernels.


182
On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors

Julius Kobialka ⋅ Emanuel Sommer ⋅ Chris Kolb ⋅ Juntae Kwon ⋅ Daniel Dold ⋅ David Rügamer

Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: layer balancedness, weight distribution on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.

We study algorithms for construction of *composable coresets* for the task of *Determinant Maximization* under *partition constraint*. Given a point set $V \subset R^d$ that is partitioned into $s$ groups $V_1,\cdots, V_s$, and integers $k_1,...,k_s$, where $k=\sum_i k_i$, the goal is to pick $k_i$ points from group $V_i$ such that the overall determinant of the picked $k$ points is maximized. Determinant Maximization and its constrained variants have gained a lot of interest for modeling diversity, and have found applications in the context of data summarization. When the cardinality $k$ of the selected set is greater than the dimension $d$, we show a peeling algorithm that gives us a composable coreset of size $kd$ with a provably optimal approximation factor of $d^{O(d)}.$ When $k\leq d$, we show a simple coreset construction with optimal size and approximation factor. As a further application of our technique, we get a composable coreset for determinant maximization under the more general laminar matroid constraints, and a composable coreset for unconstrained determinant maximization in a previously unresolved regime. Our results generalize to all strongly Rayleigh distributions and to several other experimental design problems. As an application, we improve the runtime of the practical local-search based algorithm of [Anari-Vuong--COLT'22] for determinantal maximization under partition constraint from $O(n^{2^s}k^{2^s})$ to $O(n k^{2^s})$, making it only linear on the number of points $n$.


184
Local Inconsistency Resolution: The Interplay between Attention and Control in Probabilistic Models

Oliver Richardson ⋅ Abdessamad Kabid ⋅ Mandana Samiei ⋅ Mehran Shakerinava ⋅ Joseph Viviano ⋅ Ali Parviz ⋅ Yoshua Bengio

We present a generic algorithm for learning and approximate inference with an intuitive epistemic interpretation: iteratively focus on a subset of the model and resolve inconsistencies using the parameters under control. This framework, which we call Local Inconsistency Resolution (LIR) is built upon Probabilistic Dependency Graphs (PDGs), which provide a flexible representational foundation capable of capturing inconsistent beliefs. We show how LIR unifies and generalizes a wide variety of important algorithms in the literature, including the Expectation-Maximization (EM) algorithm, belief propagation, adversarial training, GANs, and GFlowNets. Each of these methods can be recovered as a specific instance of LIR by choosing a procedure to direct focus (attention and control). We implement this algorithm for discrete PDGs and study its properties on synthetically generated PDGs, comparing its behavior to the global optimization semantics of the full PDG.


185
Minimax Generalized Cross-Entropy

Kartheek Bondugula ⋅ Santiago Mazuelas ⋅ Aritz Pérez ⋅ Anqi Liu

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.


186
Hellinger Multimodal Variational Autoencoders

Huyen Vo ⋅ Isabel Valera

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha$-divergence family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.


187
Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps

TienHai Do ⋅ Trung Nguyen ⋅ TrungTin Nguyen ⋅ Nhat Ho ⋅ Binh Nguyen Thanh ⋅ Chris Drovandi

We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $\epsilon$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.

Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conservatism, and computational traceability, the literature has introduced DR-RL models with SA-rectangular and S-rectangular adversaries. While most existing statistical analyses focus on SA-rectangular models, owing to their algorithmic simplicity and the optimality of deterministic policies, S-rectangular models more accurately capture distributional discrepancies in many real-world applications and often yield more effective robust randomized policies. In this paper, we study the empirical value iteration algorithm for divergence-based S-rectangular DR-RL and establish near-optimal sample complexity bounds of $\widetilde{O}(|\mathcal{S}||\mathcal{A}|(1-\gamma)^{-4}\varepsilon^{-2})$, where $\varepsilon$ is the target accuracy, $|\mathcal{S}|$ and $|\mathcal{A}|$ denote the cardinalities of the state and action spaces, and $\gamma$ is the discount factor. To the best of our knowledge, these are the first sample complexity results for divergence-based S-rectangular models that achieve optimal dependence on $|\mathcal{S}|$, $|\mathcal{A}|$, and $\varepsilon$ simultaneously. We further validate this theoretical dependence through numerical experiments on a robust inventory control problem and a theoretical worst-case example, demonstrating the fast learning performance of our proposed algorithm.


189
Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms

Rohan Deb ⋅ Qiaobo Li ⋅ Mayank Shrivastava ⋅ Arindam Banerjee

Uniform bounds on sketched inner products underpin several important computational and statistical results in machine learning and randomized algorithms, including the Johnson-Lindenstrauss (J-L) lemma, the Restricted Isometry Property (RIP), randomized sketching, etc. However, many modern analyses involve *sketched bilinear forms*, for which existing uniform bounds either do not apply or are not sharp on general sets. In this work, we develop a general framework to analyze such sketched bilinear forms, and derive uniform bounds in terms of geometric complexities of the associated sets. Our approach relies on *generic chaining* and introduces new techniques for handling suprema over pairs of sets. We further extend our results to (i) sketch matrices with conditionally independent entries, e.g., as in CountSketch and SRHT (Subsampled Randomized Hadamard Transform), and (ii) bilinear form involving a sum of $T$ sketch matrices, showing the deviation scales as $\sqrt{T}$. This unified analysis recovers known results such as the J-L lemma as special cases, while extending RIP guarantees. Using our new bounds, we give tighter convergence bounds for sketched federated learning, and develop sketched bandits whose regret depends on the geometric complexity of the action and parameter sets rather than the ambient dimension.

Stochastic Gradient Descent (SGD) is a cornerstone of machine learning, prized for its efficiency in large-scale optimization. This paper revisits SGD by introducing a general weighted averaging framework that significantly enhances its applicability. We establish asymptotic normality for a wide range of weighted averaged SGD solutions under minimal assumptions, providing a groundbreaking necessary condition for the central limit theorem in certain settings. This enables asymptotically valid online inference, empowering real-time confidence interval construction. Furthermore, we propose an adaptive averaging scheme, inspired by optimal weights for linear models, which achieves optimal superior non-asymptotic bounds. Our theoretical advances and empirical validations redefine SGD’s capabilities, offering transformative insights for statistical learning and optimization.


190
High Effort, Low Gain: Fundamental Limits of Active Learning for Linear Dynamical Systems

Nicolas Chatzikiriakos ⋅ Kevin Jamieson ⋅ Andrea Iannelli

We consider the problem of identifying an unknown linear dynamical system from a finite hypothesis class. In particular, we analyze the effect of the excitation input on the sample complexity of identifying the true system with high probability. To this end, we present sample complexity lower bounds that capture the choice of the selected excitation input. The sample complexity lower bound gives rise to a system-theoretic condition to determine the potential benefit of experiment design. Informed by the analysis of the sample complexity lower bound, we propose a persistency of excitation (PE) condition tailored to the considered setting, which we then use to establish sample complexity upper bounds. Notably, the PE condition is weaker than in the case of an infinite hypothesis class and allows analyzing different excitation inputs modularly. Crucially, the lower and upper bounds share the same dependency on key problem parameters. Finally, we leverage these insights to propose an active learning algorithm that sequentially excites the system optimally with respect to the current estimate, and provide sample complexity guarantees for the presented algorithm. Concluding simulations showcase the effectiveness of the proposed algorithm.


193
Distributed estimation and inference for semiparametric binary response models

Xi Chen, Wenbo Jing, Weidong Liu, Yichen Zhang

The development of modern technology has enabled data collection of unprecedented size, which poses new challenges to many statistical estimation and inference problems. This paper studies the maximum score estimator of a semi-parametric binary choice model under a distributed computing environment without pre-specifying the noise distribution. An intuitive divide-and-conquer estimator is computationally expensive and restricted by a non-regular constraint on the number of machines, due to the highly non-smooth nature of the objective function. We propose (1) a one-shot divide-and-conquer estimator after smoothing the objective to relax the constraint, and (2) a multi-round estimator to completely remove the constraint via iterative smoothing. We specify an adaptive choice of kernel smoother with a sequentially shrinking bandwidth to achieve the superlinear improvement of the optimization error over the multiple iterations. The improved statistical accuracy per iteration is derived, and a quadratic convergence up to the optimal statistical error rate is established. We further provide two generalizations to handle the heterogeneity of datasets and high-dimensional problems where the parameter of interest is sparse.


195
Conformal Prediction in Hierarchical Classification with Constrained Representation Complexity

Thomas Mortier ⋅ Alireza Javanmardi ⋅ Yusuf Sale ⋅ Eyke Hüllermeier ⋅ Willem Waegeman

Conformal prediction has emerged as a widely used framework for constructing valid prediction sets in classification and regression tasks. In this work, we extend the split conformal prediction framework to hierarchical classification, where prediction sets are commonly restricted to internal nodes of a predefined hierarchy, and propose two computationally efficient inference algorithms. The first algorithm returns internal nodes as prediction sets, while the second one relaxes this restriction. Using the notion of representation complexity, the latter yields smaller set sizes at the cost of a more general and combinatorial inference problem. Empirical evaluations on several benchmark datasets demonstrate the effectiveness of the proposed algorithms in achieving nominal coverage.


196
Consistent PCA and Spectral Clustering

Satoshi Hara ⋅ Yuichi Yoshida

Principal component analysis (PCA) and spectral clustering are representative methods for extracting and interpreting the inherent structure of data. However, if the output results significantly change upon the addition of new data points, it can lead to several issues such as instability in the downstream task or a lack of trust in the findings. To address these problems, we consider online variants of PCA and spectral clustering, and show that a natural subspace-preserving regularizer provides provable approximation and consistency guarantees. Here, an algorithm is said to have a high consistency if the output change, with respect to an appropriate distance metric, is small when new data points are added. We empirically confirm the superiority of the proposed methods using real-world data.

An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time optimization bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using algorithmic stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path, together with smaller batch sizes, yields tighter stability bounds.


198
On the hardness of Reinforcement Learning with Transition Lookahead

Corentin Pla ⋅ Hugo Richard ⋅ Marc Abeille ⋅ Nadav Merlis ⋅ Vianney Perchet

We study reinforcement learning (RL) with transition look-ahead, where the agent may observe which states would be visited upon playing any sequence of $\ell$ actions before deciding its course of action. While such predictive information can drastically improve the achievable performance, we show that using this information optimally comes at a potentially prohibitive computational cost. Specifically, we prove that optimal planning with one-step look-ahead ($\ell=1$) can be solved in polynomial time through a novel linear programming formulation. In contrast, for $\ell \geq 2$, the problem becomes NP-hard. Our results delineate a precise boundary between tractable and intractable cases for the problem of planning with transition look-ahead in reinforcement learning.


199
Towards Motion-aware Referring Image Segmentation

Chaeyun Kim ⋅ Seunghoon Yi ⋅ Yejin Kim ⋅ Yohan Jo ⋅ Joonseok Lee

Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones, and propose to address this from both data and algorithmic perspectives. First, we introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions.


2
Scalable Learning of Multivariate Distributions via Coresets

Zeyu Ding ⋅ Katja Ickstadt ⋅ Nadja Klein ⋅ Alexander Munteanu ⋅ Simon Omlor

Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of $(1\pm\varepsilon)$ and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.


20
Boltzmann Exploration for Heavy-Tailed Bandits

Hyeon-jun Park ⋅ Yoon-Sik Cho ⋅ Kyungjae Lee

We study the stochastic multi-armed bandit problem with heavy-tailed rewards, assuming only that each arm's reward distribution has a finite $p$-th moment for $p\in(1,2]$. Although prior work has proposed algorithms that are robust to heavy-tailed rewards, these methods do not admit closed-form action-selection probabilities. This hinders efficient offline evaluation and can introduce bias in inverse propensity weighting (IPW) estimators. We propose heavy Boltzmann exploration (H-BE), a Boltzmann-style randomized policy whose action-selection probabilities remain available in closed form under heavy-tailed noise. Theoretically, we show that H-BE achieves the minimax-optimal gap-independent regret bound $O(\nu^{\frac{1}{p}} K^{1-\frac{1}{p}} T^{\frac{1}{p}})$. It also attains the gap-dependent regret bound $O(\sum_{i:\Delta_i>0}{\log(T \Delta_i^{\frac{p}{p-1}}/K)}/{\Delta_i^{\frac{1}{p-1}}})$, where $\nu$ bounds the $p$-th moment, $K$ is the number of arms, $T$ is the horizon, and $\Delta_i$ is the suboptimality gap of arm $i$. Empirically, H-BE attains competitive cumulative regret relative to state-of-the-art baselines, while its explicit propensities enable more stable and efficient offline evaluation.


200
Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

Daniel Jenson ⋅ Jhonathan Navott ⋅ Piotr Grynfelder ⋅ Mengyan Zhang ⋅ Makkunda Sharma ⋅ Elizaveta Semenova ⋅ Seth Flaxman

Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. While early architectures were developed primarily as a scalable alternative to Gaussian Processes (GPs), modern NPs tackle far more complex and data-hungry applications spanning geology, epidemiology, climate, and robotics. These applications have placed increasing pressure on the scalability of these models, with many architectures compromising accuracy for scalability. In this paper, we demonstrate that this trade-off is often unnecessary, particularly when modeling fully or partially translation-invariant processes. We propose a versatile new architecture, the Biased Scan Attention Transformer Neural Process (BSA-TNP), which introduces Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). BSA-TNP is able to: (1) match or exceed the accuracy of the best models while often training in a fraction of the time, (2) exhibit translation invariance, enabling learning at multiple resolutions simultaneously, (3) transparently model processes that evolve in both space and time, (4) support high-dimensional fixed effects, and (5) scale gracefully, running inference on over 1M test points and 100K context points in under a minute on a single 24GB GPU. Code is provided as part of the \href{https://github.com/MLGlobalHealth/dl4bi}{\texttt{dl4bi}} package.


21
The Majority Vote Paradigm Shift: When Popular Meets Optimal

Antonio Purificato ⋅ Maria Sofia Bucarelli ⋅ Anil Nelakanti ⋅ Andrea Bacciu ⋅ Fabrizio Silvestri ⋅ Amin Mantrach

Reliably labelling data typically requires annotations from multiple human workers. However, humans are far from being perfect. Hence, it is a common practice to aggregate labels gathered from multiple annotators to make a more confident estimate of the true label. Among many aggregation methods, the simple and well-known Majority Vote (MV) selects the class label polling the highest number of votes. However, despite its importance, the optimality of MV’s label aggregation has not been extensively studied. We address this gap in our work by characterising the conditions under which MV achieves the theoretically optimal lower bound on label estimation error. Our results capture the tolerable limits on annotation noise under which MV can optimally recover labels for a given class distribution. This certificate of optimality provides a more principled approach to model selection for label aggregation as an alternative to otherwise inefficient practices that sometimes include higher experts, gold labels, etc., that are all marred by the same human uncertainty despite huge time and monetary costs. Experiments on both synthetic and real-world data corroborate our theoretical findings.


22
High-dimensional Learning with Noisy Labels

Aymane El Firdoussi ⋅ Mohamed El Amine Seddik

This paper provides theoretical insights into high-dimensional binary classification with class-conditional noisy labels. Specifically, we study the behavior of a linear classifier with a label noisiness aware loss function, when both the dimension of data $p$ and the sample size $n$ are large and comparable. Relying on random matrix theory by supposing a Gaussian mixture data model, the performance of the linear classifier when $p,n\to \infty$ is shown to converge towards a limit, involving scalar statistics of the data. Importantly, our findings show that the low-dimensional intuitions to handle label noise do not hold in high-dimension, in the sense that the optimal classifier in low-dimension dramatically fails in high-dimension. Based on our derivations, we design an optimized method that is shown to be provably more efficient in handling noisy labels in high dimensions. Our theoretical conclusions are further confirmed by experiments on real datasets, where we show that our optimized approach outperforms the considered baselines.


23
Split-Flows: Measure Transport and Information Loss Across Molecular Resolutions

Sander Hummerich ⋅ Ullrich Köthe ⋅ Tristan Bereau

By reducing resolution, coarse-grained models greatly accelerate molecular simulations, unlocking access to long-timescale phenomena, though at the expense of microscopic information. Recovering this fine-grained detail is essential for tasks that depend on atomistic accuracy, making backmapping a central challenge in molecular modeling. We introduce split-flows, a novel flow-based approach that reinterprets backmapping as a continuous-time measure transport across resolutions. Unlike existing generative strategies, split-flows establish a direct probabilistic link between resolutions, enabling expressive conditional sampling of atomistic structures and—for the first time—a tractable route to computing mapping entropies, an information-theoretic measure of the irreducible detail lost in coarse-graining. We demonstrate these capabilities on diverse molecular systems, including Chignolin, a lipid bilayer, and alanine dipeptide, highlighting split-flows as a principled framework for accurate backmapping and systematic evaluation of coarse-grained models.


24
FocusViT: Faithful Explanations for Vision Transformers via Gradient-Guided Layer-Skipping

Mohsin Ali ⋅ Haider Raza ⋅ John Gan ⋅ Muhammad Haris Khan

Vision Transformers (ViTs) have emerged as powerful alternatives to CNNs for various vision tasks, yet their token-based, attention-driven architecture makes interpreting their predictions challenging. Existing explainability methods, such as Grad-CAM and Attention Rollout, either fail to capture hierarchical semantic information or assume attention directly reflects importance, often leading to misleading explanations. We propose FocusViT, a novel explainability framework that integrates gradient-weighted attention attribution with validation-based, faithfulness-driven layer aggregation. By fusing attention maps with class-specific gradients and introducing per-head dynamic weighting, FocusViT highlights not only where the model attends but also how sensitive the prediction is to those attentions. Furthermore, our adaptive layer-skipping strategy ensures that only semantically meaningful layers contribute to the final explanation, enhancing both faithfulness and clarity. Extensive quantitative and qualitative evaluations on diverse benchmarks demonstrate that FocusViT improves over existing methods in faithfulness and sparsity, achieving competitive robustness and class sensitivity, and provides sharper, more reliable visual explanations for ViTs. The official implementation is publicly available at: https://github.com/game-sys/focusvit-aistats2026.git


25
Structural Alignment Improves Graph Test-Time Adaptation

Hans Hao-Hsun Hsu ⋅ Shikun Liu ⋅ Han Zhao ⋅ Pan Li

Graph-based learning excels at capturing interaction patterns in diverse domains like recommendation, fraud detection, and particle physics. However, its performance often degrades under distribution shifts, especially those altering network connectivity. Current methods to address these shifts typically require retraining with the source dataset, which is often infeasible due to computational or privacy limitations. We introduce Test-Time Structural Alignment (TSA), a novel algorithm for Graph Test-Time Adaptation (GTTA) that adapts a pretrained model to align graph structures during inference without the cost of retraining. Grounded in a theoretical understanding of graph data distribution shifts, TSA employs three synergistic strategies: uncertainty-aware neighborhood weighting to accommodate neighbor label distribution shifts, adaptive balancing of self-node and aggregated neighborhood representations based on their signal-to-noise ratio, and decision boundary refinement to correct residual label and feature shifts. Extensive experiments on synthetic and real-world datasets demonstrate TSA's consistent outperformance of both non-graph TTA methods and state-of-the-art GTTA baselines.

Transformer-based models, despite their promise for long-term time series forecasting (LTSF), suffer from an inherent low-pass filtering effect that limits their effectiveness. This issue arises due to undifferentiated propagation of frequency components across layers, causing a progressive attenuation of high-frequency information crucial for capturing fine-grained temporal variations. To address this limitation, we propose Dualformer, a principled dual-domain framework that rethinks frequency modeling from a layer-wise perspective. Dualformer introduces three key components: (1) a dual-branch architecture that concurrently models complementary temporal patterns in both time and frequency domains; (2) a hierarchical frequency sampling module that allocates distinct frequency bands to different layers, preserving high-frequency details in lower layers while modeling low-frequency trends in deeper layers; and (3) a periodicity-aware weighting mechanism that dynamically balances contributions from the dual branches based on the harmonic energy ratio of inputs, supported theoretically by a derived lower bound. This design enables structured frequency modeling and adaptive integration of time-frequency features, effectively preserving high-frequency information and enhancing generalization. Extensive experiments conducted on eight widely used benchmarks demonstrate Dualformer’s robustness and superior performance, particularly on heterogeneous or weakly periodic data.


27
Weighted quantization using MMD: From mean field to mean shift using gradient flows

Ayoub Belhadji ⋅ Daniel Sharp ⋅ Youssef Marzouk

Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a weighted mixture of Dirac measures that best approximates the target distribution. While much existing work relies on the Wasserstein distance to quantify approximation errors, maximum mean discrepancy (MMD) has received comparatively less attention, especially when allowing for variable particle weights. We argue that a Wasserstein--Fisher--Rao gradient flow is well-suited for designing quantizations optimal under MMD. We show that a system of interacting particles satisfying a set of ODEs discretizes this flow. We further derive a new fixed-point algorithm called mean shift interacting particles (MSIP). We show that MSIP extends the classical mean shift algorithm, widely used for identifying modes in kernel density estimators. Moreover, we show that MSIP can be interpreted as preconditioned gradient descent and that it acts as a relaxation of Lloyd's algorithm for clustering. Our unification of gradient flows, mean shift, and MMD-optimal quantization yields algorithms that are more robust than state-of-the-art methods, as demonstrated via high-dimensional and multi-modal numerical experiments.


28
In-memory Training on Analog Devices with Limited Conductance States via Multi-tile Residual Learning

Jindan Li ⋅ Zhaoxian Wu ⋅ Gaowen Liu ⋅ Tayfun Gokmen ⋅ Tianyi Chen

Analog in-memory computing (AIMC) accelerators enable efficient deep neural network computation directly within memory using resistive crossbar arrays, where model parameters are represented by the conductance states of memristive devices. However, effective in-memory training typically requires at least 8-bit conductance states to match digital baselines. Realizing such fine-grained states is costly and often requires complex noise mitigation techniques that increase circuit complexity and energy consumption. In practice, many promising memristive devices such as ReRAM offer only about 4-bit resolution due to fabrication constraints, and this limited update precision substantially degrades training accuracy. To enable on-chip training with these limited-state devices, this paper proposes a \emph{multi-tile residual learning} framework that sequentially learns on multiple crossbar tiles to compensate the residual errors from low-precision weight updates. Our theoretical analysis shows that the optimality gap shrinks with the number of tiles and achieves a linear convergence rate. Experiments on standard image classification benchmarks demonstrate that our method consistently outperforms state-of-the-art in-situ analog training strategies under limited-state settings, while incurring only moderate hardware overhead as confirmed by our cost analysis.


29
Preference-based Conditional Treatment Effects and Policy Learning

Dovid Parnas ⋅ Mathieu Even ⋅ julie Josse ⋅ Uri Shalit

This paper introduces a preference-based framework for conditional treatment effect estimation and policy learning, by defining the Conditional Preference-based Treatment Effect (CPTE). CPTE only requires that different outcomes can be compared, as opposed to traditional conditional average treatment effects (CATE) that quantify the conditional magnitude of an intervention on a real-valued outcome. This allows for flexible modeling of heterogeneous effects even when outcomes are multivariate, ordinal, or preference-driven. We show how CPTE enables interpretable estimands, and propose estimation strategies based on matching, quantile regression, and distributional methods. Building on this, we develop efficient influence function–based estimators to correct plug-in biases and optimize policy values. Through synthetic and semi-synthetic experiments, we demonstrate that our approach produces personalized policies that are robust to outcome heterogeneity and align with ground-truth preferences, improving decision quality over CATE-based baselines.

Standard differential privacy imposes uniform privacy constraints across all features, overlooking the inherent distinction between sensitive and insensitive features in practice. In this paper, we introduce a relaxed definition of differential privacy that accounts for such heterogeneity, allowing certain features to be treated as insensitive even when correlated with sensitive ones. We propose a correlation-aware framework, CorrDP, which relaxes privacy for insensitive features while accounting for their correlations with sensitive features, with the correlations quantified using total variation distance. We design algorithms for differentially private empirical risk minimization (DP-ERM) under the CorrDP framework, incorporating distance-dependent noise into gradients for improved theoretical utility guarantees. When the correlation distance is unknown, we estimate it from the dataset and show that it achieves a comparable privacy-utility guarantee. We perform experiments on synthetic and real-world datasets and show that CorrDP-based DP-ERM algorithms consistently outperform the standard DP framework.


30
On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach

Edo Cohen-Karlik ⋅ Itamar Zimerman ⋅ Liane Galanti ⋅ Ido Atad ⋅ Amir Globerson ⋅ Lior Wolf

Recent advances in efficient sequence modeling have introduced selective state-space layers, a key component of the Mamba architecture, which have demonstrated remarkable success in a wide range of NLP and vision tasks. While Mamba’s empirical performance has matched or surpassed SoTA transformers on such diverse benchmarks, the theoretical foundations underlying its powerful representational capabilities remain less explored. In this work, we investigate the expressivity of selective state-space layers using multivariate polynomials, and prove that they surpass linear transformers in expressiveness. Consequently, our findings reveal that Mamba offers superior representational power over linear attention-based models for long-sequences, while not sacrificing their generalization. Our theoretical insights are validated by a comprehensive set of empirical experiments on various datasets.


31
Fair Clustering via Hierarchical Fair-Dirichlet Prior

Abhisek Chakraborty ⋅ Anirban Bhattacharya ⋅ Debdeep Pati

The advent of ML-driven decision-making has led to an increasing focus on algorithmic fairness. The widespread utility of clustering has naturally prompted proliferation of literature on fair clustering. A popular notion of fairness in clustering mandates the clusters to be balanced, i.e., each level of a protected attribute must be approximately equally represented in each cluster. In this article, we offer a novel model-based formulation of fair clustering, complementing the existing literature which is almost exclusively based on optimizing appropriate objective functions. We first rigorously define a notion of fair clustering in the population level and develop a Bayesian methodology equipped with a novel hierarchical prior specification that targets the population level objective by enforcing the notion of balance in the resulting clusters. In addition, we devise a scheme for principled performance evaluation of competing algorithms leveraging on a concrete notion of optimal recovery. An efficient collapsed Gibbs sampler is developed to sample from the posterior by integrating a novel scheme for non-uniform sampling from the space of binary matrices with fixed margin with a proposal guided by optimal transport. Superior empirical performance of the proposed methodology, compared to the state-of-the-art, is demonstrated across numerical experiments, benchmark data-sets, and gender-neutral fair clustering in distress analysis interview corpus.


32
Incorporating Expert Knowledge into Bayesian Causal Discovery of Mixtures of Directed Acyclic Graphs

Zachris Björkman ⋅ Jorge Loria ⋅ Sophie Wharrie ⋅ Samuel Kaski

Bayesian causal discovery benefits from prior information elicited from domain experts, and in heterogeneous domains any prior knowledge would be badly needed. However, so far prior elicitation approaches have assumed a single causal graph and hence are not suited to heterogeneous domains. We propose a causal elicitation strategy for heterogeneous settings, based on Bayesian experimental design (BED) principles, and a variational mixture structure learning (VaMSL) method—extending the earlier differentiable Bayesian structure learning (DiBS) method—to iteratively infer mixtures of causal Bayesian networks (CBNs). We construct an informative graph prior incorporating elicited expert feedback in the inference of mixtures of CBNs. Our proposed method successfully produces a set of alternative causal models (mixture components or clusters), and achieves an improved structure learning performance on heterogeneous synthetic data when informed by a simulated expert. Finally, we demonstrate that our approach is capable of capturing complex distributions in a breast cancer database.


33
Stochastic Bandits on Mixture Distributions: Metrics & Regret Bounds

Adit Jain ⋅ Sujay Bhatt ⋅ Alec Koppel

Multimodal reward distributions naturally arise in real-world applications such as targeted recommendations to heterogeneous sub-populations and selective unit-level interventions. These settings challenge standard mean or risk-based bandit approaches, requiring metrics that quantify the merit of mixture parameters without prior mode knowledge. We consider the bandit setting where the reward associated with an arm is sampled from a finite mixture of Gaussians, which is strictly more general than the unimodal setting. We consider ranking arms using functions of the mixture parameters and propose methods to minimize the cumulative regret with respect to the induced ranking. We show that the achievable pseudo-regret has a lower bound of the order $\Omega(\mathsf{T}^{1/2})$ and propose an explore and exploit based on expectation maximization (ETE-EM) algorithm which achieves a regret of $\widetilde{\mathsf{O}}(\mathsf{T}^{2/3})$. Further, we show that the modification of Thompson sampling (TS-EM) achieves a Bayes regret of $\widetilde{\mathsf{O}}(\mathsf{T}^{1/2})$. Experiments validate our approach in practice, where we benchmark against both algorithms designed for sub-Gaussian bandits and naive clustering-based extensions of empirical CDF methods, showing our approach achieves consistently lower regret across choice of metrics.


34
Differentially Private Clustering in Data Streams

Alessandro Epasto ⋅ Tamalika Mukherjee ⋅ Peilin Zhong

Clustering tasks such as k-means and k-median are central in unsupervised learning, and streaming algorithms for these tasks are widely used to handle large or evolving datasets. When applied in sensitive domains, however, such algorithms must also provide rigorous privacy guarantees. In this work, we provide the first differentially private (DP) algorithms for k-means and k-median clustering of d-dimensional Euclidean data points over a stream of length at most T, using space that is sublinear in T, in the continual release setting where the algorithm is required to output a clustering at every timestep. We achieve (1) an O(1)-multiplicative approximation with ~O(k^{1.5} poly(d, log T)) space and poly(k,d,log T) additive error, or (2) a (1+gamma)-multiplicative approximation with ~Ogamma(poly(k, 2^{Ogamma(d)}, log T)) space for any gamma>0, with additive error poly(k, 2^{O_gamma(d)}, log T). Our main technical contribution is a DP clustering framework for data streams that only requires an offline DP coreset or clustering algorithm as a blackbox.


35
Efficient Learning of Stationary Diffusions with Stein-type Discrepancies

Fabian Bleile ⋅ Sarah Lumpp ⋅ Mathias Drton

Learning a stationary diffusion amounts to estimating the parameters of a stochastic differential equation whose stationary distribution matches a target distribution. We build on the recently introduced kernel deviation from stationarity (KDS), which enforces stationarity by evaluating expectations of the diffusion's generator in a reproducing kernel Hilbert space. Leveraging the connection between KDS and Stein discrepancies, we introduce the Stein-type KDS (SKDS) as an alternative formulation. We prove that a vanishing SKDS guarantees alignment of the learned diffusion’s stationary distribution with the target. Furthermore, under broad parametrizations, SKDS is convex with an empirical version that is $\epsilon$-quasiconvex with high probability. Empirically, learning with SKDS attains comparable accuracy to KDS while substantially reducing computational cost, and yields improvements over the majority of competitive baselines.

Continual learning (CL) aims to train models on a sequence of tasks while retaining performance on previously learned ones. A core challenge in this setting is \textit{catastrophic forgetting}, where new learning interferes with past knowledge. Among various mitigation strategies, data-replay methods—where past samples are periodically revisited—are considered simple yet effective, especially when memory constraints are relaxed. However, the theoretical effectiveness of full data replay, where all past data is accessible during training, remains largely unexplored. In this paper, we present the first theoretical framework for analyzing full data-replay training in continual learning from a feature learning perspective. Adopting a multi-view data model, we identify the signal-to-noise ratio (SNR) as a critical factor affecting forgetting. Focusing on task-incremental binary classification across $M$ tasks, our analysis verifies two key conclusions: (1) forgetting can still occur under full replay when the cumulative noise from later tasks dominates the signal from earlier ones; and (2) with sufficient signal accumulation, data replay can recover earlier tasks-even if their initial learning was poor. Notably, we uncover a novel insight into task ordering: prioritizing higher-signal tasks not only facilitates learning of lower-signal tasks but also helps prevent catastrophic forgetting—highlighting the importance of order-aware replay strategies. We validate our theoretical findings through synthetic experiments that visualize the interplay between signal learning and noise memorization across varying SNRs and task correlation regimes.


37
On the Latent Information Geometry of the Grassmann Manifold

Lorenzo Cazzella ⋅ Soren Hauberg ⋅ Georgios Arvanitidis ⋅ Matteo Matteucci

Modeling linear subspaces and relations among them naturally arises in several applications in signal processing, computer vision, and system identification. In this paper, we investigate the latent information geometry of deep generative models that output linear subspaces. Such subspaces are members of the Grassmann manifold, which we model with a matrix Bingham distribution as a likelihood. We derive the Fisher-Rao metric on the statistical manifold of the matrix Bingham parameters, and propose pulling this back to the latent space to achieve uncertainty-aware and identifiable latent representations. We provide numerical results assessing the meaningfulness of the achieved latent subspace representations on a relevant vehicular wireless communications scenario.


38
FedCCA: Federated Canonical Correlation Analysis

Zhengquan Luo ⋅ Kai Fong Ernest Chong ⋅ Pengfei Wei ⋅ Changyou Chen ⋅ Peilin Zhao ⋅ Renmin Han ⋅ Chunlai Zhou ⋅ Yunlong Wang ⋅ Zhiqiang Xu

Canonical Correlation Analysis (CCA) is a key tool for cross-modal learning, but centralized solutions are impractical due to the heavy cost of high-dimensional covariance operations and the privacy sensitivity of distributed data. To address these challenges, we propose FedCCA, a federated framework that replaces explicit inverses and inner least-squares solves with a truncated von Neumann series, reducing matrix inversions to lightweight matrix–vector multiplications while retaining provable convergence. This series formulation not only improves efficiency, but also provides explicit and tunable control of truncation error, and its structure naturally splits into client-side multiplications and a server-side projection step, making it particularly suitable for federated deployment. Building on this foundation, we incorporate Gaussian differential privacy and derive practical upper and lower bounds on the required noise variance, which yield end-to-end $(\varepsilon,\delta)$ guarantees together with convergence stability. Empirical results on five datasets confirm that FedCCA achieves accuracy comparable to centralized CCA and consistently outperforms ALS/TALS baselines in both sub-optimality gap and convergence speed, all while maintaining rigorous privacy protection.


39
Causal Additive Models with Unobserved Causal Paths and Backdoor Paths

Thong Pham ⋅ Takashi Nicholas Maeda ⋅ Shohei Shimizu

Causal additive models provide a tractable yet expressive framework for causal discovery in the presence of hidden variables. When unobserved backdoor or causal paths exist between two variables, their causal relationship is often unidentifiable under existing theories. We establish sufficient conditions under which causal directions can be identified in many such cases. These conditions rely on new characterizations of regression sets to determine independence among regression residuals and conditional independencies among observed variables. Building on these results, we introduce a search algorithm that incorporates these innovations and prove its soundness and completeness. Empirical evaluations demonstrate its competitive performance against state-of-the-art methods.


4
Welfare-Centric Clustering

Claire Jie Zhang ⋅ Seyed Esmaeili ⋅ Jamie Morgenstern

Fair clustering has traditionally focused on ensuring equitable group representation or equalizing group-specific clustering costs. However, \citet{dickerson2024fair} recently showed that these fairness notions may yield undesirable or unintuitive clustering outcomes and advocated for a welfare-centric clustering approach that models the utilities of the groups. In this work, we model group utilities based on both distances and proportional representation and formalize two optimization objectives based on welfare-centric clustering: the Rawlsian (Egalitarian) objective and the Utilitarian objective. We introduce novel algorithms for both objectives and prove theoretical guarantees for them. Empirical evaluations on multiple real-world datasets demonstrate that our methods significantly outperform existing fair clustering baselines.


40
TAS-EGNN: Task-Aware Spectral Ego-Graphs for Efficient GNNs-Based Classification

Mebarka Allaoui ⋅ Rachid Hedjam ⋅ Sonia Gupta

Graph Neural Networks (GNNs) achieve strong accuracy but remain costly to train on large graphs and in resource-constrained settings. Coreset selection mitigates this by training on a compact, representative node subset, yet many existing methods rely on expensive spectral routines or bilevel and iterative optimizations. We propose a Task-Aware Spectral Ego-Graph Neural Network (TAS-EGNN) that scores nodes within lightweight ego-graphs by combining (i) local spectral complexity, (ii) predictive uncertainty, and (iii) supervised error signals, followed by a greedy coverage step to avoid redundancy. TAS-EGNN circumvents heavy optimization, using only local spectra (or moment proxies) and a single model forward pass to obtain task signals. We evaluate TAS-EGNN across three benchmark tasks: citation networks, social networks, and graph-based bank transaction fraud detection. The third task, in particular, underscores the algorithm’s effectiveness in anomaly detection for highly imbalanced settings. TAS-EGNN matches or surpasses state-of-the-art reduction baselines, across \emph{budgets} (i.e., the allowed size of the selected training subset, controlled via the coreset ratio), including condensation, coarsening, and ego-graph selection, while delivering substantial wall-clock and peak-memory savings. Time and memory profiling show that TAS-EGNN tracks the lower envelope among structure-aware methods and scales to large graphs, whereas several other works reach OOT/OOM. These results indicate that efficiently encoded task-aware structural priors enable robust, scalable coreset selection for both standard node classification and fraud detection. The source code will be available on GitHub.

Understanding causal relations between temporal variables is a central challenge in time series analysis, particularly when the full causal structure is unknown. Even when the full causal structure cannot be fully specified, experts often succeed in providing a high-level abstraction of the causal graph, known as a summary causal graph, which captures the main causal relations between different time series while abstracting away micro-level details. In this work, we present conditions that guarantee the orientability of micro-level edges between temporal variables given the background knowledge encoded in a summary causal graph and assuming having access to a faithful and causally sufficient distribution with respect to the true unknown graph. Our results provide theoretical guarantees for edge orientation at the micro-level, even in the presence of cycles or bidirected edges at the macro-level. These findings offer practical guidance for leveraging SCGs to inform causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge to improve causal inference from observational time series data.


42
ADOPT: Additive Optimal Transport Regression

Wookyeong Song ⋅ Hans-Georg Müller

Regression models for responses $Y$ taking values in general metric spaces $(\mathcal{M}, d)$, with Euclidean predictors $X \in \mathbb{R}^p,$ has attracted growing interest in recent years. While additive regression is a powerful tool for enhancing interpretability and mitigating the curse of dimensionality in the presence of multivariate predictors, its direct extension is hindered by the absence of vector space operations in general metric spaces. We propose a novel framework for additive optimal transport regression, which incorporates additive structure through optimal geodesic transports. A key idea is to extend the notion of optimal transports in Wasserstein spaces to general geodesic metric spaces. This unified approach accommodates a wide range of responses, including probability distributions, symmetric positive definite (SPD) matrices with various metrics and spherical data. The practical utility of the method is illustrated with correlation matrices derived from resting state fMRI brain imaging data.


43
Noise-Free Dynamic Rank-Adaptation via Riemannian Methods in Federated Fine-Tuning

Zihan Zhou ⋅ Yang Zhou ⋅ Tianshi Che ⋅ Zeru Zhang ⋅ Jiaxiang Ren ⋅ Da Yan ⋅ Zhe Jiang ⋅ yelong shen ⋅ Ruoming Jin ⋅ Jianfeng Gao

Rank-adaptive low-rank adaptation (LoRA), a parameter-efficient fine-tuning (PEFT) technology, has achieved state-of-the-art performance in fine-tuning foundation models (FM). Directly transplanting the rank-adaptive LoRA methods from centralized learning to federated learning raises two critical issues: aggregation noise and rank drift. We presents Riemannian LoRA algorithm with adaptive rank for federated fine-tuning of foundation models (FFT-FM), RAFFT, which resolves both issues and significantly improves the computational cost. First, by utilizing Riemannian Procrustes analysis, we propose a Riemannian parameter matching method to avoid aggregation noise and ensure effective FFT-FM with rank-adaptive LoRA while cutting SVD cost by decomposing only low-dimensional $r \times r$ matrices, where $r$ is the rank parameter in the LoRA. We theoretically derive the equivalence between our RAFFT algorithm with rank-adaptive LoRA for the FFT-FM and the standard FFT-FM on the full parameter matrices based on FedAvg and verify the bounded error introduced by approximation. Second, by leveraging Riemannian manifold theory, we develop a Riemannian gradient descent (RGD) method to guarantee the local full parameter matrices on clients in the form of low-rank ones with fixed rank optimized by the server in each FFT-FM round, for alleviating the rank-drift issue to speed up the convergence of RAFFT. We theoretically demonstrate that the RGD optimization on the Riemannian manifold ensures the rank invariance during the local update process and the RGD optimization can converge in the FFT-FM context.

We study the $(\varepsilon, \delta)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/\delta)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/\delta)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.

We study the problem of approximating an unnormalized target distribution using probability densities from an exponential family. Specifically, we establish convergence guarantees for a monotonic alpha-divergence minimization algorithm, which decreases the alpha-divergence at each iteration. To illustrate our theoretical results, we propose an implementable Sample Average Approximation algorithm that solves a discrete approximation of the original problem. Through a detailed analysis of the loss landscape and the algorithm's dynamics, we provide practical design guidelines which suffice to ensure its convergence.

We study the problem of generalizing an expert agent's behavior, provided through demonstrations, to new environments and/or additional constraints. Inverse Reinforcement Learning (IRL) offers a promising solution by seeking to recover the expert's underlying reward function, which, if used for planning in the new setting, would reproduce the desired behavior. However, IRL is inherently ill-posed: multiple reward functions, forming the so-called feasible set, can explain the same observed behavior. Since these rewards may induce different policies in the new setting, in the absence of additional information, a decision criterion is needed to select which policy to deploy. In this paper, we propose a novel, principled criterion that selects the "average" policy among those induced by the rewards in a certain bounded subset of the feasible set. Remarkably, we show that this policy can be obtained by planning with the reward centroid of that subset, for which we derive a closed-form expression. We then present a provably efficient algorithm for estimating this centroid using only an offline dataset of expert demonstrations. Finally, we conduct numerical simulations that illustrate the relationship between the expert's behavior and the behavior produced by our method.


47
On Barycenter Computation: Analyzing Semi-Unbalanced Optimal Transport-based Method on Bures-Wasserstein manifold.

Ngoc-Hai Nguyen ⋅ Le Dung ⋅ Hoang-Phi Nguyen ⋅ Tung Pham ⋅ Nhat Ho

We explore a robust version of the barycenter problem among $n$ centered Gaussian probability measures, termed Semi-Unbalanced Optimal Transport (SUOT)-based Barycenter, wherein the barycenter remains fixed while the others are relaxed using Kullback-Leibler divergence. We develop optimization algorithms on Bures-Wasserstein manifold, named the Exact Geodesic Gradient Descent and Hybrid Gradient Descent algorithms. While the Exact Geodesic Gradient Descent method is based on computing the exact closed form of the first-order derivative of the objective function of the barycenter along a geodesic on the Bures manifold, the Hybrid Gradient Descent method utilizes optimizer components when solving the SUOT problem to replace contaminated measures before applying the Riemannian Gradient Descent. We establish the theoretical convergence guarantees for both methods and demonstrate that the Exact Geodesic Gradient Descent algorithm attains a dimension-free convergence rate. This is a novel theoretical result for Riemannian Gradient Descent applicable to an expanded class of averaging functions.

Causal representation learning (CRL) aims to identify the underlying latent variables from high-dimensional observations, even when variables are dependent with each other. We study this problem for latent variables that follow a potentially degenerate Gaussian mixture distribution and that are only observed through the transformation via a piecewise affine mixing function. We provide a series of progressively stronger identifiability results for this challenging setting in which the probability density functions are ill-defined because of the potential degeneracy. For identifiability up to permutation and scaling, we leverage a sparsity regularization on the learned representation. Based on our theoretical results, we propose a two-stage method to estimate the latent variables by enforcing sparsity and Gaussianity in the learned representations. Experiments on synthetic and image data highlight our method’s effectiveness in recovering the ground-truth latent variables.

Few-shot learning requires models to generalize under limited supervision while remaining robust to distribution shifts. Existing Sinkhorn Distributionally Robust Optimization (DRO) methods provide theoretical guarantees but rely on a fixed reference distribution, which limits their adaptability. We propose a Prototype-Guided Distributionally Robust Optimization (PG-DRO) framework that learns class-adaptive priors from abundant base data via hierarchical optimal transport and embeds them into the Sinkhorn DRO formulation. This design enables few-shot information to be organically integrated into producing class-specific robust decisions that are both theoretically grounded and efficient, and further aligns the uncertainty set with transferable structural knowledge. Experiments show that PG-DRO achieves stronger robust generalization in few-shot scenarios, outperforming both standard learners and DRO baselines.

In recent years, deep neural networks have showcased their predictive power across a variety of tasks. Beyond natural language processing, the transformer architecture has proven efficient in addressing tabular data problems and challenges the previously dominant gradient-based decision trees in these areas. However, this predictive power comes at the cost of intelligibility: Marginal feature effects are almost completely lost in the black-box nature of deep tabular transformer networks. Alternative architectures that use the additivity constraints of classical statistical regression models can maintain intelligible marginal feature effects, but often fall short in predictive power compared to their more complex counterparts. To bridge the gap between intelligibility and performance, we propose an adaptation of tabular transformer networks designed to identify marginal feature effects. We provide theoretical justifications that marginal feature effects can be accurately identified, and our ablation study demonstrates that the proposed model efficiently detects these effects, even amidst complex feature interactions. To demonstrate the model's predictive capabilities, we compare it to several interpretable as well as black-box models and find that it can match black-box performances while maintaining intelligibility. The source code is available at https://github.com/OpenTabular/NAMpy.


50
Low-Rank Bias, Weight Decay, and Model Merging in Neural Networks

Ilja Kuzborskij ⋅ Yasin Abbasi-Yadkori

We explore the low-rank structure of the weight matrices in neural networks at the stationary points (limiting solutions of optimization algorithms) with $L2$ regularization (also known as weight decay). We show several properties of such deep neural networks, induced by $L2$ regularization. In particular, for a stationary point we show alignment of the parameters and the gradient, norm preservation across layers, and low-rank bias: properties previously known in the context of solutions of gradient descent/flow type algorithms. Experiments show that the assumptions made in the analysis only mildly affect the observations. In addition, we investigate a multitask learning phenomenon enabled by $L2$ regularization and low-rank bias. In particular, we show that if two networks are trained, such that the inputs in the training set of one network are approximately orthogonal to the inputs in the training set of the other network, the new network obtained by simply summing the weights of the two networks will perform as well on both training sets as the respective individual networks. We demonstrate this for shallow ReLU neural networks trained by gradient descent, as well as deep linear networks trained by gradient flow.

We study the decentralized multi-player multi-armed bandits (MMAB) problem under a no-sensing setting, where each player receives only their own reward and obtains no information about collisions. Each arm has an unknown capacity, and if the number of players pulling an arm exceeds its capacity, all players involved receive zero reward. This setting generalizes the classical unit-capacity model and introduces new challenges in coordination and capacity discovery under severe feedback limitations. We propose A-CAPELLA (Algorithm for Capacity-Aware Parallel Elimination for Learning and Allocation), a decentralized learning algorithm that achieves logarithmic regret in this generalized regime via protocol-driven coordination.

Stochastic natural gradient variational inference (NGVI) is a popular and efficient algorithm for Bayesian inference. Despite empirical success, the convergence of this method is still not fully understood. In this work, we define and study a projected stochastic NGVI when variational distributions form an exponential family. Stochasticity arises when either gradients are intractable expectations or large sums. We prove new non-asymptotic convergence results for combinations of constant or decreasing step sizes and constant or increasing sample/batch sizes. When all hyperparameters are fixed, NGVI is shown to converge geometrically to a neighborhood of the optimum, while we establish convergence to the optimum with rates of the form $\mathcal{O}\left(\frac{1}{T^{\rho}} \right)$, possibly with $\rho \geq 1$, for all other combinations of step size and sample/batch size schedules. These rates apply when the target posterior distribution is close in some sense to the considered exponential family. Our theoretical results extend existing NGVI and stochastic optimization results and provide more flexibility to adjust, in a principled way, step sizes and sample/batch sizes in order to meet speed, resources, or accuracy constraints.


53
Sparse Linear Bandits with Fixed Sparsity Support: Adversarial and Stochastic Regimes

Kyoungseok Jang ⋅ Nam Tran ⋅ Nicolò Cesa-Bianchi

We study sparse linear bandits in both adversarial and stochastic settings. While existing literature has extensively explored sparse linear bandits in the stochastic regime, the adversarial setting, particularly for general $l_p$-ball action sets $(p>1)$, remains poorly understood. Our work addresses this gap by showing that the curse of dimensionality in adversarial linear bandits can be broken under a natural fixed sparsity support assumption. Specifically, we design algorithms for the $l_\infty$- and $l_2$-balls that integrate sparsity support identification with the OSMD algorithm, achieving regret bounds $O(s\sqrt{T}\log T )$ and $O(\sqrt{sT}\log T )$, respectively. These results nearly match the optimal results when the sparsity support is known, and significantly improve upon the $ O(d\sqrt{T}) $ regret of algorithms ignoring sparsity. Furthermore, in the stochastic setting, we show how the geometry of the $l_p$-ball action set influences both exploration and regret. Our work highlights fundamental contrasts between adversarial and stochastic regimes, and establishes the first regret guarantees for sparse adversarial linear bandits beyond the $l_1$-ball action set.


54
On Computational Limits of FlowAR Models: Expressivity and Efficiency

Yang Cao ⋅ Chengyue Gong ⋅ Yekun Ke ⋅ Xiaoyu Li ⋅ Yingyu Liang ⋅ Zhizhou Sha ⋅ Zhenmei Shi ⋅ Zhao Song

The expressive power and computational complexity of deep visual generative models, such as flow-based and autoregressive (AR) models, have gained considerable interest for their wide-ranging applications in generative tasks. However, the theoretical characterization of their expressiveness through the lens of circuit complexity remains underexplored, particularly for the state-of-the-art architecture like FlowAR proposed by [Ren et al., 2024], which integrates flow-based and autoregressive mechanisms. This gap limits our understanding of their inherent computational limits and practical efficiency. In this study, we address this gap by analyzing the circuit complexity of the FlowAR architecture. We demonstrate that when the largest feature map produced by the FlowAR model has dimensions $n \times n \times c$, the FlowAR model is simulable by a family of threshold circuits $\mathsf{TC}^0$, which have constant depth $O(1)$ and polynomial width $\mathrm{poly}(n)$. This is the first study to rigorously highlight the limitations in the expressive power of FlowAR models. Furthermore, we identify the conditions under which the FlowAR model computations can achieve almost quadratic time. To validate our theoretical findings, we present efficient model variant constructions based on low-rank approximations that align with the derived criteria. Our work provides a foundation for future comparisons with other generative paradigms and guides the development of more efficient and expressive implementations.

Recovering high-dimensional signals from corrupted measurements is a central challenge in inverse problems. Recent advances in generative diffusion models have shown remarkable empirical success in providing strong data-driven priors, but rigorous recovery guarantees remain limited. In this work, we develop a theoretical framework for analyzing deterministic diffusion-based algorithms for inverse problems, focusing on a deterministic version of the algorithm proposed by Kadkhodaie &amp; Simoncelli \cite{kadkhodaie2021stochastic}. First, we show that when the underlying data distribution concentrates on a low-dimensional model set, the associated noise-convolved scores can be interpreted as time-varying projections onto such a set. This leads to interpreting previous algorithms using diffusion priors for inverse problems as generalized projected gradient descent methods with varying projections. When the sensing matrix satisfies a restricted isometry property over the model set, we can derive quantitative convergence rates that depend explicitly on the noise schedule. We apply our framework to two instructive data distributions: uniform distributions over low-dimensional compact, convex sets and low-rank Gaussian mixture models. In the latter setting, we can establish global convergence guarantees despite the nonconvexity of the underlying model set.


56
A Finite Time Analysis of Thompson Sampling for Bayesian Optimization with Preferential Feedback

Joseph Lazzaro ⋅ Davide Buffelli ⋅ Da-shan Shiu ⋅ Sattar Vakili

Preference feedback, in the form of pairwise comparisons rather than scalar scores, has seen increasing use in applications such as human-, laboratory-, and expert-in-the-loop design, as well as scientific discovery. We propose a Thompson Sampling (TS) approach to Bayesian optimization with preferential feedback that models comparisons using a monotone link on latent utility differences and leverages the dueling kernel induced by a base kernel. We provide a finite-time analysis showing that the performance of the proposed method matches that of standard TS for conventional Bayesian optimization with scalar feedback. The analysis exploits the anchor invariance of TS for challenger selection and introduces a double-TS pairing variant. We also demonstrate the performance of the method on both synthetic and real-world examples.

Conditional-independence-based discovery uses statistical tests to identify a graphical model that represents the independence structure of variables in a dataset. These test, however, can be unreliable and algorithms are sensitive to errors and violated assumptions. Often there are tests that were not used in the construction of the graph. In this work, we show that these redundant tests have the potential to detect or sometimes correct errors in the learned model. But we further show that not all tests contain this additional information and that such redundant tests have to be applied with care. Precisely, we argue that the conditional (in)dependence statements that hold for every probability distribution are unlikely to detect and correct errors - in contrast to those that follow only from graphical assumptions.


58
High-dimensional Level Set Estimation with Trust Regions and Double Acquisition Functions

Giang Ngo ⋅ Dat Phan Trong ⋅ Dang Nguyen ⋅ Sunil Gupta

Level set estimation (LSE) classifies whether an unknown function's value exceeds a specified threshold for given inputs, a fundamental problem in many real-world applications. In active learning settings with limited initial data, we aim to iteratively acquire informative points to construct an accurate classifier for this task. In high-dimensional spaces, this becomes challenging where the search volume grows exponentially with increasing dimensionality. We propose TRLSE, an algorithm for high-dimensional LSE, which identifies and refines regions near the threshold boundary with dual acquisition functions operating at both global and local levels. We provide a theoretical analysis of TRLSE's accuracy and show its superior sample efficiency against existing methods through extensive evaluations on multiple synthetic and real-world LSE problems.

Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov-guided imagination into control-invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.


6
Lloyd's $K$-Means Clustering Algorithm is Frank-Wolfe in Disguise

Michael Pokojovy ⋅ J. Marcus Jobe ⋅ Simon Lacoste-Julien

Lloyd's $K$-means algorithm, also known as naïve $K$-means, is a widely used *ad hoc* optimization heuristic, designed to minimize the sum of squared errors (SSE) across all $K$-partitions of a dataset via iterative cluster refinement. In this work, we establish a novel connection between Lloyd's algorithm and the Frank-Wolfe (FW) algorithm, a prominent first-order method for projection-free optimization. We demonstrate that Lloyd's algorithm is a special case of FW. Leveraging recent advances in FW methods for concave objectives, we derive a non-asymptotic $\mathcal{O}(1/t)$ convergence rate to a local minimum of the SSE objective. To account for empty clusters, an outcome possible under Lloyd's greedy assignment, we develop an FW variant for semismooth objectives while retaining the same convergence rate that is solely controlled by the initial SSE value. We illustrate our findings with a simulation study for spherical Gaussian mixtures and a real-world image segmentation dataset.


60
High-Dimensional Analysis of Bootstrap Ensemble Classifiers

Malik Tiomoko ⋅ Hamza Cherkaoui ⋅ Mohamed El Amine Seddik ⋅ Cosme Louart ⋅ Ekkehard Schnoor ⋅ Balázs Kégl

Bootstrap methods have long been the cornerstone of ensemble learning in machine learning. This paper presents a theoretical analysis of bootstrap techniques applied to the Least Square Support Vector Machine (LSSVM) ensemble in the context of large and growing sample sizes and feature dimensionalities. Using tools from Random Matrix Theory, we investigate the performance of this classifier that aggregates decision functions from multiple weak classifiers, each trained on different subsets of the data. We provide insights into the use of bootstrap methods in high-dimensional settings, enhancing our understanding of their impact. Based on these findings, we propose strategies to select the number of subsets and the regularization parameter that maximize the performance of the LSSVM. Empirical experiments on synthetic and real-world datasets validate our theoretical results.


61
LLMs Judging LLMs: A Simplex Perspective

Patrick Vossler ⋅ Fan Xia ⋅ Yifan Mai ⋅ Adarsh Subbaswamy ⋅ Jean Feng

Given the challenge of automatically evaluating free‐form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified if judges are perfect, it is unclear when such an approach is (i) theoretically valid and (ii) practically robust. We study these questions for the task of ranking LLM candidates from a novel geometric perspective: for $M$-level scoring systems, both LLM judges and candidates can be represented as points on a $(M-1)$-dimensional probability simplex, where geometric concepts (e.g., triangle areas) correspond to key ranking concepts. This perspective yields intuitive theoretical conditions and visual proofs for when rankings are identifiable; for instance, we provide a formal basis for the ``folk wisdom'' that LLM judges are more effective with binary scoring ($M=2$) than with multi-level scoring ($M>2$). Leveraging the simplex, we design geometric Bayesian priors that encode epistemic uncertainty about judge quality and vary the priors to conduct sensitivity analyses. Experiments on LLM benchmarks show that rankings based solely on LLM judges are robust in many but not all datasets, underscoring both their widespread success and the need for caution. Our Bayesian method achieves substantially higher coverage rates than existing procedures, highlighting the importance of modeling epistemic uncertainty.


62
Convexified Message-Passing Graph Neural Networks

Saar Cohen ⋅ Noa Agmon ⋅ Uri Shaham

Graph Neural Networks (GNNs) are key tools for graph representation learning, demonstrating strong results across diverse prediction tasks. In this paper, we present Convexified Message-Passing Graph Neural Networks (CGNNs), a novel and general framework that combines the power of message-passing GNNs with the tractability of convex optimization. By mapping their nonlinear filters into a reproducing kernel Hilbert space, CGNNs transform training into a convex optimization problem, which projected gradient methods can solve both efficiently and optimally. Convexity further allows CGNNs' statistical properties to be analyzed accurately and rigorously. For two-layer CGNNs, we establish rigorous generalization guarantees, showing convergence to the performance of an optimal GNN. To scale to deeper architectures, we adopt a principled layer-wise training strategy. Experiments on benchmark datasets show that CGNNs significantly exceed the performance of leading GNN models, obtaining 10–40\% higher accuracy in most cases, underscoring their promise as a powerful and principled method with strong theoretical foundations. In rare cases where improvements are not quantitatively substantial, the convex models either slightly exceed or match the baselines, stressing their robustness and wide applicability. Though over-parameterization is often used to enhance performance in non-convex models, we show that our CGNNs yield shallow convex models that can surpass non-convex ones in accuracy and model compactness.


63
MDPs with a State Sensing Cost

Vansh Kapoor ⋅ Jayakrishnan Nair

In many practical sequential decision-making problems, tracking the state of the environment incurs a sensing/computation cost. In these settings, the agent's interaction with its environment includes the additional component of deciding \emph{when} to sense the state, in a manner that balances the value associated with optimal (state-specific) actions and the cost of sensing. We formulate this as an expected discounted cost Markov Decision Process (MDP), wherein the agent incurs an additional cost for sensing its next state, but has the option to take actions while remaining `blind' to the system state. We pose this problem as a classical discounted cost MDP with an expanded (countably infinite) state space. While computing the optimal policy for this MDP is intractable in general, we derive lower bounds on the optimal value function, which allow us to bound the suboptimality gap of any policy. We also propose a computationally efficient algorithm SPI, based on policy improvement, which in practice performs close to the optimal policy. Finally, we benchmark against the state-of-the-art via a numerical case study.


64
Feature Importance via Sets of Locally Performant Linear Models

Fatemeh Tohidian ⋅ Davin Hill ⋅ Aria Masoomi ⋅ Peter Castaldi ⋅ Jennifer Dy

Understanding the contribution of individual features to a model’s prediction is critical in applications such as medicine. While feature importance methods aim to quantify how much a feature contributes to a model’s accuracy, they often overlook heterogeneous patterns in the data and suffer from limited robustness. We propose $\ell\text{-MCR}$, a local feature importance method that identifies meaningful neighborhoods around a point of interest, regions where the model or data behavior is locally stable and interpretable. Within these neighborhoods, we estimate feature importance using Model Class Reliance (MCR), which offers robustness by considering the full set of near-optimal models. We also provide a consistency proof for reliably detecting such neighborhoods. Experiments on both synthetic and real-world datasets demonstrate that $\ell\text{-MCR}$ captures localized feature importance patterns that global approaches fail to detect.

In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-4})$ iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-3 })$ iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.


66
Multi-Component VAE with Gaussian Markov Random Field

Fouad Oubari ⋅ Mohamed El Baha ⋅ Raphaël Meunier ⋅ Rodrigue Décatoire ⋅ Mathilde MOUGEOT

Multi-component datasets with intricate dependencies challenge current generative modeling techniques. Existing Multi-component Variational AutoEncoders rely on simplified aggregation strategies that compromise structural coherence across generated components. We introduce the Gaussian Markov Random Field Multi-Component Variational AutoEncoder, embedding Gaussian Markov Random Fields into both prior and posterior distributions to explicitly model cross-component relationships. This enables richer representation and faithful reproduction of complex interactions. Empirically, our model achieves state-of-the-art performance on a synthetic Copula dataset designed for intricate component relationships, competitive results on PolyMNIST, and significantly enhanced structural coherence on the real-world BIKED dataset, demonstrating its suitability for applications demanding robust multi-component coherence.


67
The Minimax Lower Bound of Kernel Stein Discrepancy Estimation

Jose Cribeiro-Ramallo ⋅ Agnideep Aich ⋅ Florian Kalinke ⋅ Ashit Baran Aich ⋅ Zoltan Szabo

Kernel Stein discrepancies (KSDs) have emerged as a powerful tool for quantifying goodness-of-fit over the last decade, featuring numerous successful applications. To the best of our knowledge, all existing KSD estimators with known rate achieve $\sqrt n$-convergence. In this work, we present two complementary results (with different proof strategies), establishing that the minimax lower bound of KSD estimation is $n^{-1/2}$ and settling the optimality of these estimators. Our first result focuses on KSD estimation on $\mathbb R^d$ with the Langevin-Stein operator; our explicit constant for the Gaussian base kernel indicates that the difficulty of KSD estimation may increase exponentially with the dimensionality $d$. Our second result settles the minimax lower bound for KSD estimation on general domains.

In randomized experiments, regression adjustment can improve the precision of average treatment effect (ATE) estimation using covariates without requiring a correctly specified outcome model. Although well studied in low-dimensional settings, its behavior in high-dimensional regimes, where the number of covariates $p$ may exceed the number of observations $n$, remains underexplored. Moreover, existing analyses are largely asymptotic, providing limited guidance for finite-sample inference. We develop a design-based, non-asymptotic framework for analyzing the regression-adjusted ATE estimator under complete randomization. This yields finite-sample-valid confidence intervals with explicit, instance-adaptive widths, even when $p > n$. While these intervals rely on oracle (population-level) quantities, we also outline data-driven envelopes computable from observed data. Our approach hinges on a refined swap sensitivity analysis of an estimator: stochastic fluctuation is controlled via a variance-adaptive Doob martingale and Freedman's inequality, and design bias is bounded by Stein's method of exchangeable pairs. The analysis elucidates how covariate geometry governs concentration and bias of the adjusted estimator, suggesting when and how regression adjustment can be effective.

In many scientific fields, key variables of interest are latent---either because they cannot be measured directly or because doing so is prohibitively expensive. As a result, researchers often rely on high-dimensional surrogate observations and must infer relationships among the unobserved quantities. In this work, we address a fundamental challenge: How can one estimate the covariance between variables that are not directly observable? We consider a model where each latent variable elicits high-dimensional observable covariates. Under our model, we propose a method that estimates several spiked covariances from the observed variables and then reconstructs the covariance matrix among the latent variables. Our estimator achieves quadratic-time complexity with respect to the number of latent variables and only requires the sample size to be logarithmic in the number of latent variables. As an immediate application, our procedure can be leveraged to recover the conditional independence structure among the latent variables, providing interpretable insights. Extensive synthetic experiments validate our theory, demonstrating accurate estimation in practice.


7
Sparse Linear Bandits with Blocking Constraints

Adit Jain ⋅ Soumyabrata Pal ⋅ Sunav Choudhary ⋅ Ramasuri Narayanam ⋅ Harshita Chopra ⋅ Vikram Krishnamurthy

We investigate the high-dimensional sparse linear bandits problem in a data-poor regime where the time horizon is much smaller than the ambient dimension and number of arms. We study the setting under the additional \textit{blocking constraint} where each unique arm can be pulled only once. The blocking constraint is motivated by practical applications in personalized content recommendation and identification of datapoints to improve annotation efficiency for complex learning tasks. With mild assumptions on the arms, our proposed online algorithm (\texttt{BSLB}) achieves a regret guarantee of $\widetilde{\mathsf{O}}((1+\beta_k)^2k^{\frac{2}{3}} \mathsf{T}^{\frac{2}{3}})$ where the parameter vector has an (unknown) relative tail $\beta_k$ - the ratio of $\ell_1$ norm of the top-$k$ and remaining entries of the parameter vector. To this end, we show novel offline statistical guarantees of the lasso estimator for the linear model that is robust to the sparsity modeling assumption. Finally, we propose a meta-algorithm (\texttt{C-BSLB}) based on corralling that does not need knowledge of optimal sparsity parameter $k$ at minimal cost to regret. Our experiments on multiple real-world datasets demonstrate the validity of our algorithms and theoretical framework.


70
Policy Testing in Markov Decision Processes

Kaito Ariu ⋅ Po-An WANG ⋅ Alexandre Proutiere ⋅ Kenshi Abe

We study the policy testing problem in discounted Markov decision processes (MDPs) in the fixed-confidence setting under a generative model with static sampling. The goal is to decide whether the value of a given policy exceeds a specified threshold while minimizing the number of samples. We first derive an instance-dependent lower bound that any reasonable algorithm must satisfy, characterized as the solution to an optimization problem with non-convex constraints. Guided by this formulation, we propose a new algorithm. While this design paradigm is common in pure exploration problems such as best-arm identification, the non-convex constraints that arise in MDPs introduce substantial difficulties. To address them, we reformulate the lower-bound problem by swapping the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. This reformulation admits an interpretation as a policy optimization task in a newly constructed {\it reversed MDP}. Leveraging recent advances in policy gradient methods, we solve this problem and design an asymptotically optimal policy testing algorithm. Beyond policy testing, our reformulation and reversed MDP view suggest extensions to other pure exploration tasks in MDPs, including policy evaluation and best policy identification.


71
Narrowing Action Choices with AI Improves Human Sequential Decisions

Eleni Straitouri ⋅ Stratis Tsirtsis ⋅ Ander Artola Velasco ⋅ Manuel Gomez Rodriguez

Recent work has shown that, in classification tasks, it is possible to design decision support systems that do not require human experts to understand when to cede agency to a classifier or when to exercise their own agency to achieve complementarity---experts using these systems make more accurate predictions than those made by the experts or the classifier alone. The key principle underpinning these systems reduces to adaptively controlling the level of human agency, by design. Can we use the same principle to achieve complementarity in sequential decision making tasks? In this paper, we answer this question affirmatively. We develop a decision support system that uses a pre-trained AI agent to narrow down the set of actions a human can take to a subset, and then asks the human to take an action from the action set. Along the way, we also introduce a bandit algorithm that leverages the smoothness properties of the action sets provided by our system to efficiently optimize the level of human agency. To evaluate our decision support system, we conduct a large-scale human subject study ($n = 1{,}600$) where participants play a wildfire mitigation game. We find that participants who play the game supported by our system outperform those who play on their own by $\sim$$30$\% and the AI agent used by our system by $>$$2$\%, even though the AI agent largely outperforms participants playing without support.


72
Provable Target Sample Complexity Improvements as Pre‑Trained Models Scale

Kazuto Fukuchi ⋅ Ryuichiro Hataya ⋅ Kota Matsui

Pre-trained models have become indispensable for efficiently building models across a broad spectrum of downstream tasks. The advantages of pre-trained models have been highlighted by empirical studies on scaling laws, which demonstrate that larger pre-trained models can significantly reduce the sample complexity of downstream learning. However, existing theoretical investigations of pre-trained models lack the capability to explain this phenomenon. In this paper, we provide a theoretical investigation by introducing a novel framework, caulking, inspired by parameter-efficient fine-tuning (PEFT) methods such as adapter-based fine-tuning, low-rank adaptation, and partial fine-tuning. Our analysis establishes that improved pre-trained models provably decrease the sample complexity of downstream tasks, thereby offering theoretical justification for the empirically observed scaling laws relating pre-trained model size to downstream performance, a relationship not covered by existing results.


73
Counterfactual Credit Guided Bayesian Optimization

Qiyu Wei ⋅ Haowei Wang ⋅ Richard Allmendinger ⋅ Mauricio Álvarez

Bayesian optimization has emerged as a prominent methodology for optimizing expensive black-box functions by leveraging Gaussian process surrogates, which focus on capturing the global characteristics of the objective function. However, in numerous practical scenarios, the primary objective is not to construct an exhaustive global surrogate, but rather to quickly pinpoint the global optimum. Due to the aleatoric nature of the sequential optimization problem and its dependence on the quality of the surrogate model and the initial design, it is restrictive to assume that all observed samples contribute equally to the discovery of the optimum in this context. In this paper, we introduce Counterfactual Credit Guided Bayesian Optimization (CCGBO), a novel framework that explicitly quantifies the contribution of individual historical observations through counterfactual credit. By incorporating counterfactual credit into the acquisition function, our approach can selectively allocate resources in areas where optimal solutions are most likely to occur. We prove that CCGBO retains sublinear regret. Empirical evaluations on various synthetic and real-world benchmarks demonstrate that CCGBO consistently reduces simple regret and accelerates convergence to the global optimum.


74
Deconfounding Scores and Representation Learning for Causal Effect Estimation with Weak Overlap

Oscar Clivio ⋅ Alexander D'Amour ⋅ Alexander Franks ⋅ David Bruns-Smith ⋅ Chris Holmes ⋅ Avi Feller

Overlap, also known as positivity, is a key condition for causal treatment effect estimation. Many popular estimators suffer from high variance and become brittle when features differ strongly across treatment groups. This is especially challenging in high dimensions: the curse of dimensionality can make overlap implausible. To address this, we propose a class of feature representations called deconfounding scores, which preserve both identification and the target of estimation; the classical propensity and prognostic scores are two special cases. We characterize the problem of finding a representation with better overlap as minimizing an overlap divergence under a deconfounding score constraint. We then derive closed-form expressions for a class of deconfounding scores under a broad family of generalized linear models with Gaussian features and show that prognostic scores are overlap-optimal within this class. We conduct extensive experiments to assess this behavior empirically.


75
On Kernel based Variational Autoencoders

Tian Qin ⋅ Wei-Min Huang

In this paper, we bridge Variational Autoencoders (VAEs) and kernel density estimations (KDEs) by approximating the posterior by the expectation of kernel density estimator and deriving a new lower bound of empirical log likelihood. The flexibility of KDEs provides a new perspective of controlling the KL-divergence term in original evidence lower bound (ELBO) which enriches the choice of the posterior and prior pairs in VAE. We show that the Epanechnikov kernel gives the tightest upper bound in controlling the KL-divergence under appropriate conditions in theory and develop a kernel-based VAE called Epanechnikov Variational Autoenocoder (EVAE). The implementation of EVAE is straightforward as Epanechnikov kernel lies in the ``location-scale'' family of distributions where reparametrization tricks can be applied directly. Compared with Gaussian kernel, Epanechnikov kernel has compact support which should make the generated sample less blurry. The flexibility of new lower bound of ELBO also enables us to employ a two-stage training strategy to treat reconstruction and generation separately, which is an analogue of the idea in VQ-VAE. Extensive experiments illustrate the potential of EVAE in image generation and the superiority of EVAE over vanilla VAE and other baseline models in the quality of reconstructed images, as measured by the FID score and Sharpness. We also carried out additional experiments about the application of EVAE in downstream classification tasks.

We study transformers' in-context learning of variable-length Markov chains (VOMCs), focusing on the finite-sample accuracy as the number of in-context examples increases. Compared to fixed-order Markov chains (FOMCs), learning VOMCs is substantially more challenging due to the additional structural learning component. The problem is naturally suited to a Bayesian formulation, where the context-tree weighting (CTW) algorithm, originally developed in the information theory community for universal data compression, provides an optimal solution. Empirically, we find that single-layer transformers fail to learn VOMCs in context, whereas transformers with two or more layers can succeed, with additional layers yielding modest but noticeable improvements. In contrast to prior results on FOMCs, attention-only networks appear insufficient for VOMCs. To explain these findings, we provide explicit transformer constructions: one with $D+2$ layers that can exactly implement CTW for VOMCs of maximum order $D$, and a simplified two-layer construction that uses partial information for approximate blending, shedding light on why two-layer transformers can perform well.

We study the finite-sample behavior of generalized posteriors defined via $f$-divergences, a broad class of posteriors that includes the standard Bayesian posterior along with most of its generalizations. Our main contribution is to extend the Langevin diffusion representation of the Bayesian posterior to this broader class. With this perspective, we obtain non-asymptotic posterior contraction rates for $f$-divergence-based posteriors by bounding the moments of their associated diffusion. Our results establish nearly optimal rates and clarify how different divergence choices influence posterior concentration. Finally, we illustrate the general framework with concrete examples.


78
Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Chihyeon Song ⋅ Jaewoo Lee ⋅ Jinkyoo Park

Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design. Our code is publicly available at https://github.com/song970407/ARB.

Variable importance (VI) methods are often used for hypothesis generation, feature selection, and scientific validation. In the standard VI pipeline, an analyst estimates VI for a single predictive model with only the observed features. However, the importance of a feature depends heavily on which other variables are included in the model, and essential variables are often omitted from observational datasets. Moreover, the VI estimated for one model is often not the same as the VI estimated for another equally-good model – a phenomenon known as the Rashomon Effect. We address these gaps by introducing UNobservables and Inference for Variable importancE using Rashomon SEts (UNIVERSE). Our approach adapts Rashomon sets – the sets of near-optimal models in a dataset – to produce bounds on the true VI even with missing features. We theoretically guarantee the robustness of our approach, show strong performance on semi-synthetic simulations, and demonstrate its utility in a credit risk task.

Given an integer $k\geq1$ and a set $P$ of $n$ points in $\mathbb{R}^d$, the classic $k$-PCA (Principal Component Analysis) approximates the affine \emph{$k$-subspace mean} of $P$, which is the $k$-dimensional affine linear subspace that minimizes its sum of squared Euclidean distances ($\ell_{2,2}$-norm) over the points of $P$, i.e., the mean of these distances. The \emph{$k$-subspace median} is the subspace that minimizes its sum of (non-squared) Euclidean distances ($\ell_{2,1}$-mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the $\ell_{z,z}$ (non-mixed) norms, it is non-convex for $k

We define the problem of linear Contextual Stochastic Shortest Path (CSSP), where at the beginning of each episode, the learner observes an adversarially chosen context that determines the MDP through a fixed but unknown linear function. The learner's objective is to reach a designated goal state with minimal expected cumulative loss, despite having no prior knowledge of the transition dynamics, loss functions, or the mapping from context to MDP. In this work, we propose LR-CSSP, an algorithm that achieves a regret bound of $\widetilde{O}(K^{2/3} d^{2/3} |S| |A|^{1/3} B_\star^2 T_\star \log (1/ \delta))$, where $K$ is the number of episodes, $d$ is the context dimension, $S$ and $A$ are the sets of states and actions respectively, $B_\star$ bounds the optimal cumulative loss and $T_\star$, unknown to the learner, bounds the expected time for the optimal policy to reach the goal. In the case where all costs exceed $\ell_{\min}$, LR-CSSP attains a regret of $\widetilde O(\sqrt{K \cdot d^2 |S|^3 |A| B_\star^3 \log(1/\delta)/\ell_{\min}})$. Unlike in contextual finite-horizon MDPs, where limited knowledge primarily leads to higher losses and regret, in the CSSP setting, insufficient knowledge can also prolong episodes and may even lead to non-terminating episodes. Our analysis reveals that LR-CSSP effectively handles continuous context spaces, while ensuring all episodes terminate within a reasonable number of time steps.


81
E-Scores for (In)Correctness Assessment of Generative Model Outputs

Guneet Singh Dhillon ⋅ Javier Gonzalez ⋅ Teodora Pandeva ⋅ Alicia Curth

While generative models, especially large language models (LLMs), are ubiquitous in today’s world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as measures of incorrectness. In addition to achieving the guarantees as before, e-scores further provide users with the flexibility of choosing data-dependent tolerance levels while upper bounding size distortion, a post-hoc notion of error. We experimentally demonstrate their efficacy in assessing LLM outputs under different forms of correctness: mathematical factuality and property constraints satisfaction.


82
Filter, Augment, Forecast: Online Data Selection for Robust Time Series Forecasting

Ege Onur Taga ⋅ Halil Gozeten ⋅ Kutay Tire ⋅ Rahul Dalvi ⋅ Reinhard Heckel ⋅ Samet Oymak

Data curation pipelines play a central role in training deep learning architectures, with their impact in time series forecasting still relatively underexplored. In this work, we propose Filter, Augment, Forecast (FAF): an online data curation strategy based on (1) data selection to filter out low-quality (e.g., noisy) examples and (2) augmentation of the remaining high-quality data. We use reference model-based filtering inspired by the reducible holdout loss selection (RHO-LOSS) from the language modeling literature. We identify limitations of RHO-LOSS under domain shifts common in time series and introduce the adaptive RHO method (AdaRho), which improves performance by updating the reference model during training. Using random matrix theory, we provide a statistical analysis that characterizes the role of the reference model, sample size, and noise statistics in data selection. FAF consistently improves forecasting accuracy across diverse architectures without modifying them, achieving state-of-the-art results.


83
Catoni-Style Change Point Detection for Regret Minimization in Piecewise-Stationary Heavy-Tailed Bandits

Gianmarco Genalti ⋅ Sujay Bhatt ⋅ Nicola Gatti ⋅ Alberto Maria Metelli

Regret minimization in stochastic non-stationary bandits gained popularity over the last decade, as it can model a broad class of real-world problems, from advertising to recommendation systems. Existing literature relies on various assumptions about the reward-generating process, such as Bernoulli or subgaussian rewards. However, in settings such as finance and telecommunications, heavy-tailed distributions naturally arise. In this work, we tackle the heavy-tailed piecewise-stationary bandit problem. Heavy-tailed bandits, introduced by Bubeck et al., 2013, operate on the minimal assumption that the finite absolute centered moments of maximum order $1+\epsilon$ are uniformly bounded by a constant $v<+\infty$, for some $\epsilon \in (0,1]$. We focus on the most popular non-stationary bandit setting, i.e., the piecewise-stationary setting, in which the mean of reward-generating distributions may change at unknown time steps. We provide a novel Catoni-style change-point detection strategy tailored for heavy-tailed distributions that relies on recent advancements in the theory of sequential estimation, which is of independent interest. We introduce Robust-CPD-UCB, which combines this change-point detection strategy with optimistic algorithms for bandits, providing its regret upper bound and an impossibility result on the minimum attainable regret for any policy. Finally, we validate our approach through numerical experiments on synthetic and real-world datasets.


84
Understanding Generalization in Node and Link Prediction

Antonis Vasileiou ⋅ Timo Stoll ⋅ Christopher Morris

Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored the generalization of MPNNs in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d. assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework for analyzing the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions, and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task, regardless of whether it is inductive or transductive. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs' generalization capabilities in these tasks.

Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we underpin the validity of this assumption by using low rank techniques. We derive an expression for a subspace model to a Bayesian inference scenario based on the Laplace approximation that is, in a certain sense, optimal given a specific dataset. We empirically show that a Laplace approximation constructed with a dimensionally reduced covariance matrix closely matches the full Laplace approximation obtained using the exact covariance matrix. Where feasible, this subspace model can serve as a baseline for benchmarking the performance of subspace models. In addition, we provide a scalable approximation of this subspace construction that is usable in practice and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare the approximation quality of different subspace models even if the exact Laplace approximation is unknown.

Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers often produce ill-conditioned matrices with large condition numbers. This ill-conditioning is a well-known obstacle for gradient-based optimizers, leading to inefficient training. To address this issue, we introduce preconditioned attention, a novel approach that incorporates a conditioning matrix into each attention head. Our theoretical analysis shows that this method significantly reduces the condition number of attention matrices, resulting in better-conditioned matrices that improve optimization. Conditioned attention serves as a simple drop-in replacement for a wide variety of attention mechanisms in the literature. We validate the effectiveness of preconditioned attention across a diverse set of transformer applications, including image classification, object detection, instance segmentation, long sequence modeling and language modeling.


88
Improving Coverage in Combined Prediction Sets with Weighted p-values

Gina Wong ⋅ Drew Prinster ⋅ Suchi Saria ⋅ Rama Chellappa ⋅ Anqi Liu

Conformal prediction quantifies the uncertainty of machine learning models by augmenting point predictions with valid prediction sets. For complex scenarios involving multiple trials, models, or data sources, conformal prediction sets can be aggregated to create a prediction set that captures the overall uncertainty, often improving precision. However, aggregating multiple prediction sets with individual $1-\alpha$ coverage inevitably weakens the overall guarantee, typically resulting in $1-2\alpha$ worst-case coverage. In this work, we propose a framework for the *weighted aggregation of prediction sets*, where weights are assigned to each prediction set based on their contribution. Our framework offers flexible control over how the sets are aggregated, achieving tighter coverage bounds that interpolate between the $1-2\alpha$ guarantee of the combined models and the $1-\alpha$ guarantee of an individual model depending on the distribution of weights. Importantly, our framework generalizes to data-dependent weights, as we derive a procedure for weighted aggregation that maintains finite-sample validity even when the weights depend on the data. This extension makes our framework broadly applicable to settings where weights are learned, such as mixture-of-experts (MoE), and we demonstrate through experiments in the MoE setting that our methods achieve adaptive coverage.

Multi-dimensional classification (MDC) extends multi-class and multi-label learning by predicting several class variables per instance. We revisit probabilistic MDC methods with mixed features (discrete and continuous), focusing on their strengths and limits for handling incomplete data at prediction time. We present theoretical results leading to a new probabilistic approach with efficient learning and prediction algorithms that address scalability and robustness issues. Experiments demonstrate its benefits in different missingness scenarios.


9
Happiness as a Measure of Fairness

Georg Pichler ⋅ Marco Romanelli ⋅ Pablo Piantanida

In this paper, we propose a novel fairness framework grounded in the concept of happiness, a measure of the utility each group gains from decision outcomes. By capturing fairness through this intuitive lens, we not only offer a more human-centered approach, but also one that is mathematically rigorous: In order to compute the optimal, fair post-processing strategy, only a linear program needs to be solved. This makes our method both efficient and scalable with existing optimization tools. Furthermore, it unifies and extends several well-known fairness definitions, and our empirical results highlight its practical strengths across diverse scenarios.

In this paper, we study the problem of distributed mean estimation with 1-bit communication constraints. We propose a mean estimator that is based on (randomized and sequentially-chosen) interval queries, whose 1-bit outcome indicates whether the given sample lies in the specified interval. Our estimator is $(\epsilon, \delta)$-PAC for all distributions with bounded mean ($-\lambda \le \mathbb{E}(X) \le \lambda $) and variance ($\mathrm{Var}(X) \le \sigma^2$) for some known parameters $\lambda$ and $\sigma$. We derive a sample complexity bound $\widetilde{O}\big( \frac{\sigma^2}{\epsilon^2}\log\frac{1}{\delta} + \log\frac{\lambda}{\sigma}\big)$, which matches the minimax lower bound for the unquantized setting up to logarithmic factors and the additional $\log\frac{\lambda}{\sigma}$ term that we show to be unavoidable. We also establish an adaptivity gap for interval-query based estimators: the best non-adaptive mean estimator is considerably worse than our adaptive mean estimator for large $\frac{\lambda}{\sigma}$. Finally, we give tightened sample complexity bounds for distributions with stronger tail decay, and present additional variants that (i) handle an unknown sampling budget (ii) adapt to the unknown true variance given (possibly loose) upper and lower bounds on the variance, and (iii) use only two stages of adaptivity at the expense of more complicated (non-interval) queries.


91
Projection-free Algorithms for Online Convex Optimization with Adversarial Constraints

Dhruv Sarkar ⋅ Aprameyo Chakrabartty ⋅ Subhamon Supantha ⋅ Palash Dey ⋅ Abhishek Sinha

We study a generalization of the Online Convex Optimization (OCO) framework with time-varying adversarial constraints. In this setting, at each round, the learner selects an action from a convex decision set $\mathcal{X}$, after which both a convex cost function and a convex constraint function are revealed. The objective is to design a computationally efficient learning policy that simultaneously achieves low regret with respect to the cost functions and low cumulative constraint violation (CCV) over a horizon of length $T$. A major computational bottleneck in standard OCO algorithms is the projection operation onto the decision set $\mathcal{X}$. However, for many structured decision sets, linear optimization can be performed efficiently. Motivated by this, we propose a \emph{projection-free} online conditional gradient (OCG)-based algorithm that requires only a single call to a linear optimization oracle over $\mathcal{X}$ per round. Our approach improves upon the state of the art for projection-free online learning with adversarial constraints, achieving $\tilde{O}(T^{3/4})$ bounds for both regret and CCV. Our algorithm is conceptually simple. It constructs a surrogate cost function as a nonnegative linear combination of the cost and constraint functions, and feeds these surrogate costs into a novel adaptive online conditional gradient subroutine introduced in this paper. We further extend our framework to the bandit setting, where we show that a new form of surrogate loss is necessary to properly handle bandit feedback—an issue overlooked in prior work. Finally, we develop an efficient Follow-the-Perturbed-Leader (FTPL)-based algorithm, particularly well-suited for online combinatorial optimization problems with discrete actions, which also achieves $O(T^{3/4})$ regret and CCV.


92
Structured Matrix Scaling for Multi-Class Calibration

Eugène Berta ⋅ David Holzmüller ⋅ Michael Jordan ⋅ Francis Bach

Post-hoc recalibration methods are widely used to ensure that classifiers provide faithful probability estimates. We argue that parametric recalibration functions based on logistic regression can be motivated from a simple theoretical setting for both binary and multi-class classification. This insight motivates the use of more expressive calibration methods beyond standard temperature scaling. For multi-class calibration however, a key challenge lies in the increasing number of parameters introduced by more complex models, often coupled with limited calibration data, which can lead to overfitting. Through extensive experiments, we demonstrate that the resulting bias-variance tradeoff can be effectively managed by structured regularization, robust preprocessing and efficient optimization. The resulting methods lead to substantial gains over existing logistic-based calibration techniques. We provide efficient and easy-to-use open-source implementations of our methods, making them an attractive alternative to common temperature, vector, and matrix scaling implementations.


93
Explanation Design in Strategic Learning: Sufficient Explanations That Induce Non-harmful Responses

Kiet Q. H. Vo ⋅ Siu Lun Chau ⋅ Masahiro Kato ⋅ Yixin Wang ⋅ Krikamol Muandet

We study the design of explanations in algorithmic decision-making with strategic agents---individuals who may modify their inputs in response to explanations of a decision maker's (DM's) predictive model. While the demand for algorithmic transparency has led much prior work to assume full model disclosure, in practice DMs typically provide only partial information via explanations, which can cause agents to misinterpret the model and take actions that unintentionally reduce their own utility. A central open question is therefore how DMs should communicate explanations that avoid harming strategic agents while still supporting their own goals, e.g., minimising predictive error. In this work, we analyse widely used explanation methods and establish a necessary condition to prevent explanations from inducing self-harming responses. Furthermore, we show that action recommendation-based explanations (ARexes), which encompass counterfactual explanations, are sufficient to induce all non-harmful responses. Under a conditional homogeneity assumption, this sufficiency extends to ARex-generating methods, echoing the revelation principle in information design. To demonstrate their practical utility, we introduce a simple learning procedure that jointly optimises the predictive model and the explanation-generating policy. Experiments on both synthetic and real-world tasks show that ARexes enable DMs to achieve high predictive performance while preserving agents' utility, offering a principled strategy for safe and effective partial model disclosure.

We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak–Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).


95
Corruption Robust Thompson Sampling for Gaussian Bandits

Yinglun Xu ⋅ Zhiwei Wang ⋅ Gagandeep Singh

Thompson sampling is one of the most popular learning algorithms for online sequential decision-making problems and has rich real-world applications. However, traditional Thompson sampling algorithms are limited by the assumption that the rewards received are uncorrupted, which may not hold in real-world applications where adversarial reward poisoning exists. To make Thompson sampling more reliable, our goal is to make it robust against adversarial reward poisoning. Particularly, we consider a strong attack threat model where an adversary applies corruption after observing the agent's actions. The main challenge is that one can no longer compute the actual posteriors for the true reward, as the agent can only observe the rewards after corruption. In this work, we solve this problem by computing pseudo-posteriors that are less likely to be manipulated by the attack. Particularly, we focus on two popular settings: stochastic bandits and contextual linear bandits with priors as Gaussian distributions. We are the first to propose robust algorithms based on Thompson sampling for the two bandit settings in both cases where the agent is aware or unaware of the attacker's budget. We theoretically show that our algorithms guarantee near-optimal regret under any attack strategy.

Despite advances in the Transformer architecture, their effectiveness for long-term time series forecasting (LTSF) remains controversial. In this paper, we investigate the potential of integrating explicit periodicity modeling into the self-attention mechanism to enhance the performance of Transformerbased architectures for LTSF. Specifically, we propose PENGUIN, a simple yet effective periodic-nested group attention mechanism. Our approach introduces a periodic-aware relative attention bias to directly capture periodic structures and a grouped multi-query attention mechanism to handle multiple coexisting periodicities (e.g., daily and weekly cycles) within time series data. Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models. Code is available at https://github.com/ysygMhdxw/AISTATS2026_PENGUIN.


97
GeoTTER: Leveraging Local Geometry of Optimal Transport for Zero-Shot Classification

Wei-Yang Lee ⋅ Rudrasis Chakraborty ⋅ Vishnu Lokhande

We present GeoTTER, a novel framework that redefines optimal transport in the realm of zero-shot classification. Conventional methods often suffer from miscalibration and a lack of adaptability, as they rely on fixed cost matrices derived solely from pre-trained model embeddings. In contrast, GeoTTER addresses these limitations by incorporating two key techniques. First, to alleviate high-frequency label jaggedness (sample-level manifold jitter that assigns neighboring embeddings to different classes), GeoTTER integrates local geometric structure into the optimal transport formulation via graph-Laplacian smoothing, a technique grounded in spectral graph theory that enforces neighborhood consistency. Second, to correct coherent angular drift (a low-frequency orientation bias in which large groups of samples share the same angular offset from their true label prototypes), we fuse clustering-guided cost components with a globally adjusted transport cost, achieving a multi-objective optimization that respects both global distribution constraints and latent data structure. With a median improvement of +6.82\% compared to zero-shot and +2.13\% compared to OTTER, GeoTTER shows robust improvements across a diverse set of benchmarks. The code is available on \href{https://github.com/TeleViaBox/GeoTTER}{Github}.


98
Enhancing LLM Safety through a Theoretical Minimax Game Lens

Yihe Deng ⋅ Yu Yang ⋅ Junkai Zhang ⋅ Wei Wang ⋅ Bo Li

The rapid advancement of large language models (LLMs) necessitates effective mechanisms to ensure their responsible deployment by accurately distinguishing unsafe content from benign content. While substantial safety datasets are available in English, multilingual safety modeling remains underexplored due to limited open-source safety datasets in other languages. Even within English datasets, safe yet sensitive corner-case content is scarce, leading to shortcut learning by models and non-trivial false-positive rates. To mitigate these issues, we introduce a novel minimax reinforcement learning (RL) framework wherein a data generator and a classifier model co-evolve, facilitating the production of high-quality synthetic multilingual safety data. We theoretically formalize this interaction as a minimax game and rigorously demonstrate convergence to a Nash equilibrium. Empirical evaluations confirm that our synthetic data generation method significantly enhances the classifier model performance, enabling a substantially smaller model to surpass the state-of-the-art by nearly 10\% on English benchmarks while achieving 4.5$\times$ faster inference speed. These results establish a scalable and efficient methodology for synthetic data generation, advancing the development of safer and more robust multilingual LLM deployments.

Safe active learning (AL) is a sequential scheme for learning unknown systems while respecting safety constraints during data acquisition. Existing methods often rely on Gaussian processes (GPs) to model the task and safety constraints, requiring repeated GP updates and constrained acquisition optimization--incurring significant computations which are challenging for real-time decision-making. We propose amortized AL for regression and amortized safe AL, replacing expensive online computations with a pretrained neural policy. Inspired by recent advances in amortized Bayesian experimental design, we leverage GPs as pretraining simulators. We train our policy prior to the AL deployment on simulated nonparametric functions, using Fourier feature-based GP sampling and a differentiable acquisition objective that is safety-aware in the safe AL setting. At deployment, our policy selects informative and (if desired) safe queries via a single forward pass, eliminating GP inference and acquisition optimization. This leads to magnitudes of speed improvements while preserving learning quality. Our framework is modular and, without the safety component, yields fast unconstrained AL for time-sensitive tasks.


Differentially Private Minimum Spanning Tree in Euclidean Graphs

Zongrui Zou ⋅ Alessandro Epasto ⋅ Chenglin Fan ⋅ Rudrajit Das

We study differentially private approximation of minimum spanning trees (MST) and hierarchical clustering for Euclidean graph embeddings. Our algorithms achieve an optimal trade-off, providing a $(1+\eta)$-multiplicative approximation with $\tilde{O}(n/\eta^2)$ additive error under $\rho$-dist privacy. Furthermore, we establish a separation between Euclidean and general graphs by proving a lower bound of $\Omega(n^{1.5})$ additive error for general graphs under a similar privacy notion, demonstrating that better utility is indeed achievable for geometric data. Our algorithm can also be directly applied to clustering tasks based on specific MST algorithms, incurring only a minimal loss in the approximation guarantee compared to its non-private counterpart.