Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 2

Main Ballroom
Sun 3 May 7 a.m. PDT — 10 a.m. PDT
Abstract:
Chat is not available.


1
Efficient Uncoupled Learning Dynamics with $\tilde{O}\left(T^{-1/4}\right)$ Last-Iterate Convergence in Bilinear Saddle-Point Problems over Convex Sets under Bandit Feedback

ARNAB MAITI ⋅ Claire Jie Zhang ⋅ Kevin Jamieson ⋅ Jamie Morgenstern ⋅ Ioannis Panageas ⋅ Lillian Ratliff

In this paper, we study last-iterate convergence of learning algorithms in bilinear saddle-point problems, a preferable notion of convergence that captures the day-to-day behavior of learning dynamics. We focus on the challenging setting where players select actions from compact convex sets and receive only bandit feedback. Our main contribution is the design of an uncoupled learning algorithm that guarantees last-iterate convergence to the Nash equilibrium with high probability. We establish a convergence rate of $\tilde{O}(T^{-1/4})$ up to polynomial factors in problem parameters. Crucially, our proposed algorithm is computationally efficient, requiring only an efficient linear optimization oracle over the players' compact action sets. The algorithm is obtained by combining techniques from experimental design and the classic Follow-The-Regularized-Leader (FTRL) framework, with a carefully chosen regularizer function tailored to the geometry of the action set of each learner.


10
Rank Lifting and Random Non-Linear Maps

Andrea Drago ⋅ Maria Sofia Bucarelli ⋅ Francesco Caso ⋅ Marius Michetti ⋅ Federico Siciliano ⋅ Fabrizio Silvestri ⋅ Luca Becchetti

Deep neural networks exhibit improved training and generalization performance as the number of parameters grows well beyond the size of the training set, contradicting classical intuitions about overfitting. In order to gain a better understanding of this “benign overparameterization”, we analyze the representational capacity of a random one-hidden-layer perceptron with Gaussian weights, no bias and threshold activations. More precisely, we investigate the following question: when does a hidden layer of dimension $n$ maps $k$ input vectors with pairwise angles at least $\theta$, to a full-rank activation matrix, thus ensuring that a simple linear classifier can perfectly fit those inputs in feature space? This problem has an immediate impact on memorization capacity at initialization and we frame it as a question about hyperplane arrangements on the unit sphere, and we prove new isoperimetric-like inequalities. This allows us to derive non-trivial lower bounds on the probability that a random embedding avoids the arrangement’s zero-measure regions. Our results show that once the hidden dimension exceeds a threshold (depending on $\theta$ and the input dimension), hidden representations are linearly independent with high probability. While the case we consider is challenging due to the sparsity of the solution space, this setting highlights crucial, underlying geometric problems and connections to related questions in spherical geometry and linear algebra.

Probabilities of causation are valuable concepts for explainable artificial intelligence (XAI) and personalized decision-making. Pearl (2009) defined the probabilities of causation from the viewpoint of "necessity", "sufficiency", and "necessity and sufficiency" in the context of structural causal models. In addition, Tian and Pearl (2000) and Kuroki and Cai (2011) provided the identification conditions of the probabilities of causation under the monotonicity assumption. However, these identification conditions are described based on "the joint probabilities of observed random variables" and/or "causal risks" without selection biases. Thus, they are not applicable to studies in the presence of confounding and selection biases. To address this problem, this paper provides novel identification conditions for the probabilities of causation by using (i) two proxy covariates and (ii) an instrumental variable and a proxy covariate. When the probabilities of causation can be evaluated through the proposed identification conditions, new plug-in estimators for these probabilities are presented. Finally, we illustrate the application of our results on a real-world dataset.

Fairness in machine learning has become a critical concern, particularly in high-stakes applications. Existing approaches often focus on achieving full fairness across all score ranges generated by predictive models, ensuring fairness in both high and low-scoring populations. However, this stringent requirement can compromise predictive performance and may not align with the practical fairness concerns of stakeholders. In this work, we propose a novel framework for building partially fair machine learning models, which enforce fairness within a specific score range of interest, such as the middle range where decisions are most contested, while maintaining flexibility in other regions. We introduce two statistical metrics to rigorously evaluate partial fairness within a given score range, such as the top 20%–40% of scores. To achieve partial fairness, we propose an in-processing method by formulating the model training problem as constrained optimization with difference-of-convex constraints, which can be solved by an inexact difference-of-convex algorithm (IDCA). We provide the complexity analysis of IDCA for finding a nearly KKT point. Through numerical experiments on real-world datasets, we demonstrate that our framework achieves high predictive performance while enforcing partial fairness where it matters most.

In recent years, the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks. However, the property of these kernels are poorly understood for activation functions other than powers of the ReLU. Our main contribution is a characterization of the RKHS of these kernels for activation functions whose only non-smoothness is at zero. This extends existing theory to numerous commonly used activation functions such as SELU, ELU, or LeakyReLU. Additionally, we analyze a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, depending only on the degree of the non-smoothness up to equivalence. On the other hand, the RKHS generated by polynomial activations depends on the network depth. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.


103
CTRLS: Chain-of-Thought Reasoning via Latent State-Transition

Junda Wu ⋅ Yuxin Xiong ⋅ Xintong Li ⋅ Sheldon Yu ⋅ Zhengmian Hu ⋅ Tong Yu ⋅ Rui Wang ⋅ Xiang Chen ⋅ Jingbo Shang ⋅ Julian McAuley

Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.

Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic–per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian–Ukrainian war throughout 2022–2023.


105
Linear Reasoning Vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving

Yuliang Ji ⋅ Fuchen Shen ⋅ Jian Wu ⋅ Qiujie Xie ⋅ Yue Zhang

To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs’ reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.

We analyze the long term behavior of hyperbolic neural networks through subhomogeneous layer maps, focusing on stability, growth control, and robustness under stochastic perturbations. This work unifies the standard hyperbolic models via explicit isometries and Möbius operations, allowing statements to be transported across representations without loss of geometric meaning. Within this model invariant view, we study iterated, noise perturbed transformations and develop an ergodic theoretic framework that characterizes their asymptotic behavior, including conditions that promote stability and convergence of averaged iterates. Beyond theory, these insights inform practical design choices for training procedures that remain well-behaved in the presence of noise and avoid unbounded parameter growth, thereby supporting more reliable use of hyperbolic representations in hierarchical and graph structured learning tasks.


107
Minimizing Human Intervention in Online Classification

William Réveillard ⋅ Vasileios Saketos ⋅ Alexandre Proutiere ⋅ Richard Combes

Training or fine-tuning large language model (LLM)–based systems often requires costly human feedback, yet there is limited understanding of how to minimize such intervention while maintaining strong error guarantees. We study this problem for LLM-based classification systems in an active learning framework: an agent sequentially labels $d$-dimensional query embeddings drawn i.i.d. from an unknown distribution by either calling a costly expert or guessing with no feedback, with the goal of minimizing regret relative to an oracle with free expert access. When the horizon $T$ is at least exponential in the embedding dimension $d$, the geometry of the class regions can be learned. In this regime, we propose the Conservative Hull-based Classifier (CHC), which maintains convex hulls of expert-labeled queries and calls the expert when a query lands outside all known hulls. CHC attains $\mathcal{O}(\log^d T)$ regret in $T$ and is minimax optimal for $d=1$. Otherwise, the geometry cannot be reliably learned in general. We show that for queries drawn from a subgaussian mixture and $T \le e^d$, a Center-based Classifier (CC) achieves regret proportional to $N\log{N}$ where $N$ is the number of labels. To bridge these regimes, we introduce the Generalized Hull-based Classifier (GHC), a practical extension of CHC that enables more aggressive guessing via a tunable parameter. Our approach is validated on real-world question-answering datasets using state-of-the-art text embedding models.


109
Incentivizing Truthful Submissions in a Data Marketplace for Mean Estimation

Keran Chen ⋅ Alex Clinton ⋅ Kirthevasan Kandasamy

We study a data marketplace where a broker intermediates between buyers, who seek to estimate the mean $\mu$ of an unknown normal distribution $N(\mu, \sigma^2)$, and contributors, who can collect data from this distribution at a cost. The broker delegates data collection work to contributors, aggregates reported datasets, sells it to buyers, and redistributes revenue as payments to contributors. We aim to maximize welfare or profit under key constraints: individual rationality for buyers and contributors, incentive compatibility (contributors are incentivized to comply with data collection instructions and truthfully report the collected data), and budget balance (total contributor payments equals total revenue). We first compute welfare/profit-optimal prices under truthful reporting; however, to incentivize data collection and truthful data reporting, we adjust them based on discrepancies in contributors' reported data. This yields a Nash equilibrium (NE) where the two lowest-cost contributors collect all data. We complement this with two hardness results: $\mathcal{(i)}$ no nontrivial dominant-strategy incentive-compatible mechanism exists in this problem, and $\mathcal{(ii)}$ no mechanism outperforms ours in a NE.

Despite the enormous success of Hamiltonian Monte Carlo and related Markov Chain Monte Carlo (MCMC) methods, sampling often still represents the computational bottleneck in scientific applications. Availability of parallel resources can significantly speed up MCMC inference by running a large number of chains in parallel, each collecting a single sample. However, the parallel approach converges slowly if the chains are not initialized close to the target distribution (cold start). Theoretically this can be resolved by initially running MCMC without Metropolis-Hastings adjustment to quickly converge to the vicinity of the target distribution and then turn on adjustment to achieve fine convergence. However, no practical scheme uses this strategy, due to the difficulty of automatically selecting the step size during the unadjusted phase. We here develop Late Adjusted Parallel Sampler (LAPS), which is precisely such a scheme and is applicable out of the box. LAPS takes advantage of ensemble-based hyperparameter adaptation to estimate the bias at each iteration and converts it to the appropriate step size. We show that LAPS consistently and significantly outperforms ensemble adjusted methods such as MEADS or ChESS and the optimization-based initializer Pathfinder on a variety of standard benchmark problems. LAPS typically achieves two orders of magnitude lower wall-clock time than the corresponding sequential algorithms such as NUTS.


110
Kernel Treatment Effects with Adaptively Collected Data

Houssam Zenati ⋅ Bariscan Bozkurt ⋅ Arthur Gretton

Adaptive experiments improve efficiency by adjusting treatment assignments based on past outcomes, but this adaptivity breaks the i.i.d.\ assumptions that underpin classical asymptotics. At the same time, many questions of interest are distributional, extending beyond average effects. Kernel treatment effects (KTE) provide a flexible framework by representing interventional outcome distributions in an RKHS and comparing them via kernel distances. We present the first kernel-based framework for distributional inference under adaptive data collection. Our method combines doubly robust RKHS scores with a witness function learned on one fold, and performs inference on a second fold using a projected, sequentially normalized scalar statistic with valid type-I error. Experiments show that the resulting procedure is well calibrated and effective for both mean shifts and higher-moment differences, outperforming adaptive baselines limited to scalar effects.


111
AlphaFold's Bayesian Roots in Probability Kinematics

Thomas Hamelryck ⋅ Kanti Mardia

The seminal breakthrough of AlphaFold in protein structure prediction relied on a learned potential energy function parameterized by deep models, in contrast to its successors AlphaFold2 and AlphaFold3, which lack an explicit probabilistic interpretation. While AlphaFold’s potential was originally justified by heuristic analogy to physical potentials of mean force, we show that it can instead be understood as a principled instance of probability kinematics (PK), also known as Jeffrey conditioning, a generalization of Bayesian updating. This reinterpretation reveals that AlphaFold is a generalized Bayesian model that explicitly defines a posterior distribution over structures, providing a deeper explanation of its success and a foundation for future model design. To demonstrate this framework with precision, we introduce a tractable synthetic model in which an angular random walk prior is updated with distance-based evidence via PK, directly mirroring AlphaFold’s mechanism. This setting allows us to explore the probabilistic foundations of AlphaFold in a clear and interpretable way. Our work connects a landmark in protein structure prediction to a broader class of compositional deep generative models and points to new opportunities for principled probabilistic approaches.


112
Learning Physical Operators using Neural Operators

Vignesh Gopakumar ⋅ Ander Gray ⋅ Daniel Giles ⋅ Lorenzo Zanisi ⋅ Matt Kusner ⋅ Timo Betcke ⋅ Stanislas Pamela ⋅ Marc Deisenroth

Neural operators have emerged as promising surrogate models for solving partial differential equations (PDEs), but struggle to generalise beyond training distributions and are often constrained to a fixed temporal discretisation. This work introduces a physics-informed training framework that addresses these limitations by decomposing PDEs using operator splitting methods, training separate neural operators to learn individual non-linear physical operators while approximating linear operators with fixed finite-difference convolutions. This modular mixture-of-experts architecture enables generalisation to novel physical regimes by explicitly encoding the underlying operator structure. We formulate the modelling task as a neural ordinary differential equation (ODE) where these learned operators constitute the right-hand side, enabling continuous-in-time predictions through standard ODE solvers and implicitly enforcing PDE constraints. Demonstrated on incompressible and compressible Navier--Stokes equations, our approach achieves better convergence and superior performance when generalising to unseen physics. The method remains parameter-efficient, enabling temporal extrapolation beyond training horizons, and provides interpretable components whose behaviour can be verified against known physics.

Machine learning has become increasingly popular in informing data-driven policy-making. Policies influence behavior in individuals or populations, and ideally, through observational signals, policy-makers learn which policies are effective. However, in many settings, individual actions cannot be perfectly observed. This issue, known in economics as moral hazard, poses a significant challenge. In this work, we study the foundational multitasking principal–agent contract design problem and demonstrate how instrumental regression and the generalized method of moments (GMM) estimator can be used to estimate or learn a good contract. As a bonus result, we also give a uniformity characterization of the shape of the optimal contract.


115
Private Synthetic Graph Generation and Fused Gromov-Wasserstein Distance

Leoni Carla Wirth ⋅ Gholamali Aminian ⋅ Gesine Reinert

Networks are popular representations of complex data. In particular, differentially private synthetic networks are much in demand. Here, instead of starting from a network, we start with the complex data set itself and construct both a network representation and a corresponding synthetic network generator. We build a network model directly based on the underlying complex system data, capturing its structure and attributes. Using a random connection model, we devise an effective algorithmic approach for generating attributed synthetic networks which is $\epsilon$-differentially private at the vertex level, while preserving utility. We provide theoretical guarantees for the accuracy of the private synthetic networks using the fused Gromov-Wasserstein distance.


116
Momentum SVGD-EM for Accelerated Maximum Marginal Likelihood Estimation

Adam Rozzio ⋅ Rafael Athanasiades ⋅ O. Deniz Akyildiz

Maximum marginal likelihood estimation (MMLE) can be recast as the optimization of the so-called free energy. The celebrated expectation-maximisation (EM) procedure can then be interpreted as a coordinate-descent scheme in the space of parameters and measures. Recently, a significant body of work has adopted this perspective, leading to interacting particle algorithms for MMLE. In this paper, we propose an accelerated version of one such procedure, based on Stein variational gradient descent (SVGD), by introducing Nesterov momentum in both the parameter updates and in the space of probability measures. The resulting method, termed Momentum SVGD-EM, consistently accelerates convergence in terms of required iterations across various tasks of increasing difficulty, demonstrating effectiveness in both low- and high-dimensional settings.


117
Rate optimal learning of equilibria from data

Till Freihaut ⋅ Luca Viano ⋅ Emanuele Nevali ⋅ Volkan Cevher ⋅ Matthieu Geist ⋅ Giorgia Ramponi

We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the \emph{all-policy deviation concentrability coefficient} as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, \emph{\ours}. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, cases where Behavior Cloning fails to learn.


118
Hypergraph Neural Networks Accelerate MUS Enumeration

Hiroya Ijima ⋅ Koichiro Yawata

Enumerating Minimal Unsatisfiable Subsets (MUSes) is a fundamental task in constraint satisfaction problems (CSPs). Its major challenge is the exponential growth of the search space, which becomes particularly severe when satisfiability checks are expensive. Recent machine learning approaches reduce this cost for Boolean satisfiability problems but rely on explicit variable-constraint relationships, limiting their application domains. This paper proposes a domain-agnostic method to accelerate MUS enumeration using Hypergraph Neural Networks (HGNNs). The proposed method incrementally builds a hypergraph with constraints as vertices and MUSes enumerated until the current step as hyperedges, and employs an HGNN-based agent trained via reinforcement learning to minimize the number of satisfiability checks required to obtain an MUS. Experimental results demonstrate the effectiveness of our approach in accelerating MUS enumeration, showing that our method can enumerate more MUSes within the same satisfiability check budget compared to conventional methods.


119
Adversarial Robustness in One-Stage Learning-to-Defer

Yannis Montreuil ⋅ Letian Yu ⋅ Axel Carlier ⋅ Lai Xing Ng ⋅ Wei Ooi

Learning-to-Defer (L2D) enables hybrid decision-making by routing inputs either to a predictor or to external experts. While promising, L2D is highly vulnerable to adversarial perturbations, which can not only flip predictions but also manipulate deferral decisions. Prior robustness analyses focus solely on two-stage settings, leaving open the end-to-end (one-stage) case where predictor and allocation are trained jointly. We introduce the first framework for adversarial robustness in one-stage L2D, covering both classification and regression. Our approach formalizes attacks, proposes cost-sensitive adversarial surrogate losses, and establishes theoretical guarantees including $\mathcal{H}$, $(\mathcal{R }, \mathcal{F})$, and Bayes consistency. Experiments on benchmark datasets confirm that our methods improve robustness against untargeted and targeted attacks while preserving clean performance.

Accurate tree height estimation is vital for ecological monitoring and biomass assessment. We apply quantile regression to existing tree height estimation models based on satellite data to incorporate uncertainty quantification. Most current approaches for tree height estimation rely on point predictions, which limits their applicability in risk-sensitive scenarios. In this work, we show that, with minor modifications of a given prediction head, existing models can be adapted to provide statistically calibrated uncertainty estimates via quantile regression. Furthermore, we demonstrate how our results correlate with known challenges in remote sensing (e.g., terrain complexity, vegetation heterogeneity), indicating that the model is less confident in more challenging conditions.


120
From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences

Xinzi Tan ⋅ Kejian Zhang ⋅ Junhan Yu ⋅ Doudou Zhou

Marked Temporal Point Processes (MTPPs) arise naturally in medical, social, commercial, and financial domains. However, existing Transformer-based methods mostly inject temporal information only via positional encodings, relying on shared or parametric decay structures, which limits their ability to capture heterogeneous and type-specific temporal effects. Inspired by this observation, we derive a novel attention operator called Hawkes Attention from the multivariate Hawkes process theory for MTPP, using learnable per-type neural kernels to modulate query, key and value projections, thereby replacing the corresponding parts in the traditional attention. Benefited from the design, Hawkes Attention unifies event timing and content interaction, learning both the time-relevant behavior and type-specific excitation patterns from the data. The experimental results show that our method achieves better performance compared to the baselines. In addition to the general MTPP, our attention mechanism can also be easily applied to specific temporal structures, such as time series forecasting.


121
Gradient-Flow SDEs Have Unique Transient Population Dynamics

Vincent Guan ⋅ Joseph Janssen ⋅ Nicolas Lanzetti ⋅ Antonio Terpin ⋅ Geoffrey Schiebinger ⋅ Elina Robeva

Identifying the drift and diffusion of an SDE from its population dynamics is a notoriously challenging task. Researchers in machine learning and single-cell biology have only been able to prove a partial identifiability result: for potential-driven SDEs, the gradient-flow drift can be identified from temporal marginals if the Brownian diffusivity is already known. Existing methods therefore assume that the diffusivity is known a priori, despite it being unknown in practice. We dispel the need for this assumption by providing a complete characterization of identifiability: the gradient-flow drift and Brownian diffusivity are jointly identifiable from temporal marginals if and only if the process is observed outside of equilibrium. Given this fundamental result, we propose nn-APPEX, the first Schrödinger Bridge–based inference method that can simultaneously learn the drift and diffusion of a gradient-flow SDE solely from observed marginals. Extensive experiments show that nn-APPEX's ability to adjust its diffusion estimate enables accurate inference, while previous Schrödinger Bridge methods obtain biased drift estimates due to their assumed, and likely incorrect, diffusion.


122
Time Series Forecasting with Hahn Kolmogorov-Arnold Networks

Zahidul Hasan ⋅ Abdessamad Ben Hamza ⋅ Nizar Bouguila

Recent Transformer- and MLP-based models have demonstrated strong performance in long-term time series forecasting, yet Transformers remain limited by their quadratic complexity and permutation-equivariant attention, while MLPs exhibit spectral bias. We propose HaKAN, a versatile model based on Kolmogorov-Arnold Networks (KANs), leveraging Hahn polynomial-based learnable activation functions and providing a lightweight and interpretable alternative for multivariate time series forecasting. Our model integrates channel independence, patching, a stack of Hahn-KAN blocks with residual connections, and a bottleneck structure comprised of two fully connected layers. The Hahn-KAN block consists of inter- and intra-patch KAN layers to effectively capture both global and local temporal patterns. Extensive experiments on various forecasting benchmarks demonstrate that our model consistently outperforms recent state-of-the-art methods, with ablation studies validating the effectiveness of its core components.


123
Undersmoothing Black-Box Models for Functional Estimation

Yue Yu ⋅ Debarghya Mukherjee ⋅ Moulinath Banerjee ⋅ Yaacov Ritov

We study functional estimation using black-box models through a model-agnostic undersmoothing framework. The proposed procedure \texttt{Rep} operates by augmenting the original dataset through replicating a proportion of samples multiple times, and subsequently applying the black-box algorithm to the augmented dataset. This construction automatically induces undersmoothing and reduces the functional estimation error. We provide several empirical demonstrations (including neural network based learners) showing that compared to the plug-in estimator, the proposed algorithm \texttt{Rep} improves the estimation accuracy of functional estimation without requiring explicit expressions for the associated influence functions. Furthermore, we develop a theoretical analysis in two representative settings, the Nadaraya–Watson estimator and the random feature model, establishing that replication provides explicit prescriptions for the replication proportion and number of copies, and yields optimal convergence rates for functional estimation. In the classical nonparametric regression setting, we extend \texttt{Rep} with a Lepski-style method that adapts to unknown structural features of the regression function.

In many science and industry settings, a central challenge is designing experiments under time and budget constraints. Bayesian Optimal Experimental Design (BOED) is a paradigm to pick maximally informative designs that has been widely applied to such problems. During training, BOED selects inputs according to a pre-determined acquisition criterion to target informativeness. During testing, the model learned during training encounters a naturally occurring distribution of test samples. This leads to an instance of covariate shift, where the train and test samples are drawn from different distributions (the training samples are not representative of the test distribution). Prior work has shown that in the presence of model misspecification, covariate shift amplifies generalization error. Our first contribution is to provide a mathematical analysis of generalization error in the presence of model misspecification, revealing that, beyond covariate shift, generalization error is also driven by a previously unidentified phenomenon we term error (de-)amplification. We then develop a new acquisition function that mitigates the effects of model misspecification by including terms for representativeness, informativeness, and de-amplification (R-IDeA). Our experimental results demonstrate that the proposed method performs better than methods that target only informativeness, only representativeness, or both.


126
Evaluation of Large Language Models via Coupled Token Generation

Nina Corvelo Benz ⋅ Stratis Tsirtsis ⋅ Eleni Straitouri ⋅ Ivi Chatzi ⋅ Ander Artola Velasco ⋅ Suhas Thejaswi ⋅ Manuel Gomez Rodriguez

State-of-the-art large language models rely on randomization to respond to a prompt. Consequently, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for this randomization. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on pairwise comparisons, the two approaches can surprisingly lead to different rankings when comparing more than two models. To complement our theoretical results, we conduct experiments with several models from the $\texttt{Llama}$, $\texttt{Mistral}$ and $\texttt{Qwen}$ families. We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to $75$\% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts from the LMSYS Chatbot Arena platform differ under coupled and vanilla autoregressive generation.


127
Training Latent Diffusion Models with Interacting Particle Algorithms

Tim Y. J. Wang ⋅ Juan Kuntz ⋅ O. Deniz Akyildiz

We introduce a novel particle-based algorithm for end-to-end training of latent diffusion models. We reformulate the training task as minimizing a free energy functional and obtain a gradient flow that does so. By approximating the latter with a system of interacting particles, we obtain the algorithm, which we underpin theoretically by providing error guarantees. The novel algorithm compares favorably in experiments with previous particle-based methods and variational inference analogues.


128
Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

Yannis Montreuil ⋅ Yeo Heng ⋅ Axel Carlier ⋅ Lai Xing Ng ⋅ Wei Ooi

Large Language Models (LLMs) excel at generative language tasks but remain unreliable for structured prediction, particularly in extractive question answering (EQA), where success depends on precise span selection. These challenges are amplified in resource-constrained environments, such as mobile or embedded systems, where deploying high-capacity models is often infeasible. We propose a Learning-to-Defer framework that routes EQA queries across a pool of models with varying capabilities and costs to balance accuracy and efficiency. Our approach is grounded in statistical decision theory: we define a differentiable surrogate loss whose minimizer provably converges to the Bayes-optimal allocation policy. Experiments on SQuADv1, SQuADv2, and TriviaQA show that our method consistently improves the accuracy-efficiency trade-off relative to static baselines and prior routing heuristics. Overall, our framework provides a principled and scalable solution for EQA in both high-performance and on-device deployment settings.


129
On the Misinformation in a Statistical Experiment

Jake Callahan ⋅ Tommie Catanach

The principle that more informative experiments are always better is a cornerstone of Bayesian experimental design. This principle assumes the practitioner's model and inference are correct. In practice, both the data-generating model and the inferential approximation are inevitably misspecified, and we show that under these conditions the classical framework for comparing experiments breaks down. Designs ranked as most informative can become actively harmful, amplifying bias to produce confident but incorrect inferences. We demonstrate that the commonly-accepted axioms of experimental utility, such as Blackwell monotonicity, fail under misspecification, and that information measures proposed to handle it, like the Expected Generalized Information Gain (EGIG), do not obey these axioms. To resolve this, we propose a generalized axiomatic framework for robust Bayesian experimental design. We prove that EGIG satisfies our axioms as a criterion that penalizes inferential error, providing a principled foundation for its use in Bayesian experimental design. As a complementary approach, we derive a new measure that instead penalizes model error. Finally, we demonstrate our framework's utility across common modes of misspecification, showing it provides a reliable guide for experimental design where classical methods fail.

Evaluating joint probabilities of potential outcomes and observed variables, and their linear combinations, is a fundamental challenge in causal inference. This paper addresses the bounding and identification of these probabilities in settings with discrete treatment and discrete outcome. We propose new families of monotonicity assumptions and formulate the bounding problem as a linear programming problem. We further introduce a new monotonicity assumption specifically to achieve identification. Finally, we present numerical experiments to validate our methods and demonstrate their application using real-world datasets.


130
LatticeVision: Image to Image Networks for Modeling Non-Stationary Spatial Data

Antony Sikorski ⋅ Michael Ivanitskiy ⋅ Nathan Lenssen ⋅ Douglas Nychka ⋅ Daniel McKenzie

In many applications, we wish to fit a parametric statistical model to a small ensemble of spatially distributed random variables ('fields'). However, parameter inference using maximum likelihood estimation (MLE) is computationally prohibitive, especially for large, non-stationary fields. Thus, many recent works train neural networks to estimate parameters given spatial fields as input, sidestepping MLE completely. In this work we focus on a popular class of parametric, spatially autoregressive (SAR) models. We make a simple yet impactful observation; because the SAR parameters can be arranged on a regular grid, both inputs (spatial fields) and outputs (model parameters) can be viewed as images. Using this insight, we demonstrate that image-to-image (I2I) networks enable faster and more accurate parameter estimation for a class of non-stationary SAR models with unprecedented complexity.


131
Reconciling Communication Compression and Byzantine-Robustness in Distributed Learning

Diksha Gupta ⋅ Antonio Honsell ⋅ Chuan Xu ⋅ Nirupam Gupta ⋅ Giovanni Neglia

Distributed learning enables scalable model training over decentralized data, but remains hindered by Byzantine faults and high communication costs. While both challenges have been studied extensively in isolation, their interplay has received limited attention. Prior work has shown that naively combining communication compression with Byzantine-robust aggregation can severely weaken resilience to faulty nodes. The current state-of-the-art, Byz-DASHA-PAGE, leverages a momentum-based variance reduction scheme to counteract the negative effect of compression noise on Byzantine robustness. In this work, we introduce RoSDHB, a new algorithm that integrates classical Polyak momentum with a coordinated compression strategy. Theoretically, RoSDHB matches the convergence guarantee of Byz-DASHA-PAGE under the standard $(G, B)$-gradient dissimilarity model, but relies on milder assumptions. Empirically, RoSDHB demonstrates stronger robustness while achieving substantial communication savings compared to Byz-DASHA-PAGE.

We propose a novel approach that achieves constant activation memory usage during the training of convolutional neural networks (CNNs), addressing a key memory bottleneck in the backward pass. By reconstructing activations required for gradient matrix calculation through the proposed efficient convolution activation inversion (ECAI) rather than storing them in memory during forward pass, it becomes possible to maintain constant activation memory usage across convolution layers. We formulate the activation inversion problem as a set of $n$ systems of linear equations derived from forward convolution operations, and solve them with an accelerated method that achieves $\mathcal{O}(n^2)$ complexity. The proposed approach enables memory-constrained mobile, edge, and embedded devices to perform CNN training without a growth of activation memory over the model capacity while also enhancing training memory efficiency for large-sized images on commercial GPUs. The experimental results demonstrate that the proposed approach maintains constant activation memory by reusing a fixed memory space, improving memory efficiency without degradation in model accuracy. The memory savings achieved by the proposed method increase when using more convolution layers, potentially achieving near-zero activation (e.g., $30\times$ or more activation memory reduction in specific setups). The code implementation is available at an anonymous GitHub.

Bayesian inference for doubly-intractable pairwise exponential graphical models typically involves variations of the exchange algorithm or approximate Markov chain Monte Carlo (MCMC) samplers. However, existing methods for both classes of algorithms require either perfect samplers or sequential samplers for complex models, which are often either not available, or suffer from poor mixing, especially in high dimensions. We develop a method that does not require perfect or sequential sampling, and can be applied to both classes of methods: exact and approximate MCMC. The key to our approach is to utilize the tractable independence model underlying the intractable probabilistic graphical model for the purpose of constructing a finite sample unbiased Monte Carlo (and not MCMC) estimate of the Metropolis--Hastings ratio. This innovation turns out to be crucial for scalability in high dimensions. The method is demonstrated on the Ising model. Gradient-based alternatives to construct a proposal, such as Langevin and Hamiltonian Monte Carlo approaches, also arise as a natural corollary to our general procedure, and are demonstrated as well.


134
Evidence Estimation in Gaussian Graphical Models Using a Telescoping Block Decomposition of the Precision Matrix

Anindya Bhadra, Ksheera Sagar, David Rowe, Sayantan Banerjee and Jyotishka Datta

We study the bidding problem in multi-unit discriminatory (pay-as-bid) auctions from the perspective of a single bidder whose utility is given by value minus $\alpha$ times payment, where $\alpha \in [0,1]$ is the *cost-of-capital* parameter. The bidder aims to maximize cumulative utility over $T$ rounds subject to budget constraints. Even without budgets, the problem is nontrivial since the bidding space is exponential in $M$, the bidder’s maximum demand, and the valuation vector (treated as the context) varies over time. Leveraging the decomposition of the utility function across units, we develop polynomial-time learning algorithms based on directed acyclic graphs (DAGs) under both full-information and bandit feedback that achieve sublinear regret. Due to *complete cross learning over contexts*, where the utility under the observed context reveals counterfactual utilities, the regret bound in the bandit setting is independent of the number of contexts. With budgets, we design a primal-dual based algorithm: the DAG-based procedure performs the primal updates, while online gradient descent (OGD) is used for dual updates. This yields $\rho$-approximate sublinear regret, where $\rho \in (0,1]$ denotes the average per-unit, per-round budget.


136
Sharp Risk Bounds for Early-stopping in Gaussian Linear Regression

Tobias Wegel ⋅ Gil Kur ⋅ Patrick Rebeschini

We study early-stopped mirror descent (ESMD) for high-dimensional Gaussian linear regression over arbitrary convex bodies and design matrices, where the task is to minimize the in-sample mean squared error. Our main result shows that some of the sharpest risk bounds for the least squares estimator (LSE), based on the local Gaussian width, extend to ESMD. We derive sufficient conditions on the potential, expressed via the Minkowski functional, under which our result holds. These conditions allow us to construct new potentials and analyze existing ones. Our results then yield general sufficient conditions for minimax optimality of ESMD, provide a systematic comparison with the LSE, and establish the tightest known risk bound in the $\ell_1$-constrained setting.


137
A Bayesian Information-Theoretic Approach to Data Attribution

Dharmesh Tailor ⋅ Nicolò Felicioni ⋅ Kamil Ciosek

Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce—the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. We show this aligns with classical influence scores for single-example attribution while promoting diversity for subsets. For even larger-scale retrieval, we relax to an information-gain objective and add a variance correction for scalable attribution in vector databases. Experiments show competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection, showing that our method scales to modern architectures while bridging principled measures with practice.


138
Variational inference via radial transport

Luca Ghafourpour ⋅ Sinho Chewi ⋅ Alessio Figalli ⋅ Aram-Alexandre Pooladian

In variational inference (VI), the practitioner approximates a high-dimensional distribution $\pi$ with a simple surrogate one, often a (product) Gaussian distribution. However, in many cases of practical interest, Gaussian distributions might not capture the correct radial profile of $\pi$, resulting in poor coverage. In this work, we approach the VI problem from the perspective of optimizing over these radial profiles. Our algorithm $\texttt{radVI}$ is a cheap, effective add-on to many existing VI schemes, such as Gaussian (mean-field) VI and Laplace approximation. We provide theoretical convergence guarantees for our algorithm, owing to recent developments in optimization over the Wasserstein space—the space of probability distributions endowed with the Wasserstein distance—and new regularity properties of radial transport maps in the style of Caffarelli (2000).


139
ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models

Lejs Deen Behric ⋅ Liang Zhang ⋅ Bingcong Li ⋅ Kiran Thekumparampil

Zeroth‑order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth‑order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2$\times$ faster than MeZO while retaining the low‑memory footprint of zeroth-order methods.


14
Adaptive Coverage Policies in Conformal Prediction

Etienne Gauthier ⋅ Francis Bach ⋅ Michael Jordan

Traditional conformal prediction methods construct prediction sets such that the true label falls within the set with a user-specified coverage level. However, poorly chosen coverage levels can result in uninformative predictions, either producing overly conservative sets when the coverage level is too high, or empty sets when it is too low. Moreover, the fixed coverage level cannot adapt to the specific characteristics of each individual example, limiting the flexibility and efficiency of these methods. In this work, we leverage recent advances in e-values and post-hoc conformal inference, which allow the use of data-dependent coverage levels while maintaining valid statistical guarantees. We propose to optimize an adaptive coverage policy by training a neural network using a leave-one-out procedure on the calibration set, allowing the coverage level and the resulting prediction set size to vary with the difficulty of each individual example. We support our approach with theoretical coverage guarantees and demonstrate its practical benefits through a series of experiments.


140
Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

Max Lin ⋅ Reza Asad ⋅ Kevin Tan ⋅ Haque Ishfaq ⋅ Csaba Szepesvari ⋅ Sharan Vaswani

Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient and construct "implicit" policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable $\textit{logit-matching}$ regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves $\widetilde{\mathcal{O}}(\epsilon^{-4})$ and $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical work in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.


141
Counterfactual Explanations via Latent Structure for Time Series Classification

Akihiro Yamaguchi ⋅ Shizuo Kaji ⋅ Kaname Matsue ⋅ Ryusei Shingaki

There is a growing need for explainability in time series classification. Counterfactual (CF) generation creates in-distribution synthetic instances that flip the prediction to a desired class. We propose CELT, a model-agnostic CF generation method for time-series classifiers, including non-differentiable and one-class models. In the development phase, CELT learns a structured latent space in which desired-class latent instances form clusters and other latent instances are pushed away. In addition, the design enables segment-wise, time-local edits. In the deployment phase, CELT efficiently generates CFs by editing a minimal number of time-local segments, guided by the learned structure. We formulate both phases as mathematically sound optimization problems that uniformly handle supervised and one-class classification, and we demonstrate effectiveness on UCR datasets.


142
ReTrack: Data Unlearning in Diffusion Models through Redirecting the Denoising Trajectory

Qitan Shi ⋅ Cheng Jin ⋅ Jiawei Zhang ⋅ Yuantao Gu

Diffusion models excel at generating high-quality, diverse images but also suffer from undesirable training data memorization, raising critical privacy and safety concerns. Data unlearning has emerged to mitigate this issue by removing the influence of specific data through fine-tuning rather than retraining from scratch. We propose ReTrack, a fast and effective data unlearning method for diffusion models. ReTrack employs importance sampling to construct a more efficient unbiased fine-tuning loss. This loss is further approximated by retaining only the dominant terms, thereby reducing computational cost. This yields an interpretable objective that redirects denoising trajectories toward the $k$-nearest neighbors, enabling efficient unlearning while preserving generative quality. Experiments on MNIST T-Shirt, CelebA-HQ, CIFAR-10, and Stable Diffusion show that ReTrack achieves state-of-the-art performance, striking the best trade-off between unlearning strength and generation quality preservation.


143
Thompson Sampling-like Algorithms for Stochastic Rising Bandits

Marco Fiandri ⋅ Alberto Maria Metelli ⋅ Francesco Trovò

Stochastic rising rested bandit (SRRB) is a setting where the arms' expected rewards increase as they are pulled. It models scenarios in which the performances of the options grow as an effect of an underlying learning process (e.g., online model selection). Even if the bandit literature provides specifically crafted algorithms based on upper-confidence bounds for such a setting, no study about Thompson sampling (TS)-like algorithms has been performed so far. The strong regularity of the expected rewards in the SRRB setting suggests that specific instances may be tackled effectively using adapted and sliding-window TS approaches. This work provides novel regret analyses for such algorithms in SRRBs, highlighting the challenges and providing new technical tools of independent interest. Our results allow us to identify under which assumptions TS-like algorithms succeed in achieving sublinear regret and which properties of the environment govern the complexity of the regret minimization problem when approached with TS. Furthermore, we provide a regret lower bound based on a complexity index we introduce. Finally, we conduct numerical simulations comparing TS-like algorithms with state-of-the-art approaches for SRRBs in synthetic and real-world settings.


144
The Riemannian Geometry Associated to Gradient Flows of Linear Convolutional Networks

El Mehdi Achour ⋅ Kathlén Kohn ⋅ Holger Rauhut

We study geometric properties of the gradient flow for learning deep linear convolutional networks. For linear fully connected networks, it has been shown recently that the corresponding gradient flow on parameter space can be written as a Riemannian gradient flow on function space (i.e., on the product of weight matrices) if the initialization satisfies a so-called balancedness condition. We establish that the gradient flow on parameter space for learning linear convolutional networks can be written as a Riemannian gradient flow on function space regardless of the initialization. This result holds for $D$-dimensional convolutions with $D \geq 2$, and for $D =1$ it holds if all so-called strides of the convolutions are greater than one. The corresponding Riemannian metric depends on the initialization.


145
Active Subspaces in Infinite Dimension

Poorbita Kundu ⋅ Nathan Wycoff

Active subspace analysis uses the leading eigenspace of the gradient's second moment to conduct supervised dimension reduction. In this article, we extend this methodology to real-valued functionals on Hilbert space. We define an operator which coincides with the active subspace matrix when applied to a Euclidean space. We show that many of the desirable properties of Active Subspace analysis extend directly to the infinite dimensional setting. We also propose a Monte Carlo procedure and discuss its convergence properties. Finally, we deploy this methodology to create visualizations as well as improve modeling and optimization on complex test problems.

In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available. In this work, we formalize a model of learning with unbounded rewards without a mentor as a two-action contextual bandit with an abstain option: at each round the agent observes an input and chooses either to abstain (always 0 reward) or to commit (execute a preexisting task policy). Committing yields rewards that are upper-bounded but can be arbitrarily negative, and the commit reward is assumed Lipschitz in the input. We propose a caution-based algorithm that learns when not to learn: it chooses a trusted region and commits only where the available evidence does not already certify harm. Under these conditions and i.i.d. inputs, we establish sublinear regret guarantees, theoretically demonstrating the effectiveness of cautious exploration for deploying learning agents safely in high-stakes environments.


147
Fast and Robust Simulation-Based Inference With Optimization Monte Carlo

Vasilis Gkolemis ⋅ Christos Diou ⋅ Michael Gutmann

Bayesian parameter inference for complex stochastic simulators is challenging due to intractable likelihood functions. Existing simulation-based inference methods often require large number of simulations and become costly to use in high-dimensional parameter spaces or in problems with partially uninformative outputs. We propose a new method for differentiable simulators that delivers accurate posterior inference with substantially reduced runtimes. Building on the Optimization Monte Carlo framework, our approach reformulates inference for stochastic simulators in terms of deterministic optimization problems. Gradient-based methods are then applied to efficiently navigate toward high-density posterior regions and avoid wasteful simulations in low-probability areas. A JAX-based implementation further enhances the performance through vectorization of key method components. Extensive experiments, including high-dimensional parameter spaces, uninformative outputs, multiple observations and multimodal posteriors show that our method consistently matches, and often exceeds, the accuracy of state-of-the-art approaches, while reducing the runtime by a substantial margin.


148
Deformed Decomposition for Non-negative Tensors

Kazu Ghalamkari ⋅ Petr Taborsky ⋅ Morten Mørup

Non-negative tensor factorization finds widespread use in numerous applications, however, its global optimization has been a long-standing challenge. As such, the Frobenius norm minimization even in the rank-1 setting is an NP-hard problem. We presently reformulate tensor decompositions using deformed algebra, and show that the best rank-1 approximation thereby reduces to a convex optimization problem for the rich χ-divergence family. Building on this foundation, we propose the deformed many-body approximation for non-negative tensors, which expands model capacity while maintaining global optimality by preserving the flatness of the model manifold. Introducing latent variables, for a subclass of χ-divergences, we further develop an Expectation-Maximization-based framework for the deformed extension of traditional low-rank approximations as iterative convex subproblems. We empirically demonstrate in tensor-based discrete density estimation that the deformed decompositions induce regularization and robustness against noise and mislabelled data. Beyond ordinary tensor algebra, our findings provide a factorization framework that enables us to leverage various divergences with convex rank-1 and many-body approximations.


149
On the optimal regret of collaborative personalized linear bandits

Bruce Huang ⋅ Ruida Zhou ⋅ Lin Yang ⋅ Suhas Diggavi

Stochastic linear bandits are a fundamental model for sequential decision making. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We derive an information-theoretic lower bound showing how the number of agents, the number of rounds, and the degree of heterogeneity jointly affect regret. We propose a two-stage collaborative algorithm that achieves the optimal regret. We model heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound $\tilde{O}(d\sqrt{mn})$, $\tilde{O}(dm^{1-\gamma}\sqrt{n})$, $\tilde{O}(dm\sqrt{n})$ for the number of rounds $n$ in the range of $o \left( \frac{d}{m \sigma^2} \right)$, $\Theta \left( \frac{d}{m^{2\gamma} \sigma^2} \right)$ and $\omega \left( \frac{d}{\sigma^2}, \right)$ respectively, where $\sigma$ measures the level of heterogeneity, $m$ is the number of agents, and $\gamma\in[0, 1/2]$ is an absolute constant. In contrast, without collaboration achieves a regret bound $O(dm\sqrt{n})$ at best.


15
Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling

Sutashu Tomonaga ⋅ Kenji Doya ⋅ Noboru Murata

Structured State Space Models (SSMs), which are at the heart of the recently popular Mamba architecture, are powerful tools for sequence modeling. However, their theoretical foundation relies on a complex, multi-stage process of continuous-time modeling and subsequent discretization, which can obscure intuition. We introduce a direct, first-principles framework for constructing discrete-time SSMs that is both flexible and modular. Our approach is based on a novel lag operator, which geometrically derives the discrete-time recurrence by measuring how the system's basis functions undergo what we call a domain expansion from one timestep to the next. The resulting state matrices are computed via a single inner product involving this operator, enabling a modular design space for creating novel SSMs by flexibly combining different basis functions and time-warping schemes. To validate our framework, we demonstrate that a specific instance exactly recovers the recurrence of the influential HiPPO model. Numerical simulations confirm our derivation, providing new theoretical tools for designing flexible and robust sequence models.


150
Bayesian Inverse Transition Learning: Learning Dynamics from Near-Optimal Trajectories

Leo Benac ⋅ Abhishek Sharma ⋅ Sonali Parbhoo ⋅ Finale Doshi-Velez

We consider the problem of estimating the transition dynamics from near-optimal expert trajectories in the context of offline model-based reinforcement learning. We develop a novel constraint-based method, Inverse Transition Learning, that treats the limited coverage of the expert trajectories as a feature: we use the fact that the expert is near-optimal to inform our estimate of. We integrate our constraints into a Bayesian approach. Across both synthetic environments and real healthcare scenarios like Intensive Care Unit (ICU) patient management in hypotension, we demonstrate not only significant improvements in decision-making, but that our posterior can inform when transfer will be successful.


151
Conservative Inference in Switchback Experiments

Jose Blanchet ⋅ Peter Glynn ⋅ Ramesh Johari ⋅ Linjia Wu ⋅ Wenqian Xing

Switchback experiments are widely used in dynamic systems---such as ridesharing platforms and online marketplaces---to evaluate interventions under interference. However, the standard pipeline of estimating the average treatment effect (ATE) with a difference-in-means (DM) estimator can exhibit systematic bias in dynamic settings with evolving system state, due to intertemporal dependence (``carryover effects"). In this paper, we study this bias in a continuous-time Markov chain model of switchback experiments with stochastically monotone dynamics and state-monotone rewards; these are reasonable representations of mean-reverting and auto-regressive systems. We show the DM estimator systematically underestimates the true ATE, because it targets an average of transient treatment effects rather than the ATE itself. Using the Ornstein-Uhlenbeck process as a tractable example, we derive closed-form expressions for bias and variance; this analysis shows that standard approaches overestimate the true variance. Taken together, these effects mean that standard switchback experiment analysis yields overly conservative inference. We validate our theory using a ride-sharing simulation with real-world calibration.


152
The Role of Causal Features in Strategic Classification for Robustness and Alignment

Antonio Gois ⋅ Sophia Gunluk ⋅ Nir Rosenfeld ⋅ Nidhi Hegde ⋅ Simon Lacoste-Julien ⋅ Dhanya Sridhar

In strategic classification, an institution (e.g., a bank) anticipates adaptation from users who change their features to increase utility in a classification task (e.g., loan repayment). Since a key challenge is the distribution shift induced by users, we turn to causal models, which have been shown to bound the worst-case out-of-distribution (OOD) risk, and establish several new results that link causality and strategic classification. First, we show that causal classification leads to optimal classification error after any sufficiently large adaptation, when the noise is bounded in a certain way. Second, when these assumptions do not hold, we show OOD cross-entropy risk of optimal classifiers decomposes into an OOD bias term and a term arising from not using all observable features, allowing us to understand when causal classifiers have an advantage. Finally, we show that the use of causal features can allow alignment of long-term incentives between institutions and users, contrasting with previous work that highlights social costs of such approaches. We validate our theory empirically on synthetic data, finding that our results predict behavior in practice.


153
Efficient Logistic Regression with Mixture of Sigmoids

Federico Di Gennaro ⋅ Saptarshi Chakraborty ⋅ Nikita Zhivotovskiy

This paper studies the Exponential Weights (EW) algorithm with an isotropic Gaussian prior for online logistic regression. We show that the near-optimal worst-case regret bound $O(d\log(Bn))$ for EW, established by \citet{kakade_ng_bayesianalg} against the best linear predictor of norm at most $B$, can be achieved with total worst-case computational complexity $\tilde O(B^3 n^5)$. This substantially improves on the $O(B^{18}n^{37})$ complexity of prior work achieving the same guarantee \citep{foster2018logistic}. Beyond efficiency, we analyze the large-$B$ regime under linear separability: after rescaling by $B$, the EW posterior converges as $B\to\infty$ to a standard Gaussian truncated to the version cone. Accordingly, the predictor converges to a solid-angle vote over separating directions and, on every fixed-margin slice of this cone, the mode of the corresponding truncated Gaussian is aligned with the hard-margin SVM direction. Using this geometry, we derive non-asymptotic regret bounds showing that once $B$ exceeds a margin-dependent threshold, the regret becomes independent of $B$ and grows only logarithmically with the inverse margin. Overall, our results show that EW can be both computationally tractable and geometrically adaptive in online classification.


154
An Evaluation of Cost Functions for Algorithmic Recourse

Eoin Kenny ⋅ Allan Anzagira ⋅ Tom Bewley ⋅ Freddy Lecue ⋅ Manuela Veloso

Algorithmic recourse is a field concerned with offering actionable recommendations to individuals who have received adverse outcomes from automated systems. Most recourse algorithms assume access to a cost function, which quantifies the effort involved in following these suggestions. However, to date, there has been no serious benchmarking of these functions both from a computational and human perspective. In this paper, we propose four metrics to evaluate whether currently popular cost functions in recourse satisfy the minimal requirements for meaningful distance calculations. In addition, we also propose extensions to current approaches using large-language models (LLMs) as surrogate human labellers, which are prompted with a cost-based desiderata. Experiments revealed that methods focused on the Bradley-Terry model perform best, but only when scaled up with our proposed LLM extensions, which would be the recommended choice in practice. We expect our insights to help practitioners in training and designing appropriate cost functions in the future.


155
Neuron Block Dynamics for XOR Classification with Zero-Margin

Guillaume Braun ⋅ Masaaki Imaizumi

The ability of neural networks to learn useful features through stochastic gradient descent (SGD) is a cornerstone of their success. Most theoretical analyses focus on regression or on classification tasks with a positive margin, where worst-case gradient bounds suffice. In contrast, we study zero-margin nonlinear classification by analyzing the Gaussian XOR problem, where inputs are Gaussian and the XOR decision boundary determines labels. In this setting, a non-negligible fraction of data lies arbitrarily close to the boundary, breaking standard margin-based arguments. Building on Glasgow’s (2024) analysis, we extend the study of training dynamics from discrete to Gaussian inputs and develop a framework for the dynamics of neuron blocks. We show that neurons cluster into four directions and that block-level signals evolve coherently, a phenomenon essential in the Gaussian setting where individual neuron signals vary significantly. Leveraging this block perspective, we analyze generalization without relying on margin assumptions, adopting an average-case view that distinguishes regions of reliable prediction from regions of persistent error. Numerical experiments confirm the predicted two-phase block dynamics and demonstrate their robustness beyond the Gaussian setting.


156
Hyperbolic Part-Whole Image Segmentation

Mikhail Vlasenko ⋅ Mina GhadimiAtigh ⋅ Pascal Mettes

Semantic segmentation typically focuses on pixel-level classification at the object level. Yet, objects naturally decompose into parts and subparts, mirroring human visual perception. In this work, we introduce a hyperbolic prototypical segmentation framework capable of simultaneously representing multiple granularity levels within a unified embedding space. Leveraging hyperbolic geometry's unique capacity to model hierarchies effectively, we propose to embed class prototypes within the Poincaré ball. We introduce a tree-aware prototype initialization strategy and a distortion-p loss that together yield improved hierarchical embeddings. Furthermore, we derive an optimized formulation of the hyperbolic distance function, enabling tractable inference for dense prediction tasks. A shared transformer encoder paired with separate hyperbolic heads allows efficient multi-level segmentation from a single model. Experiments on the recently introduced SubPartImageNet show that our approach (i) improves over the state-of-the-art, especially at the subpart and part levels, at a fraction of the number of parameters, (ii) enables zero-shot generalization, and (iii) allows for transfer from part- to object-level predictions without object-level supervision. All code is available at https://github.com/mikhail-vlasenko/hyp-segmentation.

We design an efficient algorithm that outputs tests for identifying predominantly homogeneous subcohorts of patients from large in-homogeneous datasets. Our theoretical contribution is a rounding technique, similar to that of Goemans and Wiliamson (1995), which approximates the optimal solution within a factor of $0.82$. As an application, we use our algorithm to trade-off sensitivity for specificity to systematically identify clinically interesting homogeneous subcohorts of patients in the RNA microarray data set for breast cancer from Curtis et al. (2012). One identified subcohort suggests a link between LXR over-expression and BRCA2 and MSH6 methylation levels for patients in that subcohort.


158
Provably Efficient Reinforcement Learning for Sparse Dynamical Systems with Non-Gaussian Noise

Davide Maran ⋅ Gianmarco Tedeschi ⋅ Enea Gusmeroli ⋅ Marcello Restelli

The recent development of sparse methods for identifying nonlinear dynamical systems has opened new avenues for efficient and interpretable model-based reinforcement learning (RL). In this work, we study online RL in environments where the system dynamics, modeled as $s'=f(s,a)+$ noise, is assumed to be sparse with respect to a big feature map, a structural idea inspired by the SINDy framework. We introduce an optimistic algorithm that combines online sparse regression with confidence set construction to guide exploration and planning. On the theoretical side, we provide the first regret bounds for sparse nonlinear dynamics, showing that regret scales with the sparsity level $d_0$. This result holds even when relaxing standard Gaussian noise assumptions by allowing a much more general, non-parametric, family of densities and when the model is misspecified. The algorithm achieving the regret bound is not computationally efficient, as it relies on a very computationally intensive online regression method. To bridge this gap, we propose a practical variant that draws inspiration from theoretical principles but incorporates more scalable components. We adopt SINDy for sparse system identification algorithm and couple it with SAC in a Dyna-style planning framework. Empirical results on classic continuous control tasks demonstrate the practical viability and robustness of our approach.


159
Conditional Flow Matching for Bayesian Posterior Inference

Percy Zhai ⋅ Sowon Jeong ⋅ Veronika Rockova

We propose a generative multivariate posterior sampler via flow matching. It offers a simple training objective, and does not require access to likelihood evaluation. The method learns a dynamic, block-triangular velocity field in the joint space of data and parameters, which results in a deterministic transport map from a source distribution to the desired posterior. The inverse map, named vector rank, is accessible by reversibly integrating the velocity over time. It is advantageous to leverage the dynamic design: proper constraints on the velocity yield a monotone map, which leads to a conditional Brenier map, enabling a fast and simultaneous generation of Bayesian credible sets whose contours correspond to level sets of Monge-Kantorovich data depth. Our approach is computationally lighter compared to GAN-based and diffusion-based counterparts, and is capable of capturing complex posterior structures. Finally, frequentist theoretical guarantee on the consistency of the recovered posterior distribution, and of the corresponding Bayesian credible sets, is provided.


16
Demystifying Transition Matching: When and Why It Can Beat Flow Matching

Jaihoon Kim ⋅ Rajarshi Saha ⋅ Youngsuk Park ⋅ Minhyuk Sung

Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget. Second, we extend the analysis to Gaussian mixtures and identify local–unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.


160
DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification

Ziyi Wang ⋅ Siva Rajesh Kasa ⋅ Ankith M S ⋅ SANTHOSH KASA ⋅ Jiaru Zou ⋅ Sumit Negi ⋅ Ruqi Zhang ⋅ Nan Jiang ⋅ Qifan Song

Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose DynamIc VErification RElaxed SpEculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED: learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: \url{https://github.com/comeusr/diversed}.


161
Near-optimal Rank Adaptive Inference of High Dimensional Matrices

Frédéric Zheng ⋅ Yassir Jedra ⋅ Alexandre Proutiere

We address the problem of estimating a high-dimensional matrix from linear measurements, with a focus on designing optimal rank-adaptive algorithms. These algorithms infer the matrix by estimating its singular values and the corresponding singular vectors up to an effective rank, adaptively determined based on the data. We establish, for the first time, instance-specific lower bounds for the sample complexity of such algorithms. We uncover fundamental trade-offs in selecting the effective rank: balancing the precision of estimating a subset of singular values against the approximation cost incurred for the remaining ones. Our analysis identifies how the optimal effective rank depends on the matrix being estimated, the sample size, and the noise level. We propose an algorithm that combines a Least-Squares estimator with a universal singular value thresholding procedure. We provide finite-sample error bounds for this algorithm, that are tighter than those of existing rank-adaptive algorithms. Furthermore, our bounds nearly match the derived fundamental limits. Finally, we confirm experimentally that our algorithm outperforms existing rank-adaptive algorithms.


162
On the Convergence and Stability of Distributed Sub-model Training

Yuyang Deng ⋅ Fuli Qiao ⋅ Mehrdad Mahdavi

As learning models continue to grow in size, enabling on-device local training of these models has emerged as a critical challenge in federated learning. A popular solution is sub-model training, where the server only distributes randomly sampled sub-models to the edge clients, and clients only update these small models. However, those random sampling of sub-models may not give satisfying convergence performance. In this paper, observing the success of SGD with shuffling, we propose a distributed shuffled sub-model training, where the full model is partitioned into several sub-models in advance, and the server shuffles those sub-models, sends each of them to clients at each round, and by the end of local updating period, clients send back the updated sub-models, and server averages them. We establish the convergence rate of this algorithm. We also study the generalization of distributed sub-model training via stability analysis, and find that the sub-model training can improve the generalization via amplifying the stability of training process. The extensive experiments also validate our theoretical findings.


163
Best Student Paper
We Still Don’t Understand High-Dimensional Bayesian Optimization

Colin Doumont ⋅ Donney Fan ⋅ Natalie Maus ⋅ Jacob Gardner ⋅ Henry Moss ⋅ Geoff Pleiss

High-dimensional spaces have historically challenged Bayesian optimization (BO). Existing methods aim to overcome this curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demonstrate that these approaches are outperformed by arguably the simplest method imaginable: Bayesian linear regression. After applying a geometric transformation to avoid boundary-seeking behaviour, Gaussian processes with linear kernels yield state-of-the-art performance on tasks with 60- to 6,000-dimensional search spaces. Linear models offer numerous advantages over their non-parametric counterparts: they afford closed-form acquisition function optimization, they yield asymptotically lower regret, and their computation scales linearly with data, a fact we exploit on molecular optimization tasks with >20,000 observations. Coupled with empirical and theoretical analyses, our results suggest the need to depart from past intuitions about BO methods in high-dimensional spaces.


164
Best Paper Award
EventFlow: Forecasting Temporal Point Processes with Flow Matching

Gavin Kerrigan ⋅ Kai Nelson ⋅ Padhraic Smyth

Continuous-time event sequences, in which events occur at irregular intervals, are ubiquitous across a wide range of industrial and scientific domains. The contemporary modeling paradigm is to treat such data as realizations of a temporal point process, and in machine learning it is common to model temporal point processes in an autoregressive fashion using a neural network. While autoregressive models are successful in predicting the time of a single subsequent event, their performance can degrade when forecasting longer horizons due to cascading errors and myopic predictions. We propose EventFlow, a non-autoregressive generative model for temporal point processes. The model builds on the flow matching framework in order to directly learn joint distributions over event times, side-stepping the autoregressive process. EventFlow is simple to implement and achieves a 20\%-53\% lower forecast error than the nearest baseline on standard TPP benchmarks while simultaneously using fewer model calls at sampling time.


165
Complexity-Aware Deep Symbolic Regression with Robust Risk-Seeking Policy Gradients

Zachary Bastiani ⋅ Mike Kirby ⋅ Jacob Hochhalter ⋅ Shandian Zhe

We propose a novel deep symbolic regression (DSR) approach to enhance the robustness and interpretability of data-driven mathematical expression discovery. Existing DSR methods are built on recurrent neural networks, solely guided by data fitness, and potentially meet tail barriers that can zero out the policy gradient, causing inefficient model updates. To address these issues, we design a decoder-only architecture that performs attention in the frequency domain and introduce a dual-indexed position encoding to conduct layer-wise generation. Second, we propose a Bayesian information criterion (BIC)-based reward function that can automatically adjust the trade-off between expression complexity and data fitness, without the need for explicit manual tuning. Third, we develop a ranking-based weighted policy update method that eliminates the tail barriers and enhances training effectiveness. Extensive benchmarks and systematic experiments demonstrate the advantages of our approach. We have released our implementation at https://github.com/ZakBastiani/CADSR.


166
Creator Incentives in Recommender Systems: A Cooperative Game-Theoretic Approach for Stable and Fair Collaboration in Multi-Agent Bandits

Ramakrishnan Krishnamurthy ⋅ Arpit Agarwal ⋅ Lakshmi Subramanian ⋅ Maximilian Nickel

We study a collaborative variant of the stochastic linear bandit problem in a multi-agent setting, motivated by real-world recommender systems where multiple content creators indirectly influence each other’s outcomes. In our formulation, agents interact with a shared environment but may choose to form coalitions, sharing observations to improve collective learning efficiency. We formalize this setup as a transferable utility (TU) cooperative game, where the value of a coalition is defined as the negative sum of cumulative regrets incurred by its members. This framework allows us to examine how algorithmic design and structural assumptions about agents---such as identical vs. heterogeneous action sets---affect collaboration incentives. We show that under some algorithmic conditions, the induced TU game exhibits desirable properties: for identical agents, the game is convex and admits a non-empty core containing the Shapley value, ensuring stable and equitable collaboration. For heterogeneous agents, we demonstrate core non-emptiness and propose a simple, implementable payoff mechanism that satisfies all but one Shapley value axioms. Experimental results on problem instances derived from MovieLens-100k dataset further illustrate how the empirical payout aligns and diverges from ideal cooperative outcome for different settings. Our results offer a principled lens for designing collaborative learning systems that are both effective and incentive-aligned.


167
Pure Exploration with Infinite Answers

Riccardo Poiani ⋅ Martino Bernasconi ⋅ Andrea Celli

We study pure exploration problems where the set of correct answers is possibly infinite, e.g., the regression of any continuous function of the means of the bandit. We derive an instance-dependent lower bound for these problems. By analyzing it, we discuss why existing methods (i.e., Sticky Track-and-Stop) for finite answer problems fail at being asymptotically optimal in this more general setting. Finally, we present a framework, Sticky-Sequence Track-and-Stop, which generalizes both Track-and-Stop and Sticky Track-and-Stop, and that enjoys asymptotic optimality. Due to its generality, our analysis also highlights special cases where existing methods enjoy optimality.


169
Representation Learning via Non-Contrastive Mutual Information

Zhaohan Daniel Guo ⋅ Bernardo Avila Pires ⋅ Khimya Khetarpal ⋅ Dale Schuurmans ⋅ Bo Dai

Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.


17
Corruption-robust Offline Multi-agent Reinforcement Learning from Human Feedback

Andi Nika ⋅ Debmalya Mandal ⋅ Parameswaran Kamalaruban ⋅ Adish Singla ⋅ Goran Radanovic

We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong‐contamination model: given a dataset $D$ of trajectory–preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents’ preferences), an $\epsilon$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a \emph{uniform coverage} assumption—where every policy of interest is sufficiently represented in the clean (prior to corruption) data—we introduce a robust estimator that guarantees an $O(\epsilon^{1-o(1)})$ bound on the Nash‐equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single‐player deviations are covered: here our proposed algorithm achieves an $O(\sqrt{\epsilon})$ Nash‐gap bound. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral‐coverage regime, we then derive a quasi-polynomial‐time algorithm whose CCE gap scales as $O(\sqrt{\epsilon})$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.


170
Why is prompting hard? Understanding prompts on binary sequence predictors

Li Kevin Wenliang ⋅ Anian Ruoss ⋅ Jordi Grau-Moya ⋅ Marcus Hutter ⋅ Tim Genewein

Frontier models can be prompted or conditioned to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We view prompting as finding the best conditioning sequence on a near-optimal sequence predictor. On numerous well-controlled experiments, we show that unintuitive optimal conditioning sequences can be better understood given the pretraining distribution, which is not usually available. Even using exhaustive search, reliably identifying optimal prompts for practical neural predictors can be surprisingly difficult. Popular prompting methods, such as using demonstrations from the targeted task, can be surprisingly suboptimal. Using the same empirical framework, we analyze optimal prompts on frontier models, revealing patterns similar to the binary examples and previous findings. Taken together, this work takes an initial step towards understanding optimal prompts, from a statistical and empirical perspective that complements research on frontier models.


171
On the Role of Depth in the Expressivity of RNNs

Maude Lizaire ⋅ Michael Rizvi-Martel ⋅ Éric Dupuis ⋅ Guillaume Rabusseau

The benefits of depth in feedforward neural networks (FNNs) are well known: composing multiple layers of linear transformations with nonlinear activations enables complex computations. While similar effects are expected in recurrent neural networks (RNNs), it remains unclear how depth interacts with recurrence to shape expressive power. Here, we formally show that depth increases RNNs’ memory capacity efficiently with respect to parameters, enhancing expressivity both by enabling more complex input transformations and improving the retention of past information. We extend our analysis to 2RNNs, a generalization of RNNs with multiplicative interactions between inputs and hidden states. Unlike RNNs, which remain linear without nonlinear activations, 2RNNs perform polynomial transformations whose maximal degree grows with depth. We further show that multiplicative interactions cannot, in general, be replaced by layerwise nonlinearities. Finally, we validate these insights empirically on synthetic and real-world tasks.

Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.

In two-player zero-sum games, the learning dynamic based on optimistic Hedge achieves one of the best-known regret upper bounds among strongly-uncoupled learning dynamics. With an appropriately chosen learning rate, the social and individual regrets can be bounded by $O(\log(mn))$ in terms of the numbers of actions $m$ and $n$ of the two players. This study investigates the optimality of the dependence on $m$ and $n$ in the regret of optimistic Hedge. To this end, we begin by refining existing regret analysis and show that, in the strongly-uncoupled setting where the opponent's number of actions is known, both the social and individual regret bounds can be improved to $O(\sqrt{\log m \log n})$. We then show that the existing social regret bound as well as these new social and individual regret upper bounds cannot be further improved for optimistic Hedge by providing algorithm-dependent individual regret lower bounds. Importantly, these upper and lower bounds match exactly up to the constant factor in the leading term. Finally, building on these results, we improve the last-iterate convergence rate and the dynamic regret of a learning dynamic based on optimistic Hedge, and complement these results with algorithm-dependent dynamic regret lower bounds that match the improved bounds.


174
Parameter-Free Dynamic Regret for Unconstrained Linear Bandits

Alberto Rumi ⋅ Andrew Jacobsen ⋅ Nicolò Cesa-Bianchi ⋅ Fabio Vitale

We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}_1,\ldots,\boldsymbol{u}_T$ in $\mathbb{R}^d$, but receives only *point-evaluation feedback* on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}\\{\boldsymbol{u}\_t \neq \boldsymbol{u}\_{t-1}\\}$ of an arbitrary comparator sequence. In particular, we provide the *first* algorithm for linear bandits achieving the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{d(1+S_T) T}\big)$ up to poly-logarithmic terms *without prior knowledge of $S_T$*, thus resolving a long-standing open problem.

The rapid growth of model sizes and training datasets has created a strong demand for test-time compute—the ability to perform inference without additional training. At the core of test-time compute is in-context learning, an emerging capability of large language models (LLMs) that enables them to perform statistical inference directly at test time. Recent progress has shed light on the mechanisms of in-context learning in statistical learning: language models can implement linear regression and classification by iteratively extracting features at test time. This naturally raises a broader question: can in-context learning extend beyond statistical learning to combinatorial problems and integer programs? We take a first step toward addressing this question by analyzing the fundamental task of sorting and the more general problem of discrete optimal transport. We prove that transformers with softmax self-attention layers can solve discrete optimal transport through in-context learning, when model parameters remain fixed and only the input length and distribution vary. Consequently, transformers can approximately sort lists of arbitrary length within a provable approximation factor. This sorting mechanism fundamentally differs from discrete sorting algorithms: transformers implicitly simulate a continuous optimization process through their internal feature-extraction dynamics.


177
LAMP: Extracting Local Decision Surfaces from Large Language Models

Ryan Chen ⋅ Youngmin Ko ⋅ Catherine Cho ⋅ Zeyu Zhang ⋅ Mauro Giuffrè ⋅ Sunny Chung ⋅ Dennis Shung ⋅ Bradly Stadie

We introduce LAMP (Local Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case‑file data set, align with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model appears to behave consistently with the explanations it provides.


178
Beyond Binning: Soft Task Reformulation for Deep Regression

Lawrence Stewart ⋅ Francis Bach ⋅ Quentin Berthet

Whilst neural networks are powerful predictors, it has been observed and theoretically analyzed that training such models by minimizing the square loss can lead to suboptimal results on regression problems, where the targets are real-valued. In this work, we propose a novel method aimed at improving test-time performance of neural networks on regression tasks. Our method is based on casting this task in a different fashion, using a target encoder, and a prediction decoder, inspired by approaches in classification and clustering. We demonstrate our method on a wide range of real-world datasets.


179
An Information-Geometric Approach to Artificial Curiosity

Alexander Nedergaard ⋅ Pablo Morales

Learning in environments with sparse rewards remains a fundamental challenge in reinforcement learning. Artificial curiosity addresses this limitation through intrinsic rewards to guide exploration, however, the precise formulation of these rewards has remained elusive. Ideally, such rewards should depend on the agent's information about the environment, remaining agnostic to its representation---an invariance central to information geometry. Leveraging this, we show that information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy. Requiring these rewards to yield a principled exploration-exploitation trade-off, via information geodesic interpolation on the occupancy manifold, effectively limits the candidates to those determined by a scalar parameter. Remarkably, special values of this parameter are found to correspond to count-based and maximum entropy exploration. This framework provides important constraints to the engineering of intrinsic reward while integrating foundational exploration methods into a single, cohesive model.


18
Statistical Inference for Explainable Boosting Machines

Haimo Fang ⋅ Kevin Tan ⋅ Jonathan Pipping ⋅ Giles Hooker

Explainable boosting machines (EBMs) are popular ``glass-box'' models that learn a set of univariate functions using boosting trees. These achieve explainability through visualizations of each feature’s effect. However, unlike linear model coefficients, uncertainty quantification for the learned univariate functions requires computationally intensive bootstrapping, making it hard to know which features truly matter. We provide an alternative using recent advances in statistical inference for gradient boosting, deriving methods for statistical inference as well as end-to-end theoretical guarantees. Using a moving average instead of a sum of trees (Boulevard regularization) allows the boosting process to converge to a feature-wise kernel ridge regression. This produces asymptotically normal predictions that achieve the minimax-optimal MSE for fitting Lipschitz GAMs with $p$ features of $O(p n^{-2/3})$, successfully avoiding the curse of dimensionality. We then construct prediction intervals for the response and confidence intervals for each learned univariate function with a runtime independent of the number of datapoints, enabling further explainability within EBMs.


181
Root cause discovery via permutations and Cholesky decomposition

Jinzhou Li, Benjamin B Chu, Ines F Scheller, Julien Gagneur, Marloes H Maathuis

This paper is concerned with estimating the column subspace of a low-rank matrix $\boldsymbol{X}^\star \in \mathbb{R}^{n_1\times n_2}$ from contaminated data. How to obtain optimal statistical accuracy while accommodating the widest range of signal-to-noise ratios (SNRs) becomes particularly challenging in the presence of heteroskedastic noise and unbalanced dimensionality (i.e., $n_2\gg n_1$). While the state-of-the-art algorithm $\textsf{HeteroPCA}$ emerges as a powerful solution for solving this problem, _x000D_ it suffers from ``the curse of ill-conditioning,'' namely, its performance degrades as the condition number of $\boldsymbol{X}^\star$ grows. In order to overcome this critical issue without compromising the range of allowable SNRs, we propose a novel algorithm, called $\textsf{Deflated-HeteroPCA}$, that achieves near-optimal and condition-number-free theoretical guarantees in terms of both $\ell_2$ and $\ell_{2,\infty}$ statistical accuracy. The proposed algorithm divides the spectrum of $\boldsymbol{X}^\star$ into well-conditioned and mutually well-separated subblocks, and applies $\textsf{HeteroPCA}$ to conquer each subblock successively. Further, an application of our algorithm and theory to two canonical examples --- the factor model and tensor PCA --- leads to remarkable improvement for each application.


184
Functional Integrative Bayesian Analysis of High-Dimensional Multiplatform Clinicogenomic Data

Rupam Bhattacharyya, Nicholas C. Henderson, and Veerabhadran Baladandayuthapani


186
Minimax Optimal Deep Neural Network Classifiers Under Smooth Decision Boundary

Tianyang HU, Ruiqi LIU, Zuofeng SHANG, Guang CHENG


187
Inference for Dispersion and Curvature of Random Objects

Wookyeong Song, Hans-Georg Müller

There are many open questions pertaining to the statistical analysis of random objects, which are increasingly encountered. A major challenge is the absence of linear operations in such spaces. A basic statistical task is to quantify statistical dispersion or spread. For two measures of dispersion for data objects in geodesic metric spaces, Fréchet variance and metric variance, we derive a central limit theorem (CLT) for their joint distribution. This analysis reveals that the Alexandrov curvature of the geodesic space determines the relationship between these two dispersion measures. This suggests a novel test for inferring the curvature of a space based on the asymptotic distribution of the dispersion measures. We demonstrate how this test can be employed to detect the intrinsic curvature of an unknown underlying space, which emerges as a joint property of the space and the underlying probability measure that generates the random objects. We investigate the asymptotic properties of the test and its finite-sample behavior for various data types, including distributional data and point cloud data. We illustrate the proposed inference for intrinsic curvature of random objects using gait synchronization data represented as symmetric positive definite matrices and energy compositional data on the sphere.


188
Graphical Model Inference with Erosely Measured Data

Lili Zheng, Genevera I. Allen

We investigate the Gaussian graphical model inference problem in a novel setting that we call erose measurements, referring to irregularly measured or observed data. For graphs, this results in different node pairs having vastly different sample sizes which frequently arises in data integration, genomics, neuroscience, and sensor networks. Existing works characterize the graph selection performance using the minimum pairwise sample size, which provides little insights for erosely measured data, and no existing inference method is applicable. We aim to fill in this gap by proposing the first inference method that characterizes the different uncertainty levels over the graph caused by the erose measurements, named GI-JOE (Graph Inference when Joint Observations are Erose). Specifically, we develop an edge-wise inference method and an affiliated FDR control procedure, where the variance of each edge depends on the sample sizes associated with corresponding neighbors. We prove statistical validity under erose measurements, thanks to careful localized edge-wise analysis and disentangling the dependencies across the graph. Finally, through simulation studies and a real neuroscience data example, we demonstrate the advantages of our inference methods for graph selection from erosely measured data.


189
High-Dimensional Spatial Autoregression with Latent Factors by Diversified Projections

Jiaxin Shi,Xuening Zhu,Jing Zhou,Baichen Yu &Hansheng Wang

We study one particular type of multivariate spatial autoregression (MSAR) model with diverging dimensions in both responses and covariates. This makes the usual MSAR models no longer applicable due to the high computational cost. To address this issue, we propose a factor-augmented spatial autoregression (FSAR) model. FSAR is a special case of MSAR but with a novel factor structure imposed on the high-dimensional random error vector. The latent factors of FSAR are assumed to be of a fixed dimension. Therefore, they can be estimated consistently by the diversified projections method, as long as the dimension of the multivariate response is diverging. Once the fixed-dimensional latent factors are consistently estimated, they are then fed back into the original SAR model and serve as exogenous covariates. This leads to a novel FSAR model. Thereafter, different components of the high-dimensional response can be modeled separately. To handle the high-dimensional feature, a smoothly clipped absolute deviation (SCAD) type penalized estimator is developed for each response component. We show theoretically that the resulting SCAD estimator is uniformly selection consistent, as long as the tuning parameter is selected appropriately. For practical selection of the tuning parameter, a novel BIC method is developed. Extensive numerical studies are conducted to demonstrate the finite sample performance of the proposed method.


19
Towards Sensitivity-Aware Language Models

Dren Fazlija ⋅ Iyiola E. Olatunji ⋅ Daniel Kudenko ⋅ Sandipan Sikdar

With LLMs increasingly deployed in corporate data management, it is crucial to ensure that these models do not leak sensitive information. In the context of corporate data management, the concept of sensitivity awareness has been introduced, enabling LLMs to adhere to predefined access rights rules. However, it remains unclear how sensitivity awareness relates to established notions of privacy, such as differential privacy (DP), thereby making it difficult to deploy meaningfully in real-world applications. In this work, we formalize the notion of sensitivity awareness and theoretically establish its connection to DP. Additionally, we develop a supervised fine-tuning recipe to make existing, four-bit quantized LLMs more sensitivity-aware. With a performance boost of up to 21.7%, the finetuned LLMs not only substantially improve over their baseline but also outperform other full-precision open-source and commercial models of similar size in achieving sensitivity awareness, demonstrating the effectiveness of our proposed approach. At the same time, our method also largely preserves the models' performance on other tasks, such as general instruction-following, mathematical, and common-sense reasoning.


191
Settling the sample complexity of model-based offline reinforcement learning

Gen Li, Laixi Shi, Yuxin Chen, Yuejie Chi, Yuting Wei


192
Clustering risk in Non-parametric Hidden Markov and I.I.D. Models

Elisabeth Gassiat, Ibrahim Kaddouri, Zacharie Naulet

We conduct an in-depth analysis of the Bayes risk of clustering in the context of Hidden Markov and i.i.d. models. In both settings, we identify the situations where this risk is comparable to the Bayes risk of classification and those where its minimizer, the Bayes clusterer, can be derived from the Bayes classifier. While we demonstrate that clustering based on the Bayes classifier does not always match the optimal Bayes clusterer, we show that this difference is primarily theoretical and that the Bayes classifier remains nearly optimal for clustering. A key quantity emerges, capturing the fundamental difficulty of both classification and clustering tasks. Furthermore, by leveraging the identifiability of HMMs, we establish bounds on the clustering excess risk of a plug-in Bayes classifier in the general nonparametric setting, offering theoretical justification for its widespread use in practice. Simulations further illustrate our findings.


193
Scalable community detection in massive networks via predictive assignment

Subhankar Bhadra, Marianna Pensky, Srijan Sengupta


194
A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools

Gerardo Flores ⋅ Alyssa Smith ⋅ Abigail Schiff ⋅ Julia Fukuyama ⋅ Ashia Wilson

Machine learning-supported decisions, such as ordering diagnostic tests or determining preventive custody, often rely on binary classification from probabilistic forecasts. A consequentialist perspective, long emphasized in decision theory, favors evaluation methods that reflect the quality of such forecasts under threshold uncertainty and varying prevalence, notably Brier scores and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on accuracy and AUC-ROC. To address this disconnect, we introduce a decision-theoretic framework mapping evaluation metrics to their appropriate use cases, along with a practical Python package, \texttt{briertools}, designed to make proper scoring rules more usable in real-world settings. Specifically, we implement a bounded-threshold variant of the Brier score and log loss that restricts evaluation to a practitioner-specified range of plausible cost ratios, rather than averaging over the full unit interval. We further contribute a theoretical reconciliation between the Brier score and decision curve analysis, directly addressing a longstanding critique by Assel et al (2017) regarding the clinical utility of proper scoring rules.


195
Multilayer Correlation Clustering

Atsushi Miyauchi ⋅ Florian Adriaens ⋅ Francesco Bonchi ⋅ Nikolaj Tatti

We establish Multilayer Correlation Clustering, a novel generalization of Correlation Clustering to the multilayer setting. In this model, we are given a series of inputs of Correlation Clustering (called layers) over the common set $V$ of $n$ elements. The goal is to find a clustering of $V$ that minimizes the $\ell_p$-norm ($p\geq 1$) of the multilayer-disagreements vector, which is defined as the vector (with dimension equal to the number of layers), each element of which represents the disagreements of the clustering on the corresponding layer. For this generalization, we first design an $O(L\log n)$-approximation algorithm, where $L$ is the number of layers. We then study an important special case of our problem, namely the problem with the so-called probability constraint. For this case, we first give an $(\alpha+2)$-approximation algorithm, where $\alpha$ is any possible approximation ratio for the single-layer counterpart. Furthermore, we design a $4$-approximation algorithm, which improves the above approximation ratio of $\alpha+2=4.5$ for the general probability-constraint case. Computational experiments using real-world datasets support our theoretical findings and demonstrate the practical effectiveness of our proposed algorithms.


196
Variational Grey-Box Dynamics Matching

Gurjeet Singh ⋅ Frantzeska Lavda ⋅ Giangiacomo Mercatali ⋅ Alexandros Kalousis

Deep generative models such as flow matching and diffusion models have shown great potential in learning complex distributions and dynamical systems, but often act as black-boxes, neglecting underlying physics. In contrast, physics-based simulation models described by ODEs/PDEs remain interpretable, but may have missing or unknown terms, unable to fully describe real-world observations. We bridge this gap with a novel grey-box method that integrates incomplete physics models directly into generative models. Our approach learns dynamics from observational trajectories alone, without ground-truth physics parameters, in a simulation-free manner that avoids scalability and stability issues of Neural ODEs. The core of our method lies in modelling a structured variational distribution within the flow matching framework, by using two latent encodings: one to model the missing stochasticity and multi-modal velocity, and a second to encode physics parameters as a latent variable with a physics-informed prior. Furthermore, we present an adaptation of the framework to handle second-order dynamics. Our experiments on representative ODE/PDE problems and real-world weather forecasting demonstrate that our method performs on par with or superior to fully data-driven approaches and previous grey-box baselines, while preserving the interpretability of the physics model. Our code is available at https://github.com/DMML-Geneva/VGB-DM.


197
Local Regression on Path Spaces with Signature Metrics

Christian Bayer ⋅ Davit Gogolashvili ⋅ Luca Pelizzari

We study nonparametric regression and classification for path-valued data. We introduce a functional Nadaraya-Watson estimator that combines the signature transform from rough path theory with local kernel regression. The signature transform provides a principled way to encode sequential data through iterated integrals, enabling direct comparison of paths in a natural metric space. Our approach leverages signature-induced distances within the classical kernel regression framework, achieving computational efficiency while avoiding the scalability bottlenecks of large-scale kernel matrix operations. We establish finite-sample convergence bounds demonstrating favorable statistical properties of signature-based distances compared to traditional metrics in infinite-dimensional settings. We propose robust signature variants that provide stability against outliers, enhancing practical performance. Applications to both synthetic and real-world data—including stochastic differential equation learning and time series classification—demonstrate competitive accuracy while offering significant computational advantages over existing methods.

Inverse problems for parameter identification in systems governed by partial differential equations (PDEs) are fundamental across numerous domains in science and engineering. While traditionally approached using classical numerical techniques, recent advancements have highlighted the potential of physics-informed neural networks (PINNs) in this context. However, incorporating principled uncertainty quantification (UQ) into PINN frameworks remains a challenge. To address this, we introduce flowPINNs, a probabilistic framework for UQ in PDE parameter inverse problems. The core idea of a flowPINN is to define a variational posterior that combines a normalising flow approximation to the distribution over the PDE parameters with a PINN that represents the corresponding PDE solution. This joint model enables efficient posterior inference by maximisation of the evidence lower bound (ELBO), thereby converting the inverse problem into a tractable optimisation task. Through a series of numerical experiments, we demonstrate that flowPINNs can yield improved predictive performance and more reliable uncertainty estimates compared to existing PINN-based UQ approaches.

Unsupervised learning of high-dimensional data is challenging due to irrelevant or noisy features obscuring underlying structures. It's common that only a few features, called the influential features, meaningfully define the clusters. Recovering these influential features is helpful in data interpretation and clustering. We propose i-IF-Learn, an iterative unsupervised framework that jointly performs feature selection and clustering. Our core innovation is an adaptive feature selection statistic that effectively combines pseudo-label supervision with unsupervised signals, dynamically adjusting based on intermediate label reliability to mitigate error propagation common in iterative frameworks. Leveraging low-dimensional embeddings (PCA or Laplacian eigenmaps) followed by $k$-means, i-IF-Learn simultaneously outputs influential feature subset and clustering labels. Numerical experiments on gene microarray and single-cell RNA-seq datasets show that i-IF-Learn significantly surpasses classical and deep clustering baselines. Furthermore, using our selected influential features as preprocessing substantially enhances downstream deep models such as DeepCluster, UMAP, and VAE, highlighting the importance and effectiveness of targeted feature selection.

This paper uses a minimum divergence framework to introduce a new way of calculating model weights that can be used to average probabilistic predictions from statistical and machine learning models. The method is general and can be applied regardless of whether the models under consideration are fit to data using frequentist, Bayesian, or some other fitting method. The proposed method is motivated in two different ways and is shown empirically to perform better than or on a par with standard model averaging methods, including model stacking and model averaging that relies on Akaike-style negative exponentiated model weighting, especially when the sample size is small. Our theoretical analysis explains why the method has a small-sample advantage.


20
Reinforcement Learning Using Known Invariances

Alexandru Cioba ⋅ Aya Kayal ⋅ Laura Toni ⋅ Sattar Vakili ⋅ Alberto Bernacchia

In many real-world reinforcement learning (RL) problems, the environment exhibits inherent symmetries that can be exploited to improve learning efficiency. This paper develops a theoretical and algorithmic framework for incorporating known group symmetries into kernel-based RL. We propose a symmetry-aware variant of optimistic least-squares value iteration (LSVI), which leverages invariant kernels to encode invariance in both rewards and transition dynamics. Our analysis establishes new bounds on the maximum information gain and covering numbers for invariant RKHSs, explicitly quantifying the sample efficiency gains from symmetry. Empirical results on a customized Frozen Lake environment and a 2D placement design problem confirm the theoretical improvements, demonstrating that symmetry-aware RL achieves significantly better performance than their standard kernel counterparts. These findings highlight the value of structural priors in designing more sample-efficient reinforcement learning algorithms.


200
Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

Neha Kalibhat ⋅ Zi Wang ⋅ Prasoon Bajpai ⋅ Drew Proud ⋅ Wenjun Zeng ⋅ Been Kim ⋅ Mani Malek

We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model's specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to focus on grammatical adherence, while Imagen 4 prioritizes atmospheric coherence. In mathematical reasoning, distractor variables confuse GPT-5 but leave Gemini 2.5 models and o4-mini largely unaffected. Moreover, our results show that the learned constitutions are highly effective for controlling model behavior, achieving an average of $1.86$ times boost in success rate over methods that do not use constitutions.


201
Unified Causal Discovery and Missing Data Imputation

Osman Ali Mian ⋅ Jens Kleesiek ⋅ Michael Kamp

Causal discovery and data imputation are often treated separately, yet both face challenges when data are missing. Existing causal discovery methods discard incomplete samples, leading to significant information loss, while standard imputation approaches rely on spurious correlations that distort the underlying causal signal. We introduce LOGIC, a framework that performs causal discovery and causally consistent imputation simultaneously. While existing work directly makes the assumption that all source variables in the causal graph are observed, we establish a verifiable criterion for this assumption under MCAR and MAR missingness, using the Algorithmic Markov Condition postulate. Building on this, LOGIC proceeds layer by layer, identifying sources, recovering downstream relations, and imputing missing values, while explicitly declaring unknowns when imputation is unsupported rather than forcefully completing the data. This preserves causal reasoning even in challenging missingness regimes. Experiments on synthetic and real-world datasets demonstrate that LOGIC achieves better performance than state-of-the-art baselines in both structure discovery and imputation accuracy.


21
Unmixing Mean Embeddings for Domain Adaptation with Target Label Proportion

Alain Rakotomamonjy ⋅ Maxime Berar ⋅ Mokhtar Alaya

We introduce a novel approach to domain adaptation within the context of Learning from Label Proportions (LLP). We address the challenging scenario where labeled samples are available in the source domain, but only bags of unlabeled samples with their corresponding label proportions are accessible in the target domain. Our proposed method, bagMME (Bag Matching Mean Embeddings), tackles the distributional shift between domains by focusing on matching class-conditional distributions. A key contribution of bagMME is a simple yet effective unmixing strategy that leverages the target label proportions to estimate the target class-conditional mean embeddings. These estimated target means are then aligned with their corresponding source class-conditional means, thereby reducing the domain discrepancy. We theoretically demonstrate the soundness of our approach and its effectiveness in mitigating distributional shifts. Extensive experiments on various computer vision datasets showcase the superior performance of bagMME compared to state-of-the-art baselines. Our results highlight the critical role of incorporating target label proportions into the learning process for improved generalization on the target domain.


22
LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Anoop Cherian ⋅ Radu Corcodel ⋅ Siddarth Jain ⋅ Diego Romeres

Most learning-based approaches to complex physical reasoning sidestep the crucial problem of parameter identification (e.g., mass, friction) that governs scene dynamics—despite its importance in real-world applications such as collision avoidance and robotic manipulation. In this paper, we present LLMPhy, a black-box optimization framework that integrates large language models (LLMs) with physics simulators for physical reasoning. The core insight of LLMPhy is to bridge the textbook physical knowledge embedded in LLMs with the world models implemented in modern physics engines, enabling the construction of digital twins of input scenes via latent parameter estimation. Specifically, LLMPhy decomposes digital twin construction into two subproblems: (i) a continuous problem of estimating physical parameters and (ii) a discrete problem of estimating scene layout. For each subproblem, LLMPhy iteratively prompts the LLM to generate computer programs encoding parameter estimates, executes them in the physics engine to reconstruct the scene, and uses the resulting reconstruction error as feedback to refine the LLM’s predictions. As existing physical reasoning benchmarks rarely account for parameter identifiability, we introduce three new datasets designed to evaluate physical reasoning in zero-shot settings. Our results show that LLMPhy achieves state-of-the-art performance on our tasks, recovers physical parameters more accurately, and converges more reliably than prior black-box methods.


23
Tight Analysis of Decentralized SGD: a Markov Chain Perspective

Lucas Versini ⋅ Paul Mangold ⋅ Aymeric Dieuleveut

We propose a novel analysis of the Decentralized Stochastic Gradient Descent (DSGD) algorithm with constant step size, interpreting the iterates of the algorithm as a Markov chain. We show that DSGD converges to a stationary distribution, with its bias, to first order, decomposable into two components: one due to decentralization (growing with the graph's spectral gap and heterogeneity) and one due to stochasticity. Remarkably, the variance of local parameters is, at the first-order, inversely proportional to the number of agents, regardless of the network topology and even when clients' iterates are not averaged at the end. As a consequence of our analysis, we obtain non-asymptotic convergence bounds for clients' local iterates, confirming that DSGD has linear speed-up in the number of clients, and that the network topology only impacts higher-order terms.


24
Entropic Projection Alignment: Estimating, Explaining, and Improving Model Performance Under Distribution Shift

Salim I. Amoukou ⋅ Emanuele Albini ⋅ Tom Bewley ⋅ Saumitra Mishra ⋅ Manuela Veloso

We propose a unified framework for addressing three key challenges of distribution shift: (1) estimating a model’s performance on an unlabeled target domain, (2) explaining the shift by identifying the features responsible, and (3) improving the target domain performance. Our method, Entropic Projection Alignment (EPA), aligns the source distribution to the target by matching carefully selected moments while simultaneously minimising the KL divergence from the source. This formulation yields a unique closed-form solution for importance weights, achieving robustness through implicit variance control. Drawing on domain adaptation theory, we establish that moment matching is sufficient for reliable estimation and adaptation, avoiding the need for full density ratio recovery. Extensive experiments, together with strong theoretical guarantees, demonstrate that EPA consistently outperforms state-of-the-art baselines while offering substantial computational efficiency.


25
Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise

Adrien Fradin ⋅ Abdurakhmon Sadiev ⋅ Laurent Condat ⋅ Peter Richtarik

We study stochastic nonconvex optimization under heavy-tailed noise. In this setting, the stochastic gradients only have bounded p–th central moment ($p$–BCM) for some $p \in (1,2]$. Building on the foundational work of Arjevani et al. (2022) in stochastic optimization, we establish tight sample complexity lower bounds for all first-order methods under relaxed mean-squared smoothness ($q$-WAS) and $\delta$-similarity ($(q,\delta)$-S) assumptions, allowing any exponent $q\in[1,2]$ instead of the standard $q= 2$. These results substantially broaden the scope of existing lower bounds. To complement them, we show that Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR), a known algorithm, matches these bounds in expectation. Beyond expectation guarantees, we introduce a new algorithm, Double-Clipped NSGD-MVR, which allows the derivation of high-probability convergence rates under weaker assumptions than in previous works. Finally, for second-order methods with stochastic Hessians satisfying bounded $q$-th central moment assumptions for some exponent $q \in[1,2] $ (allowing $q\neq p$), we establish sharper lower bounds than previous works while improving over Sadiev et al. (2025) (where only $p=q$ is considered) and yielding stronger convergence exponents. Together, these results provide a nearly complete complexity characterization of stochastic nonconvex optimization in heavy-tailed regimes.


26
Neural Doubly Robust Proximal Causal Estimation

Ruolin Meng ⋅ Dhanajit Brahma ⋅ Ricardo Henao ⋅ Lawrence Carin Duke

We consider the challenging task of estimating treatment effects from observational data under the assumption that there are unobserved confounders. We employ the proximal causal estimation framework, that assumes access to control (proxy) measurements that contain information about unobserved confounders. We consider outcome and treatment bridges, which provide two distinct ways of estimating causal effects. We also consider a doubly-robust approach, based on combining the outcome and treatment bridges, which is robust in expectation to either (but not both) of the two bridge functions being misspecified. We present a new theoretical bound on the estimation accuracy of the treatment bridge, and we analyze the variance of the doubly-robust estimator. We investigate the impact of autoencoder-based regularization through an ablation study, finding that simpler models sometimes outperform more complex variants. Comparisons with state-of-the-art methods on synthetic and real-world data demonstrate the advantages of our approach.


27
Asymptotic optimality theory of confidence intervals of the mean

Vikas Deep ⋅ Achal Bassamboo ⋅ Sandeep Juneja

We address the classical problem of constructing confidence intervals (CIs) for the mean of a distribution, given $N$ i.i.d. samples, such that the CI contains the true mean with probability at least $1 - \delta$, where $\delta \in (0,1)$. We characterize three distinct learning regimes based on the minimum achievable limiting width of any CI as the sample size $N_\delta \to \infty$ and $\delta \to 0$. In the first regime, where $N_\delta$ grows slower than $\log(1/\delta)$, the limiting width of any CI equals the width of the distribution’s support, precluding meaningful inference. In the second regime, where $N_\delta$ scales as $\log(1/\delta)$, we precisely characterize the minimum limiting width, which depends on the scaling constant. In the third regime, where $N_\delta$ grows faster than $\log(1/\delta)$, complete learning is achievable, and the limiting width of the CI collapses to zero and CI converges to the true mean. We demonstrate that CIs derived from concentration inequalities based on Kullback-Leibler (KL) divergences achieve asymptotically optimal performance, attaining the minimum limiting width in both the sufficient and the complete learning regimes for distributions in three families: single-parameter exponential, bounded support and known bound on $(1+\epsilon)^{\rm th}$ moment. Additionally, these results extend to one-sided CIs, with the width notion adjusted appropriately. Finally, we generalize our findings to settings with random per-sample costs, motivated by practical applications such as stochastic simulators and cloud service selection. Instead of a fixed sample size, we consider a cost budget $C_\delta$, identifying analogous learning regimes and characterizing the optimal CI construction policy.


28
Injecting Measurement Information Yields a Fast and Noise-Robust Diffusion-Based Inverse Problem Solver

Jonathan Patsenker ⋅ Henry Li ⋅ Myeongseob Ko ⋅ Ruoxi Jia ⋅ Yuval Kluger

Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and iterative sampling algorithm. These approaches often rely on Tweedie's formula, which relates the diffusion variate $\mathbf{x}_t$ to the posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t]$, in order to guide the diffusion trajectory with an estimate of the final denoised sample $\mathbf{x}_0$. However, this does not consider information from the measurement $\mathbf{y}$, which must then be integrated downstream. In this work, we propose to estimate the conditional posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t, \mathbf{y}]$, which can be formulated as the solution to a lightweight, single-parameter maximum likelihood estimation problem. The resulting prediction can be integrated into any standard sampler, resulting in a fast and memory-efficient inverse solver. Our optimizer is amenable to a noise-aware likelihood-based stopping criteria that is robust to measurement noise in $\mathbf{y}$. We demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets and tasks.


29
Differentially Private E-Values

Daniel Csillag ⋅ Diego Mesquita

E-values have gained prominence as flexible tools for statistical inference and risk control, enabling anytime- and post-hoc-valid procedures under minimal assumptions. However, many applications fundamentally rely on sensitive data, which can be leaked through e-values. To ensure their safe release, we propose a general framework for differentially private e-values that transforms any non-private e-value into a differentially private one. Towards this end, we develop a novel biased multiplicative noise mechanism that ensures our differentially private e-values remain statistically valid. We show that our differentially private e-values attain strong statistical power, and are asymptotically as powerful as their non-private counterparts. Experiments across online risk monitoring, private healthcare, and conformal e-prediction demonstrate our approach's effectiveness and broad applicability.


3
Optimal Local Convergence Rates of Stochastic First-Order Methods under Local Alpha-PL

Mohammadsaeed Masiha ⋅ Saber Salehkaleybar ⋅ Niao He ⋅ Negar Kiyavash ⋅ Patrick Thiran

We study the local oracle complexity of stochastic first-order methods under a local $\alpha$–Polyak--Łojasiewicz ($\alpha$–PŁ) condition in a neighborhood of a target connected component $\mathcal M'$ of the local minimizer set. The parameter $\alpha\in[1,2]$ is the exponent of the gradient norm in the $\alpha$–PŁ inequality: $\alpha=2$ recovers the classical PŁ case, $\alpha=1$ corresponds to H\"older-type error bounds, and intermediate values interpolate between these regimes. Our performance criterion is the number of oracle queries required to output $\hat x$ with $F(\hat x)-l\le\varepsilon$, where $l:=F(y)$ for any $y\in\mathcal M'$. We work in a local regime where the algorithm is initialized near $\mathcal M'$ and, with high probability, its iterates remain in that neighborhood. We establish a lower bound $\Omega(\varepsilon^{-2/\alpha})$ for all stochastic first-order methods in this regime, and we obtain a matching upper bound $\mathcal O(\varepsilon^{-2/\alpha})$ for $1\le \alpha<2$ via a SARAH-type variance-reduced method with time-varying batch sizes and step sizes. Thus, for $1\le\alpha<2$, the optimal dependence on $\varepsilon$ is $\Theta(\varepsilon^{-2/\alpha})$. In the convex setting, assuming a local $\alpha$–PŁ condition on the $\varepsilon$-sublevel set, we further show a complexity lower bound $\widetilde{\Omega}(\varepsilon^{-2/\alpha})$ for reaching an $\varepsilon$-global optimum, matching the $\varepsilon$-dependence of known accelerated stochastic subgradient methods.


30
Sequential Off-Policy Learning with Logarithmic Smoothing

Maxime Haddouche ⋅ Otmane Sakhi

Off-policy learning enables training policies from logged interaction data. Most prior work considers the batch setting, where a policy is learned from data generated by a single behavior policy. In real systems, however, policies are updated and redeployed repeatedly, each time training on all previously collected data while generating new interactions for future updates. This sequential off-policy learning setting is common in practice but remains largely unexplored theoretically. In this work, we present and study a simple algorithm for sequential off-policy learning, combining Logarithmic Smoothing (LS) estimation with online PAC-Bayesian tools. We further show that a principled adjustment to LS improves performance and accelerates convergence under mild conditions. The algorithms introduced generalize previous work: they match state-of-the-art offline approaches in the batch case and substantially outperform them when policies are updated sequentially. Empirical evaluations highlight both the benefits of the sequential framework and the strength of the proposed algorithms.


31
Tractable Shapley Values and Interactions via Tensor Networks

Farzaneh Heidari ⋅ Chao Li ⋅ Guillaume Rabusseau

We show how to replace the $O(2^n)$ coalition enumeration over $n$ features behind Shapley values and Shapley-style interaction indices with a few-evaluation scheme on a tensor-network (TN) surrogate: TN-SHAP. The key idea is to represent a predictor’s local behavior as a factorized multilinear map, so that coalitional quantities become linear probes of a coefficient tensor. TN-SHAP replaces exhaustive coalition sweeps with just a small number of targeted evaluations to extract order$-k$ Shapley interactions. In particular, both order-1 (single-feature) and order-2 (pairwise) computations have cost $O\!\big(n\,\mathrm{poly}(\chi) + n^2\big)$, where $\chi$ is the TN’s maximal cut rank. We provide theoretical guarantees on the approximation error and tractability of TN-SHAP. On UCI datasets, our method matches enumeration on the fitted surrogate while reducing evaluation by orders of magnitude and achieves \textbf{25--1000$\times$} wall-clock speedups over KernelSHAP-IQ at comparable accuracy, while amortizing training across local cohorts.


32
Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints

Christopher Ratigan ⋅ Kyle Heuton ⋅ Carissa Wang ⋅ Lenore Cowen ⋅ Michael Hughes

The ROC curve is widely used to assess binary classifiers. Yet for some applications, such as alert systems for monitoring hospitalized patients, conventional ROC analysis cannot meet two key deployment needs: enforcing a constraint on precision to avoid false alarm fatigue and imposing an upper bound on the number of predicted positives to represent the capacity of hospital staff. The usual area under the curve metric also does not reflect asymmetric costs for false positives and false negatives. In this paper we address all three of these issues. First, we show how the subset of classifiers that meet precision and capacity constraints occupy a feasible region in ROC space. We establish the polygon-shaped geometry of this region. We then define the partial area of lesser classifiers, a performance metric that is monotonic with cost and only accounts for the feasible region. Averaging this area over a desired distribution for cost parameters results in the partial volume over the ROC surface, or partial VOROS. In experiments predicting mortality risk from vital sign history on several datasets, we show this cost-aware metric can outperform alternatives at ranking classifiers for in-hospital alerts.


33
Learnability with Partial Labels and Adaptive Nearest Neighbors

Nicolas A. Errandonea ⋅ Santiago Mazuelas ⋅ Jose A Lozano ⋅ Sanjoy Dasgupta

Prior work on partial labels learning (PLL) has shown that learning is possible even when each instance is associated with a bag of labels, rather than a single accurate but costly label. However, the necessary conditions for learning with partial labels remain unclear, and existing PLL methods are effective only in specific scenarios. In this work, we mathematically characterize the scenarios in which PLL is feasible. In addition, we present PL A-$k$NN, an adaptive nearest-neighbors algorithm for PLL that is effective in general scenarios and enjoys strong performance guarantees. Experimental results corroborate that PL A-$k$NN can outperform state-of-the-art methods in general PLL scenarios

We investigate the existence of a statistical-computational gap in multiple Gaussian graph alignment. We first generalize a previously established informational threshold from Vassaux and Massoulié (2025) to regimes where the number of observed graphs $p$ may also grow with the number of nodes $n$: when $p \leq O(n/\log(n))$, we recover the results from Vassaux and Massoulié (2025), and $p \geq \Omega(n/\log(n))$ corresponds to a regime where the problem is as difficult as aligning one single graph with some unknown "signal" graph. Moreover, when $p = \omega(n)$, the informational thresholds for partial and exact recovery no longer coincide, in contrast to the all-or-nothing phenomenon observed when $p=O(n)$. Then, we provide the first computational barrier in the low-degree framework for (multiple) Gaussian graph alignment. We prove that when the correlation $\rho$ is less than $1$, up to logarithmic terms, low degree non-trivial estimation fails. Our results suggest that the task of aligning $p$ graphs in polynomial time is as hard as the problem of aligning two graphs in polynomial time, up to logarithmic factors. These results characterize the existence of a statistical-computational gap and provide another example in which polynomial-time algorithms cannot handle complex combinatorial bi-dimensional structures.


35
Uncertainty Quantification for Named Entity Recognition via Conformal Prediction

Matthew Singer ⋅ Karl Pazdernik ⋅ Srijan Sengupta

Named Entity Recognition (NER) is a foundational component in many language tasks, such as knowledge graph construction, information extraction, and question answering. However, existing NER models typically output a single predicted label sequence without any quantification of uncertainty, leaving downstream applications vulnerable to cascading errors. We introduce a conformal prediction framework for NER that produces prediction sets over full label sequences with finite-sample coverage guarantees, serving an analogous role to confidence intervals in classical statistics. To tailor the general conformal prediction methodology to the NER application, we propose the use of Mondrian conformal prediction according to input length and language, hybrid probability-index nonconformity scores, and a modified RAPS procedure for sequence labeling. These techniques mitigate the problem of overly large prediction sets while maintaining valid coverage. Experiments on CoNLL++, CoNLL-Reduced, and WikiNEuRal benchmarks demonstrate that our methods consistently achieve the target confidence while producing efficient prediction sets across diverse base models. This work establishes a statistically principled approach to uncertainty-aware NER with direct benefits for downstream knowledge-driven NLP systems.

We study low-rank matrix completion in the \textit{semi-random model}, where each entry $(i, j)$ is observed independently with an unknown probability $p_{i,j} \ge p$, in contrast to the standard model with a uniform probability $p$. While prior work has shown that nonconvex approach succeeds in the semi-random model for exact observations [CG18], it remains unclear whether similar guarantees extend to more general observation model, such as noisy or one-bit measurements. In this paper, we give a unified framework for semi-random matrix recovery applicable to a broad family of observation models. Our approach builds on the preprocessing step of [CG18] to restore regularity conditions that are violated under adversarial sampling, and leverages the primal-dual framework of [ZWYG18] to obtain near-optimal recovery guarantees. As concrete corollaries, we show that for both noisy and one-bit matrix completion in the semi-random model, after the preprocessing step, every local minimum of the non-convex objective yields an approximate recovery of the ground-truth matrix.


38
Recency Biased Causal Attention for Time-series Forecasting

Kareem Hegazy ⋅ Michael Mahoney ⋅ N. Benjamin Erichson

Recency bias is a useful inductive prior for sequential modeling: it emphasizes nearby observations and can still allow longer-range dependencies. Standard Transformer attention lacks this property, relying on all-to-all interactions that overlook the causal and often local structure of temporal data. We propose a simple mechanism to introduce recency bias by reweighting attention scores with a smooth heavy-tailed decay. This adjustment strengthens local temporal dependencies without sacrificing the flexibility to capture broader and data-specific correlations. We show that recency-biased attention consistently improves sequential modeling, aligning Transformer more closely with the read–ignore–write operations of RNNs. Finally, we demonstrate that our approach achieves competitive and often superior performance on challenging time-series forecasting benchmarks.


39
Incoherence in Goal-Conditioned Autoregressive Models

Jacek Karwowski ⋅ Raymond Douglas

We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterise the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon of Laidlaw et al. (2024).

Bayesian optimization (BO) is effective for expensive black-box problems but remains challenging in high dimensions. We propose NeST-BO, a curvature-aware local BO method that targets a (modified) Newton step by jointly learning gradient and Hessian information with Gaussian process (GP) surrogates, and selecting evaluations via a one-step lookahead bound on the Newton-step error. We show that this bound contracts with batch size, so NeST-BO drives the step error to zero; in well-behaved neighborhoods it recovers the fast local convergence behavior of inexact/modified Newton methods, while standard safeguards support global convergence to stationary points. To improve scaling with problem dimension, we optimize the acquisition in low-dimensional embedded subspaces (random or learned), reducing the dominant cost of learning curvature from $O(d^2)$ to $O(m^2)$ with $m \ll d$ while preserving step targeting. Across high-dimensional synthetic and real-world problems, including cases with thousands of variables and unknown active subspaces, NeST-BO consistently yields faster convergence and better final values than state-of-the-art local and high-dimensional BO baselines.


40
Fact-Augmented Lookahead Planning for LLM Agents

Samuel Holt ⋅ Max Ruiz Luyten ⋅ Thomas Pouplin ⋅ Mihaela van der Schaar

Large Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observable, long-horizon environments when search is unguided or recent history is insufficient. We introduce LWM-Planner, a fact-augmented lookahead planning framework that improves agent behavior purely through in-context learning. After each episode, the agent extracts task-critical atomic facts from its trajectories, validates candidates with a lightweight predictive-consistency filter (and optionally compresses them), and uses the resulting fact set to condition action proposal, single-step latent world-model simulation, and state-value estimation. Planning then proceeds via recursive, depth-limited lookahead over candidate trajectories conditioned on the accumulated facts and recent history, enabling online improvement without parameter updates. We provide abstraction-style motivation—treating facts as reducing state aliasing (proxy $\epsilon_{\mathrm{sim}}$) and fact-conditioned simulation as lowering one-step error (proxy $\delta_{\mathrm{model}}$)—without claiming formal guarantees. Empirically, on text FrozenLake variants, CrafterMini, and ALFWorld, the approach improves cumulative return over ReAct/Reflexion and search-only baselines, suggesting that additional test-time search is most useful when grounded by compact, experience-derived facts.

Factual reliability in domain-specific Large Language Models (LLMs) is paramount in high-stakes applications where incorrect outputs carry significant risks. Current detection methodologies often rely on expensive retrieval validation or labor-intensive manual annotation, creating substantial barriers to scalable deployment. To bridge the gap, we propose SiGHT, a self-supervised graph framework designed for efficient hallucination detection in specialized contexts. SiGHT introduces an automated training pipeline that leverages prompt strategies to synthesize plausible hallucinated content from structured knowledge, effectively eliminating the need for human labeling. By mapping texts to high-resolution word-level relational graphs, the framework employs a Graph Attention Network (GAT) to model fine-grained semantic dependencies and identify structural inconsistencies. Empirical evaluations on the MSMARCO-QnA and RAGTruth-QA benchmarks demonstrate that SiGHT achieves a 46.94% relative F1 gain over prior graph baselines. Notably, SiGHT remains competitive with state of the art detectors while utilizing only 0.03M parameters and incurring a minimal inference latency of 0.342 seconds per instance. Dominating the accuracy--efficiency frontier, SiGHT delivers a robust and scalable architecture for real-time hallucination monitoring in high-stakes specialized pipelines.


42
Time-Aware Synthetic Control

Saeyoung Rho ⋅ Cyrus Illick ⋅ Samhitha Narasipura ⋅ Alberto Abadie ⋅ Daniel Hsu ⋅ Vishal Misra

The synthetic control (SC) framework is widely used for observational causal inference with time-series panel data. Despite its success across diverse applications, existing SC methods typically treat pre-intervention time indices as exchangeable, meaning they may fail to exploit temporal structure when strong trends are present. We propose Time-Aware Synthetic Control (TASC), a method that addresses this limitation by adopting a state-space model with a constant trend component while preserving the low-rank structure of the signal. TASC uses the Kalman filter and the Rauch–Tung–Striebel smoother in two steps: it first fits a generative time-series model with expectation–maximization and then performs counterfactual inference. We evaluate TASC on simulated and real-world datasets spanning policy evaluation and sports prediction. Our results demonstrate that TASC offers advantages in settings with high observation noise and long prediction horizons.


43
Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri ⋅ Rahul Jain ⋅ Deepak Ramachandran ⋅ Zheng Wen

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset—potentially biased or out-of-distribution and collected from a rater of subpar 'competence'—with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.


44
Graph Learning is Suboptimal in Causal Bandits

Mohammad Shahverdikondori ⋅ Jalal Etesami ⋅ Negar Kiyavash

We study regret minimization in causal bandits under causal sufficiency where the underlying causal structure is not known to the agent. Previous work has focused on identifying the reward’s parents and then applying classic bandit methods to them, or jointly learning the parents while minimizing regret. We investigate whether such strategies are optimal. Somewhat counterintuitively, our results show that learning the parent set is suboptimal. We do so by proving that there exist instances where regret minimization and parent identification are fundamentally conflicting objectives. We further analyze both the known and unknown parent set size regimes, establish novel regret lower bounds that capture the combinatorial structure of the action space. Building on these insights, we propose nearly optimal algorithms that bypass graph and parent recovery, demonstrating that parent identification is indeed unnecessary for regret minimization. Experiments confirm that there exists a large performance gap between our method and existing baselines in various environments.


45
From Transformers to State Spaces: GeoMamba-SE(3) for Fast and Accurate Molecular Learning

Jiayu Qin ⋅ Zhengquan Luo ⋅ Jian Chen ⋅ Xuhui Li ⋅ Jiayi Chen ⋅ Zhiqiang Xu

Transformers play an important role in molecular representation learning, enabling unsupervised learning from large scale unlabeled molecule datasets. However, existing Transformer based methods suffer from heavy training computation and slow inference. To accelerate the computation and relieve the burdensome pre-training, we propose a Mamba-based framework that leverages selective state space models to learn molecular representations more efficiently. Unlike conventional methods, our model, GeoMamba-SE(3), offers streamlined computation with linear-time complexity. However, naively applying Mamba to molecules struggles with SE(3) symmetry, representations can drift under rotations/translations—leading to chemically inconsistent features. To address this, we introduce a geometry and statistics aware design: (i) complete local frames at atoms by converting geometric vectors into scalar channels suitable for SSMs; (ii) multi-stream Mamba blocks are modulated by SE(3)-invariant scalars to preserve geometric stability; and (iii) we impose statistical symmetry constraints via orbit-kernel losses and invariant risk minimization, treating SE(3) actions and conformers as environments. This yields practical SE(3) stability without heavy high-order tensor representations. Experiments show that our method achieves new state-of-the-art performance benchmarks on the MoleculeNet datasets, while using only one-sixth of the training computation and 57\% less computation for inference.


46
Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees

Nicolas HUYNH ⋅ Krzysztof Kacprzyk ⋅ Ryan Sheridan ⋅ David Bentley ⋅ Mihaela van der Schaar

The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.


47
Structured Matching via Cost-Regularized Unbalanced Optimal Transport

Emanuele Pardini ⋅ Katerina Papagiannouli

Unbalanced optimal transport (UOT) provides a flexible way to match or compare nonnegative finite Radon measures. However, UOT requires a predefined ground transport cost, which may misrepresent the data’s underlying geometry. Choosing such a cost is particularly challenging when datasets live in heterogeneous spaces, often motivating practitioners to adopt Gromov–Wasserstein formulations. To address this challenge, we introduce cost-regularized unbalanced optimal transport (CR-UOT), a framework that allows the ground cost to vary while allowing mass creation and removal. We show that CR-UOT incorporates unbalanced Gromov–Wasserstein–type problems through families of inner-product costs parameterized by linear transformations, enabling the matching of measures (or point clouds) across Euclidean spaces. We develop algorithms for such CR-UOT problems using entropic regularization and demonstrate that this approach improves the alignment of heterogeneous single-cell omics profiles, especially when many cells lack direct matches.

Pooling is a core operation in convolutional neural networks (CNNs), enabling spatial reduction and hierarchical abstraction. However, standard methods such as max or average pooling operate locally and often fail to capture global context, leading to under- or over-estimation of features. This limits performance on tasks requiring both fine localization and holistic understanding. To address this, we propose Top-$k$\% Contextual Pooling (TkCP), a framework that preserves informative activations based on contextual importance. TkCP includes two variants: (1) Sparse Contextual Pooling, selecting top-$k$\% activations within local windows, and (2) Global Contextual Pooling, selecting top-$k$\% across the entire feature map. Given a kernel size and target resolution, TkCP deterministically sets the stride and reconstructs outputs without additional parameters. Experiments across classification, detection, tracking, segmentation, and generation show consistent improvements in accuracy and robustness. Additionally, TkCP enhances interpretability by tracing high-activation regions across layers.


49
Free Random Projection for In-Context Reinforcement Learning

Tomohiro Hayase ⋅ Benoit Collins ⋅ Nakamasa Inoue

Hierarchical inductive biases are hypothesized to promote generalizable policies in reinforcement learning, as demonstrated by explicit hyperbolic latent representations and architectures. Therefore, a more flexible approach is to have these biases emerge naturally from the algorithm. We introduce Free Random Projection, an input mapping grounded in free probability theory that constructs random orthogonal matrices where hierarchical structure arises inherently. The free random projection integrates seamlessly into existing in-context reinforcement learning frameworks by encoding hierarchical organization within the input space without requiring explicit architectural modifications. Empirical results on multi-environment benchmarks show that free random projection consistently outperforms the standard random projection, leading to improvements in generalization. Furthermore, analyses within linearly solvable Markov decision processes and investigations of the spectrum of kernel random matrices reveal the theoretical underpinnings of free random projection's enhanced performance, highlighting its capacity for effective adaptation in hierarchically structured state spaces.


5
Online Learning-to-Defer with Varying Experts

Yannis Montreuil ⋅ Hoang Dang ⋅ Maxime Meyer ⋅ Lai Xing Ng ⋅ Axel Carlier ⋅ Wei Ooi

Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of $O((n+n_e)T^{2/3})$ in general and $O((n+n_e)\sqrt{T})$ under a low-noise condition, where $T$ is the time horizon, $n$ the number of labels, and $n_e$ the number of distinct experts observed across rounds. The analysis builds on novel $\mathcal{H}$-consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.


50
Personalized Incentive Alignment: Correcting Utility-Driven Selection Bias in A/B Tests

Jiachun Li ⋅ yang meng ⋅ David Simchi-Levi ⋅ Chonghuan Wang

Although A/B testing is a powerful tool for estimating the average treatment effect (ATE), it often proves impractical in social or commercial settings because ethical and business constraints induce participant non-compliance. For example, patients may refuse assignment to less promising therapies, and users may choose whether to adopt a newly released feature based on personal preferences. In this work, we posit that participants act to maximize individual incentives. To capture this behavior, we adopt a utility-based random choice model that explicitly characterizes the identification bias introduced by self-selection and the estimation instability caused by feature imbalance. We then demonstrate how heterogeneous incentives generate both selection bias and inflated variance. Building on these insights, we design an optimal incentive mechanism that equalizes preference distributions across treatment arms, thereby achieving a more balanced covariate profile, lower variance, and a sharper identified set with minimal bias. Finally, we propose an online learning framework that adaptively identifies the optimal incentive scheme during the experiment and produces valid treatment-effect estimates. We validate our theoretical results through both simulation studies and field experiments.


51
Scalable Model-Based Clustering with Sequential Monte Carlo

Connie Trojan ⋅ Pavel Myshkov ⋅ Paul Fearnhead ⋅ James Hensman ⋅ Tom Minka ⋅ Christopher Nemeth

In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.


52
Loss Gaps Parity for Fairness in Heterogeneous Federated Learning

Brahim Erraji ⋅ Michaël Perrot ⋅ Aurélien Bellet

While clients may join federated learning to improve performance on data they rarely observe locally, they often remain self-interested, expecting the global model to perform well on their own data. This motivates an objective that ensures all clients achieve a similar \emph{loss gap}—the difference in performance between the global model and the best model they could train using only their local data. To this end, we propose EAGLE, a novel federated learning algorithm that explicitly regularizes the global model to minimize disparities in loss gaps across clients. Our approach is particularly effective in heterogeneous settings, where clients' optimal local models may be misaligned. Unlike existing methods that encourage loss parity, potentially degrading performance for many clients, EAGLE targets fairness in relative improvements. We provide theoretical convergence guarantees for EAGLE under non-convex loss functions, and characterize how its iterates perform relative to the standard federated learning objective using a novel heterogeneity measure. Empirically, we demonstrate that EAGLE reduces the disparity in loss gaps among clients by prioritizing those furthest from their local optimal loss, while maintaining competitive utility in both convex and non-convex cases compared to strong baselines.


53
The Rashomon Effect for Visualizing High-Dimensional Data

Yiyang Sun ⋅ Haiyang Huang ⋅ Gaurav Rajesh Parikh ⋅ Cynthia Rudin

Dimension reduction (DR) is inherently non-unique: multiple embeddings can preserve the structure of high-dimensional data equally well while differing in layout or geometry. In this paper, we formally define the Rashomon set for DR—the collection of `good' embeddings—and show how embracing this multiplicity leads to more powerful and trustworthy representations. Specifically, we pursue three goals. First, we introduce PCA-informed alignment to steer embeddings toward principal components, making axes interpretable without distorting local neighborhoods. Second, we design concept-alignment regularization that aligns an embedding dimension with external knowledge, such as class labels or user-defined concepts. Third, we propose a method to extract common knowledge across the Rashomon set by identifying trustworthy and persistent nearest-neighbor relationships, which we use to construct refined embeddings with improved local structure while preserving global relationships. By moving beyond a single embedding and leveraging the Rashomon set, we provide a flexible framework for building interpretable, robust, and goal-aligned visualizations.

In this paper, we provide the first investigation into adaptive combinatorial experimental design, focusing on the trade-off between regret minimization and statistical power in combinatorial multi-armed bandits (CMAB). While minimizing regret requires repeated exploitation of high-reward arms, accurate inference on reward gaps requires sufficient exploration of suboptimal actions. We formalize this trade-off through the concept of Pareto optimality and establish equivalent conditions for Pareto-efficient learning in CMAB. We consider two relevant cases under different information structures, i.e., full-bandit feedback and semi-bandit feedback, and propose two algorithms MixCombKL and MixCombUCB respectively for these two cases. We provide theoretical guarantees showing that both algorithms are Pareto optimal, achieving finite-time guarantees on both regret and estimation error of arm gaps. Our results further reveal that richer feedback significantly tightens the attainable Pareto frontier, with the primary gains arising from improved estimation accuracy under our proposed methods. Taken together, these findings establish a principled framework for adaptive combinatorial experimentation in multi-objective decision-making.


55
Mask-Conditional Conformal Prediction: Valid Uncertainty For All Missing Data Mechanisms

Jiarong Fan ⋅ Juhyun Park ⋅ Thi Vo ⋅ Nicolas Brunel

Conformal prediction (CP) offers a principled framework for uncertainty quantification, but it fails to guarantee mask-conditional coverage when faced with missing covariates. In addressing the heterogeneity induced by various missing patterns, Mask-Conditional Valid (MCV) Coverage has emerged as a more desirable property than Marginal Coverage. In this work, we adapt split CP to handle missing values by proposing a preimpute-mask-then-correct framework that can offer valid coverage. We show that our method provides guaranteed Marginal Coverage and Mask-Conditional Validity for general missing data mechanisms. A key component of our approach is a reweighted conformal prediction procedure that corrects the prediction sets after distributional imputation (multiple imputation) of the calibration dataset, making our method compatible with standard imputation pipelines. We derive two algorithms and prove that they achieve both marginal validity and MCV. We evaluate them on synthetic and real-world datasets. It reduces significantly the width of prediction intervals w.r.t standard MCV methods, while maintaining the target guarantees.


56
Local Causal Discovery for Statistically Efficient Causal Inference

Mátyás Schubert ⋅ Tom Claassen ⋅ Sara Magliacane

Causal discovery methods can identify valid adjustment sets for causal effect estimation for a pair of target variables, even when the underlying causal graph is unknown. Global causal discovery methods focus on learning the whole causal graph and therefore enable the recovery of optimal adjustment sets, i.e., sets with the lowest asymptotic variance, but they quickly become computationally prohibitive as the number of variables grows. Local causal discovery methods offer a more scalable alternative by focusing on the local neighborhood of the target variables, but are restricted to statistically suboptimal adjustment sets. In this work, we propose Local Optimal Adjustments Discovery (LOAD), a sound and complete causal discovery approach that combines the computational efficiency of local methods with the statistical optimality of global methods. First, LOAD identifies the causal relation between the targets and tests if the causal effect is identifiable by using only local information. If it is identifiable, it finds the possible descendants of the treatment and infers the optimal adjustment set as the parents of the outcome in a modified forbidden projection. Otherwise, it returns the locally valid parent adjustment sets. In our experiments on synthetic and realistic data LOAD outperforms global methods in scalability, while providing more accurate effect estimation than local methods.


57
Near-Optimal Clustering in Mixture of Markov Chains

Junghyun Lee ⋅ Yassir Jedra ⋅ Alexandre Proutiere ⋅ Se-Young Yun

We study the problem of clustering $T$ trajectories of length $H$, each generated by one of $K$ unknown ergodic Markov chains over a finite state space of size $S$. We derive an instance-dependent, high-probability lower bound on the clustering error rate, governed by the stationary-weighted KL divergence between transition kernels. We then propose a two-stage algorithm: Stage I applies spectral clustering via a new injective Euclidean embedding for ergodic Markov chains, a contribution of independent interest enabling sharp concentration results; Stage II refines clusters with a single likelihood-based reassignment step. We prove that our algorithm achieves near-optimal clustering error with high probability under reasonable requirements on $T$ and $H$. Preliminary experiments support our approach, and we conclude with discussions of its limitations and extensions.


58
GL-LowPopArt: A Nearly Instance-Wise Minimax-Optimal Estimator for Generalized Low-Rank Trace Regression

Junghyun Lee ⋅ Kyoungseok Jang ⋅ Kwang-Sung Jun ⋅ Milan Vojnovic ⋅ Se-Young Yun

We present GL-LowPopArt, a novel Catoni-style estimator for generalized low-rank trace regression. Building on LowPopArt (Jang et al., 2024), it employs a two-stage approach: nuclear norm regularization followed by matrix Catoni estimation. We establish state-of-the-art estimation error bounds, surpassing existing guarantees (Fan et al., 2019; Kang et al., 2022), and reveal a novel experimental design objective, GL(π). The key technical challenge is controlling bias from the nonlinear inverse link function, which we address with our two-stage approach. We prove a local minimax lower bound, showing that GL-LowPopArt enjoys instance-wise optimality up to the condition number of the ground-truth Hessian. Our method immediately achieves an improved Frobenius error guarantee for generalized linear matrix completion. We also introduce a new problem setting called bilinear dueling bandits, a contextualized version of dueling bandits with a general preference model. Using an explore-then-commit approach with GL-LowPopArt, we show an improved Borda regret bound over naïve vectorization (Wu et al., 2024).


59
Connectome-Guided Optimization for Deep Networks

Peilin He ⋅ Tananun Songdechakraiwut

The human brain is highly adaptive: its functional connectivity reconfigures on multiple timescales during cognition and learning, enabling flexible information processing. By contrast, artificial neural networks typically rely on manually-tuned learning-rate schedules or generic adaptive optimizers whose hyperparameters remain largely agnostic to a model's internal dynamics. In this paper, we propose Connectome-Guided Automatic Learning Rate (CG-ALR) that dynamically constructs a functional connectome of the neural network from neuron co-activations at each training iteration and adjusts learning rates online as this connectome reconfigures. This connectomics-inspired mechanism adapts step sizes to the network's dynamic functional organization, slowing learning during unstable reconfiguration and accelerating it when stable organization emerges. Our results demonstrate that principles inspired by brain connectomes can inform the design of adaptive learning rates in deep learning, with particularly consistent improvements over traditional SGD-based schedules and competitive performance against Adam-family scheduled baselines and recent adaptive methods.


6
Efficient Swap Regret Minimization in Combinatorial Bandits

Andreas Kontogiannis ⋅ Vasilis Pollatos ⋅ Panayotis Mertikopoulos ⋅ Ioannis Panageas

This paper addresses the problem of designing efficient no-swap regret algorithms for combinatorial bandits, where the number of actions $N$ is exponentially large in the dimensionality of the problem. In this setting, designing efficient no-swap regret translates to sublinear -- in horizon $T$ -- swap regret with polylogarithmic dependence on $N$. In contrast to the weaker notion of external regret minimization -- a problem which is fairly well understood in the literature -- achieving no-swap regret with a polylogarithmic dependence on $N$ has remained elusive in combinatorial bandits. Our paper resolves this challenge, by introducing a no-swap-regret learning algorithm with regret that scales polylogarithmically in $N$ and is tight for the class of combinatorial bandits. To ground our results, we also demonstrate how to implement the proposed algorithm efficiently -- that is, with a per-iteration complexity that also scales polylogarithmically in $N$ -- across a wide range of well-studied applications.


60
Multi-Metric Adaptive Experimental Design Under a Fixed Budget with Validation

Qining Zhang ⋅ Tanner Fiez ⋅ Yi Liu ⋅ Wenyang Liu

A/B tests in online experiments face statistical power challenges when testing multiple candidates simultaneously, while adaptive experimental designs (AED) alone fall short in inferring experiment statistics such as the average treatment effect, especially with many metrics (e.g., revenue, safety) and heterogeneous variances. This paper proposes a fixed-budget multi-metric AED framework with a two-phase structure: an adaptive exploration phase to identify the best treatment, and a validation phase with an A/B test to verify the treatment's quality and infer statistics. We propose SHRVar, which generalizes sequential halving (SH) with a novel relative-variance-based sampling and an elimination strategy built on reward $z$ values. It achieves a provable error probability that decreases exponentially, where the exponent $H_3$ generalizes the complexity measure for SH and SHVar with homogeneous and heterogeneous variances, respectively. Numerical experiments demonstrate its performance and robustness.


61
Three-operator splitting with stale gradients for faster non-linear optimal transport

Jacob Lindbäck ⋅ David Alvarez-Melis ⋅ Mikael Johansson

Scalable optimization for non-linear optimal transport (OT) poses unique challenges; it requires efficient memory management of large matrices, effective parallelization strategies suited for modern accelerators like GPUs, and theoretical guarantees that support practical implementation patterns. To address these challenges, we introduce a new algorithm based on three-operator splitting that reduces gradient computation costs by allowing gradient evaluations to run asynchronously and in parallel with other computations. Using monotone operator theory, we establish new convergence guarantees for this asynchronous adaptation and extend existing results to important non-convex problem classes, including Gromov–Wasserstein as a notable example. We validate our method through a series of experiments demonstrating improved accuracy and faster convergence for a broad range of problems


62
In-Context Function Learning in Large Language Models

Elif Akata ⋅ Konstantinos Voudouris ⋅ Vincent Fortuin ⋅ Eric Schulz

Large language models (LLMs) can learn from a few demonstrations provided at inference time. We study this in-context learning phenomenon through the lens of Gaussian processes (GPs). We build controlled experiments where models observe sequences of function samples drawn from known GP priors. We evaluate prediction error in relation to the number of demonstrations and compare against two principled references: (i) an empirical GP-regression learner that gives a lower bound on achievable error, and (ii) the expected error of a 1-nearest-neighbor (1-NN) rule, which gives a data-driven upper bound. Across model sizes, we find that LLM learning curves are strongly influenced by the function-generating kernels and approach the GP lower bound as the number of demonstrations increases. We then study the inductive biases of these models using a likelihood-based analysis. We find that LLM predictions are most likely under less smooth GP kernels. Finally, we explore whether post-training can shift these inductive biases and improve sample-efficiency on functions sampled from GPs with smoother kernels. We find that both reinforcement learning and supervised fine-tuning can effectively shift inductive biases in the direction of the training data. Together, our framework quantifies the extent to which LLMs behave like GP learners and provides tools for steering their inductive biases for continuous function learning tasks.


63
Hyperbolic Learning with Supervision from any Granularity

Mina GhadimiAtigh ⋅ Max van Spengler ⋅ Teng Long ⋅ Melika Ayoughi ⋅ Tejaswi Kasarla ⋅ Pascal Mettes

Supervised classification commonly follows a one-vs-rest paradigm where each sample belongs to one category from a set of independent classes. In real-world settings, classes are typically not independent, but organized hierarchically from coarse-grained to fine-grained. More pressingly, people naturally annotate at different levels of granularity, depending on their expertise, biases, or data quality. What should be the correct label of a picture of a bird? Is it \emph{animal}, \emph{bird}, \emph{albatross}, or \emph{Laysan albatross}? What if one annotator is an ornithologist and the other has little bird knowledge? Similarly, if two pictures of a \emph{Laysan albatross} differ in blurriness, we tend to annotate blurry ones more generically, as we are unsure of details that differentiate classes at the finest levels. Currently, many annotations are removed, ignored, or reassigned because they do not match the required granularity. Instead of viewing the world as a flat, independent collection of concepts, this paper strives to perform supervised learning with labels at any granularity. We propose a hyperbolic embedding space, where classes are hierarchically organized as prototypes. We introduce a coarse-to-fine Busemann approach, where images are optimized to the correct region of the hyperbolic embedding space by projecting their labels -- which can be as precise or generic as desired -- to ideal prototypes on the boundary of the Poincaré ball. Experiments show that our approach improves multi-granular classification and beats the current state-of-the-art, which views different granularities as independent, instead of a connected tree.

Deep equilibrium models (DEQs) achieve infinitely deep network representations without stacking layers by exploring fixed points of layer transformations in neural networks. Such models constitute an innovative approach that achieves performance comparable to state-of-the-art methods in many large-scale numerical experiments, despite requiring significantly less memory. However, DEQs face the challenge of requiring vastly more computational time for training and inference than conventional methods, as they repeatedly perform fixed-point iterations with no convergence guarantee upon each input. Therefore, this study explored an approach to improve fixed-point convergence and consequently reduce computational time by restructuring the model architecture to guarantee fixed-point convergence. Our proposed approach for image classification, Lipschitz multiscale DEQ, has theoretically guaranteed fixed-point convergence for both forward and backward passes by hyperparameter adjustment, achieving up to a 4.75$\times$ speedup in numerical experiments on CIFAR-10 at the cost of a minor drop in accuracy.


65
Hybrid Meta-Learners for Estimating Heterogeneous Treatment Effects

Zhongyuan Liang ⋅ Lars van der Laan ⋅ Ahmed Alaa

Estimating conditional average treatment effects (CATE) from observational data involves modeling decisions that differ from supervised learning, particularly concerning how to regularize model complexity. Previous approaches can be grouped into two primary meta-learner paradigms that impose distinct inductive biases. Indirect meta-learners first fit and regularize separate potential outcome (PO) models and then estimate CATE by taking their difference, whereas direct meta-learners construct and directly regularize estimators for the CATE function itself. Neither approach consistently outperforms the other across all scenarios: indirect learners perform well when the PO functions are simple, while direct learners outperform when the CATE is simpler than individual PO functions. In this paper, we introduce the Hybrid Learner (H-learner), a novel regularization strategy that interpolates between the direct and indirect regularizations depending on the dataset at hand. The H-learner achieves this by learning intermediate functions whose difference closely approximates the CATE without necessarily requiring accurate individual approximations of the POs themselves. We demonstrate that intentionally allowing suboptimal fits to the POs improves the bias-variance tradeoff in estimating CATE. Experiments conducted on semi-synthetic and real-world benchmark datasets illustrate that the H-learner consistently operates at the Pareto frontier, effectively combining the strengths of both direct and indirect meta-learners.


66
Formally Exploring Time-Series Anomaly Detection Evaluation Metrics

Dennis Wagner ⋅ Arjun Nair ⋅ Billy Franks ⋅ Justus Arweiler ⋅ Aparna Muraleedharan ⋅ Indra Jungjohann ⋅ Fabian Hartung ⋅ Andriy Balinskyy ⋅ Saurabh Varshneya ⋅ Mayank Ahuja ⋅ Nabeel Hussain Syed ⋅ Mayank Nagda ⋅ Philipp Liznerski ⋅ Steffen Reithermann ⋅ Maja Waldron ⋅ Sebastian Vollmer ⋅ Ralf Schulz ⋅ Torsten Katz ⋅ Stephan Mandt ⋅ Michael Bortz ⋅ Heike Leitte ⋅ Daniel Neider ⋅ Jakob Burger ⋅ Fabian Jirasek ⋅ Hans Hasse ⋅ Sophie Fellenz ⋅ Marius Kloft

Detecting anomalies in time series is vital to ensure safety and reliability in many real-world applications. Despite the staggering number of anomaly detection methods, it remains unclear which methods perform best, largely due to flawed evaluation practices. Without rigorous analysis, evaluations yield unintuitive or misleading comparisons. Existing evaluation metrics often focus on specifics and, therefore, fail to capture essential aspects of the anomaly detection task. In this work, we formalize the problem by introducing verifiable properties of evaluation metrics that individually reflect important aspects of anomaly detection in time series. By formalizing requirements and analyzing them systematically, we outline a theoretical framework for evaluating time-series anomaly detection that can support principled evaluations and reliable comparisons. We analyze 37 known metrics and prove that most satisfy only few and none satisfy all properties, explaining many observed inconsistencies in evaluations. To address this gap, we introduce a new flexible evaluation metric LARM that provably satisfies all properties. We illustrate the adaptability of this approach by refining the properties to satisfy stricter requirements and adapting LARM to these advanced properties yielding ALARM.

Function approximation with parametric, feature-based reward models is widely used to enable decision-making in bandits with large action spaces. While bandit learning is well understood in the case of little or no misspecification in the reward approximation, real-world applications can often involve significantly high model misspecification. We study whether optimal learning is still possible under arbitrary misspecification. We identify structural, instance-dependent conditions, determined jointly by the problem instance and model class, under which standard algorithms like $\epsilon$-greedy and LinUCB achieve sublinear regret, despite an arbitrarily large misspecification error in the traditional sense. These results contrast sharply with worst-case analyses that predict linear regret, and show that a broad class of instances remains robust to model error. Our findings offer a theoretical explanation for the empirical success of approximate value-based methods in complex environments.


68
Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data

Rickard Karlsson ⋅ Piersilvio De Bartolomeis ⋅ Issa Dahabreh ⋅ Jesse Krijthe

Randomized trials are typically designed to detect average treatment effects but often lack the statistical power to uncover individual-level treatment effect heterogeneity, limiting their value for personalized decision-making. To address this, we propose the QR-learner, a model-agnostic learner that estimates conditional average treatment effects (CATE) within the trial population by leveraging external data from other trials or observational studies. The proposed method is robust: it can reduce the mean squared error relative to a trial-only CATE learner, and is guaranteed to recover the true CATE even when the external data are not aligned with the trial. Moreover, we introduce a procedure that combines the QR-learner with a trial-only CATE learner and show that it asymptotically matches or exceeds both component learners in terms of mean squared error. We examine the performance of our approach in simulation studies and apply the methods to a real-world dataset, demonstrating improvements in both CATE estimation and statistical power for detecting heterogeneous effects.


69
Accelerated Distributed Optimization with Compression and Error Feedback

Yuan Gao ⋅ Anton Rodomanov ⋅ Jeremy Rack ⋅ Sebastian Stich

Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration---a cornerstone for achieving faster convergence in optimization. In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence.


7
Minimax-Optimal Two-Sample Test with Sliced Wasserstein

Binh Thuan Tran ⋅ Nicolas Schreuder

We study the problem of nonparametric two-sample testing using the sliced Wasserstein (SW) distance. While prior theoretical and empirical work indicates that the SW distance offers a promising balance between strong statistical guarantees and computational efficiency, its theoretical foundations for hypothesis testing remain limited. We address this gap by proposing a permutation-based SW test and analyzing its performance. The test inherits finite-sample Type I error control from the permutation principle. Moreover, we establish non-asymptotic power bounds and show that the procedure achieves the minimax separation rate $n^{-1/2}$ with respect to the sliced Wasserstein distance over multinomial and bounded-support alternatives. This matches the optimal minimax rate $n^{-1/2}$ achieved by kernel-based tests with respect to the MMD, while leveraging the geometric structure of Wasserstein distances. Our analysis further quantifies the trade-off between the number of projections and statistical power. Finally, numerical experiments demonstrate that the test combines finite-sample validity with competitive power and scalability, and---unlike kernel-based tests, which require careful kernel tuning---it performs consistently well across all scenarios we consider.


70
FastRank: Fast Tensor Rank Approximation based on Spectral Energy

Konstantinos Bougiatiotis ⋅ Georgios Paliouras

Complex multi-dimensional data are often represented as tensors, analyzed through tensor decompositions. A central challenge is selecting the right number of components for the decomposition. In the Canonical Polyadic Decomposition (CPD), this means determining the canonical rank, which directly impacts decomposition quality. Existing methods typically estimate rank by repeatedly computing CPDs, an expensive process. We introduce $FastRank$, a theoretically grounded method that estimates rank without CPD computation. By applying Singular Value Decomposition (SVD) to a sum-reduced matrix of the tensor and analyzing its eigenspectrum, FastRank achieves over $1000\times$ speedup and surpasses state-of-the-art accuracy. We validate it using both synthetic and real data, including noisy settings, and highlight its scalability in knowledge graph completion, where prior methods fail due to computational limitations.


71
Information-Theoretic Error Bounds for Source Localization in Neural Sensing

Leighton Barnes ⋅ Yuxin Guo ⋅ Alex Dytso ⋅ Pulkit Grover

We formulate a point-source localization problem in $d$ dimensions, where a source inside the ball of radius $R$ emits a signal that is picked up by various sensors located at the surface of the ball. For $d=3$, this can model problems in neural sensing, where a net of electroencephalography (EEG) or magnetoencephalography (MEG) sensors try to locate the source of a distinct neural event such as a seizure. For a power law decay model with exponent $\alpha>0$ for the sensors, we obtain a lower bound on the minimax risk for localizing the source that is asymptotically $\frac{d^2\sigma^2R^{2\alpha+2}}{n\alpha^2PK}$ under mean-squared error loss, where $\sigma^2$ is the noise variance, $P$ is the signal power, $K$ is the number of sensors, and $n$ is the number of independent measurements. In the case $d\leq 2(\alpha+1)$ with uniformly distributed sensor locations, we then give a matching upper bound, including getting the exact constant correct, for the asymptotic minimax rate in a neighborhood of the origin. We show that there is a phase transition at $d=2(\alpha+2)$, above which a certain Fisher information quantity is minimized at the boundary of the ball, and below which it is minimized at the origin. At the critical dimension $d=2(\alpha+2)$, the Fisher information is constant throughout the entire parameter space. For the special case $d=3$, we supplement and compare this information-theoretic analysis with a simulated forward EEG model that uses a realistic head model derived from population-averaged magnetic resonance imaging data.

Offline change point detection tries to detect $\textit{time points}$ of distribution change in a given data sequence; and is now routinely used in signal processing, speech processing, climatology etc. Despite this broad applicability across economics, computer science, and planetary sciences, rigorous, nonparametric techniques for change point detection with non-independent and identically distributed (i.i.d.) datasets has remained elusive. This paper establishes such guarantees by proposing a non-parametric clustering algorithm which can accurately obtain the change points from a given Markovian dataset of length $n$. It does so by bridging together two different components of mathematical statistics; Rademacher complexities of Markov chains, and adaptive clustering via penalisation. Our first result uses recent advances in Rademacher complexities of regenerating Markov chains to derive a Dvoretzky Kiefer Wolfowitz (DKW) type inequality for the empirical distribution of the Markov chain. We then use this to show that an adaptive clustering algorithm recovers the correct change points for a Markovian sequence. We establish the tightness of our rates by showing that they essentially coincide with the best known rates for i.i.d. data. We end the paper by discussing the computational considerations of the problem.


73
Learning Geometry and Topology via Multi-Chart Flows

Hanlin Yu ⋅ Soren Hauberg ⋅ Marcelo Hartmann ⋅ Arto Klami ⋅ Georgios Arvanitidis

Real world data often lie on low-dimensional Riemannian manifolds embedded in high-dimensional spaces. This motivates learning degenerate normalizing flows that map between the ambient space and a low-dimensional latent space. However, if the manifold has a non-trivial topology, it can never be correctly learned using a single flow. Instead multiple flows must be `glued together'. In this paper, we first propose the general training scheme for learning such a collection of flows, and secondly we develop the first numerical algorithms for computing geodesics on such manifolds. Empirically, we demonstrate that this leads to highly significant improvements in topology estimation.


74
Lower Bounds for Public-Private Learning under Distribution Shift

Amrith Setlur ⋅ Pratiksha Thaker ⋅ Jonathan Ullman

The most effective differentially private machine learning algorithms in practice rely on an additional source of purportedly public data. This paradigm is most interesting when the two sources combine to be more than the sum of their parts. However, there are settings such as mean estimation where we have strong lower bounds, showing that when the two data sources have the same distribution, there is no complementary value to combining the two data sources. In this work we extend the known lower bounds for public-private learning to a setting where the two data sources exhibit significant distribution shift. Our results apply to both Gaussian mean estimation where the two distributions have different means, and to Gaussian linear regression where the two distributions exhibit parameter shift. We find that when the shift is small (relative to the desired accuracy), either public or private data must be sufficiently abundant to estimate the private parameter. Conversely, when the shift is large, public data provides no benefit.

We introduce CoPRIME (Contrastive Probabilistic Routing for IMbalanced tokens with ELBO-regularized mixture of experts), a probabilistic routing framework for multimodal representation learning that generalizes multimodal representation learning beyond vision-text by tackling the fundamental challenge of extreme token imbalance across modalities. This imbalanced-ness is particularly pronounced between spectrogram-tokenized audio and text. CoPRIME augments contrastive pretraining with an ELBO-regularized routing objective that jointly promotes 1) expert specialization, requiring experts to explain the tokens they receive, and 2) diverse utilization via KL regularization to a uniform prior. To stabilize routing, we further replace standard CoV-based regularizers with entropy-based importance and load losses, yielding smoother gradients and flexible, modality-aware routing without rigid uniformity constraints. On MOSEI and IEMOCAP datasets, CoPRIME achieves state-of-the-art zero- and few-shot emotion and sentiment results, outperforming dense Transformers and prior multimodal MoE variants while retaining the efficiency of sparse conditional computation. Ablations isolate the role of each loss and show that ELBO is the primary driver of stable specialization under modality imbalance, with entropy-based regularizers further improving convergence and utilization.

Real-world tables are messy: column headers are inconsistent, cells contain errors or missing values, and crucial information is scattered across multiple tables and documents. These issues cause even state-of-the-art language models to fail at seemingly simple questions. We present a robust framework for table understanding that explicitly handles these challenges through three coordinated mechanisms: structure-aware encoders that learn invariance to common corruptions, trainable slots that compress evidence to a fixed-size representation, and grounding modules that align each slot to supporting text passages. Unlike prior work that treats tables as flat text or relies on clean schemas, our approach maintains strong performance even under schema corruption and structural perturbations. Across eight benchmarks spanning question answering, fact verification, and text generation, we achieve the best performance among methods without external tools on five tasks and remain competitive with systems using much larger models or SQL executors. Under schema corruption and row/column permutations, our method degrades by less than 2 points, while baselines drop by up to 6-22 points, confirming that explicit denoising and grounding are essential for robust table understanding.

Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. However, existing preprocessing approaches generally lack transparent mechanisms for identifying which features are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel preprocessing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical features in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. Our method effectively reduces discriminative risk (DR) with an instance-wise guarantee up to an interaction residual term, which is bounded under local matching, while simultaneously bounding the upper limit of demographic parity (DP), which in practice leads to its reduction. Experiments on multiple tabular datasets show that we achieve state-of-the-art or comparable performance across DR, DP, and equality of opportunity (EO) with minimal modifications, thereby preserving data fidelity. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of bias. Our code is available on https://github.com/ZhuMuMu0216/FairSHAP.

We study a correlated group testing model where $n$ items are infected according to a Markov chain, which creates bursty infection patterns. In the sparse infections regime, where the expected number of infections scales as $O(n^{\theta})$ with $\theta \in (0,1)$, we propose a non-adaptive testing strategy with an efficient decoding algorithm. Our approach outperforms an optimal yet computationally inefficient independent testing and decoding scheme (one that disregards correlation), under certain parameter regimes. At a high level, we use randomized block testing, where we first sample contiguous blocks of correlated items and then subsample items within selected blocks. Decoding then proceeds in two stages: a coarse elimination step to rule out items appearing in negative tests, followed by a fine thresholding step that declares an item infected if its test participation count exceeds a predefined threshold. Notably, when $\theta \to 0$, our method achieves asymptotically vanishing error while using a number of tests that is within a $1/\ln(2) \approx 1.44$ multiplicative factor of the fundamental entropy bound---a result that parallels the independent group testing setting. Further, we show that the number of tests reduces with an increase in the expected burst length of infected items, quantifying the advantage of exploiting correlation in test design.

Quasar-convex functions form a broad nonconvex class with applications to linear dynamical systems, generalized linear models, and Riemannian optimization, among others. Current nearly optimal algorithms work only in affine spaces due to the loss of one degree of freedom when working with general convex constraints. Obtaining an accelerated algorithm that makes nearly optimal $\bigotilde{1/(\gamma\sqrt{\epsilon})}$ first-order queries to a $\gamma$-quasar convex smooth function \emph{with constraints} was independently asked as an open problem in Martinez-Rubio (2022); Lezane, Langer and Koolen, (2024). In this work, we solve this question by designing an inexact accelerated proximal point algorithm that we implement using a first-order method achieving the aforementioned rate and, as a consequence, we improve the complexity of the accelerated geodesically Riemannian optimization solution in Martínez-Rubio (2022). We also analyze projected gradient descent and Frank-Wolfe algorithms in this constrained quasar-convex setting. To the best of our knowledge, our work provides the first analyses of first-order methods for quasar-convex smooth functions with general convex constraints.


8
Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics

Sunmook Choi ⋅ Yahya Sattar ⋅ Yassir Jedra ⋅ Maryam Fazel ⋅ Sarah Dean

We study a nonstationary bandit problem where rewards depend on both actions and latent states, the latter governed by unknown linear dynamics. Crucially, the state dynamics also depend on the actions, resulting in tension between short-term and long-term rewards. We propose an explore-then-commit algorithm for a finite horizon $T$. During the exploration phase, random Rademacher actions enable estimation of the Markov parameters of the linear dynamics, which characterize the action-reward relationship. In the commit phase, the algorithm uses the estimated parameters to design an optimized action sequence for long-term reward. Our proposed algorithm achieves $\tilde{\mathcal{O}}(pT^{2/3})$ regret where $p$ is the action dimension. Our analysis handles two key challenges: learning from temporally correlated rewards, and designing action sequences with optimal long-term reward. We address the first challenge by providing near-optimal sample complexity and error bounds for system identification using bilinear rewards. We address the second challenge by proving an equivalence with indefinite quadratic optimization over a hypercube, a known NP-hard problem. We provide a sub-optimality guarantee for this problem, enabling our regret upper bound. Lastly, we propose a semidefinite relaxation with Goemans-Williamson rounding as a practical approach.

Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as *supervised fine-tuning*, involves training a new next token predictor on good generations. The second method, *Best-of-N*, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either $n$ or a rate of convergence with better dependence on the response length.


81
Moonwalk: Inverse-Forward Differentiation

Dmitrii Krylov ⋅ Armin Karamzade ⋅ Roy Fox

Backpropagation’s main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector–Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer’s Jacobian. We define submersive networks—networks whose layer Jacobians have trivial cokernels—in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector–inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient backward pass, then reconstructs parameter gradients in a forward sweep that does not need to store activations. We implement this method, called Moonwalk, and show that it matches backpropagation’s runtime while training networks more than twice as deep under the same memory budget.


82
Towards Characterizing the Complexity of Riemannian Online Convex Optimization

Hibiki Fukushima ⋅ Hiroshi Hirai ⋅ Shinji Ito

Online Convex Optimization (OCO) over Riemannian manifolds raises fundamental questions about how geometry affects algorithmic performance. While Riemannian Online Gradient Descent (R-OGD) has been shown to achieve a regret upper bound of $O(DL\sqrt{\zeta T})$, where $\zeta$ depends on the manifold’s curvature, the tightness of this bound remained unclear. We first establish a matching lower bound of $\Omega(DL\sqrt{\zeta T})$ for R-OGD, valid for any predetermined step-size schedules and for certain types of adaptive step-size schedules. This shows that the worst-case regret of R-OGD is $\Theta(DL\sqrt{\zeta T})$, and that the effect of manifold curvature appears as a multiplicative factor of $\sqrt{\zeta}$ in the regret. In contrast to the Euclidean setting—where OGD is minimax optimal and regret bounds are independent of feedback models—this result reveals that geometry can substantially degrade the performance of first-order algorithms. We also analyze a Riemannian extension of Follow-the-Regularized-Leader, which we term R-FTRL, in the full-information setting. R-FTRL achieves a regret bound of $O(DL\sqrt{T})$, independent of the curvature. This complements recent curvature-independent guarantees for full-information methods obtained by different algorithmic approaches. Together with our lower bound for R-OGD, our results support a separation between first-order and full-information models in non-Euclidean settings, and highlight the subtle interactions between feedback structure, algorithm design, and geometry.

Systems of interacting continuous time Markov chains are a powerful model class, but inference is typically intractable in high-dimensional settings. Auxiliary information, such as noisy observations, is typically only available at discrete times, and incorporating it via a Doob's $h$-transform gives rise to an intractable posterior process that requires approximation. We introduce Latent Interacting Particle Systems, a model class parameterizing the generator of each Markov chain in the system. Our inference method involves estimating look-ahead functions (twist potentials) that anticipate future information, for which we introduce an efficient parameterization. We incorporate this approximation in a twisted Sequential Monte Carlo sampling scheme. We demonstrate the effectiveness of our approach on a challenging posterior inference task for a latent SIRS model on a graph, and on a neural model for wildfire spread dynamics trained on real data.


84
Set to Be Fair: Demographic Parity Constraints for Set-Valued Classification

Eyal Cohen ⋅ Christophe Denis ⋅ Mohamed Hebiri

Set-valued classification is used in multiclass settings where confusion between classes can occur and lead to misleading predictions. However, its application may amplify discriminatory bias motivating the development of set-valued approaches under fairness constraints. In this paper, we address the problem of set-valued classification under demographic parity and expected size constraints. We propose two complementary strategies: an oracle-based method that minimizes classification risk while satisfying both constraints, and a computationally efficient proxy that prioritizes constraint satisfaction. For both strategies, we derive closed-form expressions for the (optimal) fair set-valued classifiers and use these to build plug-in, data-driven procedures for empirical predictions. We establish distribution-free convergence rates for violations of the size and fairness constraints for both methods, and under mild assumptions we also provide excess-risk bounds for the oracle-based approach. Empirical results demonstrate the effectiveness of both strategies and highlight the efficiency of our proxy method.

Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on SMOTE density. From these results, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We confirm and illustrate empirically this first theoretical behavior on a real-world data-set. Furthermore, we prove that SMOTE density vanishes near the boundary of the support of the minority class distribution. We then adapt SMOTE based on our theoretical findings to introduce two new variants. These strategies are compared on $13$ tabular data sets with $10$ state-of-the-art rebalancing procedures, including deep generative and diffusion models. First, for most data sets, applying no rebalancing strategy is competitive in terms of predictive performances, would it be with LightGBM, tuned random forests or logistic regression. Second, when the imbalance ratio is artificially augmented, one of our two modifications of SMOTE leads to promising predictive performances compared to SMOTE and other strategies.

We model neural training as a classical multi-time map from controllable interventions---batch choices, augmentations, and optimizer micro-steps---to model predictions on a fixed probe set. On this basis, we introduce a simple, model-agnostic witness of training memory based on back-flow of distinguishability. In a controlled two-step protocol, we compare predictive distributions after one intervention versus two; a positive increase $\Delta_{\mathrm{BF}} = D_2 - D_1 > 0$, with $D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$, certifies observable-level non-Markovianity. Across controlled SGD experiments, we observe consistent positive back-flow with tight bootstrap confidence intervals, stronger effects under higher momentum, larger batch overlap, and more micro-steps, and marked collapse under a \emph{causal break} that resets optimizer state. The witness is inexpensive, requires no architectural changes, and is robust across TV/JS/Hellinger. We position this as a measurement contribution: a practical diagnostic, and empirical evidence, that real training often deviates from the Markov idealization in ways that matter for optimizer behavior, data order, and schedule design.


87
Learning How Deep to Go: Self-Scaling Deep Reinforcement Learning

Michelangelo Vegliò ⋅ Marco Fantozzi ⋅ Antonio Di Cecco ⋅ Carlo Metta ⋅ Flora Angileri ⋅ Simone Treccani ⋅ Adrienne Macazar ⋅ Silvia Galfre' ⋅ Maurizio Parton ⋅ Francesco Morandin

Deep Reinforcement Learning (DRL) has achieved remarkable results in complex sequential decision-making tasks, often using very deep neural networks. However, these architectures incur substantial computational and energy costs, and selecting the optimal network depth in advance remains an open challenge. In this paper, we introduce SCALE-RL, a self-scaling DRL framework that dynamically adjusts its architectural depth during training, allowing the network to automatically adapt its depth to the task. Integrated into an AlphaZero-style pipeline for Othello, our approach matches the playing strength of the baseline agent while reducing network depth by 50\%. This process not only translates into substantial savings in computation and energy but also enhances model interpretability through the additive decomposition of decision-making across layers. Our results suggest that enabling DRL models to discover the complexity they require, rather than relying on fixed, over-parameterized architectures, makes it possible to develop more efficient, interpretable, and sustainable DRL agents.


88
Stationarity-Aware Causal Discovery in Time Series via Minimal Separating Sets

Shanyun Gao ⋅ Raghavendra Addanki ⋅ Tong Yu ⋅ Ryan Rossi ⋅ Qifan Song ⋅ Murat Kocaoglu

Discovering causal relationships from observational time series is a fundamental problem with broad applications in climate science, healthcare, and finance. Causal graphs with time-lagged structure capture the effects of underlying mechanisms over time. Under the causal stationarity assumption, these causal mechanisms remain consistent across time. Existing constraint-based methods leverage stationarity for conditional independence testing and reduce the problem to learning the parents of variables at the final time point, which can then be used to reconstruct the stationary graph. However, their separating set search strategy mimics the PC algorithm and does not take advantage of the stationary structure. We observe that the stationary graph structure and autoregressive edges impose many meaningful constraints on the separating sets between variables at different time lags. After characterizing the behavior of such separating sets, we propose a novel causal discovery algorithm that exploits this structure of minimal separating sets. Extensive evaluations on synthetic and real-world datasets demonstrate the robustness and accuracy of our method.


89
Computationally lightweight classifiers with frequentist bounds on predictions

Shreeram Murali ⋅ Cristian Rojas ⋅ Dominik Baumann

While both classical and neural network classifiers can achieve high accuracy, they fall short on offering uncertainty bounds on their predictions, making them unfit for safety-critical applications. Existing kernel-based classifiers that provide such bounds scale with $\mathcal O (n^{\sim3})$ in time, making them computationally intractable for large datasets. To address this, we propose a novel, computationally efficient classification algorithm based on the Nadaraya-Watson estimator, for whose estimates we derive frequentist uncertainty intervals. We evaluate our classifier on synthetically generated data and on electrocardiographic heartbeat signals from the MIT-BIH Arrhythmia database. We show that the method achieves competitive accuracy >96% at $\mathcal O(n)$ and $\mathcal O(\log n)$ operations, while providing actionable uncertainty bounds. These bounds can, e.g., aid in flagging low-confidence predictions, making them suitable for real-time settings with resource constraints, such as diagnostic monitoring or implantable devices.

Power transforms are popular parametric methods for making data more Gaussian-like, and are widely used as preprocessing steps in statistical analysis and machine learning. However, we find that direct implementations of power transforms suffer from severe numerical instabilities, which can lead to incorrect results or even crashes. In this paper, we provide a comprehensive analysis of the sources of these instabilities and propose effective remedies. We further extend power transforms to the federated learning setting, addressing both numerical and distributional challenges that arise in this context. Experiments on real-world datasets demonstrate that our methods are both effective and robust, substantially improving stability compared to existing approaches.


90
Where the Score Lives: A Wavelet View of Diffusion

Emma Finn ⋅ Binxu Wang ⋅ T. Keller ⋅ Demba Ba

Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

The area under the ROC curve (AUC) is a key metric for classification tasks, valued for its robustness to class imbalance. Sparse models trained with $\ell_0$ constraints further enhance interpretability and generalization. Building on prior work that reformulates nonlinear AUC maximization as a pointwise compositional optimization problem, we revisit this formulation as the basis for addressing the black-box setting, where only function evaluations are available. A central challenge arises from integrating zeroth-order gradient estimation with hard-thresholding operators in the compositional framework, which has remained unresolved. To overcome this difficulty, we propose the Zeroth-Order Stochastic Compositional Hard-Thresholding (ZO-SCHT) algorithm, which, to the best of our knowledge, is the first method for black-box sparse AUC maximization. We establish that ZO-SCHT achieves linear convergence up to a tolerance bound under a fixed step size. Extensive experiments on both black-box sparse AUC maximization and black-box adversarial attack tasks demonstrate the effectiveness and versatility of our approach.


92
Tensor Gaussian Processes: Efficient Solvers for Nonlinear PDEs

Qiwei Yuan ⋅ Zhitong Xu ⋅ Yinghao Chen ⋅ Yiming Xu ⋅ Houman Owhadi ⋅ Shandian Zhe

Machine learning solvers for partial differential equations (PDEs) have attracted growing interest. However, most existing approaches, such as neural network solvers, rely on stochastic training, which is inefficient and typically requires a great many training epochs. Gaussian process (GP)/kernel-based solvers, while mathematical principled, suffer from scalability issues when handling large numbers of collocation points often needed for challenging or higher-dimensional PDEs. To overcome these limitations, we propose TGPS, a tensor-GP-based solver that introduces factor functions along each input dimension using one-dimensional GPs and combines them via tensor decomposition to approximate the full solution. This design reduces the task to learning a collection of one-dimensional GPs, substantially lowering computational complexity, and enabling scalability to massive collocation sets. For efficient nonlinear PDE solving, we use a partial freezing strategy and Newton's method to linerize the nonlinear terms. We then develop an alternating least squares (ALS) approach that admits closed-form updates, thereby substantially enhancing the training efficiency. We establish theoretical guarantees on the expressivity of our model, together with convergence proof and error analysis under standard regularity assumptions. Experiments on several benchmark PDEs demonstrate that our method achieves superior accuracy and efficiency compared to existing approaches. The code is released at \url{https://github.com/BayesianAIGroup/TGPSolve-NonLinear-PDEs}

Submodular maximization has become increasingly important in the fields of machine learning and data mining. For general submodular maximization without monotonicity, many previous analyses provide poor approximation guarantees, especially for submodular functions that are approximately monotone. To address this issue, the research community has proposed a continuous metric called the monotonicity ratio for submodular functions. The monotonicity ratio has been studied for submodular maximization under no constraint, a cardinality constraint, and a matroid constraint. However, the implications of using the monotonicity ratio for submodular maximization with a knapsack constraint remain unclear. Although a knapsack constraint can be regarded as a continuous extension of the cardinality constraint with non-uniform costs, the gap in analysis between these two constraints is substantial. In this paper, we analytically show that many previously proposed algorithms for monotone submodular maximization with a knapsack constraint can achieve improved approximation guarantees under partial monotonicity with a simple modification: enforcing positive marginal gain. In addition, we evaluate our proposed algorithms for two machine learning applications of movie recommendation and influence-and-exploit marketing, showing that our algorithms could achieve better empirical performance than state-of-the-art algorithms under partial monotonicity.


94
Transportability Without Graphs: A Bayesian Approach to Identifying s-Admissible Backdoor Sets

Konstantina Lelova ⋅ Gregory Cooper ⋅ Sofia Triantafillou

Transporting causal information across populations is a critical challenge in clinical decision-making. Causal modeling provides criteria for identifiability and transportability, but these require knowledge of the causal graph, which rarely holds in practice. We propose a Bayesian method that combines observational data from the target domain with experimental data from a different domain to identify s-admissible backdoor sets, which enable unbiased estimation of causal effects across populations, without requiring the causal graph. We prove that if such a set exists, we can always find one within the Markov boundary of the outcome, narrowing the search space, and we establish asymptotic convergence guarantees for our method. We develop a greedy algorithm that reframes transportability as a feature selection problem, selecting conditioning sets that maximize the marginal likelihood of experimental data given observational data. In simulated and semi-synthetic data, our method correctly identifies transportability bias, improves causal effect estimation, and performs favorably against alternatives.


95
Boosted GFlowNets: Improving Exploration via Sequential Learning

Pedro Dall’Antonia ⋅ Tiago Silva ⋅ Daniel Augusto de Souza ⋅ César Lincoln Mattos ⋅ Diego Mesquita

Generative Flow Networks (GFlowNets) are powerful samplers for compositional objects that, by design, sample proportionally to a given non-negative reward. Nonetheless, in practice, they often struggle to explore the reward landscape evenly: trajectories toward easy-to-reach regions dominate training, while hard-to-reach modes receive vanishing or uninformative gradients, leading to poor coverage of high-reward areas. We address this imbalance with Boosted GFlowNets, a method that sequentially trains an ensemble of GFlowNets, each optimizing a residual reward that compensates for the mass already captured by previous models. This residual principle reactivates learning signals in underexplored regions and, under mild assumptions, ensures a monotone non-degradation property: adding boosters cannot worsen the learned distribution and typically improves it. Empirically, Boosted GFlowNets achieve substantially better exploration and sample diversity on multimodal synthetic benchmarks and peptide design tasks, while preserving the stability and simplicity of standard trajectory-balance training.


96
Busemann Functions in the Wasserstein Space: Existence, Closed-Forms, and Applications to Slicing

Clément Bonet ⋅ Elsa Cazelles ⋅ Lucas Drumetz ⋅ Nicolas Courty

The Busemann function has recently found many interests in a variety of geometric machine learning problems, as it naturally defines projections onto geodesic rays of Riemannian manifolds and generalizes the notion of hyperplanes. As several sources of data can be conveniently modeled as probability distributions, it is natural to study this function in the Wasserstein space, which carries a rich formal Riemannian structure induced by Optimal Transport metrics. In this work, we investigate the existence and computation of Busemann functions in Wasserstein space, which admits geodesic rays. We establish closed-form expressions in two important cases: one-dimensional distributions and Gaussian measures. These results enable explicit projection schemes for probability distributions on $\mathbb{R}$, which in turn allow us to define novel Sliced-Wasserstein distances over Gaussian mixtures and labeled datasets. We demonstrate the efficiency of those original schemes on synthetic datasets as well as transfer learning problems.


97
Data Distribution Valuation Using Generalized Bayesian Inference

Ngoc Cuong Nguyen ⋅ Cuong V. Nguyen

We investigate the data distribution valuation problem, which aims to quantify the values of data distributions from their samples. This is a recently proposed problem that is related to but different from classical data valuation and can be applied to various applications. For this problem, we develop a novel framework called Generalized Bayes Valuation that utilizes generalized Bayesian inference with a loss constructed from transferability measures. This framework allows us to solve, in a unified way, seemingly unrelated practical problems, such as annotator evaluation and data augmentation. Using the Bayesian principles, we further improve and enhance the applicability of our framework by extending it to the continuous data stream setting. Our experiment results confirm the effectiveness and efficiency of our framework in different real-world scenarios.


98
Provable Accelerated Bayesian Optimization with Knowledge Transfer

Haitao Lin ⋅ Boxin Zhao ⋅ Mladen Kolar ⋅ Chong Liu

We study how to accelerate Bayesian optimization (BO) on a target task by transferring historical knowledge from related source tasks. Existing work on BO with knowledge transfer either lacks theoretical guarantees or achieves the same regret as BO in the non-transfer setting, $\tilde{\mathcal{O}}(\sqrt{T \gamma_f})$, where $T$ is the number of evaluations of the target function and $\gamma_f$ denotes its information gain. In this paper, we propose the DeltaBO algorithm, which builds a novel uncertainty-quantification approach on the difference function $\delta$ between the source and target functions, which are allowed to belong to different Reproducing Kernel Hilbert Spaces (RKHSs). Under mild assumptions, we prove that the regret of DeltaBO is of order $\tilde{\mathcal{O}}(\sqrt{T(T/N+\gamma_\delta)})$, where $N$ denotes the number of evaluations from source tasks and typically $N \gg T$. In many applications, source and target tasks are similar, which implies that $\gamma_\delta$ can be much smaller than $\gamma_f$. Empirical studies on both real-world hyperparameter-tuning tasks and synthetic functions show that DeltaBO outperforms other baseline methods and also verify our theoretical claims. Our code is available on GitHub.


99
Generalized Correlation Shifting for Lasso

Izuru Miyazaki ⋅ Hironori Fujisawa

The Lasso has been widely used in a high-dimensional setting, but its estimation accuracy may become inadequate when the covariates are highly correlated or when the number of covariates is extremely large. To overcome this problem, we propose a novel preconditioner that adaptively induces a low-rank structure in the design matrix. The proposed preconditioner achieves a higher probability of sign correctness under some conditions. We establish theoretical guarantees showing that our method dominates the standard Lasso, and we further demonstrate its superiority over the correlation shifting. To validate its practical effectiveness, we conducted numerical experiments on synthetic and semi-real datasets, and the proposed method presented better performance than existing methods.