Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 3

Main Ballroom
Mon 4 May 7 a.m. PDT — 10 a.m. PDT
Abstract:
Chat is not available.


1
Batch-Adaptive Causal Annotations

Ezinne Nwankwo ⋅ Lauri Goldkind ⋅ Angela Zhou

Estimating the causal effects of interventions is crucial to policy and decision-making, yet outcome data are often missing or subject to non-standard measurement error. While ground-truth outcomes can sometimes be obtained through costly data annotation or follow-up, budget constraints typically allow only a fraction of the dataset to be labeled. We address this challenge by optimizing which data points should be sampled for outcome information in order to improve efficiency in average treatment effect estimation with missing outcomes. We derive a closed-form solution for the optimal batch sampling probability by minimizing the asymptotic variance of a doubly robust estimator for causal inference with missing outcomes. Motivated by our street outreach partners, we extend the framework to costly annotations of unstructured data, such as text or images in healthcare and social services. Across simulated and real-world datasets, including one of outreach interventions in homelessness services, our approach achieves substantially lower mean-squared error and recovers the AIPW estimate with fewer labels than existing baselines. In practice, we show that our method can match confidence intervals obtained with 361 random samples using only 90 optimized samples—saving 75% of the labeling budget.

State-of-the-art post-training pipelines for reasoning LLMs rely on bootstrapped reasoning loops: they sample many traces, score them, and reinforce the highest-scoring ones, typically by correctness. This can improve accuracy while still collapsing the distribution inside the correct set onto a narrow family of redundant strategies, reducing creative problem-solving. To diagnose this failure mode, we introduce Distributional Creative Reasoning (DCR), a variational framework that casts training as gradient flow on the simplex of reasoning traces. The framework yields three core results. First, a diversity-decay analysis shows that STaR-style rejection fine-tuning and exact mean-field GRPO amplify whichever correct trace is already larger, while DPO regresses pairwise correct-trace ratios toward the reference ratios. Second, it explains why entropy and KL can slow or tether collapse but do not reward semantically distinct correct strategies for being distinct, and how a creativity kernel supplies the missing relational term. Third, under mild conditions, the resulting dynamics converge to a unique, stable, and diverse equilibrium, yielding practical guidance for kernel and hyperparameter design. DCR thus offers a principled route to training reasoning LLMs that remain both correct and creative.


100
Generalization Bounds for Spectral GNNs via Fourier Domain Analysis

Vahan Martirosyan ⋅ Daniele Malitesta ⋅ Hugues Talbot ⋅ Jhony Giraldo ⋅ Fragkiskos Malliaros

Spectral graph neural networks learn graph filters, but their behavior with increasing depth and polynomial order is not well understood. We analyze these models in the graph Fourier domain, where each layer becomes an element-wise frequency update, separating the fixed spectrum from trainable parameters and making depth and order explicit. In this setting, we show that Gaussian complexity is invariant under the Graph Fourier Transform, which allows us to derive data-dependent, depth, and order-aware generalization bounds together with stability estimates. In the linear case, our bounds are tighter, and on real graphs, the data-dependent term correlates with the generalization gap across polynomial bases, highlighting practical choices that avoid frequency amplification across layers.


101
Efficient Model Performance Evaluation Using a Combination of Expert and Crowd-sourced Labels

Sam Corbett-Davies ⋅ Viet-An Nguyen ⋅ Udi Weinsberg

As models, particularly large language models (LLMs), are deployed on increasingly challenging tasks, correctly evaluating their performance is growing in importance and difficulty. Expert human labelers are high-quality but scarce and resource-intensive to obtain, while crowd-sourced labels are more readily accessible at scale but lower in quality. We propose Maven (Model And Voter EvaluatioN), a hierarchical Bayesian model that combines these two label sources to produce model performance estimates on binary tasks that are less biased than using crowd-sourced labels alone and have lower variance than using expert labels alone. By modeling the ranking of model scores, Maven is robust to a range of prediction distributions and achieves constant inference time regardless of dataset size. We validate our approach on both simulated and real-world data, and deploy it to measure production models at Meta.


102
Deep Polynomial Chaos Expansion

Johannes Exenberger ⋅ Sascha Ranftl ⋅ Robert Peharz

Polynomial chaos expansion (PCE) is a classical and widely used surrogate modeling technique in physical simulation and uncertainty quantification. By taking a linear combination of a set of basis polynomials - orthonormal with respect to the distribution of uncertain input parameters - PCE enables tractable inference of key statistical quantities such as (conditional) means, variances, covariances, and Sobol sensitivity indices, which are essential for understanding the modeled system and identifying influential parameters and their interactions. The applicability of PCE to high-dimensional problems is limited by poor scalability, as the number of basis functions grows exponentially with the number of parameters. In this paper, we address this challenge by combining PCE with ideas from tractable probabilistic circuits, resulting in deep polynomial chaos expansion (DeepPCE) - a deep generalization of PCE that scales effectively to high-dimensional input spaces. DeepPCE achieves predictive performance comparable to that of multilayer perceptrons (MLPs), while retaining PCE's ability to compute exact statistical inferences via simple forward passes. In contrast, such computations in MLPs require costly and often inaccurate approximations, such as Monte Carlo integration.


103
On Global Convergence Rates for Federated Softmax Policy Gradient under Heterogeneous Environments

Safwan Labbi ⋅ Paul Mangold ⋅ Daniil Tiapkin ⋅ Eric Moulines

We provide global convergence rates for vanilla and entropy-regularized federated softmax stochastic policy gradient ($\texttt{FedPG}$) with local training. We show that $\texttt{FedPG}$ converges to a near-optimal policy in terms of the average agent value, with a gap controlled by the level of heterogeneity. Remarkably, we obtain the first convergence rates for entropy-regularized policy gradient *with explicit constants*, leveraging a projection-like operator. Our results build upon a new analysis of federated averaging for non-convex objectives, based on the observation that the Łojasiewicz-type inequalities from the single-agent setting (Mei et al., 2020) do not hold for the federated objective. This uncovers a fundamental difference between single-agent and federated reinforcement learning: while single-agent optimal policies can be deterministic, federated objectives may inherently require stochastic policies.

In standard RL, the structure of the Markov Decision Process (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of $M >1$ model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose $\textsf{MRBEAR}$, an online model selection algorithm for the average reward RL setting which is based on the idea of regret balancing and elimination. The regret of the algorithm is in $\tilde O(M C_{m*}^2 B_{m*}(T,\delta))$ where $C_{m*}$ represents the complexity of the simplest well-specified model class and $B_{m^*}(T,\delta)$ is its corresponding regret bound. This result shows that in average reward RL, the additional cost of model selection scales only linearly in $M$, the number of model classes. As an application, in a simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy, the learner can maximize its utility using $\textsf{MRBEAR}$. By proving a lower bound, we showed the learner's regret is tight in opponent's memory order. In addition, the algorithm's performance is demonstrated through experiments.


105
RamPINN: Recovering Raman Spectra From Coherent Anti-Stokes Spectra Using Embedded Physics

Sai Karthikeya Vemuri ⋅ Adithya Ashok Chalain Valapil ⋅ Tim Büchner ⋅ Joachim Denzler

Transferring the recent advancements in deep learning into scientific disciplines is hindered by the lack of the required large-scale datasets for training. We argue that in these knowledge-rich domains, the established body of scientific theory provides reliable inductive biases in the form of governing physical laws. We address the ill-posed inverse problem of recovering Raman spectra from noisy Coherent Anti-Stokes Raman Scattering (CARS) measurements, as the true Raman signal here is suppressed by a dominating non-resonant background. We propose RamPINN, a model that learns to recover Raman spectra from given CARS spectra. Our core methodological contribution is a physics-informed neural network that utilizes a dual-decoder architecture to disentangle resonant and non-resonant signals. This is done by enforcing the Kramers-Kronig causality relations via a differentiable Hilbert transform loss on the resonant and a smoothness prior on the non-resonant part of the signal. Trained entirely on synthetic data, RamPINN demonstrates strong zero-shot generalization to real-world experimental data, explicitly closing this gap and significantly outperforming existing baselines. Furthermore, we show that training with these physics-based losses alone, without access to any ground-truth Raman spectra, still yields competitive results. This work highlights a broader concept: formal scientific rules can act as a potent inductive bias, enabling robust, self-supervised learning in data-limited scientific domains.


106
LLM-as-a-Judge on a Budget

Aadirupa Saha ⋅ Aniket Wagde ⋅ Branislav Kveton

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, focusing resources where uncertainty is highest. Our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K \sigma_i^2}{B}}\right)$, $\sigma_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on HelpSteer2 dataset demonstrate our method significantly outperforms uniform allocation, reducing worst-case estimation error given a fixed budget. % Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at


107
Multiclass Local Calibration with the Jensen-Shannon Distance

Cesare Barbera ⋅ Lorenzo Perini ⋅ Giovanni De Toni ⋅ Andrea Passerini ⋅ Andrea Pugnana

Developing trustworthy Machine Learning (ML) models requires their predicted probabilities to be well-calibrated, meaning they should reflect true-class frequencies. Among calibration notions in multiclass classification, strong calibration is the most stringent, as it requires all predicted probabilities to be simultaneously calibrated across all classes. However, existing approaches to multiclass calibration lack a notion of distance among inputs, which makes them vulnerable to proximity bias: predictions in sparse regions of the feature space are systematically miscalibrated. In this work, we address this main shortcoming by introducing a local perspective on multiclass calibration. First, we formally define multiclass local calibration and establish its relationship with strong calibration. Second, we theoretically analyze the pitfalls of existing evaluation metrics when applied to multiclass local calibration. Third, we propose a practical method to enhance local calibration in Neural Networks, which enforces alignment between predicted probabilities and local estimates of class frequencies using the Jensen-Shannon distance. Finally, we empirically validate our approach against existing multiclass calibration techniques.


108
Private and Efficient Federated Statistical Learning

Jaemu Heo ⋅ Xiwen Feng ⋅ Jeonghun Kang ⋅ Taehwan Kim ⋅ Changgee Chang

Federated Learning (FL) enables collaborative model training across multiple sites while preserving data privacy and differential privacy (DP) provides a probabilistic framework to safeguard sensitive information when sharing output derived from data. While numerous DP-FL methods exist for large-scale prediction models, achieving DP, efficiency, and robustness in federated statistical learning remains a significant challenge. In this work, we propose a novel federated statistical learning framework that ensures efficient, robust, and privacy-preserving estimation. We introduce a new noising mechanism that encodes Fisher information along with the maximum likelihood estimate (MLE) by leveraging multiple noisy copies of the MLE. To calibrate noise effectively, we extend the smooth sensitivity to account for data-dependent correlations, ensuring strong DP guarantees while maintaining utility. Additionally, we develop INFEMBLER, an information-assembling algorithm that efficiently de-noises multiple noisy MLE copies using a hierarchical Bayesian model and an expectation-maximization (EM) algorithm. INFEMBLER significantly enhances estimation efficiency over existing methods and is inherently robust, providing estimates at least as reliable as those derived from local data alone, thereby preserving the benefits of FL. We establish its asymptotic properties and validate its effectiveness through experiments on both simulated and real datasets, demonstrating its superior statistical efficiency and robustness.


109
Learning Linear Regression with Low-Rank Tasks In-Context

Kaito Takanami ⋅ Takashi Takahashi ⋅ Yoshiyuki Kabashima

In-context learning (ICL) is a key building block of modern large language models, yet its theoretical mechanisms remain poorly understood. It is particularly mysterious how ICL operates in real-world applications where tasks have a structure. In this work, we address this problem by analyzing a linear attention model trained on low-rank regression tasks. Within this setting, we precisely characterize the distribution of predictions and the generalization error in the high-dimensional limit. Moreover, we find that statistical fluctuations in finite pre-training data induce an implicit regularization. Finally, we identify a sharp phase transition of the generalization error governed by task structure. These results provide a framework for understanding how transformers learn to learn the task structure.


11
Conformal Margin Risk Minimization: An Envelope Framework for Robust Learning under Label Noise

Yuanjie Shi ⋅ Peihong Li ⋅ Zijian Zhang ⋅ Jana Doppa ⋅ Yan Yan

Most methods for learning with noisy labels require privileged knowledge such as noise transition matrices, clean subsets or pretrained feature extractors, resources typically unavailable when robustness is most needed. We propose ***C**onformal **M**argin **R**isk **M**inimization (CMRM)*, a plug-and-play envelope framework that improves \emph{any} classification loss under label noise by adding a single quantile-calibrated regularization term, with no privileged knowledge or training pipeline modification. CMRM measures the confidence margin between the observed label and competing labels, and thresholds it with a conformal quantile estimated per batch to focus training on high-margin samples while suppressing likely mislabeled ones. We derive a learning bound for CMRM under arbitrary label noise requiring only mild regularity of the margin distribution. Across five base methods and six benchmarks with synthetic and real-world noise, CMRM consistently improves accuracy (up to $+3.39\\%$), reduces conformal prediction set size (up to $-20.44\\%$) and does not hurt under 0\% noise, showing that CMRM captures a method-agnostic uncertainty signal that existing mechanisms did not exploit.


110
TESLA: Taylor Expansion of Sinusoidal Learnable Activations

Daehwa Ko ⋅ JaeHyeon Kim ⋅ SeungHyun Ham ⋅ Jay Hoon Jung

The parity problem—deciding whether the number of ones in a binary vector is odd or even—remains challenging for standard neural networks due to linear inseparability and the need for global interactions. We propose TESLA, an activation defined as a learnable combination of sine and cosine terms, enabling explicit control over polynomial degree and selective amplification of high-order components. Theoretically, we show that constraining TESLA’s coefficients yields Lipschitz/Rademacher complexity bounds and shapes the training dynamics to emphasize higher-frequency structure. Empirically, on parity with input length $n=32$, TESLA attains strong generalization with 100K training samples ($\approx 0.002\%$ of the $2^{32}$ input space) and remains robust under heavy corruption, retaining high accuracy with up to 30\% label noise. We also compare against periodic and frequency-based baselines (SIREN, SNAKE, and Fourier feature embeddings) on parity and Forrelation. Beyond synthetic structure, TESLA delivers comparable performance on ImageNet-100, indicating that activation-level degree control transfers to more general vision workloads. Code: \url{https://github.com/KAU-QuantumAILab/TESLA}


111
Aggregation on Learnable Manifolds for Asynchronous Federated Optimisation

Archie Licudi ⋅ Anshul Thakur ⋅ Soheila Molaei ⋅ Danielle Belgrave ⋅ David Clifton

Asynchronous federated learning (FL) with heterogeneous clients faces two key issues: curvature-induced loss barriers encountered by standard linear parameter interpolation techniques (e.g. FedAvg) and interference from stale updates misaligned with the server’s current optimisation state. To alleviate these issues, we introduce a geometric framework that casts aggregation as curve learning in a Riemannian model space and decouples choice of update direction from staleness conflict resolution. Within this, we propose $\textbf{AsyncBezier}$, which replaces linear aggregation with low-degree polynomial (Bézier) trajectories to bypass loss barriers, and $\textbf{OrthoDC}$, which orthogonally projects delayed updates to reduce interference. We establish framework-level convergence guarantees covering each variant given simple assumptions on their components. On three datasets spanning general-purpose and healthcare domains, including LEAF Shakespeare and FEMNIST, our approach consistently improves accuracy and client fairness over strong asynchronous baselines; finally, we show that these gains are preserved even when other methods are allocated a higher local compute budget.


112
Adversarial Debiasing for Parameter Recovery

Luke Sanford ⋅ Megan Ayers ⋅ Matthew Gordon ⋅ Eliana Stone

Advances in machine learning and the increasing availability of high-dimensional data have led to the proliferation of social science research that uses the predictions of machine learning models as proxies for outcomes of interest. However, prediction errors from machine learning models can lead to bias in downstream estimation tasks, including regression. In this paper, we show how this bias can arise, propose a test for detecting bias, and demonstrate the use of an adversarial machine learning algorithm in order to generate predictions suitable for unbiased downstream estimation. Here, we focus on a setting where machine-learned predictions are the dependent variable in a regression. We conduct simulations and empirical exercises using ground truth and satellite data on forest cover in Africa. Using the predictions from a naive machine learning model leads to biased parameter estimates, while the predictions from the adversarial model recover the true coefficients. Our approach consistently matches or exceeds the performance of existing methods.


113
Randomized HyperSteiner: A Stochastic Delaunay Triangulation Heuristic for the Hyperbolic Steiner Minimal Tree

Aniss Aiman MEDBOUHI ⋅ Alejandro García-Castellanos ⋅ Giovanni Luca Marchetti ⋅ Daniel Pelt ⋅ Erik Bekkers ⋅ Danica Kragic

We study the problem of constructing Steiner Minimal Trees (SMTs) in hyperbolic space. Exact SMT computation is NP-hard, and existing hyperbolic heuristics such as HyperSteiner are deterministic and often get trapped in locally suboptimal configurations. We introduce Randomized HyperSteiner (RHS), a stochastic Delaunay triangulation heuristic that incorporates randomness into the expansion process and refines candidate trees via Riemannian gradient descent optimization. Experiments on synthetic data sets and a real-world single-cell transcriptomic data show that RHS outperforms Minimum Spanning Tree (MST), Neighbour Joining, and vanilla HyperSteiner (HS). In near-boundary configurations, RHS can achieve a 32\% reduction in total length over HS, demonstrating its effectiveness and robustness in diverse data regimes.


114
Empirically Calibrated Conditional Independence Tests

Milleno Pan ⋅ Antoine de Mathelin ⋅ Wesley Tansey

Conditional independence tests (CIT) are widely used for causal discovery and feature selection. Even with false discovery rate (FDR) control procedures, they often fail to provide frequentist guarantees in practice. We highlight two common failure modes: (i) in small samples, asymptotic guarantees for many CITs can be inaccurate and even correctly specified models fail to estimate the noise levels and control the error, and (ii) when sample sizes are large but models are misspecified, unaccounted dependencies skew the test's behavior and fail to return uniform p-values under the null. We propose Empirically Calibrated Conditional Independence Tests (ECCIT), a method that measures and corrects for miscalibration. For a chosen base CIT (e.g., GCM, HRT), ECCIT optimizes an adversary that selects features and response functions to maximize a miscalibration metric. ECCIT then fits a monotone calibration map that adjusts the base-test p-values in proportion to the observed miscalibration. Across empirical benchmarks on synthetic and real data, ECCIT achieves valid FDR with higher power than existing calibration strategies while remaining test agnostic. Code is available at https://github.com/tansey-lab/ECCIT.


115
MineGrad: Gradient Inversion Attacks on LoRA Fine-Tuning

Hasin Us Sami ⋅ Swapneel Sen ⋅ Basak Guler

Parameter-efficient fine-tuning (PEFT), such as low-rank adaptation (LoRA), has recently been adopted in federated learning to reduce communication and computation costs. In this setup, users download a pretrained model from the server prior to fine-tuning, and then fine-tune lightweight LoRA modules locally while keeping the pretrained model frozen, sharing only the gradients of the fine-tuning parameters with the server. Despite its growing popularity, robustness of federated fine-tuning against an adversarial server remains underexplored, where the server maliciously tampers with the training protocol to breach the privacy of users’ data. In this work, we investigate gradient inversion attacks on LoRA fine-tuning. We propose an analytical attack that enables a malicious server to recover private user data by leveraging a poisoned pretrained model and fine-tuning parameters. Our design embeds fine-tuning data within the shared gradients, to allow the server to analytically reconstruct user data. Unlike prior works, our attack is applicable to both language and vision tasks, does not rely on computationally expensive (adversarial) pretraining with public datasets or require the number of training tokens to be less than the rank of LoRA modules. Experimental results on both language and vision tasks demonstrate high-fidelity data recovery across multiple baselines, revealing several critical vulnerabilities.


116
Discrete State Diffusion Models: A Sample Complexity Perspective

Aadithya Srikanth ⋅ Mudit ⋅ Vaneet Aggarwal

Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume score estimation error bounds without studying sample complexity results. In this work, we present a principled theoretical framework for discrete-state diffusion, providing the first sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-2})$. Our structured decomposition of the score estimation error into statistical, approximation, optimization, and clipping components offers critical insights into how discrete-state models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of discrete-state diffusion models.


117
Optimized Projection-Free Algorithms for Online Learning: Construction and Worst-Case Analysis

Julien Weibel ⋅ Pierre Gaillard ⋅ Wouter Koolen ⋅ Adrien Taylor

This work studies and develops projection-free algorithms for online learning with linear optimization oracles (a.k.a.~Frank--Wolfe) for handling the constraint set, and for convex loss functions. More precisely, this work (i) shows how to exploit semidefinite programming to jointly design and analyze online Frank--Wolfe-type algorithms numerically in a variety of settings, (ii) leverages those design techniques to propose an improved (optimized) variant of an online Frank--Wolfe algorithm along with its conceptually simple potential-based proof, and (iii) extends this proof to its anytime version, which benefits from a similar $O(T^{3/4})$ regret rate without requiring knowledge of the time horizon $T$ in advance. We are not aware of other direct regret guarantees for an anytime version of online Frank--Wolfe without using the classical doubling trick. Based on the semidefinite technique, we conclude with strong numerical evidence suggesting that no pure online Frank--Wolfe algorithm within our model class can have a regret guarantee better than $O(T^{3/4})$ without additional assumptions, that the current algorithms do not have optimal constants, and that multiple linear optimization rounds do not generally help to obtain better regret bounds.


119
BOAT: Navigating The Sea of in Silico Predictors for Antibody Design via Multi-Objective Bayesian Optimization

Jackie Rao ⋅ Ferran Gonzalez ⋅ Leon Gerard ⋅ Alexandra Gessner

Antibody lead optimization is inherently a multi-objective challenge in drug discovery. Achieving a balance between different drug-like properties is crucial for the development of viable candidates, and this search becomes exponentially challenging as desired properties grow. The ever-growing zoo of sophisticated in silico tools for predicting antibody properties calls for an efficient joint optimization procedure to overcome resource-intensive sequential filtering pipelines. We present BOAT, a versatile Bayesian optimization framework for multi-property antibody engineering. Our 'plug-and-play' framework couples uncertainty-aware surrogate modeling with a genetic algorithm to jointly optimize various predicted antibody traits while enabling efficient exploration of sequence space. Through systematic benchmarking against genetic algorithms and newer generative learning approaches, we demonstrate competitive performance with state-of-the-art methods for multi-objective protein optimization. We identify clear regimes where surrogate-driven optimization outperforms expensive generative approaches and establish practical limits imposed by sequence dimensionality and oracle costs.


120
Tractable Uncertainty-Aware Meta-Learning

Young-Jin Park ⋅ Cesar Almecija ⋅ Apoorva Sharma ⋅ Navid Azizan

Meta-learning is a popular approach for learning new tasks with limited data by leveraging the commonalities among different tasks. However, meta-learned models can perform poorly when context data is too limited, or when data is drawn from an out-of-distribution (OoD) task. Especially in safety-critical settings, this necessitates an uncertainty-aware approach to meta-learning. In addition, the often multimodal nature of task distributions can pose unique challenges to meta-learning methods. To this end, we present LUMA, a meta-learning method for regression that (1) makes probabilistic predictions on in-distribution tasks efficiently, (2) is capable of detecting OoD context data, and (3) handles heterogeneous, multimodal task distributions effectively. The strength of our framework lies in its solid theoretical basis, enabling analytically tractable Bayesian inference on a linearized model for principled uncertainty estimation and robust generalization. We achieve this by adopting a probabilistic perspective and learning a parametric, tunable task distribution via Bayesian inference on a linearized neural network, leveraging Gaussian process theory. Moreover, we make our approach computationally tractable by leveraging a low-rank prior covariance learning scheme based on the Fisher Information Matrix. Our numerical analysis demonstrates that LUMA quickly adapts to new tasks and remains accurate even in low-data regimes, it effectively detects OoD tasks, and that both of these properties continue to hold for multimodal task distributions.


121
CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

Mushir Akhtar ⋅ M. Tanveer ⋅ Mohd. Arshad

Randomized neural networks (RdNNs) enable efficient, backpropagation-free training by freezing randomly initialized input-to-hidden weights, which permits a closed-form solution for the output layer. However, conventional random initialization is blind to inter-feature dependence—ignoring correlations, asymmetries, and tail dependence in the data—which degrades conditioning and predictive performance. To the best of our knowledge, this limitation remains unaddressed in the RdNN literature. To close this gap, we propose CAWI (Copula-Aligned Weight Initialization), a framework that draws input-to-hidden weights from a data-fitted copula that matches empirical dependence, ensuring the frozen projections respect inter-feature dependence without sacrificing closed-form solution. CAWI (i) maps each feature to the unit interval using empirical CDFs, (ii) fits a multivariate copula that captures rank-based dependence among features, and (iii) samples each weight column $w_j$ from the fitted copula and applies a fixed inverse marginal transform to set scale. The objective, solver, and ``freeze-once'' paradigm remain unchanged; only the sampling law for $W$ becomes dependence-aware. For dependence modeling, we consider two copula families: elliptical (Gaussian, t) and Archimedean (Clayton, Frank, Gumbel). This enables CAWI to handle diverse dependence, including tail dependence. We evaluate CAWI across 83 diverse classification benchmarks (binary and multiclass) and two biomedical datasets, BreaKHis and the Schizophrenia dataset, using standard shallow and deep RdNN architectures. CAWI consistently delivers significant improvements in predictive performance over conventional random initialization. Codes are provided at \url{https://github.com/mtanveer1/CAWI}.


122
GRANITE: A Generalized Regional Framework for Identifying Agreement in Feature-Based Explanations

Julia Herbinger ⋅ Gabriel Laberge ⋅ Maximilian Muschalik ⋅ Yann Pequignot ⋅ Marvin N. Wright ⋅ Fabian Fumagalli

Feature-based explanation methods aim to quantify how features influence the model's behavior, either locally or globally, but different methods often disagree, producing conflicting explanations. This disagreement arises primarily from two sources: how feature interactions are handled and how feature dependencies are incorporated. We propose GRANITE, a generalized regional explanation framework that partitions the feature space into regions where interaction and distribution influences are minimized. This approach aligns different explanation methods, yielding more consistent and interpretable explanations. GRANITE unifies existing regional approaches, extends them to feature groups, and introduces a recursive partitioning algorithm to estimate such regions. We demonstrate its effectiveness on real-world datasets, providing a practical tool for consistent and interpretable feature explanations.


123
Adaptive A/B Testing under Nonstationary Dynamics using State-Space Models

Junzhe Shao ⋅ Waverly Wei ⋅ Jingshen Wang

A/B testing is central to evaluating how modifications to products, services, and user experiences impact user outcomes. Yet in practice, experiments rarely occur in stationary environments: seasonality, feature launches, and dynamically evolved user demographics make the underlying treatment effects shift over time. Conventional fixed-allocation designs fail to adapt to this nonstationarity, relying on static treatment allocations that potentially compromise estimation efficiency and lead to inefficient use of experimental resources. Response-adaptive randomization (RAR) design provides a natural alternative, adaptively allocating participants over time based on accrued information. In this work, we propose a methodology framework that addresses these challenges. On the one hand, we model period-level treatment arm means as autoregressive state-space processes and develop a Kalman smoother estimator for the time-averaged treatment effect that exploits temporal dependence. On the other hand, we propose an RAR design that accommodates nonstationarity by incorporating state uncertainty via predicted Kalman variances. Our theoretical analysis establishes asymptotic normality of both a naive and a smoother-based estimator, proves that the smoother strictly dominates the naive estimator in asymptotic variance under correct specification, compares relative efficiency, and enables the construction of anytime-valid confidence sequences for continuous monitoring. Simulation studies demonstrate that our method is significantly more efficient than a benchmark time-averaging estimator and fixed allocation strategy, particularly under treatment effect drift and variance imbalance.


124
Fundamental Limits for Weighted Empirical Approximations of Exponentially Tilted Distributions

Sarvesh Iyer ⋅ Himadri Mandal ⋅ Dhruman Gupta ⋅ Rushil Gupta ⋅ Agniv Bandyopadhyay ⋅ Achal Bassamboo ⋅ Sandeep Juneja ⋅ Varun Gupta

Generating samples from exponentially tilting a given distribution of random vectors when samples from the given distribution are available finds applications in fields such as finance and climate science and in the broad area of rare event simulation. In this article, we discuss the asymptotic efficiency of an estimator obtained by exponentially tilting the empirical distribution. We provide a sharp characterization of how much one can accurately tilt distributions given a certain number of samples. Our findings reveal a surprising dichotomy: While twisting unbounded distributions is a fundamentally hard task, for bounded distributions, one can accurately tilt by a large amount using much fewer samples.


125
Network Inversion for Extreme-Case Training-Like Data Reconstruction

Pirzada Suhail ⋅ Sunny Gupta ⋅ Amit Sethi

Machine learning models are often trained on proprietary or private datasets that cannot be openly shared. However, the trained model weights are frequently distributed under the assumption that sharing model parameters does not compromise the confidentiality or privacy of the training data. In this work, we challenge this assumption by presenting \textbf{Training-Like Data Reconstruction (TLDR)}, as a general-purpose and architecture-agnostic framework for reconstructing training data from a fully trained classifier. Our approach leverages network inversion techniques to recover data that closely resembles the original training samples by exploiting key properties of the classifier with respect to the training data, without requiring access to training dynamics, gradients, pre-trained models, auxiliary datasets, or unobvious priors. Operating in this extreme setting, we demonstrate successful reconstruction of samples with high similarity to the original training data from diverse classifier architectures highlighting critical privacy concerns associated with sharing model parameters. While prior work in this extreme setting has been limited to binary MLP classifiers trained on small datasets, our framework extends to multi-class classification tasks for models based on diverse architectures trained on significantly larger and more complex datasets. Furthermore, we provide quantitative evaluation using the Structural Similarity Index Measure (SSIM) to compare the reconstructed samples with the training samples.


126
Robust Federated Clustering under Heterogeneity and Adversaries

Martín Bravo ⋅ Sebastian Dalleiger

Clustering distributed and private data is an increasingly important task across domains that handle sensitive information, such as life sciences and clinical research. In federated settings, clustering faces three challenges: heterogeneous client data distributions, adversarial behavior, and strict privacy requirements. Existing approaches often exhibit significant performance degradation under these conditions and fail to return accurate solutions. To overcome these limitations, we introduce a novel federated clustering algorithm that combines client-level differential privacy with Byzantine-robust aggregation at the server, based on a novel efficient and robust clustering procedure. Our method comes with theoretical robustness guarantees, and through extensive experiments on synthetic and real-world data, we demonstrate that it produces high-quality clusters in just a few communication rounds, even in scenarios where state-of-the-art methods fail.


127
On the Neural Feature Ansatz for Deep Neural Networks

Edward Tansley ⋅ Estelle Massart ⋅ Coralia Cartis

Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power $\alpha>0$ of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent $\alpha = 1/2$ (Radhakrishnan et al., 2024). We extend this result to networks with $L \geq 2$ layers, showing that the NFA holds with exponent $\alpha = 1/L$, thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.

The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an adaptive memory mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.


129
A New Perspective on Minimum-Norm Interpolation Under Gaussian Covariates

Gil Kur ⋅ Zong Shang ⋅ Paul Simanjuntak ⋅ Guillaume Lecué ⋅ Reese Pathak

Minimum-Norm Interpolators (MNI) in overparameterized linear models have gained attention as a tractable framework for studying interpolation phenomena that resemble empirical observations in neural networks. Most prior work on these interpolators either exploits closed-form solutions when available or relies heavily on Gaussian comparison results, such as the convex Gaussian Min-Max Theorem (CGMT). In this paper, we introduce a new perspective on MNI under isotropic Gaussian covariates by leveraging tools from high-dimensional geometry. First, we obtain a ``localized'' bound on the MNI's shrinkage of the original ground truth that occurs under isotropic Gaussian covariates when the norm is in an isotropic position. Then, we prove a sharp bound on the Mean Squared Error (MSE) of the $\ell_1$-MNI, as obtained by Wang 22' via a geometric proof, which avoids invoking the CGMT and instead relies on the work of Fleury 12' on Gaussian polytopes.

We study a multi-agent multi-armed bandit problem (MA-MAB) in open systems, where multiple agents can enter and leave at any time and face multiple bandit problems to minimize the group-wise cumulative regret. To our knowledge, this is the first work to consider a dynamic set of agents that arrive and depart according to stochastic processes, systematically evolving over time. We also extend to a permissionless blockchain-based MA-MAB (PB-MA-MAB) problem, where agents may behave either honestly or maliciously depending on compliance with the mechanism, and malicious agents may disrupt honest ones. These formulations pose new challenges, as regret grows with the increasing number of agents. To this end, we design new UCB-based methodologies for both MA-MAB and PB-MA-MAB, introducing information-integration rules for existing agents and information-access mechanisms for new agents to fully leverage available information. We derive regret bounds for our algorithms and characterize the complexity of the formulation via regret lower bounds in both settings. We establish regret upper bounds of order $\max\{O(M_0), O(\log T), O(\tfrac{\log^2 T}{(M_0)})1_{\{\lambda > 0\}}\}$ (a significant improvement over the naïve bound $(M_0 + T)\log T$), where $M_0$ is the initial number of agents and $C$ reflects the arrival/departure rate. We also prove lower bounds of $O(\log T)$ and $O(M_0)$ for all consistent algorithms, and tighter bounds of $O(\log T + M_0)$ or $O(\log^2 T)$ for a subset including ours. These imply that our algorithm is nearly optimal in general and optimal in certain cases.


130
CoreSPECT: Enhancing Clustering Algorithms via an Interplay of Density and Geometry

Chandra Sekhar Mukherjee ⋅ Joonyoung Bae ⋅ Jiapeng Zhang

In this paper, we provide a novel perspective on the underlying structure of real-world data with ground-truth clustering via characterization of an abundantly observed yet often overlooked density–geometry correlation. We leverage this correlation to design CoreSPECT (Core Space Projection based Enhancement of Clustering Techniques), a general framework that improves the performance of generic clustering algorithms. Our framework boosts the performance of clustering algorithms by applying them to strategically selected regions, then extending the partial partition to a complete partition for the dataset using a novel neighborhood graph based multi-layer propagation procedure. We provide initial theoretical support of the functionality of our framework under the assumption of our model, and then provide large-scale real-world experiments on 20 datasets that include standard image datasets as well as genomics datasets. We observe two notable improvements. First, CoreSPECT improves the NMI of K-Means by 20 % on average, making it competitive (and in some cases surpassing) the state-of-the art manifold-based clustering algorithms, while being orders of magnitude faster. Secondly, our framework boosts the NMI of HDBSCAN by more than 100 % on average, making it competitive to the state-of-the-art in several cases without requiring the true number of clusters and hyper-parameter tuning. The overall ARI improvements are higher.


131
Eliciting Truthful Feedback for Preference-Based Learning via the VCG Mechanism

Leo Landolt ⋅ Anna Maddux ⋅ Andreas Schlaginhaufen ⋅ Saurabh Vaishampayan ⋅ Maryam Kamgarpour

We study resource allocation problems in which a central planner allocates resources among strategic agents with private cost functions in order to minimize a social cost, defined as an aggregate of the agents’ costs. This setting poses two main challenges: (i) the agents’ cost functions may be unknown to them or difficult to specify explicitly, and (ii) agents may misreport their costs strategically. To address these challenges, we propose an algorithm that combines preference-based learning with Vickrey–Clarke–Groves (VCG) payments to incentivize truthful reporting. Our algorithm selects informative preference queries via D-optimal design, estimates cost parameters through maximum likelihood, and computes VCG allocations and payments based on these estimates. In a one-shot setting, we prove that the mechanism is approximately truthful, individually rational, and efficient up to an error of $\tilde{\mathcal O}(K^{-1/2})$ for $K$ preference queries per agent. In an online setting, these guarantees hold asymptotically with sublinear regret at a rate of $\tilde{\mathcal O}(T^{2/3})$ after $T$ rounds. Finally, we validate our approach through a numerical case study on demand response in local electricity markets.


132
Dual Averaging Converges for Nonconvex Smooth Stochastic Optimization

Tuo Liu ⋅ El Mehdi Saad ⋅ Wojciech Kotlowski ⋅ Francesco Orabona

Dual averaging and gradient descent with their stochastic variants stand as the two canonical recipe books for first-order optimization: Every modern variant can be viewed as a descendant of one or the other. In the convex regime, these algorithms have been deeply studied, and we know that the two classes are essentially equivalent in terms of theoretical guarantees. On the other hand, in the non-convex setting, the situation is drastically different: While it is provable that SGD can minimize the gradient norm of non-convex smooth functions, no finite-time complexity guarantee for Stochastic Dual Averaging (SDA) was known in the same setting. In this paper, we close this gap by a reduction that views SDA as SGD applied to a sequence of implicitly regularized objectives. We show that a tuned SDA exhibits a rate of convergence $\mathcal{O}(1 / T + \sigma \log T/ \sqrt{T})$, similar to that of SGD under the same assumptions. To our best knowledge, this is the first complete convergence theory for dual averaging on non-convex smooth stochastic problems without restrictive assumptions, closing a long-standing open problem in the field. Beyond the base algorithm, we also discuss ADA-DA, a variant that marries SDA with AdaGrad's auto-scaling, which achieves the same rate without requiring knowledge of the noise variance.


133
Differentially Private and Federated Structure Learning in Bayesian Networks

Ghita Fassy El Fehri ⋅ Aurélien Bellet ⋅ Philippe Bastien

Learning the structure of a Bayesian network from decentralized data poses two major challenges: (i) ensuring rigorous privacy guarantees for participants, and (ii) avoiding communication costs that scale poorly with dimensionality. In this work, we introduce Fed-BNSL, a novel federated method for learning linear Gaussian Bayesian network structures that addresses both challenges. By combining differential privacy with greedy updates that target only a few relevant edges per participant, Fed-BNSL efficiently uses the privacy budget while keeping communication costs low. Our careful algorithmic design preserves model identifiability and enables accurate structure estimation. Experiments on synthetic and real datasets demonstrate that Fed-BNSL achieves utility close to non-private baselines while offering substantially stronger privacy and communication efficiency.


134
Adaptive Diffusion Guidance via Stochastic Optimal Control

Iskander Azangulov ⋅ Peter Potaptchik ⋅ Qinyu Li ⋅ Eddie Aamari ⋅ George Deligiannidis ⋅ Judith Rousseau

Classifier-Free Guidance (CFG) is a cornerstone of modern diffusion models, playing a pivotal role in conditional generation and enhancing the quality of unconditional samples. However, current approaches to CFG scheduling—determining the appropriate guidance weight—are largely heuristic and lack a solid theoretical foundation. This work addresses these limitations on two fronts. First, we provide a theoretical formalization that precisely characterizes the relationship between guidance strength and classifier confidence. Second, building on this insight, we introduce a stochastic optimal control framework that casts CFG scheduling as an adaptive optimization problem. In this formulation, guidance strength is not fixed but dynamically selected based on time, the current sample, and the conditioning class, either independently or in combination. By solving the resulting control problem, we establish a principled foundation for more effective guidance in diffusion models.

Replicable algorithms produce identical outputs with high probability when run on independent samples drawn from the same distribution, providing strong reproducibility guarantees for machine learning pipelines. We study replicability in machine learning in Vapnik's general learning setting, which encompasses stochastic optimization over convex and non-convex loss classes, establishing algorithms with near-optimal sample complexity across these settings. For general Lipschitz losses over a bounded parameter space, we show that the exponential mechanism combined with correlated sampling achieves optimal $O(1/\sqrt{n})$ excess risk with $\rho$-replicability guarantees, but at the cost of exponential runtime. For general Lipschitz losses, the exponential mechanism with correlated sampling achieves optimal $O(1/\sqrt{n})$ excess risk and $\rho$-replicability, but with exponential runtime. For strongly convex losses over a $d$-dimensional parameter space, empirical risk minimization (ERM) paired with randomized rounding achieves $\widetilde{O}(\sqrt{d}/(\rho\sqrt{n}))$ excess risk in polynomial time. For general convex losses, regularized ERM yields excess risk of $\widetilde{O}(n^{-1/4})$. We further extend our techniques to overparameterized neural networks in the Neural Tangent Kernel (NTK) regime. Taken together, our results provide evidence for a fundamental computational-statistical tradeoff in replicable learning, whereby optimal replicability requires exponential time while our polynomial-time algorithms incur a modest but provable statistical penalty.


136
Efficient Bilevel Optimization with KFAC-Based Hypergradients

Disen Liao ⋅ Felix Dangel ⋅ Yaoliang Yu

Bilevel optimization (BO) is widely applicable to many machine learning problems. Scaling BO, however, requires repeatedly computing hypergradients, which involves solving inverse Hessian-vector products (IHVPs). In practice, these operations are often approximated using crude surrogates such as one-step gradient unrolling or identity/short Neumann expansions, which discard curvature information. We build on implicit function theorem-based algorithms and propose to incorporate Kronecker-factored approximate curvature (KFAC), yielding curvature-aware hypergradients with a better performance efficiency trade-off than Conjugate Gradient (CG) or Neumann methods and consistently outperforming unrolling. We evaluate this approach across diverse tasks, including meta-learning and AI safety problems. On models up to BERT, we show that curvature information is valuable at scale, and KFAC can provide it with only modest memory and runtime overhead. Our implementation is available at \url{https://github.com/liaodisen/NeuralBo}.


137
A Polynomial-Time Approximation for Pairwise Fair $k$-Median Clustering

Sayan Bandyapadhyay ⋅ Eden Chlamtáč ⋅ Zachary Friggstad ⋅ Mahya Jamshidian ⋅ Yury Makarychev ⋅ Ali Vakilian

In this work, we study pairwise fair $k$-Median with $\ell \ge 2$ groups, where for every cluster $C$ and every group $i \in [\ell]$, the number of points in $C$ from group $i$ must be at most $t$ times the number of points in $C$ from any other group $j \in [\ell]$, for a given integer $t$. Only bi-criteria approximation and exponential-time algorithms follow for this problem from the prior work on fair clustering problems when $\ell > 2$. We present the first polynomial-time $O(k^2\cdot \ell \cdot t)$-approximation for this problem that does not violate the fairness constraints. We also implemented our algorithm on a variety of datasets to test the "price of fairness" achieved by our approach in real data, which turned out to be significantly smaller than the theoretical guarantee.


138
UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees

Prateek Chanda ⋅ Prayas Agrawal ⋅ Karthik Gurumoorthy ⋅ Ganesh Ramakrishnan ⋅ Bamdev Mishra ⋅ Pratik Jawanpuria

Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present UniPROT, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emph{super-additive} objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a $(1-1/e)$ approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub: https://github.com/efficiency-learning/UniPROT

We investigate the challenging problem of adversarial multi-armed bandits operating under time-varying constraints, a scenario motivated by numerous real-world applications. To address this complex setting, we propose a novel primal-dual algorithm that extends online mirror descent through the incorporation of suitable gradient estimators and effective constraint handling. We provide theoretical guarantees establishing sublinear dynamic regret and sublinear constraint violation for our proposed policy. Our algorithm achieves state-of-the-art performance in terms of both regret and constraint violation. Empirical evaluations demonstrate the superiority of our approach.

The generalization error of a supervised statistical learning algorithm, defined as the difference between the population risk and the empirical risk, quantifies its ability to predict performance on previously unseen data. In this work, we analyze the generalization error under the heavy-tailed assumption on the loss function with respect to the data-generating distribution. Specifically, we derive uniform, information-theoretic, and PAC-Bayesian bounds on the generalization error under the assumption that the $(1+\epsilon)$-th moment of the loss function is bounded for $\epsilon\in(0,1]$. The generalization error is shown to have a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ where $n$ is the number of training samples. Furthermore, we apply our results to study the generalization error of the Gibbs posterior and noisy iterative learning algorithms under the heavy-tailed assumption.


140
Amortized In-Context Mixed Effect Transformer Models: A Zero-Shot Approach for Pharmacokinetics

César Ali Ojeda Marin ⋅ Ramses Sanchez ⋅ Wilhelm Huisinga ⋅ Niklas Hartung

Accurate dose–response forecasting under sparse sampling is central to precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect Transformer (AICMET) model, a transformer‑based latent‑variable framework that unifies mechanistic compartmental priors with amortized in‑context Bayesian inference. AICMET is pre‑trained on hundreds of thousands of synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters of compartment models, endowing the model with strong inductive biases and enabling zero‑shot adaptation to new compounds. At inference time, the decoder conditions on the collective context of previously profiled trial participants, generating calibrated posterior predictions for newly enrolled patients after a few early drug concentration measurements. This capability collapses traditional model‑development cycles from weeks to hours while preserving some degree of expert modelling. Experiments across public datasets show that AICMET attains state‑of‑the‑art predictive accuracy and faithfully quantifies inter‑patient variability—outperforming both nonlinear mixed‑effects baselines and recent neural ODE variants.


142
Policy Learning with Abstention

Ayush Sawarni ⋅ Jikai Jin ⋅ Justin Whitehouse ⋅ Vasilis Syrgkanis

Policy learning algorithms are regularly leveraged in domains such as personalized medicine and advertising to develop individualized treatment regimes. One deficit of existing policy learning algorithms is that they do not adjust their decisions based on uncertainty in their predictions—that is, they fail to \textit{abstain}. To remedy this, we introduce a framework for \textit{policy learning with abstention}, in which policies that choose not to assign a treatment to some customers/patients receive a small additive reward on top of the value of a random guess. Building on empirical welfare maximization, we propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention class based on disagreements between the policies. We establish fast $O(1/n)$–type regret guarantees for the learned policy from offline data when the treatment propensity in the offline data is known, and we show how to extend these guarantees to the unknown–propensity case via a doubly robust (DR) objective. Further, we use our algorithm as a black box to obtain improved guarantees under margin conditions that go beyond realizability, which has been a standard assumption in prior work on policy learning with a margin. We also study links to distributionally robust policy learning—where abstention acts as a hedge against small shifts—and to safe policy improvement, where the objective is to improve upon a given baseline policy with high probability. We validate our theoretical findings through extensive synthetic experiments.

Phase retrieval is the classical problem of recovering a signal $x^* \in \mathbb{R}^n$ from its noisy phaseless measurements $y_i = \langle a_i, x^* \rangle^2 + \zeta_i$ (where $\zeta_i$ denotes noise, and $a_i$ is the sensing vector) for $i \in [m]$. The problem of phase retrieval has a rich history, with a variety of applications such as optics, crystallography, heteroscedastic regression, astrophysics, etc. A major consideration in algorithms for phase retrieval is \emph{robustness} against measurement errors. In recent breakthroughs in algorithmic robust statistics, efficient algorithms have been developed for several parameter estimation tasks such as mean estimation, covariance estimation, robust principal component analysis (PCA), etc. in the presence of heavy-tailed noise and adversarial corruptions. In this paper, we study efficient algorithms for robust phase retrieval with heavy-tailed noise when a constant fraction of both the measurements $y_i$ and the sensing vectors $a_i$ may be arbitrarily adversarially corrupted. For this problem, Buna and Rebeschini (AISTATS 2025) very recently gave an \emph{exponential} time algorithm with sample complexity $O(n \log n)$. Their algorithm needs a \emph{robust spectral initialization}, specifically, a robust estimate of the top eigenvector of a covariance matrix, which they deemed to be beyond known efficient algorithmic techniques (similar spectral initializations are a key ingredient of a large family of phase retrieval algorithms). In this work, we make a connection between robust spectral initialization and recent algorithmic advances in robust PCA, yielding the first polynomial-time algorithms for robust phase retrieval with both heavy-tailed noise and adversarial corruptions, in fact with near-linear (in $n$) sample complexity.


144
On the convergence and straightness of Rectified Flow

Vansh Bansal ⋅ Saptarshi Roy ⋅ Alessandro Rinaldo ⋅ Purnamrita Sarkar

Flow Matching has become a cornerstone of modern generative models like Stable Diffusion 3, largely due to the efficiency of its Rectified Flow (RF) variant. The success of RF hinges on iteratively learning straight trajectories, pushing generation towards fewer sampling steps. However, the theoretical link between path geometry and sampling efficiency has been under-explored. This paper fills this gap by introducing a novel \textit{Piecewise Straightness} parameter, $\gamma_{2,T}$. We establish the first Wasserstein convergence bound that explicitly links the discretization error of \textit{any} general flow-model to $\gamma_{2,T}$, proving that minimizing curvature is the key to achieving high-fidelity, one-step sampling. Building on this theory, we establish the first theoretical framework to analyze the straightness of RF. We begin by offering intuitive geometric arguments for simple cases before identifying sufficient conditions under which a single rectification step (1-RF) yields a perfectly straight or even a Monge optimal coupling. While whether these sufficient conditions are met depends on the problem geometry, they enable the first concrete proofs in this area. Critically, fulfilling these conditions makes the subsequent flow (2-RF) perfectly straight ($\gamma_{2,T}=0$). This eliminates the discretization error in our bound and makes flawless, single-step sampling possible.

In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. We establish a new convergence rate, for the Mean-Square Error (MSE) on the approximated function, that is (i) fast in the sense that it admits an optimal dependency in the number of iterations $k$ (i.e., of order $1/k$), (ii) is robust to ill-conditioning: it only depends on an initial error and model-independent constants and (iii) is sharp up to a multiplicative constant lower than $11$. In particular, it does not depend on the smallest eigenvalue of the uncentered covariance matrix of the linear parametrization, unlike all pre-existing $O(1/k)$ rates in the TD(0) literature. We also introduce PCTD(0), a variant of TD(0), which benefits from better convergence properties under an additional assumption of strong mixing on the Markov Chain.


146
From Restless to Contextual: A Thresholding Bandit Reformulation for Finite-horizon Improvement

Jiamin Xu ⋅ Ivan Nazarov ⋅ Aditya Rastogi ⋅ Africa Perianez Santiago ⋅ Kyra Gan

This paper addresses the poor finite-horizon performance of existing online restless bandit (RB) algorithms, which stems from the prohibitive sample complexity of learning a full Markov decision process (MDP) for each agent. We argue that superior finite-horizon performance requires rapid convergence to a high-quality policy. Thus motivated, we introduce a reformulation of online RBs as a budgeted thresholding contextual bandit, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving faster convergence than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. The code is provided in github. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.


147
Functional Properties of the Focal-Entropy

Jaimin Shah ⋅ Martina Cardone ⋅ Alex Dytso

The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.


148
DRAUN: An Optimization-Agnostic Data Reconstruction Attack on Federated Unlearning

Hithem Lamri ⋅ Manaar Alam ⋅ Haiyan Jiang ⋅ Michail Maniatakos

Federated Unlearning (FU) enables clients to remove the influence of specific data from a collaboratively trained shared global model, addressing regulatory requirements such as GDPR and CCPA. However, this unlearning process introduces a new privacy risk: A malicious server may exploit unlearning updates to reconstructthe data requested for removal, a form of Data Reconstruction Attack (DRA). While DRAs for machine unlearning have been studied extensively in centralized Machine Learning-as-a-Service (MLaaS) settings, their applicability to FU remains unclear due to the decentralized, client-driven nature of FU. This work presents DRAUN, the first attack framework to reconstruct unlearned data in FU systems. DRAUN targets optimization-based unlearning methods, which are widely adopted for their efficiency. We theoretically demonstrate why existing DRAs targeting machine unlearning in MLaaS fail in FU and show how DRAUN overcomes these limitations. We validate our approach through extensive experiments on five datasets and five model architectures, evaluating its performance against five popular unlearning methods, effectively demonstrating that state-of-the-art FU methods remain vulnerable to DRAs.

We provide a rigorous random matrix theory analysis of spiked cross-covariance models where the signals across two high-dimensional data channels are partially aligned. These models are motivated by multi-modal learning and form the standard generative setting underlying Partial Least Squares (PLS), a widely used yet theoretically underdeveloped method. We show that the leading singular values of the sample cross-covariance matrix undergo a Baik–Ben Arous–Péché (BBP)-type phase transition, and we characterize the precise thresholds for the emergence of informative components. Our results yield the first sharp asymptotic description of the signal recovery capabilities of PLS in this setting, revealing a fundamental performance gap between PLS and the Bayes-optimal estimator. In particular, we identify the SNR and correlation regimes where PLS fails to recover any signal, despite detectability being possible in principle. These findings clarify the theoretical limits of PLS and provide guidance for the design of reliable multi-modal inference methods in high dimensions.


15
Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

Neeraj Gangwar ⋅ Anshuka Rangi ⋅ Rishabh Deshmukh ⋅ Holakou Rahmanian ⋅ Yesh Dattatreya ⋅ Nickvash Kani

Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning exacerbates common issues, such as task interference and negative transfer, due to the limited number of trainable parameters. To address these challenges, we introduce progressive task-specific multi-task adaptation, a novel parameter-efficient approach for multi-task learning. Our approach introduces adapter modules that are shared in early layers and become increasingly task-specific in later layers. Additionally, we propose a gradient-based approach for computing task similarity and use this measure to allocate similar tasks to the shared adapter modules. To evaluate our approach, we adapt Swin and Pyramid Vision Transformers on PASCAL and NYUD-v2. On both datasets, our approach outperforms prior parameter-efficient multi-task methods while using fewer trainable parameters.


150
Leveraging Machine-Learned Advice in Strategic Interactions with No-Regret Learners

Tinashe Handina ⋅ Tongxin Li ⋅ Kishan Panaganti ⋅ Eric Mazumdar ⋅ Adam Wierman

As machine learning becomes increasingly integrated into decision-making across domains, understanding how machine-learned advice can be leveraged in strategic environments is of growing importance. In this work, we study how an agent in a two-player repeated game can effectively utilize potentially imperfect advice when interacting with a no-regret learner (i.e., satisfying a no-external or no-swap regret condition). We characterize the advice landscape by introducing a pseudo-metric to quantify the usefulness of an advice instance. We demonstrate the pseudo-metric's applicability through two forms of advice: simulators and payoff matrix predictions. We then show how an optimizing player, equipped with correctness guarantees on the advice, could leverage simulators to compute approximate Stackelberg strategies more efficiently, reducing the interaction complexity traditionally required and illustrating the power of good advice. Finally, we extend our analysis to settings where the advice does not have any guarantee of correctness. We find that, in general, a player cannot simultaneously guarantee near Stackelberg performance when the advice is approximately accurate and a no-regret condition when the advice is inaccurate. We do show, however, that it is possible for an advice-aided player to weakly dominate their utility in some (coarse)-correlated equilibria.

Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches without formal near-optimality guarantees. Recent work studies how to learn a good step-size in gradient descent. However, like most of the literature with theoretical guarantees for gradient-based optimization, their results rely on strong assumptions on the function class including convexity and smoothness which do not hold in typical applications. In this work, we develop novel analytical tools for provably tuning hyperparameters in gradient-based algorithms that apply to non-convex and non-smooth functions. We obtain matching sample complexity bounds for learning the step-size in gradient descent shown for smooth, convex functions in prior work (up to logarithmic factors) but for a much broader class of functions. Our analysis applies to gradient descent on neural networks with commonly used activation functions (including ReLU, sigmoid and tanh). We extend our framework to tuning multiple hyperparameters, including tuning the learning rate schedule, simultaneously tuning momentum and step-size, and pre-training the initialization vector. Our approach can be used to bound the sample complexity for minimizing both the validation loss as well as the number of gradient descent iterations.


152
On the Bias of Variational Resampling

Axel Finke ⋅ Oskar Kviman ⋅ Nicola Branchini ⋅ Victor Elvira

Variational resampling (VR) is a method for deterministically resampling the $N$ particles in sequential Monte Carlo (SMC) algorithms (also known as particle filters), by minimising the Kullback--Leibler divergence from the empirical measure of the $N$ weighted original particles to the empirical measure of $M$ unweighted resampled particles. The combination of VR with a weight transformation (called smoothing weights) has shown to often yield a smaller mean-square error (MSE) than standard resampling schemes in the literature. However, its bias has never been investigated. In this paper, we first show that VR incurs a weighting bias and a truncation bias. We then propose a mechanism to alleviate the weighting bias through an uneven weighting of the resampled particles. We also show that the truncation bias implies that the particle approximation of the target distribution is restricted to a region in which the unnormalised weights are larger than some threshold with high probability. We prove that this probability approaches $1$ if $M = \mathrm{O}(N)$ as $N \to \infty$. Finally, we empirically illustrate that the smaller MSE of VR observed in the literature may be attributable to an underestimation of uncertainty caused by the use of the smoothing weights.


153
Q-ShiftDP: A Differentially Private Parameter-Shift Rule for Quantum Machine Learning

Hoang Ngo ⋅ Nhat Hoang-Xuan ⋅ Quan Nguyen ⋅ Nguyen Do ⋅ Incheol Shin ⋅ My T. Thai

Quantum Machine Learning (QML) promises significant computational advantages, but preserving training data privacy remains challenging. Classical approaches like differentially private stochastic gradient descent (DP-SGD) add noise to gradients but fail to exploit the unique properties of quantum gradient estimation. In this work, we introduce the Differentially Private Parameter-Shift Rule (Q-ShiftDP), the first privacy mechanism tailored to QML. By leveraging the inherent boundedness and stochasticity of quantum gradients computed via the parameter-shift rule, Q-ShiftDP enables tighter sensitivity analysis and reduces noise requirements. We combine carefully calibrated Gaussian noise with intrinsic quantum noise to provide formal privacy and utility guarantees, and show that harnessing quantum noise further improves the privacy–utility trade-off. Experiments on benchmark datasets demonstrate that Q-ShiftDP consistently outperforms classical DP methods in QML.


156
Learning Equivariant Functions via Quadratic Forms

Pavan Karjol ⋅ Vivek Kashyap ⋅ Rohan Kashyap ⋅ Prathosh AP

In this study, we introduce a method for learning group equivariant functions by learning the associated quadratic form $x^TAx$ corresponding to the group from the data. Certain groups, known as generalised orthogonal groups, preserve a specific quadratic form, and we leverage this property to uncover the underlying symmetry group under the assumption that it is generalised orthogonal group. By utilising the corresponding unique symmetric matrix, we incorporate suitable inductive biases into the neural network architecture, leading to models that are both simplified and efficient. Our approach results in an invariant model that preserves norms, while the equivariant model is represented as a product of a norm-invariant model and a scale-invariant model, where the “product” refers to the group action. Moreover, we extend our framework to a more general setting where the function acts on tuples of input vectors via a diagonal group action. In this extension, the equivariant function is decomposed into an angular component extracted solely from the normalised first vector and a scale-invariant component that depends on the full Gram matrix of the tuple. This decomposition captures the inter-dependencies between multiple inputs while preserving the underlying group symmetry. We assess the effectiveness of our framework across multiple tasks, including polynomial regression, top quark tagging, and moment of inertia matrix prediction. Comparative analysis with baseline methods demonstrates that our model consistently excels in both discovering the underlying symmetry and efficiently learning the corresponding equivariant function.


158
Multi-Armed Sampling Problem and the End of Exploration

Mohammad Pedramfar ⋅ Siamak Ravanbakhsh

This paper introduces the framework of multi-armed sampling, which serves as the sampling counterpart to the optimization problem of multi-armed bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off in the context of sampling. We systematically define plausible notions of regret for this framework and establish corresponding lower bounds. We then propose a simple algorithm that achieves near-optimal regret bounds. Our theoretical results suggest that, in contrast to optimization, sampling barely requires any exploration. To further connect our findings with those of multi-armed bandits, we define a continuous family of problems and associated regret measures that smoothly interpolate and unify multi-armed sampling and multi-armed bandit problems using a temperature parameter. We believe that the multi-armed sampling framework and our findings in this setting can play a foundational role in the study of sampling, including recent neural samplers, much like the role of multi-armed bandits in reinforcement learning. In particular, our work sheds light on the role of exploration (or lack thereof) and the convergence properties of algorithms for entropy-regularized reinforcement learning, fine-tuning of pretrained models and reinforcement learning with human feedback (RLHF).


159
Fast Private Adaptive Query Answering for Large Data Domains

Miguel Fuentes ⋅ Brett Mullins ⋅ Yingtai Xiao ⋅ Daniel Kifer ⋅ Cameron Musco ⋅ Daniel Sheldon

Privately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.

We demonstrate how supervised learning can be decomposed into a two-stage procedure, where (1) all model parameters are selected in an unsupervised manner, and (2) the outputs y are added to the model, without changing the parameter values. This is achieved by a new model selection criterion that - in contrast to cross-validation - can be used also without access to y. For linear ridge regression, we bound the asymptotic out-of-sample risk of our method in terms of the optimal asymptotic risk. We also demonstrate that versions of linear and kernel ridge regression, smoothing splines, k-nearest neighbors, random forests, and neural networks, trained without access to y, perform similarly to their standard y-based counterparts. Hence, our results suggest that the difference between supervised and unsupervised learning is less fundamental than it may appear.


162
Scalable Policy Maximization Under Network Interference

Aidan Gleich ⋅ Eric Laber ⋅ Alexander Volfovsky

Many interventions, such as vaccines in clinical trials or coupons in online marketplaces, must be assigned sequentially without full knowledge of their effects. Multi-armed bandit algorithms have proven successful in such settings. However, standard independence assumptions fail when the treatment status of one individual impacts the outcomes of others, a phenomenon known as interference. We study optimal-policy learning under interference on large networks. Existing approaches to this problem require repeated observations of the same fixed network and struggle to scale in sample size beyond as few as fifteen connected units --- both limit applications. We show that common assumptions on the structure of interference enable a parsimonious linear parameterization of the reward function. We develop a scalable Thompson sampling algorithm that maximizes cumulative rewards on a $n$-node network while allowing for both nodes and edges to be sampled at each time period. We prove upper and lower bounds on Bayesian regret that imply near-optimality. Simulation experiments show that our algorithm learns quickly and outperforms existing methods. The results close a key scalability gap between causal inference methods for interference and practical bandit algorithms, enabling policy optimization in large-scale networked systems.

In Bayesian optimization, Thompson sampling selects the evaluation point by sampling from the posterior distribution over the objective function maximizer. Because this sampling problem is intractable for Gaussian process (GP) surrogates, the posterior distribution is typically restricted to fixed discretizations (i.e., candidate points) that become exponentially sparse as dimensionality increases. While previous works aim to increase candidate point density through scalable GP approximations, our orthogonal approach increases density by adaptively reducing the search space during sampling. Specifically, we introduce Adaptive Candidate Thompson Sampling (ACTS), which generates candidate points in subspaces guided by the gradient of a surrogate model sample. ACTS is a simple drop-in replacement for existing TS methods—including those that use trust regions or other local approximations—producing better samples of maxima and improved optimization across synthetic and real-world benchmarks.

Federated Learning faces significant challenges due to data heterogeneity, which manifests as Label Distribution Skew and label missingness. We propose Skew-Scarcity Disentanglement Indicator (SSDI), a novel metric that decomposes heterogeneity into two disentangled components: Label Distribution Skew (LDS) (quantity skew of present labels) and Label Coverage Deficiency (LCD) (deviation due to missing labels). Using a PAC-Bayesian framework, we derive a generalization bound indicating that Label Coverage Deficiency becomes the dominant risk factor as the number of clients increases, severely degrading accuracy on rare labels. Our study reveals that, for a fixed number of labels, increasing clients is a primary driver of per-label accuracy variance by exacerbating Label Coverage Deficiency. Moreover, a higher global missing rate intensifies this divergence effect and can precipitate severe performance breakdown at a lower critical threshold of clients. Experiments on vision benchmarks confirm that SSDI accurately captures the severity of performance divergence. The SSDI framework provides a principled tool for diagnosing heterogeneity and guiding targeted mitigation strategies. The code for the SSDI-controlled client-label matrix generation used in our experiments is available at https://github.com/wkzeng/SSDI.git.


165
Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

Soo Min Kwon ⋅ Alec Xu ⋅ Can Yaras ⋅ Laura Balzano ⋅ Qing Qu

The transformer's emergent ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its strengths and limitations. However, a theoretical understanding of when ICL can and cannot generalize beyond its pre-training data still remains unclear. This paper puts forth a minimal mathematical model that provably identifies when ICL can generalize out-of-distribution (OOD). By studying linear regression tasks parameterized with low-rank covariance matrices, we model distribution shifts as varying angles between subspaces and derive conditions under which a single-layer linear attention model interpolates across all angles. We show that if pre-training task vectors are drawn from a union of subspaces, transformers can generalize to all angle shifts—enabling ICL even in regions with zero probability mass in the training distribution. On the other hand, if the pre-training tasks are drawn from a single Gaussian, the test risk shows a non-negligible dependence on the angle, implying that ICL cannot generalize OOD. We empirically show that our results also hold for models such as GPT-2, and present experiments on how our observations extend to nonlinear function classes.


166
Active Measuring in Reinforcement Learning With Delayed Negative Effects

Daiqi Gao ⋅ Ziping Xu ⋅ Aseel Rawashdeh ⋅ Predrag Klasnja ⋅ Susan Murphy

Measuring states in reinforcement learning (RL) can be costly in real-world settings and may negatively influence future outcomes. We introduce the Actively Observable Markov Decision Process (AOMDP), where an agent not only selects control actions but also decides whether to measure the latent state. The measurement action reveals the true latent state but may have a negative delayed effect on the environment. We show that this reduced uncertainty enables sample-efficient learning and may increase the value of the optimal policy despite these costs. We formulate an AOMDP as a periodic partially observable MDP and propose an online RL algorithm based on belief states. To approximate the belief states, we further propose a sequential Monte Carlo method to jointly approximate the posterior of unknown static environment parameters and unobserved latent states. We evaluate the proposed algorithm in a digital health application, where the agent decides when to deliver digital interventions and when to assess users' psychological status through surveys.


167
Atlas-based Manifold Representations for Interpretable Riemannian Machine Learning

Ryan Robinett ⋅ Sophia Madejski ⋅ Kyle Ruark ⋅ Samantha Riesenfeld ⋅ Lorenzo Orecchia

Despite the popularity of the manifold hypothesis, current manifold-learning methods do not support machine learning directly on the latent $d$-dimensional data manifold, as they primarily aim to perform dimensionality reduction into $\mathbb{R}^D$, losing key manifold features when the embedding dimension $D$ approaches $d$. On the other hand, methods that directly learn the latent manifold as a differentiable atlas have been relatively underexplored. In this paper, we aim to give a proof of concept of the effectiveness and potential of atlas-based methods. To this end, we implement a generic data structure to maintain a differentiable atlas that enables Riemannian optimization over the manifold. We complement this with an unsupervised heuristic that learns a differentiable atlas from point cloud data. We experimentally demonstrate that this approach has advantages in terms of efficiency and accuracy in selected settings. Moreover, in a supervised classification task over the Klein bottle and in RNA velocity analysis of hematopoietic data, we showcase the improved interpretability and robustness of our approach.


168
Robust Learning of A Group DRO Neuron

Guyang Cao ⋅ Shuyao Li ⋅ Sushrut Karmalkar ⋅ Jelena Diakonikolas

We study the problem of learning a single neuron under standard squared loss in the presence of arbitrary label noise and group-level distributional shifts, for a broad family of covariate distributions. Our goal is to identify a "best-fit" neuron parameterized by ${\boldsymbol w}^{\star}$ that performs well under the most challenging reweighting of the groups. Specifically, we address a Group Distributionally Robust Optimization problem: given sample access to $K$ distinct distributions ${\mathcal p_{[1]}},\dots, {\mathcal p_{[K]}}$, we seek to approximate ${\boldsymbol w}^*$ that minimizes the worst-case objective over convex combinations of group distributions ${\boldsymbol \lambda} \in \Delta_K$, where the objective is $\sum_{i \in [K]}\lambda_{[i]},\mathbb E_{(\mathbf x,y)\sim{\mathcal p_{[i]}}}(\sigma(\boldsymbol w\cdot\boldsymbol x)-y)^2 - \nu d_f(\boldsymbol\lambda,\tfrac1K\boldsymbol1)$ and $d_f$ is an $f$-divergence that imposes (optional) penalty on deviations from uniform group weights, scaled by a parameter $\nu \geq 0$. We develop a computationally efficient primal-dual algorithm that outputs a vector $\widehat{\boldsymbol w}$ that is constant-factor competitive with ${\boldsymbol w}^{*}$ under the worst-case group weighting. Our analytical framework directly confronts the inherent nonconvexity of the loss function, providing robust learning guarantees in the face of arbitrary label corruptions and group-specific distributional shifts. The implementation of the dual extrapolation update motivated by our algorithmic framework shows promise on LLM pre-training benchmarks.

We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension $n$ diverges faster than the latent dimension $q^*$, while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume $\boldsymbol{B}$-right rotational invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.


17
Clustering-Based Edge Augmentation for Minimizing the Kirchhoff Index

Prasanth Yalamanchili ⋅ Aditya Bhaskara

The Kirchhoff index ($\mathcal{K}_{G}$), defined as the sum of effective resistances over all pairs of nodes in a connected undirected graph $G$, is a fundamental metric for real-world networks. It corresponds to average power consumption in electrical circuits, average commute time of random walks, and more relevantly to optimization, is equal to $\text{Tr}(\mathcal{L}^{\dagger})$, where $\mathcal{L}$ is the graph Laplacian. In this paper, we study the problem of augmenting a given graph by adding $k$ edges to minimize the Kirchhoff index. The problem was introduced in a work of Ghosh, Boyd, and Saberi (2008), and is known to be NP-hard; the state-of-the-art algorithms mostly employ greedy heuristics and have very weak guarantees. We design novel algorithms and show bi-criteria approximation guarantees, i.e., the algorithm adds $c \cdot k$ edges and obtains an $\alpha$ factor approximation to the optimum objective value with $k$ edges. Specifically, an algorithm based on $k$-median clustering with penalties achieves $c=2$ and $\alpha = O(k)$. By using known submodularity ideas, we extend this to achieve $c=O(\log k)$ and $\alpha=(4+\epsilon)$. The problem corresponds to an augmentation version of the classic A-optimal experimental design problem in statistics. We also prove strong integrality gaps for the natural convex relaxation and demonstrate the performance of our algorithm on real and synthetic graphs.


170
Inverse-Free Sparse Variational Gaussian Processes

Stefano Cortinovis ⋅ Laurence Aitchison ⋅ Stefanos Eleftheriadis ⋅ Mark van der Wilk

Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.


171
Orthogonal Representation Learning for Estimating Causal Quantities

Valentyn Melnychuk ⋅ Dennis Frauen ⋅ Jonas Schweisthal ⋅ Stefan Feuerriegel

End-to-end representation learning has become a powerful tool for estimating causal quantities from high-dimensional observational data, but its efficiency remained unclear. Here, we face a central tension: End-to-end representation learning methods often work well in practice but lack asymptotic optimality in the form of the quasi-oracle efficiency. In contrast, two-stage Neyman-orthogonal learners provide such a theoretical optimality property but do not explicitly benefit from the strengths of representation learning. In this work, we step back and ask two research questions: (1) When do representations strengthen existing Neyman-orthogonal learners? and (2) Can a balancing constraint – commonly proposed technique in the representation learning literature – provide improvements to Neyman-orthogonality? We address these two questions through our theoretical and empirical analysis, where we introduce a unifying framework that connects representation learning with Neyman-orthogonal learners (namely, OR-learners). In particular, we show that, under the low-dimensional manifold hypothesis, the OR-learners can strictly improve the estimation error of the standard Neyman-orthogonal learners. At the same time, we find that the balancing constraint requires an additional inductive bias and cannot generally compensate for the lack of Neyman-orthogonality of the end-to-end approaches. Building on these insights, we offer guidelines for how users can effectively combine representation learning with the classical Neyman-orthogonal learners to achieve both practical performance and theoretical guarantees.


172
Panprediction: Optimal Predictions for Any Downstream Task and Loss

Sivaraman Balakrishnan ⋅ Nika Haghtalab ⋅ Daniel Hsu ⋅ Brian Lee ⋅ Eric Zhao

Machine learning is classically formulated as optimizing a model for a known task and loss function. However, an emerging paradigm instead views model training as extracting enough information from a dataset so that the model—after light post-processing—can minimize any loss on any downstream task. This paper formalizes a mathematical framework for this paradigm, which we call panprediction. Panprediction generalizes the problem of omniprediction (Gopalan et al., 2021) and sits upstream from the problem of multi-group learning (Rothblum and Yona, 2021), which respectively focus on predictions that generalize to many downstream losses or many downstream tasks, but not both. We show that panprediction admits a nearly lossless reduction to a statistically efficient notion of calibration, called step calibration. Using this reduction, we design deterministic and randomized panpredictors that can be learned with $\tilde{O}(1/\varepsilon^3)$ and $\tilde{O}(1/\varepsilon^2)$ samples, respectively. This improves the best known sample complexity guarantee of deterministic omniprediction, and matches the best known sample complexity guarantees of randomized omniprediction and both deterministic and randomized multi-group learning. Our results demonstrate that simultaneously minimizing infinitely many losses on infinitely many tasks can be as statistically easy as minimizing one loss on one task.


173
OEUVRE: OnlinE Unbiased Variance-Reduced loss Estimation

Kanad Pardeshi ⋅ Bryan Wilder ⋅ Aarti Singh

Online learning algorithms continually update their models as data arrive, making it essential to accurately estimate the expected loss at the current time step. The prequential method is an effective estimation approach which can be practically deployed in various ways. However, theoretical guarantees have previously been established under strong conditions on the algorithm, and practical algorithms have hyperparameters which require careful tuning. We introduce OEUVRE, an estimator that evaluates each incoming sample on the function learned at the current and previous time steps, recursively updating the loss estimate in constant time and memory. We use algorithmic stability, a property satisfied by many popular online learners, for optimal updates and prove consistency, convergence rates, and concentration bounds for our estimator. We design a method to adaptively tune OEUVRE's hyperparameters and test it across diverse online and stochastic tasks. We observe that OEUVRE matches or outperforms other estimators even when their hyperparameters are tuned with oracle access to ground truth.


174
On the Number of Conditional Independence Tests in Constraint-based Causal Discovery

Marc Franquesa Monés ⋅ Jiaqi Zhang ⋅ Caroline Uhler

Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of $p^{\mathcal{O}(s)}$ tests, where $p$ is the number of nodes in the graph and $s$ denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least $2^{\Omega(s)}$ conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.


175
On the calibration of survival models with competing risks

Julie Alberge ⋅ Tristan Haugomat ⋅ Gaël Varoquaux ⋅ Judith Abécassis

In survival analysis, accurate probability estimates are essential for decision-making, particularly in the competing-risks setting where multiple events are considered. Recent work has focused on the calibration of these probabilities in the survival analysis setting. Yet calibration in the competing-risks setting is both under-explored and harder, because it applies to both probabilities across classes and across time. We show that existing calibration measures are not suited to the competing-risk setting and that recent models do not give well-behaved probabilities. Competing risks need a dedicated calibration framework. For this, we introduce two well-behaved calibration measures, and related methods to estimate, test, and correct -recalibration. We show that these calibration scores lead to a principled statistical framework: they are minimized for oracle estimators (i.e., both measures are proper); they reveal calibration errors in modern models, corrected by our recalibration methods that yield good probabilities while preserving discrimination.


176
Archetypal Graph Generative Models: Explainable and Identifiable Communities via Anchor-Dominant Convex Hulls

Nikolaos Nakis ⋅ Chrysoula Kosma ⋅ Panagiotis Promponas ⋅ Michail Chatzianastasis ⋅ Giannis Nikolentzos

Representation learning has been essential for graph machine learning tasks such as link prediction, community detection, and network visualization. Despite recent advances achieving high performance on these downstream tasks, little progress has been made toward self-explainable models. Understanding the patterns behind predictions is equally important, motivating recent interest in explainable machine learning. In this paper, we present GraphHull, an explainable generative model that represents networks using two levels of convex hulls. At the global level, the vertices of a convex hull are treated as archetypes, each corresponding to a pure community in the network. At the local level, each community is refined by a prototypical hull whose vertices act as representative profiles, capturing community-specific variation. This two-level construction yields clear multi-scale explanations: a node’s position relative to global archetypes and its local prototypes directly accounts for its edges. The geometry is well-behaved by design, while local hulls are kept disjoint by construction. To further encourage diversity and stability, we place principled priors, including determinantal point processes, and fit the model under MAP estimation with scalable subsampling. Experiments on real networks demonstrate the ability of GraphHull to recover multi-level community structure and to achieve competitive or superior performance on link prediction and community detection, while naturally providing interpretable predictions.

Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the anomaly score based on the error and variance of the location prediction. By applying the network in a convolutional manner, this generates a pixel-wise heatmap of anomalies providing finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation.


178
A Semi-Supervised Kernel Two-Sample Test

Gyumin Lee ⋅ Shubhanshu Shekhar ⋅ Ilmun Kim

We consider the problem of two-sample testing in a semi-supervised setting with abundant unlabeled covariate data. Standard two-sample tests neglect covariate information, which has the potential to significantly boost performance. However, incorporating covariates potentially breaks the exchangeability assumption under the null, which further complicates a calibration procedure. To address these issues, we propose a semi-supervised method that produces a test statistic with asymptotic normality, while effectively integrating additional information from covariates. Our test is straightforward to calibrate due to the asymptotic normality under the null and achieves asymptotic power that is often much higher than existing kernel tests without covariates. Furthermore, we formally show that the proposed method is consistent in power against fixed and local alternatives. Simulations confirm the practical and theoretical strengths of our approach.


179
Empirical PAC-Bayes Bounds for Markov Chains

Vahe Karagulyan ⋅ Pierre Alquier

The core of generalization theory was developed for independent observations. Some PAC and PAC-Bayes bounds are available for data that exhibit a temporal dependence. However, there are constants in these bounds that depend on properties of the data-generating process: mixing coefficients, mixing time, spectral gap... Such constants are unknown in practice. In this paper, we prove a new PAC-Bayes bound for Markov chains. This bound depends on a quantity called the \textit{pseudo-spectral gap}, $\gamma_{ps}$. The main novelty is that we can provide an empirical bound on $\gamma_{ps}$ when the state space is finite. Thus, we obtain the first fully empirical PAC-Bayes bound for Markov chains. This extends beyond the finite case, although this requires additional assumptions. On simulated experiments, the empirical version of the bound is essentially as tight as the one that depends on $\gamma_{ps}$.


18
Auditing Pay-Per-Token in Large Language Models

Ander Artola Velasco ⋅ Stratis Tsirtsis ⋅ Manuel Gomez Rodriguez

Millions of users rely on a market of cloud-based services to obtain access to state-of-the-art large language models. However, it has been very recently shown that the de facto pay-per-token pricing mechanism used by providers creates a financial incentive for them to strategize and misreport the (number of) tokens a model used to generate an output. In this paper, we develop an auditing framework based on martingale theory that enables a trusted third-party auditor who sequentially queries a provider to detect token misreporting. Crucially, we show that our framework is guaranteed to always detect token misreporting, regardless of the provider's (mis-)reporting policy, and not falsely flag a faithful provider as unfaithful with high probability. To validate our auditing framework, we conduct experiments across a wide range of (mis-)reporting policies using several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from a popular crowdsourced benchmarking platform. The results show that our framework detects an unfaithful provider after observing fewer than $\sim$$70$ reported outputs, while maintaining the probability of falsely flagging a faithful provider below $\alpha = 0.05$.


180
Longitudinal Flow Matching for Trajectory Modeling

Mohammad Mohaiminul Islam ⋅ Thijs Kuipers ⋅ Sharvaree Vadgama ⋅ Coen de Vente ⋅ Afsana Khan ⋅ Clara Sánchez ⋅ Erik Bekkers

Generative models for sequential data often struggle with sparsely sampled and high-dimensional trajectories, typically reducing the learning of dynamics to pairwise transitions. We propose \textit{Interpolative Multi-Marginal Flow Matching} (IMMFM), a framework that learns continuous stochastic dynamics jointly consistent with multiple observed time points. IMMFM employs a piecewise-quadratic interpolation path as a smooth target for flow matching and jointly optimizes drift and a data-driven diffusion coefficient, supported by a theoretical condition for stable learning. This design captures intrinsic stochasticity, handles irregular sparse sampling, and yields subject-specific trajectories. Experiments on synthetic benchmarks and real-world longitudinal neuroimaging datasets show that IMMFM outperforms existing methods in both forecasting accuracy and further downstream tasks.


182
DP-SPRT: Differentially Private Sequential Probability Ratio Tests

Thomas Michel ⋅ Debabrota Basu ⋅ Emilie Kaufmann

We revisit Wald's celebrated Sequential Probability Ratio Test for sequential tests of two simple hypotheses, under privacy constraints. We propose DP-SPRT, a wrapper that can be calibrated to achieve desired error probabilities and privacy constraints, addressing a significant gap in previous work. DP-SPRT relies on a private mechanism that processes a sequence of queries and stops after privately determining when the query results fall outside a predefined interval. This OutsideInterval mechanism improves upon naive composition of existing techniques like AboveThreshold, achieving a factor-of-2 privacy improvement and thus potentially benefiting other continual monitoring procedures. We prove generic upper bounds on the error and sample complexity of DP-SPRT that can accommodate various noise distributions based on the practitioner's privacy needs. We exemplify them in two settings: Laplace noise (pure Differential Privacy) and Gaussian noise (Rényi differential privacy). In the former setting, by providing a lower bound on the sample complexity of any $\epsilon$-DP test with prescribed type I and type II errors, we show that DP-SPRT is near optimal when both errors are small and the two hypotheses are close. Moreover, we conduct an experimental study revealing its good practical performance.


183
Scalable Utility-Aware Multiclass Calibration

Mahmoud Hegazy ⋅ Michael Jordan ⋅ Aymeric Dieuleveut

Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emph{evaluation} of multiclass calibration. To this end, we propose utility calibration, a general framework which measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and going beyond such binarized approaches, towards assessing calibration for richer classes of downstream utilities.


184
Near Optimal Dropout-Robust Sortion

Maya Gambhir ⋅ Bailey Flanigan ⋅ Aaron Roth

Citizens' assemblies - small panels of citizens that convene to deliberate and make policy recommendations - often face the issue of panelists dropping out last-minute. These dropouts undermine two key goals: that the panel is (a) of a desired size, and (b) is descriptively representative of the population. This dropouts problem motivates the question: how can we choose the panel -or add extra participants to an existing panel - to ensure that after dropouts, the panel satisfied desiderata (a) and (b) to some guaranteed degree? The practical challenge is that panelists (or extras) must be selected before seeing who ultimately drops out. We model this problem as a minimax game: the minimizer chooses a panel (or extras); then, an adversary defines a randomization over dropouts from which the realized dropouts are drawn. The loss is then the deviation of the resulting panel from predefined descriptive representation targets. Our main contribution is an efficient loss-minimizing algorithm for selecting a panel (or extras), which achieves optimal expected loss even as we vary the adversary's power from worst case to average case. Our algorithm - which iteratively plays a projected gradient descent subroutine against a best-responder - also addresses a key issue left open by prior work on this problem: it allows us to control the selection probabilities with which we choose each potential panelist (or extra). We implement our algorithms and run them on datasets from real assemblies. We show robustness gains over previous algorithms, and we use our control over selection probabilities to offer the first exploration of trade-offs between randomness and representation in handling dropouts.


185
Beyond Real Data: Synthetic Data through the Lens of Regularization

Amitis Shidani ⋅ Tyler Farghly ⋅ Yang SUN ⋅ Habib Ganjgahi ⋅ George Deligiannidis

Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.

We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample ($m$) used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being $d$-dimensional Gaussian. We operate in a regime where the dimension $d$, number of data samples $n$, and number of features $p$ tend to infinity while keeping the ratios $\psi_n=\frac{n}{d}$ and $\psi_p=\frac{p}{d}$ fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of $\psi_n,\psi_p$, and $m$. Our theoretical findings are consistent with the empirical observations.


187
Simplex-to-Euclidean Bijections for Categorical Flow Matching

Bernardo Williams ⋅ Victor Yeom-Song ⋅ Marcelo Hartmann ⋅ Arto Klami

We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.

The manifold hypothesis suggests that the generalization performance of machine learning methods improves significantly when the intrinsic dimension of the input distribution's support is low. In the context of Kernel Ridge Regression (KRR), we investigate two alternative notions of intrinsic dimension. The first, denoted $d_\varrho$, is the upper Minkowski dimension defined with respect to the canonical metric induced by a kernel function $K$ on a domain $\Omega$. The second, denoted $d_K$, is the effective dimension, derived from the decay rate of Kolmogorov $n$-widths associated with $K$ on $\Omega$. Given a probability measure $\mu$ on $\Omega$, we analyze the relationship between these $n$-widths and eigenvalues of the integral operator $\phi \mapsto \int_\Omega K(\cdot,x)\phi(x)\,d\mu(x)$. We show that, for a fixed domain $\Omega$, the Kolmogorov $n$-widths characterize the worst-case eigenvalue decay across all probability measures $\mu$ supported on $\Omega$. These eigenvalues are central to understanding the generalization behavior of constrained KRR, enabling us to derive an excess error bound of order $\mathcal{O}(n^{-\frac{2+d_K}{2+2d_K} + \varepsilon})$ for any $\varepsilon > 0$, when the training set size $n$ is large. We also propose an algorithm that estimates upper bounds on the $n$-widths using only a finite sample from $\mu$. For distributions close to uniform, we prove that $\varepsilon$-accurate upper bounds on all $n$-widths can be computed with high probability using at most $\mathcal{O}\left(\varepsilon^{-d_\varrho}\log\frac{1}{\varepsilon}\right)$ samples, with fewer required for small $n$. Finally, we compute the effective dimension $d_K$ for various fractal sets and present additional numerical experiments. Our results show that, for kernels such as the Laplace kernel, the effective dimension $d_K$ can be significantly smaller than the Minkowski dimension $d_\varrho$, even though $d_K = d_\varrho$ provably holds on regular domains.

Sampling from Gibbs distributions and computing their log-partition function are fundamental tasks in statistics, machine learning, and statistical physics. While efficient algorithms are known for log-concave densities, the worst-case non-log-concave setting necessarily suffers from the curse of dimensionality. For many numerical problems, the curse of dimensionality can be alleviated when the target function is smooth, allowing the exponent in the rate to improve linearly with the number of available derivatives. Recently, it has been shown that similarly fast convergence rates can be achieved by efficient optimization algorithms. Since optimization can be seen as the low-temperature limit of sampling from Gibbs distributions, we pose the question of whether similarly fast convergence rates can be achieved for non-log-concave sampling. We first study the information-based complexity of the sampling and log-partition estimation problems and show that the optimal rates for sampling and log-partition computation are sometimes equal and sometimes faster than for optimization. We then analyze various polynomial-time sampling algorithms, including an extension of a recent promising optimization approach, and find that they sometimes exhibit interesting behavior but no near-optimal rates. Our results also give further insights into the relation between sampling, log-partition, and optimization problems.


19
Differentially Private Clipped-SGD: High-Probability Convergence with Arbitrary Clipping Level

Saleh Khah ⋅ Savelii Chezhegov ⋅ Shahrokh Farahmand ⋅ Samuel Horvath ⋅ Eduard Gorbunov

Gradient clipping is a fundamental tool in Deep Learning, improving the high-probability convergence of stochastic first-order methods like SGD, AdaGrad, and Adam under heavy-tailed noise, which is common in training large language models. It is also a crucial component of Differential Privacy (DP) mechanisms. However, existing high-probability convergence analyses typically require the clipping threshold to increase with the number of optimization steps, which is incompatible with standard DP mechanisms like the Gaussian mechanism. In this work, we close this gap by providing the first high-probability convergence analysis for DP-Clipped-SGD with a fixed clipping level, applicable to both convex and non-convex smooth optimization under heavy-tailed noise, characterized by a bounded central $\alpha$-th moment assumption, $\alpha \in (1,2]$. Our results show that, with a fixed clipping level, the method converges to a neighborhood of the optimal solution with a \emph{faster rate} than the existing ones. The neighborhood can be balanced against the noise introduced by DP, providing a refined trade-off between convergence speed and privacy guarantees.


190
SetPINNs: Set-based Physics-informed Neural Networks

Mayank Nagda ⋅ Phil Ostheimer ⋅ Thomas Specht ⋅ Frank Rhein ⋅ Fabian Jirasek ⋅ Stephan Mandt ⋅ Marius Kloft ⋅ Sophie Fellenz

Physics-Informed Neural Networks (PINNs) solve partial differential equations using deep learning. However, conventional PINNs perform pointwise predictions that neglect dependencies within a domain, which may result in suboptimal solutions. We introduce SetPINNs, a framework that effectively captures local dependencies. With a finite element-inspired sampling scheme, we partition the domain into sets to model local dependencies while simultaneously enforcing physical laws. We provide a rigorous theoretical analysis showing that SetPINNs yield unbiased, lower-variance estimates of residual energy and its gradients, ensuring improved domain coverage and reduced residual error. Extensive experiments on synthetic and real-world tasks show improved accuracy, efficiency, and robustness.


191
A Continuous Time Markov Chain Framework for Insertion Language Models

Dhruvesh Patel ⋅ Benjamin Rozonoyer ⋅ Soumitra Das ⋅ Tahira Naseem ⋅ Tim G. J. Rudner ⋅ Andrew McCallum

Sequence generation through insertion of tokens offers several advantages over left-to-right generation and mask-based generation—especially for planning tasks that require look-ahead, and for tasks that need to satisfy relative ordering constraints. Existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective from first principles by formulating the noising process as a continuous-time Markov chain—defined over the space of variable-length sequences—that drops tokens uniformly with a time-dependent rate. We show that, with certain approximations, previous formulations of insertion language models can be viewed as special cases of this denoising framework. We propose new network parameterizations that explicitly model the rate matrix of the generative Markov chain, leading to principled sampling procedures. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.

As generative models continue to evolve, detecting AI-generated images remains a critical challenge. While effective detection methods exist, they often lack formal interpretability and may rely on implicit assumptions about the nature of fake content, potentially limiting their robustness to distributional shifts. In this work, we introduce a rigorous, statistically grounded framework for fake image detection that focuses on producing a probability score interpretable with respect to the real-image population. Our method leverages the strengths of multiple existing detectors by combining strong training-free statistics. We compute $p$-values over a range of test statistics and aggregate them using classical statistical ensembling to assess alignment with the unified real-image distribution. This framework is generic, flexible, and training-free, making it well-suited for robust fake image detection across diverse and evolving settings.


193
The Good, the Bad, and the Sampled: a No-Regret Approach to Safe Online Classification

Tavor Baharav ⋅ Spyridon Konstantinos Dragazis ⋅ Aldo Pacchiano

We study sequential testing for a binary disease outcome when risk follows an unknown logistic model. At each round, the decision maker may either pay for a test revealing the true label or predict the outcome based on patient features and past data. The goal is to minimize costly tests while ensuring the misclassification rate stays below $\alpha$ with probability at least $1-\delta$. We propose a method that jointly estimates the logistic parameter $\theta^{\*}$ and the feature distribution, using a conservative threshold on the logistic score to decide when to test. We prove our procedure achieves the target error with high probability and requires only $\widetilde O(\sqrt{T})$ more tests than an oracle with full knowledge. This is the first no-regret guarantees for error-constrained logistic testing, with direct applications to medical screening. Simulations confirm the theory, showing efficient estimation of $\theta^{\*}$ with few excess tests.


194
Distribution free M-estimation

Felipe Areces ⋅ John Duchi

The basic question of delineating those statistical problems that are solvable without making any assumptions on the underlying data distribution has long animated statistics and learning theory. This paper characterizes when a convex M-estimation or stochastic optimization problem is solvable in such an assumption-free setting, providing a precise dividing line between solvable and unsolvable problems. The conditions we identify show that Lipschitz continuity of the loss being minimized is not necessary for distribution free minimization, and they are also distinct from classical characterizations of learnability in machine learning.

Diffusion models can learn rich representations during data generation, showing potential for Self-Supervised Learning (SSL), but they face a trade-off between generative quality and discriminative performance. Their iterative sampling also incurs substantial computational and energy costs, hindering industrial and edge AI applications. To address these issues, we propose the Flow Matching-based Foundation Model (FlowFM), which jointly trains a representation encoder and a conditional flow matching generator. This decoupled design achieves both high-fidelity generation and effective recognition. By using flow matching to learn a simpler velocity field, FlowFM accelerates and stabilizes training, improving its efficiency for representation learning. Experiments on wearable sensor data show FlowFM reduces training time by 50.4\% compared to a diffusion-based approach. On downstream tasks, FlowFM surpassed the state-of-the-art SSL method (SSL-Wearables) on all five datasets while achieving up to a 51.0x inference speedup and maintaining high generative quality.


196
VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

Sakshi Agarwal ⋅ Gabriel Hope ⋅ Jimin Heo ⋅ Erik B. Sudderth

Diffusion probabilistic models learn to remove noise added during training, generating novel data (e.g., images) from Gaussian noise through sequential denoising. However, conditioning the generative process on corrupted or masked images is challenging. While various methods have been proposed for inpainting masked images with diffusion priors, they often fail to produce samples from the true conditional distribution, especially for large masked regions. Additionally, many can't be applied to latent diffusion models which have been demonstrated to generate high-quality images at a significantly lower computational cost. We propose a hierarchical variational inference algorithm that optimizes a non-Gaussian Markov approximation of the true diffusion posterior. Our VIPaint method outperforms existing approaches to inpainting, producing diverse high-quality imputations, while also being effective for other inverse problems like deblurring and superresolution.


197
Efficient Subgroup Analysis via Optimal Trees with Global Parameter Fusion

Zhongming Xie ⋅ Joseph Giorgio ⋅ Jingshen Wang

Identifying and making statistical inferences on differential treatment effects—commonly known as subgroup analysis in clinical research—is central to precision health. Subgroup analysis allows practitioners to pinpoint populations for whom a treatment is especially beneficial or protective, thereby advancing targeted interventions. Tree-based recursive partitioning methods are widely used for subgroup analysis due to their interpretability. Nevertheless, these approaches encounter significant limitations, including suboptimal partitions induced by greedy heuristics and overfitting from locally estimated splits, especially under limited sample sizes. To address these limitations, we propose a fused optimal causal tree method that leverages mixed-integer optimization (MIO) to facilitate precise subgroup identification. Our approach ensures globally optimal partitions and introduces a parameter-fusion constraint to facilitate information sharing across related subgroups. This design substantially improves subgroup discovery accuracy and enhances statistical efficiency. We provide theoretical guarantees by rigorously establishing out-of-sample risk bounds and comparing them with those of classical tree-based methods. Empirically, our method consistently outperforms popular baselines in simulations. Finally, we demonstrate its practical utility through a case study on the Health and Aging Brain Study–Health Disparities (HABS-HD) dataset, where our approach yields clinically meaningful insights.


198
KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

Damien Lesens ⋅ Beheshteh T. Rakhshan ⋅ Guillaume Rabusseau

The Key–Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are sub-optimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.


199
Robustness and Generalization in Uncertainty-Aware Message Passing Neural Networks

Alesia Chernikova ⋅ Moritz Laber ⋅ Narayan G. Sabhahit ⋅ Tina Eliassi-Rad

Existing theoretical guarantees for message passing neural networks (MPNNs) assume deterministic node features. We address a more realistic scenario where noise or finite measurement precision introduces uncertainties in node feature values. First, we quantify uncertainty by propagating the moments of node-feature distributions through the MPNN architecture. To propagate moments through activation functions, we use the Taylor expansion and the pseudo-Taylor polynomial expansion. We then use the resulting node embedding distributions to analytically derive probabilistic adversarial robustness certificates for node classification tasks against L2-bounded perturbations of node features. Second, we model node features as multivariate random variables and introduce Feature Convolution Distance $FCD_p$, a pseudometric based on the Wasserstein distance. $FCD_p$ corresponds to the discriminative power of MPNNs at the node level. We show that MPNNs are globally Lipschitz continuous functions with respect to the pseudometric $FCD_p$. Using the covering number of the resulting pseudometric space, which is a subset of the Wasserstein space, we derive generalization bounds for MPNNs with uncertainties in node features. Together, these two complementary approaches---moment propagation for adversarial robustness and $FCD_p$ on the subset of the Wasserstein space for generalization---establish a unified theoretical framework that comprehensively addresses MPNN reliability under node feature uncertainty.


2
Loss-Driven Bayesian Active Learning

Zhuoyue Huang ⋅ Freddie Bickford Smith ⋅ Tom Rainforth

The central goal of active learning is to gather data that maximises downstream predictive performance, but popular approaches have limited flexibility in customising this data acquisition to different downstream problems and losses. We propose a rigorous loss-driven approach to Bayesian active learning that allows data acquisition to directly target the loss associated with a given decision problem. In particular, we show how any loss can be used to derive a unique objective for optimal data acquisition. Critically, we then show that any loss taking the form of a weighted Bregman divergence permits analytic computation of a central component of its corresponding objective, making the approach applicable in practice. In regression and classification experiments with a range of different losses, we find our approach reduces test losses relative to existing techniques.


200
Improving Adaptive Moment Optimization via Preconditioner Diagonalization

Son Nguyen ⋅ Bo Liu ⋅ Lizhang Chen ⋅ qiang liu

Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based on estimates of gradient statistics. Compared to traditional algorithms like Stochastic Gradient Descent, these adaptive methods are typically more robust to model scale and hyperparameter tuning. However, the gradient statistics employed by these methods often do not leverage sufficient gradient covariance information, leading to suboptimal updates in certain directions of the parameter space and potentially slower convergence. In this work, we keep track of such covariance statistics in the form of a structured preconditioner matrix. Unlike other works, our approach does not apply direct approximations to estimate this matrix. We instead implement an invertible transformation that maps the preconditioner matrix into a new space where it becomes approximately diagonal. This enables a diagonal approximation of the preconditioner matrix in the transformed space, offering several computational advantages. Empirical results show that our approach can substantially enhance the convergence speed of the modern adaptive optimizers. Notably, for large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam. Additionally, our method can be integrated with memory-efficient optimizers like Adafactor to manage computational overhead.

SimCLR is a popular contrastive learning method for vision tasks, renowned for its ability to pre-train neural networks to learn efficient representations. Despite its empirical effectiveness, the theoretical understanding of SimCLR is still very limited, even in the simplest learning scenarios. In this paper, we introduce a theoretical case study of SimCLR. Specifically, we consider training a two-layer convolutional neural network (CNN) to learn a toy image data model that has been considered in a series of recent works. For this particular learning task, we precisely characterize the label complexity under which SimCLR pre-training followed by supervised fine-tuning achieves approximately zero training loss and almost optimal test loss. Notably, the label complexity for SimCLR pre-training is far less demanding compared to direct supervised training, especially when the signal-to-noise ratio in the data is low. Our analysis sheds light on the benefits of SimCLR in learning with fewer labels.


202
Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees

Shurong Lin ⋅ Aleksandra Slavkovic ⋅ Deekshith Bhoomireddy

In the social sciences, small- to medium-scale datasets are common, and linear regression is canonical. In privacy-aware settings, much work has focused on differentially private (DP) linear regression, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP linear regression methods do not readily support it. Mainstream DP-SDG approaches either are tailored to discrete or discretized data, making them less suitable for analyses involving continuous variables, or rely on deep learning models that require large datasets, limiting their use for the smaller-scale data typical in social science. We propose a method for linear regression with valid inference under Gaussian DP. It includes a bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure such that the corresponding regression on the synthetic data matches our DP linear regression procedure. Our approach is effective in small- to moderate-dimensional settings. Experiments show that our method (1) improves accuracy over existing methods for DP linear regression, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream statistical and machine learning tasks than current DP synthesizers.

Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts, whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or restrictive top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process are ``guided" by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. Our analysis brings novel insight in understanding the optimization landscape of the MoE architecture.


204
Regularizing attention scores with bootstrapping

Neo Christopher Chung ⋅ Maxim Laletin

Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where, e.g., noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed Attention Regularization approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT.


205
HGT-FD: Hypergraph transformer for Fraud Detection

Yintao Cai ⋅ Yunjiong Liu ⋅ Zelong Yang ⋅ Shuyang Fang ⋅ Xiaoping Min

Graph-based fraud detection aims to identify anomalous patterns or fraudulent behaviors in graph-structured data, playing a crucial role across various domains. However, traditional models primarily focus on simple node-to-node message passing, which limits the integration of multi-node features and overlooks long-range dependencies in fraud detection. Hyperedge features in hypergraphs, as an integration of node features, can convey and aggregate multi-node characteristics, providing environmental information for the model to detect fraudulent nodes. To this end, we propose HGT-FD for hypergraph-based fraud detection. Specifically, we design a Hypergraph Transformer model that can directly employ hyperedge features for fraud detection, utilizing a co-attention mechanism to generate node representations. Furthermore, structural encoding and positional encoding are proposed to enhance the model's perception of hypergraph structures, enabling the model to capture more complex high-order structural relationships. Extensive experimental results on three fraud detection datasets demonstrate that the proposed method exhibits significant advantages over baselines in fraud detection.

We introduce BASTION (Bayesian Adaptive Seasonality and Trend DecompositION), a flexible Bayesian framework for decomposing time series into trend and multiple seasonality components. We cast the decomposition as a penalized nonparametric regression and establish formal conditions under which the trend and seasonal components are uniquely identifiable, an issue only treated informally in the existing literature. BASTION offers three key advantages over existing decomposition methods: (1) accurate estimation of trend and seasonality amidst abrupt changes, (2) enhanced robustness against outliers and time-varying volatility, and (3) robust uncertainty quantification. We evaluate BASTION against established methods, including TBATS, STR, and MSTL, using both simulated and real-world datasets. By effectively capturing complex dynamics while accounting for irregular components such as outliers and heteroskedasticity, BASTION delivers a more nuanced and interpretable decomposition. To support further research and practical applications, BASTION is available as an R package at \url{https://github.com/Jasoncho0914/BASTION}


207
PolarQuant: Vector Quantization with Polar Transformation

Insu Han ⋅ Praneeth Kacham ⋅ Amin Karbasi ⋅ Vahab Mirrokni ⋅ Amir Zandieh

Vector quantization is a prevalent technique for reducing the memory footprint in a wide range of computational problems, such as the training and deployment of deep learning models and vector search systems. We introduce PolarQuant, a novel online vector quantization method that leverages random preconditioning and polar transformation. Our approach efficiently transforms Euclidean vectors into polar coordinates using a recursive algorithm, and then quantizes the resulting angles. A key insight is that, following random preconditioning, the angles in the polar representation exhibit a tightly bounded, highly concentrated, and analytically computable distribution, independent of the input data. This nice distribution eliminates the need for explicit normalization and learned data-dependent quantization codebooks, steps that introduce significant memory and runtime overhead in traditional product/scalar quantization methods. By circumventing this data-dependent step, PolarQuant achieves substantial memory and runtime savings, making it highly suitable for online scenarios such as KV cache compression. The long-context evaluation demonstrates that PolarQuant compresses the KV cache by over 4.2X while achieving the best quality scores compared to the state-of-the-art methods.

We study sampling from posterior distributions in Bayesian linear inverse problems where $\mathbf{A}$, the parameters to observables operator, is computationally expensive. In many applications $\mathbf{A}$ can be factored in a manner that facilitates the construction of a cost-effective approximation $\widetilde{\mathbf{A}}$. In this framework, we introduce Latent-IMH, a sampling method based on the Metropolis-Hastings independence (IMH) sampler. \newimh{} first generates intermediate latent variables using the approximate $\widetilde{\mathbf{A}}$, and then refines them using the exact $\mathbf{A}$. Its primary benefit is that it shifts the computational cost to an offline phase. We theoretically analyze the performance of Latent-IMH using KL divergence and mixing time bounds. Using numerical experiments on several model problems, we show that, under reasonable assumptions, it outperforms state-of-the-art methods such as the No-U-Turn sampler (NUTS) in computational efficiency. In some cases Latent-IMH can be orders of magnitude faster than existing schemes.


209
Multi-Agent Lipschitz Bandits

Sourav Chakraborty ⋅ Amit Rege ⋅ Claire Monteleoni ⋅ Lijun Chen

We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximizes collective reward, while separating coordination costs from learning costs. We propose a modular protocol that first solves the multi-agent coordination problem by identifying and seating players on distinct, high-value regions via a novel maxima-directed search and then decouples the problem into $N$ independent single-player Lipschitz bandits. In the consensus regime, we obtain an end-to-end regret bound whose dominant learning term is \(\tilde{O}(T^{(d+1)/(d+2)})\), matching the single-player Lipschitz rate; the upfront coordination cost is horizon-independent at fixed confidence and only polylogarithmic in \(T\) in the expected-regret form. Under an additional public coverage/scheduling assumption for the epochic extension, we also obtain a gap-free \(\tilde{O}(T^{(d+1)/(d+2)})\) guarantee. We further derive a matching lower bound for the dominant learning term and extend the framework to general distance-threshold collision models.


21
Direct Preference Optimization With Unobserved Preference Heterogeneity

Keertana Chidambaram ⋅ Karthik Seetharaman ⋅ Vasilis Syrgkanis

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.


210
Black-Box Optimization from Small Offline Datasets via Meta Learning with Synthetic Tasks

Azza Fadhel ⋅ The Hung Tran ⋅ Nghia Hoang ⋅ Jana Doppa

We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes {\em Surrogate Learning with Optimization Bias via Synthetic Task Generation} (\textsc{OptBias}), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.


211
Amortized Structural Variational Inference

Shitao Fan ⋅ Carlos Misael Madrid Padilla ⋅ Yun Yang ⋅ Lizhen Lin

Variational inference (VI) is widely used for approximate Bayesian inference, but it can scale poorly and often requires re-optimization when new data arrive. Amortized variational inference (AVI) learns a global inference map, yet standard mean-field AVI can suffer from large variational and amortization gaps because of independence assumptions. We propose amortized structural variational inference (ASVI), which injects structural dependencies among latent variables through neural architectures that encode local neighborhood information. ASVI reduces both gaps while retaining scalability. Simulations and real-data experiments show that ASVI improves predictive accuracy and posterior fidelity over AVI, and matches structured VI at lower computational cost.


212
Uncovering Hidden Training Dynamics in Neural Networks via Inter-Sample Influence Graphs

Dylan Tai ⋅ Jiayin Zhang ⋅ Rohan Ghosh ⋅ Mehul Motani

Deep learning models are primarily trained through batchwise optimization, where each update can potentially be a tug‑of‑war among samples, shaping the overall trajectory of learning. Existing interpretability tools, most notably influence functions, have provided valuable insights into how individual training samples affect model predictions, primarily at test time. However, these methods were not intended to capture these inter-sample interactions that arise during training. Here, we ask a complementary question: How does optimizing the loss on one training sample affect the loss on the rest during learning? We introduce Influence Graphs (IGs), directed inter-sample graphs where each edge weight $w_{ij}$ quantifies how optimizing on sample $X_i$ influences the loss of sample $X_j$. We estimate these influences via simulated batch interventions and slope coefficients of loss changes, enabling scalable construction of IGs during training. We further define the {Mean-of-Mean In-Degree Influence (MMDI)} and prove it bounds generalization under practical assumptions. Empirically, MMDI correlates strongly with test accuracy in noisy-label settings, making it a useful diagnostic of model quality even before test metrics are available. Finally, we show that IGs reveal distinct, evolving training phases, offering a new lens on the dynamics of learning.

Low-precision execution can induce substantial forward discrepancies in Transformers even for fixed weights and input, yet these discrepancies are usually monitored only at the output and lack a layer-wise theoretical account. We develop a first-order decomposition of output mismatch into layer-local attention, LayerNorm, and residual-transport terms, and derive from it a practical causal risk estimator and a budgeted controller, Bound-Guided Selective Stabilization (BGSS). Controlled sweeps verify the predicted local sign, monotonicity, and transport structure. On GPT-2, the transport-aware combined predictor is positively correlated with FP32-reference mismatch in all 18 runs and improves over a no-transport ablation in 17/18 runs. Reference-patch attribution shows that the same score preserves useful layer ordering information (mean Spearman 0.362). In budget-matched mitigation, BGSS outperforms random same-budget control in onset events (10.67 vs. 11.67), final mismatch (0.001243 vs. 0.001284), and worst-case mismatch (0.00314 vs. 0.00849), while matching a risk-only same-budget controller on onset suppression and sharply reducing worst-case mismatch (0.00314 vs. 0.00571). These results support a theory-to-algorithm account of Transformer numerical fragility in which finite-precision risk can be analyzed, estimated, localized, and selectively stabilized.


214
One-Step Diffusion Samplers via Self-Distillation and Deterministic Flow

Pascal Jutras-Dube ⋅ Jiaru Zhang ⋅ Ziran Wang ⋅ Ruqi Zhang

Sampling from unnormalized target distributions is a fundamental yet challenging task in machine learning and statistics. Existing sampling algorithms typically require many iterative steps to produce high-quality samples, leading to high computational costs. We introduce one-step diffusion samplers which learn a step-conditioned ODE so that one large step reproduces the trajectory of many small ones via a state-space consistency loss. We further show that standard ELBO estimates in diffusion samplers degrade in the few-step regime because common discrete integrators yield mismatched forward/backward transition kernels. Motivated by this analysis, we derive a deterministic-flow (DF) importance weight for ELBO estimation without a backward kernel. To calibrate DF, we introduce a volume-consistency regularization that aligns the accumulated volume change along the flow across step resolutions. Our proposed sampler therefore achieves both sampling and stable evidence estimate in only one or few step. Across challenging synthetic and Bayesian benchmarks, it achieves competitive sample quality with orders‑of‑magnitude fewer network evaluations while maintaining robust ELBO estimates.

Pure exploration in bandits formalises multiple real-world problems, such as tuning hyper-parameters or conducting user studies to test a set of items, where different safety, resource, and fairness constraints on the decision space naturally appear. We study these problems as pure exploration in multi-armed bandits with unknown linear constraints, where the aim is to identify an *$r$-optimal and feasible policy* as fast as possible with a given level of confidence. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. Second, we leverage properties of convex optimisation in the Lagrangian lower bound to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX. Then, we propose a constraint-adaptive stopping rule, and while tracking the lower bound, use optimistic estimate of the feasible set at each step. We show that LAGEX achieves asymptotically optimal sample complexity upper bound, while LATS shows asymptotic optimality up to *novel* constraint-dependent constants. Finally, we conduct numerical experiments with different reward distributions and constraints that validate efficient performance of LATS and LAGEX.


23
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

Zelin He ⋅ Boran Han ⋅ Xiyuan Zhang ⋅ Shuai Zhang ⋅ Haotian Lin ⋅ Qi Zhu ⋅ Haoyang Fang ⋅ Danielle Maddix ⋅ Abdul Fatir Ansari ⋅ Akash Chandrayan ⋅ Abhinav Pradhan ⋅ Bernie Wang ⋅ Matthew Reimherr

Time‐series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning large language models (GRLMs) possess strong reasoning skills but lack the domain-specific knowledge to understand complex time-series patterns. Conversely, fine-tuned time-series LLMs (TSLMs) understand these patterns but lack the capacity to generalize reasoning for more complicated questions. To bridge this gap, we propose a hybrid knowledge-injection framework that injects TSLM-generated insights directly into GRLM's reasoning trace, thereby achieving strong time-series reasoning with in-domain knowledge. As collecting data for knowledge injection fine-tuning is costly, we further leverage a reinforcement learning-based approach with verifiable rewards (RLVR) to elicit knowledge-rich traces without human supervision, then transfer such an in-domain thinking trace into GRLM for efficient knowledge injection. We further release SenTSR-Bench, a multivariate time-series-based diagnostic reasoning benchmark collected from real-world industrial operations. Across SenTSR-Bench and other public datasets, our method consistently surpasses TSLMs by 9.1%–26.1% and GRLMs by 7.9%–22.4%, delivering robust, context-aware time-series diagnostic insights.


24
Visual Prompting Reimagined: The Power of Activation Prompts

Yihua Zhang ⋅ Hongkang Li ⋅ Yuguang Yao ⋅ Aochuan Chen ⋅ Shuai Zhang ⋅ Pin-Yu Chen ⋅ Meng Wang ⋅ Sijia Liu

Visual prompting (VP) has emerged as a popular method to repurpose large pretrained models for downstream vision tasks. Unlike many parameter-efficient finetuning (PEFT) techniques that modify model parameters, VP introduces a universal perturbation directly into the input data to facilitate task-specific finetuning while keeping the pretrained model intact. However, there exists a noticeable performance gap between VP and conventional finetuning methods, highlighting an unexplored realm in theory and practice to understand and advance VP to close its performance gap. Towards this end, we introduce a novel concept, termed activation prompt (AP), which extends the scope of input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. With the aid of AP, we show that VP, by its input perturbation design, has intrinsic limitations in both performance and efficiency. By contrast, AP shares a natural connection to normalization tuning, e.g., batch normalization for convolutional neural networks (CNNs) and layer normalization for vision transformers (ViTs). This illuminates the reason behind the observed better accuracy of normalization tuning than VP in the literature. Furthermore, we show that the choice of prompting exhibits a distinct preference for layer depth, with conclusions varying significantly between CNNs and ViTs. We theoretically elucidate the rationale behind such preference by analyzing global features across layers. By conducting extensive experiments across 29 datasets and various model architectures, we provide a thorough performance analysis of AP, comparing it with VP and PEFT baselines. Our experimental results demonstrate that AP significantly surpasses the input-level VP in terms of both accuracy and efficiency, considering factors like time, parameters, memory usage, and throughout. These results further support our new insights into the incapabilities of VP and the capabilities of AP.


25
Optimistic Reinforcement Learning with Quantile Objectives

Mohammad Alipour-Vaezi ⋅ Huaiyang Zhong ⋅ Kwok-leung Tsui ⋅ Sajad Khodadadian

Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare, finance, etc. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $\tau$-quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. Here, we show that UCB-QRL yields high-probability regret bounds $\mathcal O\left((2/\kappa)^HH\sqrt{SATH\log(2SATH/\delta)}\right)$ in the episodic setting with $S$ states, $A$ actions, $T$ episodes, and $H$ horizons. Here, $\kappa>0$ is a problem-dependent constant that captures the sensitivity of the underlying MDP's quantile value.

The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the attachment of patches to clean images, known as an adversarial patch attack. Similarly, natural noises such as Gaussian and Salt&Pepper are highly prevalent in the real world. The current research need arises from the above vulnerabilities and the lack of efforts to tackle these two singularities independently and, especially, in combination. In this research, we have, for the first time, combined these two prominent singularities and proposed a novel dataset. Using this dataset, we have conducted a benchmark study of singularity data-point detection using features from several convolutional neural networks. For classification, rather than the popular neural network-based parameter tuning, we have used traditional yet effective machine learning classifiers. The extensive experiments across various in- and out-of-distribution (OOD) singularities reveal several interesting findings about the effectiveness of classifiers and show that it is hard to defend against adversaries when they are treated independently, and inefficient classifiers are selected.


27
ConDiSim: Conditional Diffusion Models for Simulation-Based Inference

Mayank Nautiyal ⋅ Andreas Hellander ⋅ Prashant Singh

We present ConDiSim, a conditional diffusion model for simulation-based inference in complex systems with intractable likelihoods. ConDiSim leverages denoising diffusion probabilistic models to approximate posterior distributions, consisting of a forward process that adds Gaussian noise to parameters, and a reverse process learning to denoise, conditioned on observed data. This approach effectively captures complex dependencies and multi-modalities within posteriors. ConDiSim is evaluated across ten benchmark problems and two real-world test problems, where it demonstrates effective posterior approximation accuracy while maintaining computational efficiency and stability in model training. ConDiSim provides a robust and extensible framework for simulation-based inference, well suited to parameter estimation tasks that demand fast methods for handling noisy, time series observations.


28
Deep Feedback Models

David Calhas ⋅ Arlindo Oliveira

Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at github.com/DCalhas/deepfeedbackmodels.


29
WSBD: Freezing-Based Optimizer for Quantum Neural Networks

Christopher Kverne ⋅ Mayur Akewar ⋅ Yuqian Huo ⋅ Tirthak Patel ⋅ Janki Bhimani

The training of Quantum Neural Networks (QNNs) is hindered by the high computational cost of gradient estimation and the barren plateau problem, where optimization landscapes become intractably flat. To address these challenges, we introduce Weighted Stochastic Block Descent (WSBD), a novel optimizer with a dynamic, parameter-wise freezing strategy. WSBD intelligently focuses computational resources by identifying and temporarily freezing less influential parameters based on a gradient-derived importance score. This approach significantly reduces the number of forward passes required per training step and helps navigate the optimization landscape more effectively. Unlike pruning or layer-wise freezing, WSBD maintains full expressive capacity while adapting throughout training. Our extensive evaluation shows that WSBD converges on average 63.9\% faster than Adam for the popular ground-state-energy problem, an advantage that grows with QNN size. We provide a formal convergence proof for WSBD and show that parameter-wise freezing outperforms traditional layer-wise approaches in QNNs. Project page: https://github.com/Damrl-lab/WSBD-Stochastic-Freezing-Optimizer.


3
Auto-Regressive Masked Diffusion Models

Mahdi Karami ⋅ Ali Ghodsi

Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to bridge this gap by unifying the training efficiency of autoregressive models with the strengths of diffusion-based learning. Our key insight is to interpret masked diffusion process as a block-wise causal model. This allows us to design a strictly causal, permutation-equivariant, attention-based architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. On standard language model- ing benchmarks, ARMD achieves state-of-the- art performance, outperforming established diffusion-based methods while requiring significantly fewer training steps.

Techniques for concept extraction, such as sparse autoencoders and transcoders, aim to extract high-level symbolic concepts from low-level nonsymbolic representations. When these extracted concepts are used for downstream tasks such as model steering and unlearning, it is essential to understand their guarantees, or lack thereof. In this work, we present a unified theoretical framework for unsupervised concept extraction, in which we frame the task of concept extraction as identifying a generative model. We present a general meta-theorem for identifiability, which reduces the problem of establishing identifiability guarantees to the problem of characterizing the intersection of two sets. As we demonstrate on a range of widely-used approaches, this meta-theorem substantially simplifies the task of proving such guarantees, thus paving the way for the development of new, principled approaches for concept extraction.


31
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

Hen Davidov ⋅ Shai Feldman ⋅ Gilad Freidkin ⋅ Yaniv Romano

We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

Convex Markov Games (cMGs) were recently introduced as a broad class of multi-agent learning problems that generalize Markov games to settings where strategic agents optimize general utilities beyond additive rewards. While cMGs expand the modeling frontier, their theoretical foundations, particularly the structure of Nash equilibria (NE) and guarantees for learning algorithms, are not yet well understood. In this work, we address these gaps for an extension of cMGs, which we term General Utility Markov Games (GUMGs), capturing new applications requiring coupling between agents' occupancy measures. We prove that in GUMGs, Nash equilibria coincide with the fixed points of projected pseudo-gradient dynamics (i.e. first-order stationary points), enabled by a novel agent-wise gradient domination property. This insight also yields a simple proof of NE existence using Brouwer’s fixed-point theorem. We further show the existence of Markov perfect equilibria. Building on this characterization, we establish a policy gradient theorem for GUMGs and design a model-free policy gradient algorithm. For potential GUMGs, we establish iteration complexity guarantees for computing approximate-NE under exact gradients and provide sample complexity bounds in both the generative model and on-policy settings. Our results extend beyond prior work restricted to zero-sum cMGs, providing the first theoretical analysis of common-interest cMGs.

In this work, we establish the sufficient conditions under which nonlinear Canonical Correlation Analysis (CCA) recovers ground-truth latent factors up to an affine transformation. By transporting the analysis from the observation space to the source space, we extend classical statistical results on orthogonal polynomial expansions of bivariate distributions to representation learning, proving affine identifiability under specific distributional priors. We formally demonstrate that whitening is strictly necessary to ensure the boundedness and well-conditioning of the learned mappings. Furthermore, we bridge the gap between theory and practice by proving that ridge-regularized empirical CCA converges to its population counterpart in the finite-sample regime. Finally, our findings provide a rigorous theoretical foundation explaining the empirical success of recent correlation-based non-contrastive learning methods. Experiments on synthetic and rendered image datasets, alongside systematic ablations, validate the predicted recovery behavior and illustrate the failure modes that arise when the assumptions are violated.


35
A Gaussian Process View on Observation Noise and Initialization in Wide Neural Networks

Sergio Calvo Ordoñez ⋅ Jonathan Plenk ⋅ Richard Bergna ⋅ Alvaro Cartea ⋅ Jose Miguel Hernandez-Lobato ⋅ Konstantina Palla ⋅ Kamil Ciosek

Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific prior mean and with zero observation noise. However, existing formulations have two limitations: (i) observation noise, since the NTK-GP assumes noiseless targets, leading to misspecification on noisy data; (ii) the equivalence does not extend to arbitrary prior means, which are essential for well-specified models. To address (i), we introduce a regularizer into the training objective, showing its correspondence to incorporating observation noise in the NTK-GP. To address (ii), we propose a \textit{shifted network} that enables arbitrary prior means and allows obtaining the posterior mean with gradient descent on a single network, without ensembling or kernel inversion. We validate our results with experiments across datasets and architectures, showing that this approach removes key obstacles to the practical use of NTK-GP equivalence in applied Gaussian process modeling.


36
High-Probability Bounds for Heterogeneous Local Differential Privacy

Maryam Aliakbarpour ⋅ Alireza Fallah ⋅ Swaha Roy ⋅ Ria Stevens

We study statistical estimation under local differential privacy (LDP) when users may hold heterogeneous privacy levels and accuracy must be guaranteed with high probability. Departing from the common in-expectation analyses, and for one-dimensional and multi-dimensional mean estimation problems, we develop finite sample upper bounds in $\ell_2$-norm that hold with probability at least $1-\beta$. We complement these results with matching minimax lower bounds, establishing the optimality (up to constants) of our guarantees in the heterogeneous LDP regime. We further study distribution learning in $\ell_\infty$-distance, designing an algorithm with high-probability guarantees under heterogeneous privacy demands. Our techniques offer principled guidance for designing mechanisms in settings with user-specific privacy levels.


37
Accelerated Learning on Large-Scale Screens using Generative Library Models

Eli Weinstein ⋅ Andrei Slabodkin ⋅ Mattia Gollub ⋅ Elizabeth Wood

Biological machine learning is often bottlenecked by a lack of scaled data. One promising route to relieving data bottlenecks is through high-throughput screens, which can experimentally test the activity of $10^6-10^{12}$ protein sequences in parallel. In this article, we introduce algorithms to optimize high throughput screens for data creation and model training. We focus on the large-scale regime, where dataset sizes are limited by the cost of measurement and sequencing. We show that when active sequences are rare, we maximize information gain if we \textit{only} collect positive examples of active sequences, i.e. $x$ with $y>0$. We can correct for the missing negative examples using a generative model of the library, producing a consistent and efficient estimate of the true $p(y\mid x)$. We demonstrate this approach in simulation and on a large-scale screen of antibodies. Overall, co-design of experiments and inference lets us accelerate learning dramatically.


38
Recovery Guarantees for Continual Learning of Dependent Tasks: Memory, Data-Dependent Regularization, and Data-Dependent Weights

Liangzu Peng ⋅ Uday Kiran Reddy Tadipatri ⋅ Ziqing Xu ⋅ Eric Eaton ⋅ Rene Vidal

Continual learning (CL) is concerned with learning multiple tasks sequentially without forgetting previously learned tasks. Despite substantial empirical advances over recent years, the theoretical development of CL remains in its infancy. At the heart of developing CL theory lies the challenge that the data distribution varies across tasks, and we argue that properly addressing this challenge requires understanding this variation–dependency among tasks. To explicitly model task dependency, we consider nonlinear regression tasks and propose the assumption that these tasks are dependent in such a way that the data of the current task is a nonlinear transformation of previous data. With this model and under natural assumptions, we prove statistical recovery guarantees (more specifically, bounds on estimation errors) for several CL paradigms in practical use, including experience replay with data-independent regularization and data-independent weights that balance the losses of tasks, replay with data-dependent weights, and continual learning with data-dependent regularization (e.g., knowledge distillation). To the best of our knowledge, our bounds are informative in cases where prior work gives vacuous bounds.


39
TENDE: Transfer Entropy Neural Diffusion Estimation

Simon Pedro Galeano Munoz ⋅ Maurizio Filippone ⋅ Giulio Franzese ⋅ Mustapha Bounoua ⋅ Pietro Michiardi

Transfer entropy is a fundamental measure for quantifying directed information flow in time series, with applications spanning neuroscience, finance, and complex systems analysis. However, existing estimation methods suffer from the curse of dimensionality, require restrictive distributional assumptions, or need exponentially large datasets for reliable convergence. We address these limitations in the literature by proposing TENDE (Transfer Entropy Neural Diffusion Estimation), a novel approach that leverages score-based diffusion models to estimate transfer entropy through conditional mutual information. By learning score functions of the relevant conditional distributions, TENDE provides flexible, scalable estimation while making minimal assumptions about the underlying data-generating process. We demonstrate superior accuracy and robustness compared to existing neural estimators and other state-of-the-art approaches across synthetic benchmarks and real data.


4
$\epsilon$-Identifiability of Causal Quantities

Ang Li ⋅ Scott Mueller ⋅ Xin Shu ⋅ Judea Pearl

Identifying the effects of causes and causes of effects is vital in virtually every scientific field. Often, however, the needed probabilities may not be fully identifiable from the available data sources. This paper shows how approximate identifiability is still possible for several probabilities of causation. We term this $\epsilon\text{-identifiability}$ and demonstrate its usefulness in cases where the behavior of certain subpopulations can be restricted within sufficiently narrow bounds. In particular, we show how unidentifiable causal effects and counterfactual probabilities can be $\epsilon\text{-identified}$ when such allowances are made. Often, these allowances are easily measured and reasonably assumed. Finally, $\epsilon\text{-identifiability}$ is applied to the unit selection problem.


40
Calibrated Principal Component Regression

Yixuan Florence Wu ⋅ Yilun Zhu ⋅ Lei Cao ⋅ Naichen Shi

We propose a new method for statistical inference in generalized linear models. In the overparameterized regime, Principal Component Regression (PCR) reduces variance by projecting high-dimensional data to a low-dimensional principal subspace before fitting. However, PCR incurs truncation bias whenever the true regression vector has mass outside the retained principal components (PC). To mitigate the bias, we propose Calibrated Principal Component Regression (CPCR), which first learns a low-variance prior in the PC subspace and then calibrates the model in the original feature space via a centered Tikhonov step. CPCR leverages cross-fitting and controls the truncation bias by softening PCR's hard cutoff. Theoretically, we calculate the out-of-sample risk in the random matrix regime, which shows that CPCR outperforms standard PCR when the regression signal has non-negligible components in low-variance directions. Empirically, CPCR consistently improves prediction across multiple overparameterized problems. The results highlight CPCR's stability and flexibility in modern overparameterized settings.


41
Modeling Multi-Objective Tradeoffs with Monotonic Utility Functions

Edward Chen ⋅ Natalie Dullerud ⋅ Thomas Niedermayr ⋅ Elizabeth Kidd ⋅ Ransalu Senanayake ⋅ Pang Wei Koh ⋅ Sanmi Koyejo ⋅ Carlos Guestrin

Countless science and engineering applications in multi-objective optimization (MOO) necessitate that decision-makers (DMs) select a Pareto-optimal (PO) solution which aligns with their preferences. Evaluating individual solutions is often expensive, and the high-dimensional trade-off space makes exhaustive exploration of the full Pareto frontier (PF) infeasible. We introduce a novel, principled two-step process for obtaining a compact set of PO points that aligns with user preferences, which are specified a priori as general monotonic utility functions (MFs). Our process (1) densely samples the user's region of interest on the PF, then (2) sparsifies the results into a small, diverse set for the DM. We instantiate this framework with soft-hard functions (SHFs), an intuitive class of MFs that operationalizes the common expert heuristic of imposing soft and hard bounds. We provide extensive empirical validation of our framework instantiated with SHFs on diverse domains, including brachytherapy, engineering design, and large language models. For brachytherapy, our approach returns a compact set of points with over 3% greater SHF-defined utility than the next best approach. Among the other domains, our approach consistently leads in utility, as a final compact set of just 5 points captures over 99% of the utility offered by the entire dense set.


42
Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence

José Manuel de Frutos ⋅ Pablo Martínez Olmos ⋅ Manuel Vázquez ⋅ Joaquín Míguez

Rank-based objectives such as the invariant statistical loss (ISL) are robust, likelihood-free tools for training implicit generative models. We propose \emph{dual-ISL}, obtained by interchanging the roles of the target $p$ and model density $\tilde p$ within ISL, which induces a \emph{convex} optimization problem over model densities. We show that the associated rank-based discrepancy $d_K$ is \emph{continuous} under weak and $L^1$ convergence and \emph{convex} in its first argument, properties not shared by classical divergences such as KL or Wasserstein distances. Additionally, we prove that $d_K$ admits an $L^2$ interpretation: it is the projection of the density ratio $q=p/\tilde p$ onto a Bernstein polynomial basis. This yields explicit truncation-error bounds, sharp convergence rates, and a closed-form expression for the truncated density approximation. To handle multivariate data, we further introduce a sliced dual-ISL via random one-dimensional projections that preserves both continuity and convexity. Empirically, across several benchmarks, dual-ISL delivers faster and smoother convergence than standard ISL and offers competitive, often superior, mode coverage relative to state-of-the-art implicit models (modern GAN baselines, including multi-critic setups), while providing an explicit density approximation.


43
On the Normalization of Confusion Matrices: Methods and Geometric Interpretations

Johan Erbani ⋅ Sonia Ben Mokhtar ⋅ Pierre-Edouard Portier ⋅ Elöd Egyed-Zsigmond ⋅ Diana Nurbakova

The confusion matrix is a standard tool for evaluating classifiers, providing a detailed view of model errors. In heterogeneous settings, its entries are influenced by two main factors: class similarity, reflecting how easily the model confuses certain classes, and distribution bias, stemming from imbalanced training or test distributions. Because confusion matrix values jointly reflect both factors, it is difficult to disentangle their individual effects. To address this issue, we introduce bi-normalization via Iterative Proportional Fitting, a generalization of row and column normalization. Unlike standard approaches, this method recovers the underlying structure of class similarity. By disentangling error sources, it enables a more precise diagnosis of model behavior and facilitates classifier improvement. We further establish connections between normalization, importance sampling, and class representations in the model’s latent space, thus offering a clearer interpretation of normalization schemes. Our implementation is publicly available.


44
How to Approximate Inference with Subtractive Mixture Models

Lena Zellinger ⋅ Nicola Branchini ⋅ Lennert De Smet ⋅ Victor Elvira ⋅ Nikolay Malkin ⋅ Antonio Vergari

Classical mixture models (MMs) are widely used tractable proposals for approximate inference settings such as variational inference (VI) and importance sampling (IS). Recently, mixture models with negative coefficients, called subtractive mixture models (SMMs), have been proposed as a potentially more expressive alternative. However, how to effectively use SMMs for VI and IS is still an open question as they do not provide latent variable semantics and therefore cannot use sampling schemes for classical MMs. In this work, we study how to circumvent this issue by designing several expectation estimators for IS and learning schemes for VI with SMMs, and we empirically evaluate them for distribution approximation. Finally, we discuss the additional challenges in estimation stability and learning efficiency that they carry and propose ways to overcome them. Code is available at https://github.com/april-tools/delta-vi.

While security research in Federated Learning (FL) has predominantly focused on protecting client data, the confidentiality of the model parameters themselves represents a critical and underexplored vulnerability. This work addresses model reconstruction attacks by passive eavesdroppers, a threat present in common update strategies like transmitting full models or model increments. To our knowledge, we are the first to repurpose dynamic uniform quantization as a dedicated defense for model confidentiality. Our lightweight, architecture-agnostic approach combines low-bit quantization with an adaptive clipping rule to thwart reconstruction attacks, even under warm adversary initialization. We provide theoretical guarantees establishing that our defense offers persistent, non-zero protection in both protocols. Across extensive experiments on CIFAR-10 and CIFAR-100, with up to 1000 clients in heterogeneous settings, our method reduces the adversary's test accuracy to near-random levels while maintaining global accuracy within 4\% of the unquantized baseline. Our findings establish that repurposing quantization is a simple yet highly effective strategy for securing the largely overlooked area of model confidentiality in FL.


46
Frequency-Based Hyperparameter Selection in Games

Aniket Sanyal ⋅ Baraah Sidahmed ⋅ Rebekka Burkholz ⋅ Tatjana Chavdarova

Learning in smooth games fundamentally differs from standard minimization due to rotational dynamics, which invalidate classical hyperparameter tuning strategies. Despite their practical importance, effective methods for tuning in games remain underexplored. A notable example is LookAhead (LA), which achieves strong empirical performance but introduces additional parameters that critically influence performance. We propose a principled approach to hyperparameter selection in games by leveraging frequency estimation of oscillatory dynamics. Specifically, we analyze oscillations both in continuous-time trajectories and through the spectrum of the discrete dynamics in the associated frequency-based space. Building on this analysis, we introduce Modal LookAhead (MoLA), an extension of LA that selects the hyperparameters adaptively to a given problem. We provide convergence guarantees and demonstrate in experiments that MoLA accelerates training in both purely rotational games and mixed regimes, all with minimal computational overhead.


48
ACE-KT: Cascaded Cognitive Modeling for Stage-wise Knowledge Tracing

Teng Guo ⋅ Yubin Xia ⋅ Jinsen Ke ⋅ Mingliang Hou ⋅ Jiaqi Zheng ⋅ Zitao Liu

Knowledge Tracing (KT) aims to predict students’ academic performance by modeling their knowledge mastery over time, based on their historical learning interactions. However, current KT models often oversimplify student interactions by treating them as standard time series rather than as cognitive processes. Consequently, modeling student learning as a process of cognitive transformation rather than as a mere sequence of time-stamped events remains a fundamental challenge in KT research. To address this issue, we propose ACE-KT (cAscaded Cognitive modEling for Knowledge Tracing), a novel framework inspired by cognitive process theory, which shifts the focus from purely sequential modeling to cognitive representation learning. Specifically, we design a cascaded cognitive framework inspired by human cognitive processes in three sequential stages: convolution-based rhythm perception module, Transformer encoder-based contextual structuring module, and cognitive integration module implemented via a selective structured state space model. Extensive experiments on five real-world datasets demonstrate that ACE-KT consistently outperforms 21 SOTA KT baselines, demonstrating its effectiveness. The source code is publicly available at our GitHub repository (https://github.com/AWord992/ACEKT.git).


49
Shift is Good: Mismatched Data Mixing Improves Test Performance

Marko Medvedev ⋅ Kaifeng Lyu ⋅ Zhiyuan Li ⋅ Nathan Srebro

We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component ``skills'' at training and test.

Positional encoding (PE) is a core architectural component of Transformers, yet its impact on the Transformer's generalization and robustness remains unclear. In this work, we provide the first generalization analysis for single-layer Transformer under in-context regression that explicitly accounts for a trainable PE module. Our result shows that PE systematically enlarges the generalization gap. Extending to the adversarial setting, we derive the adversarial Rademacher generalization bound. We find that the gap between models with and without PE is magnified under attack, demonstrating that PE amplifies the vulnerability of models. Our bounds are empirically validated by a simulation study. Together, this work establishes a new framework for understanding the clean and adversarial generalization in ICL with PE.

Probabilistic relaxations of graph cuts offer a differentiable alternative to spectral clustering, enabling end-to-end and online learning without eigendecompositions, yet prior work centered on RatioCut and lacked general guarantees and principled gradients. We present a unified probabilistic framework that covers a wide class of cuts, including Normalized Cut. Our framework provides tight analytic upper bounds on expected discrete cuts via integral representations and Gauss hypergeometric functions with closed-form forward and backward. Together, these results deliver a rigorous, numerically stable foundation for scalable, differentiable graph partitioning covering a wide range of clustering and contrastive learning objectives.


51
Deliberate-When-Needed: Flow-Reasoner for Neuro-Symbolic Continuous Thought

Wenjie Shen ⋅ Boyang Li ⋅ Chao Yang ⋅ Shuang Li

We present Flow-Reasoner, a Deliberate-When-Needed neuro-symbolic model that integrates continuous latent cognition with selective symbolic reasoning. The mental module is a latent state vector evolving smoothly under a first-order ordinary differential equation (ODE), capturing continuous thought that drifts and decays between interventions. The action module is a temporal point process whose intensities are modulated by symbolic rules. Crucially, reasoning is not constant: it is triggered only at irregular instants—when an observed action arrives or when a latent state crosses a threshold—at which point a bounded differentiable forward-chaining procedure updates beliefs and adjusts event likelihoods. Between these triggers, cognition evolves autonomously under the ODE without symbolic intervention. This design yields a model that (i) unifies continuous-time dynamics with selective logical reasoning, (ii) predicts both the type and timing of future actions, and (iii) produces concise rule traces that explain predictions. Empirical studies on synthetic benchmarks and real-world behavioral datasets demonstrate that Flow-Reasoner consistently outperforms strong temporal point process baselines, while providing interpretable, cognitively inspired explanations of decision dynamics. The code is publicly available at https://github.com/shennnnwj/flow-reasoner.


52
Finite-Time Analysis of Gradient Descent for Shallow Transformers

Enes Arda ⋅ Semih Cayci ⋅ Atilla Eryilmaz

Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher–student setting and compare Transformers with recurrent architectures on an autoregressive task.


53
Predictive Deep Sets

Alex Hämäläinen ⋅ Sammie Katt ⋅ Samuel Kaski

Amortized meta-learning methods, such as neural processes, promise near-instantaneous inference on new labeled datasets encountered during downstream tasks. Recent adaptations of the transformer architecture have propelled these approaches to impressive performance in tasks like function estimation, parameter inference, and decision-making. Curiously, their success still primarily stems from the expressiveness of transformers, lacking a bias for modeling the functional structures between features and labels shared across datasets. We argue and show that this leads to training sample inefficiency and sub-optimal performance, and address this by introducing a novel set encoding technique called Predictive Deep Sets. Our approach exploits a strong bias towards functional structures by meta-learning an RKHS that captures domain-critical functional patterns, and by representing datasets as optimal fit functions within this space. Besides providing theoretical justification for this approach, we empirically demonstrate orders of magnitude increases in training data sample efficiency compared to strong baselines across various settings.


54
ProxRouter: Proximity-Weighted LLM Query Routing for Improved Robustness to Outliers

Shivam Patel ⋅ Neharika Jali ⋅ Ankur Mallick ⋅ Gauri Joshi

Large language model (LLM) query routers are critical to modern AI platforms as they seek to improve efficiency by assigning inference queries to accurate, yet low-cost models. Parametric routers typically use trained neural networks for LLM selection but suffer from retraining and maintenance overheads. Nonparametric routers are training-free, instead estimating LLM accuracy and cost via similarity between encodings of the input query and training set queries. However, like their parametric counterparts, nonparametric routers struggle to generalize to outlier queries, an issue exacerbated by limited diversity in training sets which are costly to expand and difficult to keep current with ever-evolving use cases. We propose ProxRouter, which applies an exponentially tilted aggregation mechanism to balance bias and variance in nonparametric routers, improving their robustness to outliers. Experiments show ProxRouter enhances outlier routing while preserving inlier performance with minimal overhead.


56
Active Measurement of Two-Point Correlations

Max Hamilton ⋅ Daniel Sheldon ⋅ Subhransu Maji

Two-point correlation functions (2PCF) are often used to characterize how points cluster together. In this work, we are interested in measuring the 2PCF among a large number of points, but restricted to a subset that satisfies some property of interest. An example comes from astronomy, where scientists measure the 2PCF of star clusters, which make up only a tiny subset of possible sources within a galaxy. This task typically requires careful labeling of sources to construct catalogs, which is time-consuming. We present a human-in-the-loop framework for efficient estimation of 2PCF of target sources. By leveraging a pre-trained classifier to guide sampling, our approach adaptively selects the most informative points for human labeling. After each annotation, it produces unbiased estimates of pair counts across multiple distance bins simultaneously. Compared to simple Monte Carlo approaches, our method achieves substantially lower variance while significantly reducing annotation effort. We introduce a novel unbiased estimator, sampling strategy, and confidence-interval construction that together enable scalable and statistically grounded measurement of two-point correlations in astronomy datasets.


57
Optimal Learning in Games under Delayed Feedback

Ruotong Zhuang ⋅ Taira Tsuchiya ⋅ Shinji Ito

Learning in games is a central topic in both learning theory and game theory, and learning dynamics based on online learning have made significant theoretical and practical advances in recent years. In particular, in two-player zero-sum games, it has been shown that using optimistic follow-the-regularized-leader (OFTRL) as the online learning algorithm allows us to upper bound the individual regret for each player by a constant independent of the time horizon. However, in realistic game scenarios, players are not always able to observe the outcomes of their interactions immediately. Motivated by this, very recently, the problem of learning from delayed feedback in games has been proposed, and it has been shown that by using a variant of OFTRL, one can achieve a social regret upper bound of $\tilde{O}(D^2 + 1)$ for a fixed delay time $D$. This study investigates the optimal dependence on the delay parameter $D$ in the setting of learning from delayed feedback in games. In particular, we show that a simple algorithm that runs $D+1$ independent copies of the standard OFTRL designed for the non-delayed setting achieves social and individual regret upper bounds of $\tilde{O}(D + 1)$, thereby improving the existing bounds by a factor of $D$. Moreover, we provide a matching lower bound: for any learning dynamic, there exists a payoff matrix such that the regret of every player is at least $\Omega(D + 1)$.


58
CONTEXTUAL RANKING AND MATCHING. OPTIMAL REGRET UNDER LST

Hafedh El Ferchichi ⋅ Vianney Perchet ⋅ Matthieu LERASLE

We address the problem of online matchmaking with contextual information. In each round, a perfect matching between a varying set of players -- with different strengths -- is selected, and the outcomes of the comparisons of the chosen pairs are observed. We assume that matching players incurs dissatisfaction proportional to the "strength gap", thereby incentivising the pairing of players with closely matched strengths. Additionally, we assume that the strength of each player can be inferred from some available contextual information through the contextualised linear stochastic transitivity model \textbf{(LST)}. We propose an algorithm that performs matchmaking by selecting pairs of maximum informativeness among admissible pairs and prove that its regret is optimal up to logarithmic factors.


59
Beyond the Ideal: Analyzing the Inexact Muon Update

Egor Shulgin ⋅ AlRashed ⋅ Peter Richtarik ⋅ Francesco Orabona

The Muon optimizer has rapidly emerged as a powerful, geometry-aware alternative to AdamW, demonstrating strong performance in large-scale training of neural networks. However, a critical theory-practice disconnect exists: Muon's efficiency relies on fast, approximate orthogonalization, while most theoretical analyses study idealized exact-SVD updates. This work moves beyond the ideal by providing a general analysis of the *inexact* orthogonalized update at Muon's core. We develop our analysis within the general framework of Linear Minimization Oracle (LMO)-based optimization, introducing a realistic additive error model to capture the inexactness of practical approximation schemes. Our analysis yields explicit bounds that quantify performance degradation as a function of the LMO inexactness/error, $\delta$. We reveal a fundamental coupling between this inexactness and the optimal step size and momentum: lower oracle precision requires a smaller step size but larger momentum parameter. These findings elevate the approximation procedure, such as the number of Newton-Schulz steps, from an implementation detail to a critical parameter that must be *co-tuned* with the learning schedule. NanoGPT experiments directly confirm the predicted coupling, with optimal learning rates clearly shifting as approximation precision changes.


6
Two mathematical models of knowledge distillation

Audrey Xie ⋅ Ludwig Schmidt ⋅ John Duchi

Many hypotheses compete to explain the successes of knowledge distillation. To help address this, we propose and analyze a mathematical model of distillation, which suggests that distillation's performance comes not from obtaining better models but from easier to optimize landscapes. For generalized linear models trained with stochastic gradient descent, we prove that distillation fits performant student models asymptotically more quickly than non-distilled models. In rank-1 matrix approximation, we characterize conditions on the target matrix under which gradient descent with distillation converges strictly faster than training on the supervised objective. The theory helps delineate the ways distillation provides benefits (i.e., in optimization speed, not in generalization), and experiments on real datasets corroborate the theoretical predictions.

High-dimensional and complex discrete distributions often exhibit multimodal behavior due to inherent discontinuities, posing significant challenges for sampling. Gradient-based discrete samplers, while effective, frequently become trapped in local modes when confronted with rugged or disconnected energy landscapes. This limitation makes it difficult for sampling methods to achieve adequate mixing and convergence in high-dimensional multimodal discrete spaces. To address these challenges, we propose Hyperbolic Secant-squared Gibbs-Sampling (HiSS), a novel family of sampling algorithms that integrates a Metropolis-within-Gibbs framework to enhance mixing efficiency. HiSS leverages a logistic convolution kernel to couple the discrete sampling variable with the continuous auxiliary variable in a joint distribution. This design ensures that the auxiliary variable encapsulates the true target distribution while facilitating easy transitions between distant and disconnected modes. We provide theoretical guarantees of convergence and demonstrate empirically that HiSS outperforms many popular alternatives on a wide variety of tasks, including Ising models, binary neural networks, and combinatorial optimization.

Disentanglement learning is central to understanding and reusing learned representations in variational autoencoders (VAEs). Although equivariance has been explored in this context, effectively exploiting it for disentanglement remains challenging. In this paper, we propose a novel method, called \textit{Multiple Invertible and Partial-Equivariant Transformation} (MIPE-Transformation), which integrates two main parts: (1) \textit{Invertible and Partial-Equivariant Transformation} (IPE-Transformation), guaranteeing an invertible latent-to–transformed-latent mapping while preserving partial input-to-latent equivariance in the transformed latent space; and (2) \textit{Exponential-Family Conversion} (EF-Conversion) to extend the standard Gaussian prior to an approximate exponential family via a learnable conversion. In experiments on the 3D Cars, 3D Shapes, and dSprites datasets, MIPE-Transformation improves the disentanglement performance of state-of-the-art VAEs.


62
Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning

Ha Manh Bui ⋅ Felix Parker ⋅ Kimia Ghobadi ⋅ Anqi Liu

We study the Non-Stationary Reinforcement Learning (RL) under distribution shifts in both finite-horizon episodic and infinite-horizon discounted Markov Decision Processes (MDPs). In the finite-horizon case, the transition functions may suddenly change at a particular episode. In the infinite-horizon setting, such changes can occur at an arbitrary time step during the agent's interaction with the environment. While the Q-learning Upper Confidence Bound algorithm (QUCB) can discover a proper policy during learning, due to the distribution shifts, this policy can exploit sub-optimal rewards after the shift happens. To address this issue, we propose Density-QUCB (DQUCB), a shift-aware Q-learning UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning UCB, resulting in a balance between exploration and exploitation. Theoretically, we prove that our oracle DQUCB achieves a better regret guarantee than QUCB. Empirically, our DQUCB enjoys the computational efficiency of model-free RL and outperforms QUCB baselines by having a lower regret across RL tasks, as well as a COVID-19 patient hospital allocation task using a Deep-Q-learning architecture.


63
Counterfactually Fair Conformal Prediction

Ozgur Guldogan ⋅ Neeraj Sarna ⋅ Yuanyuan Li ⋅ Michael Berger

While counterfactual fairness of point predictors is well studied, its extension to prediction sets---central to fair decision-making under uncertainty---remains underexplored. On the other hand, conformal prediction (CP) provides efficient, distribution-free, finite-sample valid prediction sets, yet does not ensure counterfactual fairness. We close this gap by developing Counterfactually Fair Conformal Prediction (CF-CP) that produces counterfactually fair prediction sets. Through symmetrization of conformity scores across protected-attribute interventions, we prove that CF-CP results in counterfactually fair prediction sets while maintaining the marginal coverage property. Furthermore, we empirically demonstrate that on both synthetic and real datasets, across regression and classification tasks, CF-CP achieves the desired counterfactual fairness and meets the target coverage rate with minimal increase in prediction set size. CF-CP offers a simple, training-free route to counterfactually fair uncertainty quantification.

Harnessing the predictive capability of Markov process models requires propagating probability density functions (beliefs) through the model. For many existing models however, belief propagation is analytically infeasible, requiring approximation or sampling to generate predictions. This paper proposes a functional modeling framework leveraging sparse Sum-of-Squares forms for valid (conditional) density estimation. We show that such an architecture enables generalized simultaneous learning of basis functions and coefficients, while preserving analytical integrability. We study the theoretical underpinnings of the proposed model with respect to (i) representational capacity, (ii) analytical marginalization, and (iii) sparse parameter representation. In addition, we propose a training method that allows for exact adherence to the normalization and non-negativity constraints. Our results show that the proposed method achieves accuracy comparable to state-of-the-art approaches while requiring significantly less memory in low-dimensional spaces, and it further scales to 12D systems when existing methods fail beyond 2D.


65
Regularized $f-$Divergence Kernel Tests

Mónica Ribero ⋅ Antonin Schrab ⋅ Arthur Gretton

We propose a framework to construct practical kernel-based two-sample tests from the family of $f$-divergences. The test statistic is computed from the witness function of a regularized variational representation of the divergence, which we estimate using kernel methods. Aggregation is used to adapt the test over hyperparameters such as the kernel bandwidth and the regularization parameter. While our test covers a variety of $f$-divergences, we bring particular focus to the hockey-stick divergence, motivated by its applications to differential privacy auditing and machine unlearning evaluation. We provide theoretical guarantees for statistical test power across our family of $f-$divergence estimates. For two-sample testing, experiments demonstrate that different $f$-divergences are sensitive to different localized differences, illustrating the importance of leveraging diverse statistics. For machine unlearning, we propose a relative test that distinguishes true unlearning failures from safe distributional variations.

Approximate inference is central to Bayesian learning, with variational inference (VI) providing a scalable framework for posterior approximation. While mean-field VI often fails in high dimensions, the more refined Bethe approximation, equivalent to the Thouless-Anderson-Palmer (TAP) free energy in statistical physics, has long been conjectured to capture Bayes-optimal behavior. We prove that the TAP formula holds for Bayesian linear regression with a uniform spherical prior at all noise levels ($\Delta>0$), extending the result of Qiu and Sen (2022) in the high-noise regime. Our argument constructs a ridge regression functional that dominates the TAP free energy, yielding the first rigorous analysis of the global optimizer of the non-concave TAP functional for a planted inference model at an arbitrary noise level. This verifies that TAP, rather than mean-field, is the correct variational description in this setting.


67
Understanding SAM's Robustness to Noisy Labels through Gradient Down-weighting

Hoang-Chau Luong ⋅ Thuc Nguyen-Quang ⋅ Dat Tran ⋅ Minh-Triet Tran

Sharpness-Aware Minimization (SAM) was introduced to improve generalization by seeking flat minima, yet it also exhibits robustness to label noise, a phenomenon that remains only partially understood. Prior work has mainly attributed this effect to SAM’s tendency to prolong the learning of clean samples. In this work, we provide a complementary explanation by analyzing SAM at the element-wise level. We show that when noisy gradients dominate a parameter direction, their influence is reduced by the stronger amplification of clean gradients. This slows the memorization of noisy labels while sustaining clean learning, offering a more complete account of SAM’s robustness. Building on this insight, we propose SANER (Sharpness-Aware Noise-Explicit Reweighting), a simple variant of SAM that explicitly magnifies this down-weighting effect. Experiments on benchmark image classification tasks with noisy labels demonstrate that SANER significantly mitigates noisy-label memorization and improves generalization over both SAM and SGD. Moreover, since SANER is designed from the mechanism of SAM, it can also be seamlessly integrated into SAM-like variants, further boosting their robustness.

Transfer learning promises to reduce the high sample complexity of deep reinforcement learning (RL), yet existing methods struggle with domain shift between source and target environments. Policy distillation provides powerful tactical guidance but fails to transfer long-term strategic knowledge, while automaton-based methods capture task structure but lack fine-grained action guidance. We introduce Context-Aware Distillation with Experience-gated Transfer (CADENT), a framework that unifies strategic automaton-based knowledge with tactical policy-level knowledge into a coherent guidance signal. CADENT's key innovation is an experience-gated trust mechanism that dynamically weighs teacher guidance against the student's own experience at the state-action level, enabling graceful adaptation to target domain specifics. Across challenging environments, from sparse-reward grid worlds to continuous control tasks, CADENT achieves 40-60\% better sample efficiency than baselines while maintaining superior asymptotic performance, establishing a robust approach for adaptive knowledge transfer in RL.


69
Confidence-Guided Self-Training for Gradual Domain Adaptation

Akram Heidarizadeh ⋅ Akram Awad ⋅ HanQin Cai ⋅ George Atia

Domain adaptation addresses the challenge of distributional shift between a labeled source domain and an unlabeled target domain. In gradual domain adaptation (GDA), the shift is assumed to occur through a sequence of intermediate domains, enabling smoother adaptation. A popular approach in this setting is self-training, where a model iteratively generates pseudo-labels for unlabeled data. However, pseudo-labeling errors can accumulate across rounds, especially under large shift, undermining generalization. We develop a theoretical framework for self-training under gradual domain shift that explicitly quantifies and controls the pseudo-labeling error incurred at each round. Our first result is a modular generalization bound that decomposes the excess target risk into *coverage*, *pseudo-label error* $(\varepsilon_k)$ on the accepted set, domain shift, sample complexity, and regularization. Unlike prior bounds, our analysis separates the coverage penalty (due to rejecting inputs) from the pseudo-label error (controlled by confidence calibration or margin filtering, including Tsybakov-type noise via margin decay or calibration assumptions). We also provide the first theoretical justification for percentile (quantile) thresholding schemes used in practice: such schedules directly control coverage while tightening $\varepsilon_k$, yielding a principled coverage--noise tradeoff. Under mild conditions, both terms accumulate only logarithmically, leading to improved generalization. We validate these insights across multiple GDA benchmarks, using both observed and OT-generated intermediate domains.


7
Meta-probabilistic Modeling

Kevin Zhang ⋅ Yixin Wang

Probabilistic graphical models (PGMs) are widely used to discover latent structure in data, but their success hinges on selecting an appropriate model design. In practice, model specification is difficult and often requires iterative trial-and-error. This challenge arises because classical PGMs typically operate on individual datasets. In this work, we consider settings involving collections of related datasets and propose meta-probabilistic modeling (MPM) to learn the generative model structure itself. MPM uses a hierarchical formulation in which global components encode shared patterns across datasets, while local parameters capture dataset-specific latent structure. For scalable learning and inference, we derive a tractable VAE-inspired surrogate objective together with a bi-level optimization algorithm. Our methodology supports a broad class of expressive probabilistic models and has connections to existing architectures, such as Slot Attention. Experiments on object-centric representation learning and sequential text modeling demonstrate that MPM effectively adapts generative models to data while recovering meaningful latent representations.

We study fixed-confidence Best Arm Identification (BAI) in semiparametric bandits, where rewards are linear in arm features plus an unknown additive baseline shift. Unlike linear-bandit BAI, this setting requires orthogonalized regression, and its instance-optimal sample complexity has remained open. For the transductive setting, we establish an attainable instance-dependent lower bound characterized by the corresponding linear-bandit complexity on shifted features. We then propose a computationally efficient phase-elimination algorithm based on a new $\mathcal{X}\mathcal{Y}$-design for orthogonalized regression. Our analysis yields a nearly optimal high-probability sample-complexity upper bound, up to log factors and an additive $d^2$ term, and experiments on synthetic instances and the Jester dataset show clear gains over prior baselines.


71
Regression Descent: A Statistical Framework for Neural Network Optimization

Kamaljeet Singh ⋅ Nicolas Hengartner ⋅ Hao Zhang ⋅ Brian Bell ⋅ James Hyman

We present Regression Descent (RD), a novel optimization algorithm for training deep neural networks that reformulates each gradient step as a regression problem in the span of the Jacobian. By leveraging the implicit function theorem in over-parameterized settings where the number of parameters exceed observations $(p > n)$, we project the optimization onto an $n$-dimensional subspace, enabling the use of statistical techniques and potentially improved conditioning. Our key insight is that in the over-parameterized regime, meaningful parameter updates lie in the row space of the Jacobian matrix, allowing us to solve a lower-dimensional regression problem with explicit regularization control. We establish convergence guarantees for RD under standard smoothness assumptions, showing that it achieves a convergence rate of $O(1/k)$ for smooth non-convex objectives. The algorithm naturally handles the ill-conditioning common in neural network optimization through adaptive regularization and extends seamlessly to multi-output problems and mini-batch settings. Experimental results on Lorenz96, MNIST, and FMNIST datasets demonstrate that RD achieves up to 40\% faster convergence compared to SGD and Adam in terms of wall-clock time, with strong performance in the presence of activation function saturation. The computational overhead of solving $m \times m$ linear systems (where $m$ is the batch size) is offset by improved convergence properties and GPU-efficient operations. Our work opens new avenues for understanding neural network optimization through the lens of statistical regression, providing a practical algorithm for scenarios where standard gradient methods struggle.


72
SQuaT: Self-Supervised Knowledge Distillation via Student-Aware Quantized Teacher Features

Hyeon Jun Lee ⋅ Hyeonsik Jo ⋅ Jinwoo Chung ⋅ Jangho Kim

Quantization-Aware Training (QAT) enables the deployment of quantized models with minimal accuracy degradation. However, in practical scenarios, training labels are often unavailable due to privacy, copyright, or cost constraints. Knowledge Distillation (KD) is a common approach to address this challenge, but we observe that prior work combining QAT with KD suffers from a fundamental limitation: during distillation, the range mismatch between the teacher and the quantized student model induces an unattainable residual, resulting in an irreducible lower bound on the distillation loss. Motivated by this observation, we propose SQuaT (Student-Aware Quantized Teacher Features), a label-free QAT framework with KD that theoretically eliminates this lower bound by applying the student’s quantization parameters to quantize the teacher’s features during distillation. Through comprehensive experiments across diverse settings, we demonstrate that SQuaT consistently outperforms strong baselines, with particularly pronounced gains in extreme low-bit (e.g., 1- and 2-bit) settings. Furthermore, extensive evaluations across various model design choices show that our approach does not rely on specific architectural assumptions, making it broadly applicable across diverse architectures and quantization settings. The source code is available at https://github.com/lcdbsa522/SQuaT.

We show that minimizing the expected Wasserstein loss between empirical distributions can lead to biased parameter estimates in the finite-sample regime. Remarkably, such bias arises even in well-specified settings where both empirical distributions are drawn from the same parametric family: unlike maximum likelihood estimation—understood here as maximizing the expected log-likelihood—optimizing one parameter while fixing another fails to recover the true fixed value. We derive closed-form expressions for the expected Wasserstein loss in one dimension and, focusing on location–scale models, provide an analytic characterization of the bias. This analysis reveals that finite-sample bias occurs whenever the expected loss varies along the diagonal subspace where parameter values coincide, and we propose a simple correction scheme that removes this effect. We extend our analysis to misspecified models and the Sinkhorn divergence, demonstrating that finite-sample bias persists in more practical settings. Experiments on synthetic and real data confirm that stochastic optimization of Wasserstein-based objectives converges to biased solutions, and validate the effectiveness of the proposed correction scheme.


74
ZipMoE: A Theoretically-Grounded Mixture of Experts Approach forParameter-Efficient Deep Learning

Lin Chen ⋅ Kyriakos Axiotis ⋅ Gang Fu ⋅ Kaiyuan Wang ⋅ MohammadHossein Bateni ⋅ Vahab Mirrokni

The relentless growth of large language models (LLMs) presents formidable challenges for their training and deployment. To address this critical bottleneck, we introduce ZipMoE, a novel family of parameter-efficient building blocks inspired by the Mixture of Experts (MoE) paradigm. ZipMoE offers a modular and highly efficient substitute for conventional fully connected layers. We provide a rigorous theoretical analysis of ZipMoE's expressiveness, formally demonstrating its superior representational capacity over low-rank factorization. Furthermore, in a least squares regression setting, we prove that ZipMoE achieves a lower test error bound. Our empirical results—featuring comprehensive comparisons against low-rank, Monarch, and Kronecker methods—corroborate these theoretical findings. We demonstrate that ZipMoE consistently attains superior model quality under equivalent parameter or FLOP budgets, establishing it as a potent component for building efficient and powerful deep learning architectures.


75
It’s All In The (Exponential) Family: An Equivalence Between Maximum Likelihood Estimation and Control Variates For Sketching Algorithms

Keegan Kang ⋅ Kerong Wang ⋅ Ding Zhang ⋅ Rameshwar Pratap ⋅ Bhisham Verma ⋅ Benedict Wong

Maximum likelihood estimators (MLE) and control variate estimators (CVE) have been used in conjunction with known information across sketching algorithms and applications in machine learning. We prove that under certain conditions in an exponential family, an optimal CVE will achieve the same asymptotic variance as the MLE, giving an Expectation-Maximization (EM) algorithm for the MLE. Experiments show the EM algorithm is faster and numerically stable compared to other root finding algorithms for the MLE for the bivariate Normal distribution, and we expect this to hold across distributions satisfying these conditions. We show how the EM algorithm leads to reproducibility for algorithms using MLE / CVE, and demonstrate how the EM algorithm leads to finding the MLE when the CV weights are known.

We consider a class of optimization problems on the space of probability measures motivated by the mean-field approach to studying neural networks. Such problems can be solved by constructing continuous-time gradient flows that converge to the minimizer of the energy function under consideration, and then implementing discrete-time algorithms that approximate the flow. In this work, we focus on the Fisher-Rao gradient flow and we construct an interacting particle system that approximates the flow as its mean-field limit. We discuss the connection between the energy function, the gradient flow and the particle system and explain different approaches to smoothing out the energy function with an appropriate kernel in a way that allows for the particle system to be well-defined. We provide a rigorous proof of the existence and uniqueness of thus obtained kernelized flows, as well as a propagation of chaos result that provides a theoretical justification for using the corresponding kernelized particle systems as approximation algorithms in entropic mean-field optimization.

Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a coarse plan. Then, "look now" to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, "look backward" audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at https://github.com/ZoeyZheng0/3-step-Nav.


78
Efficient Flow Matching using Latent Variables

Anirban Samaddar ⋅ Yixuan Sun ⋅ Viktor Nilsson ⋅ Sandeep Madireddy

Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.

Combining a Student-t Process prior with a Student-t likelihood yields a doubly robust regression model whose intractable posterior has prevented its practical use. We introduce the first tractable variational inference framework for this model. Leveraging the Student-t distribution's scale-mixture representation, we design a structured variational family that affords an analytic evidence lower bound. To overcome the non-conjugacy of this family, which precludes closed-form updates, we devise a novel projection-based optimization: we find the optimum in a simpler, factorized family and analytically project it back onto our structured one. The framework is extended to a scalable sparse, stochastic setting. Empirical results demonstrate strong performance, particularly in the full-batch setting, establishing this robust model as a practical and powerful tool.


8
An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations

Yichen Gao ⋅ Altay Unal ⋅ Akshay Rangamani ⋅ Zhihui Zhu

While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable—for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier—a phenomenon we call feature–classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, and we corroborate this with experiments using classifier-only fine-tuning. Motivated by these findings, we propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show that CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy, highlighting the need for faithful representation-level evaluation of MU.

Spatio-temporal event data---such as crime incidents or shared-mobility usage---are generated by human decisions. Yet most existing models focus on statistical dependencies in time and space, overlooking the cognitive and social factors that shape behavior. We argue that uncovering underlying preferences is essential, as they provide a structured link between observed event data and decision processes. We introduce a preference-driven framework that models event distributions through a two-stage ``consider--then--choose'' process: sparse gating captures limited attention, and utility functions guide selection within the consideration set. To capture heterogeneity, we employ a mixture-of-experts design that reveals distinct preference patterns across groups and contexts. The framework incorporates sparse structural design, and we analyze its theoretical properties by establishing approximation and generalization guarantees. Empirical studies on crime and bike-sharing datasets demonstrate competitive predictive accuracy while providing interpretable insights into behavioral drivers. By shifting the focus from counts to preferences, our approach offers a behaviorally grounded and socially meaningful perspective for modeling event data.

Supervised disentanglement, that is, learning interpretable nonlinear latent representations of a target data view informed by an auxiliary data view, is a central challenge in interpretable machine learning. We formulate this problem as a partially linear invertible canonical correlation analysis (PLiCCA). Specifically, given two data views, (i) complex data lying near a potentially high-dimensional manifold, and (ii) auxiliary high-dimensional multivariate data, PLiCCA learns latent variables for the complex view that are maximally correlated with sparse linear combinations of the auxiliary variables. In contrast to regression-based approaches to supervised disentanglement, the proposed method yields a latent embedding whose coordinates are explicitly ordered by their interpretability with respect to the auxiliary variables. We formalize the population PLiCCA problem and establish existence results. We then show a close theoretical connection between PLiCCA and conditional latent variable models, in particular conditional variational autoencoders and conditional normalizing flows, which enables practical estimation. We demonstrate our approach on brain imaging data, where PLiCCA is used to learn embeddings informed by auxiliary demographic, psychometric, and behavioral variables.


82
Exact Tensor Completion Beyond Isotropy and Invertibility

Li Ge ⋅ Lin Chen ⋅ Yudong Chen ⋅ Xue Jiang

In this work, a tensor completion problem is studied, which aims to perfectly recover the tensor from partial observations. The existing theoretical guarantee requires the involved transform to be orthogonal, which hinders its applications. In this paper, jumping out of the constraints of isotropy and invertibility for the first time, the theoretical guarantee of exact tensor completion with arbitrary linear transforms is established by directly operating the tensors in the transform domain. With the enriched choices of transforms, we theoretically disclose why slim transforms outperform their square counterparts, providing support for existing works on experimental excellence of slim transforms. Our model and analysis greatly enhance the flexibility of tensor completion, supported by extensive experimental results.


83
Learning Right Monotone Permutation Matrices for Neural Subsequence Search

Bhavya Kohli ⋅ Soutrik Sarangi ⋅ Aziz Shameem ⋅ Abir De

Subsequence retrieval seeks relevant segments in a large corpus given a short query. Existing pairwise metric-based methods are computationally intensive, hard to parallelize, and tied to domain-specific metrics. In this work, we introduce a neural framework that casts subsequence matching as end-to-end alignment with permutation matrices satisfying monotonicity used as differentiable approximate subsequence selectors. Our framework yields fixed-dimensional embeddings for variable-length inputs, and we prove these embeddings are compatible with standard Approximate Nearest Neighbor search methods such as Locality-sensitive hashing (LSH), enabling scalable retrieval. We also impose structural priors on admissible subsequences and integrate them directly into the scoring function. The approach is domain-agnostic and operates on pre-trained representations across modalities. Experiments on real-world datasets from two different domains show strong retrieval performance and substantial speedups, with high parallelism on GPU-accelerated hardware.


84
Non-Stationary Functional Bilevel Optimization

Jason Bohne ⋅ Ieva Petrulionytė ⋅ Michael Arbel ⋅ Julien Mairal ⋅ Pawel Polak

Functional bilevel optimization (FBO) provides a powerful framework for hierarchical learning in function spaces, yet current methods are limited to static offline settings and perform suboptimally in online, non-stationary scenarios. We propose SmoothFBO, the first algorithm for non-stationary FBO with both theoretical guarantees and practical scalability. SmoothFBO introduces a time-smoothed stochastic hypergradient estimator that reduces variance through a window parameter, enabling stable outer-loop updates with sublinear regret. Importantly, the classical parametric bilevel case is a special reduction of our framework, making SmoothFBO a natural extension to online, non-stationary settings. Empirically, SmoothFBO consistently outperforms existing FBO methods in non-stationary hyperparameter optimization and model-based reinforcement learning, demonstrating its practical effectiveness. Together, these results establish SmoothFBO as a general, theoretically grounded, and practically viable foundation for bilevel optimization in online, non-stationary scenarios.


85
Provably Efficient and Agile Randomized Q-Learning

He Wang ⋅ Xingyu Xu ⋅ Yuejie Chi

While Bayesian-based exploration often demonstrates superior empirical performance compared to bonus-based methods in model-based reinforcement learning (RL), its theoretical understanding remains limited for model-free settings. Existing provable algorithms either suffer from computational intractability or rely on stage-wise policy updates which reduce responsiveness and slow down the learning process. In this paper, we propose a novel variant of Q-learning algorithm, referred to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL. We establish a sublinear regret bound $\widetilde{O}(\sqrt{H^5SAT})$, where $S$ is the number of states, $A$ is the number of actions, $H$ is the episode length, and $T$ is the total number of episodes. In addition, we present a logarithmic regret bound $ O\left(\frac{H^6SA}{\Delta_{\min}}\log^5(SAHT)\right)$ when the optimal Q-function has a positive sub-optimality $\Delta_{\min}$. Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.


86
Monotone and Conservative Policy Iteration Beyond the Tabular Case

S R Eshwar ⋅ Gugan Chandrashekhar Mallika Thoppe ⋅ Ananyabrata Barua ⋅ Aditya Gopalan ⋅ Gal Dalal

We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.


87
Retrieval Augmented Time Series Forecasting

Kutay Tire ⋅ Ege Onur Taga ⋅ Muhammed Emrullah Ildiz ⋅ Samet Oymak

Retrieval-augmented generation (RAG) is a central component of modern LLM systems, particularly in scenarios where up-to-date information is crucial for accurately responding to user queries or when queries exceed the scope of the training data. The advent of time-series foundation models (TSFM), such as Chronos or Moirai, and the need for effective zero-shot forecasting performance across various time-series domains motivates the question: Do the benefits of RAG similarly carry over to time series forecasting? In this paper, we advocate that the dynamic and event-driven nature of time-series data makes RAG a crucial component of TSFMs and introduce a principled RAG framework for time-series forecasting, called Retrieval Augmented Forecasting (RAF). Within RAF, we develop efficient strategies for retrieving related time-series examples and incorporating them into the forecast. Through experiments and mechanistic studies, we demonstrate that RAF improves the forecasting accuracy across diverse time series domains and TSFMs, with gains that are more pronounced for larger models.

Information-theoretic bounds on generalization are foundational to learning theory, yet their static form offers limited insight into the dynamic, iterative nature of modern optimization. This gap is addressed herein by developing a local theory of generalization based on Euclidean Information Theory, where each update is modeled as a perturbation vector. The resulting analysis shows that the change in the generalization gap is bounded by the expected squared norm of this vector, a quantity interpreted as the local generalization cost. The proposed local bound is proven to be the first-order approximation of classic global bounds, revealing their underlying differential structure. Within this framework, it is further revealed that the optimal update direction is mathematically equivalent to the natural gradient, offering an information-geometric justification for natural gradient descent. Finally, the overall theory is validated through experiments showing that the derived bound closely tracks training dynamics.


89
Breaking Data Symmetry is Needed For Generalization in Feature Learning Kernels

Marcel Bernal ⋅ Neil Mallinar ⋅ Mikhail Belkin

Grokking occurs when a model achieves high training accuracy but generalization to unseen test points happens long after that. This phenomenon was initially observed on a class of algebraic problems, such as learning modular arithmetic (Power et al., 2022). We study grokking on algebraic tasks in a class of feature learning kernels via the Recursive Feature Machine (RFM) algorithm (Radhakrishnan et al., 2024), which iteratively updates feature matrices through the Average Gradient Outer Product (AGOP) of an estimator in order to learn task-relevant features. Our main experimental finding is that generalization occurs only when a certain symmetry in the training set is broken. Furthermore, we empirically show that RFM generalizes by recovering the underlying invariance group action inherent in the data. We find that the learned feature matrices encode specific elements of the invariance group, explaining the dependence of generalization on symmetry.


9
Quantifying Epistemic Uncertainty in Diffusion Models

Aditi Gupta ⋅ Raphael Meyer ⋅ Yotam Yaniv ⋅ Elynn Chen ⋅ N. Benjamin Erichson

To ensure high quality outputs, it is important to quantify the epistemic uncertainty of diffusion models. Existing methods are often unreliable because they mix epistemic and aleatoric uncertainty. We introduce a method based on Fisher information that explicitly isolates epistemic variance, producing more reliable plausibility scores for generated data. To make this approach scalable, we propose FLARE (Fisher-Laplace Randomized Estimator), which approximates the Fisher information using a uniformly random subset of model parameters. Empirically, FLARE improves uncertainty estimation in synthetic time-series generation tasks, achieving more accurate and reliable filtering than other methods. Theoretically, we bound the convergence rate of our randomized approximation and provide analytic and empirical evidence that last-layer Laplace approximations are insufficient for this task.


90
Bayesian Fourier Features for Reduced Rank Gaussian Processes

Cristian A. Galvis-Florez ⋅ George Whittle ⋅ Michael A. Osborne ⋅ Simo Särkkä

Gaussian processes are probabilistic models used in machine learning and the physical sciences, although they are limited by cubic complexity in the number of training observations. To mitigate this problem, various low-rank kernel approximation methods, including Fourier feature methods, Hilbert space methods, and inducing point methods, have been developed. In this paper, we propose a novel Fourier feature approach leveraging Bayesian quadrature methods to construct reduced-rank approximations of the Gaussian process kernel. The new Bayesian Fourier feature framework also unifies many previously proposed low-rank methods, as they can be seen as instances of Bayesian quadrature-based approximations of Gaussian process kernels. Due to its probabilistic nature, the unified framework also enables the quantification of uncertainty in the approximation. Furthermore, the framework allows for the design of entirely new low-rank kernel approximations. We compare the performance of the proposed methods with other approaches across different kernel length scales. Our experimental results demonstrate that it outperforms other popular low-rank kernel approximation methods across a wide range of length scales.


91
PowerSoftmax: Towards Secure LLM Inference Over Encrypted Data

Itamar Zimerman ⋅ Allon Adir ⋅ Ehud Aharoni ⋅ Matan Avitan ⋅ Moran Baruch ⋅ Nir Drucker ⋅ Jenny Lerner ⋅ Ramy Masalha ⋅ Reut Moshe ⋅ Omri Soceanu

Modern cryptographic methods for implementing privacy-preserving LLMs such as HE require the LLMs to have a polynomial form. Forming such a representation is challenging because transformers include non-polynomial components, such as Softmax and layer normalization. Previous approaches have either directly approximated pre-trained models with large-degree polynomials, which are less efficient over HE, or replaced non-polynomial components with easier-to-approximate primitives before training, e.g., Softmax with pointwise attention. The latter approach might introduce scalability challenges. We present a new HE-friendly variant of self-attention that offers a stable form for training and is easy to approximate with polynomials for secure inference. Our work introduces the first polynomial LLMs over a billion parameters, exceeding the size of previous models by more than tenfold. The resulting models demonstrate reasoning and in-context learning (ICL) capabilities comparable to standard transformers of the same size, representing a breakthrough in the field. Finally, we provide a detailed latency breakdown for each computation over encrypted data, paving the way for further optimization, and explore the differences in inductive bias between models relying on our HE-friendly variant and standard transformers. Our code is attached as a supplement.


93
Adversary-Free Counterfactual Prediction via Information-Regularized Representations

Shiqin Tang ⋅ Rong Feng ⋅ Shuxin Zhuang ⋅ Youzhi Zhang ⋅ Hongzong LI

We study counterfactual prediction under assignment bias and propose a mathematically grounded, information-theoretic approach that removes treatment–covariate dependence without adversarial training. Starting from a bound that links the counterfactual–factual risk gap to mutual information, we learn a stochastic representation $Z$ that is predictive of outcomes while minimizing $I(Z;T)$. We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised decoder, yielding a stable, provably motivated training criterion. The framework extends naturally to dynamic settings by applying the information penalty to sequential representations at each decision time. We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines. Across metrics of likelihood, counterfactual error, and policy evaluation, our approach performs favorably while avoiding the training instabilities and tuning burden of adversarial schemes.


94
Accelerating PDE Surrogates via RL-Guided Mesh Optimization

yang meng ⋅ Ruoxi Jiang ⋅ Zhuokai Zhao ⋅ Chong Liu ⋅ Rebecca Willett ⋅ Yuxin Chen

Deep learning–based surrogate models for parametric partial differential equations (PDEs) can deliver high-fidelity approximations but remain prohibitively data-hungry: training often requires thousands of fine-grid simulations, each incurring substantial computational cost. To address this challenge, we introduce RLMesh, an end-to-end framework for efficient surrogate training under limited simulation budget. The key idea is to use reinforcement learning (RL) to adaptively allocate mesh grid points non-uniformly within each simulation domain, focusing numerical resolution in regions most critical for accurate PDE solutions. A lightweight proxy model further accelerates RL training by providing efficient reward estimates without full surrogate retraining. Experiments on standard PDE benchmarks, including 1D Burgers’ equation and 2D Darcy flow, demonstrate that RLMesh achieves competitive accuracy to baselines but with substantially fewer simulation queries. These results show that solver-level spatial adaptivity can dramatically improve the efficiency of surrogate training pipelines, enabling practical deployment of learning-based PDE surrogates across a wide range of problems.

Locally stationary time series (LSTS) represent an essential modeling paradigm for capturing the nuanced dynamics inherent in time series data, whose statistical characteristics, including mean and variance, evolve smoothly over time. In this paper, we propose a conditional probability distribution estimator for LSTS through Nadaraya–Watson (NW) kernel smoothing. NW estimator leverages local kernel smoothing to approximate the conditional distribution of a response variable given its covariates. Under mild conditions, we establish optimal transport convergence guarantees to the proposed NW-based conditional probability estimator. These guarantees are initially proven in the univariate setting using the Wasserstein distance, and subsequently in a multivariate setting employing the sliced Wasserstein distance. To corroborate our theoretical findings, we conduct a wide range of numerical experiments to assess the convergence rates and showcase the practical relevance of the estimator in capturing intricate temporal dependencies in complex nonstationary phenomena.

As machine learning becomes deeply embedded in societal infrastructure, assessing the risks posed by these models has grown increasingly critical. Real-world deployment further complicates this assessment: model owners may apply strategic updates in response to dynamic environments (e.g., financial markets), potentially undermining key guarantees. We formalize this setting and address two goals: (i) accurately estimating a target auditing property-- such as group fairness-- using a minimal number of labeled samples; and (ii) characterizing the complexity of strategic updates by identifying the subset of admissible updates that preserve the property. To this end, we propose a generic algorithmic framework for efficient PAC auditing, powered by an Empirical Property Optimization (EPO) oracle. For statistical parity, we establish distribution-free audit bounds characterized by the SP dimension, a new combinatorial measure that captures the complexity of admissible strategic updates. Finally, we show that our framework naturally extends to other properties, including prediction error and robust risk.


98
Unsupervised Ensemble Learning Through Deep Energy-based Models

Ariel Maymon ⋅ Yanir Buznah ⋅ Uri Shaham

Unsupervised ensemble learning emerged to address the challenge of combining multiple learners' predictions without access to ground truth labels or additional data. This paradigm is crucial in scenarios where evaluating individual classifier performance or understanding their strengths is challenging due to limited information. We propose a novel deep energy-based method for constructing an accurate meta-learner using only the predictions of individual learners, potentially capable of capturing complex dependence structures between them. Our approach requires no labeled data, learner features, or problem-specific information, and has theoretical guarantees for when learners are conditionally independent. We demonstrate superior performance across diverse ensemble scenarios, including challenging mixture of experts settings. Our experiments span standard ensemble datasets and curated datasets designed to test how the model fuses expertise from multiple sources. These results highlight the potential of unsupervised ensemble learning to harness collective intelligence, especially in data-scarce or privacy-sensitive environments.


99
Value Gradient Sampler: Learning Invariant Value Functions for Equivariant Diffusion Sampling

Himchan Hwang ⋅ Hyeokju Jeong ⋅ Dong Shin ⋅ Che-Sang Park ⋅ Sehee Kweon ⋅ Sangwoong Yoon ⋅ Frank Park

We propose the Value Gradient Sampler (VGS), a diffusion sampler parameterized by value functions. VGS generates samples from an unnormalized target density (i.e., energy) by evolving randomly initialized particles along the gradient of the value function. In many sampling problems where the target density exhibits invariant symmetries, value functions provide a novel approach to leveraging invariant networks for sampling by inducing an equivariant gradient flow, without requiring more complex equivariant networks. The value networks are trained via temporal difference learning, which supports off-policy training and other established reinforcement learning (RL) techniques. By combining advanced RL methods with efficient invariant networks, VGS achieves both the highest sample quality and the fastest sampling speed among our baselines on the 55-particle Lennard-Jones system.