Poster
Keyue Jiang · Bohan Tang · Xiaowen Dong · Laura Toni
[ Hall A-E ]
Abstract
Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approach for heterogeneous graph structure learning (HGSL). To this end, we first propose a novel statistical model for the data-generating process (DGP) of heterogeneous graph data, namely hidden Markov networks for heterogeneous graphs (H2MN). Then we formalize HGSL as a maximum a-posterior estimation problem parameterized by such DGP and derive an alternating optimization method to obtain a solution together with a theoretical justification of the optimization conditions. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate that our proposed method excels in learning structure on heterogeneous graphs in terms of edge type identification and edge weight recovery.
Poster
Albert Tseng · Tao Yu · Youngsuk Park
[ Hall A-E ]
Abstract
Low precision (LP) datatypes such as MXFP4 can accelerate matrix multiplications (GEMMs) and reduce training costs. However, directly using MXFP4 instead of BF16 during training significantly degrades model quality. In this work, we present the first near-lossless training recipe that uses MXFP4 GEMMs, which are $2\times$ faster than FP8 on supported hardware.Our key insight is to compute unbiased gradient estimates with stochastic rounding (SR), resulting in more accurate model updates.However, directly applying SR to MXFP4 can result in high variance from block-level outliers, harming convergence.To overcome this, we use the random Hadamard tranform to theoretically bound the variance of SR.We train GPT models up to 6.7B parameters and find that our method induces minimal degradation over mixed-precision BF16 training.Our recipe computes $>1/2$ the training FLOPs in MXFP4, enabling an estimated speedup of $>1.3\times$ over FP8 and $>1.7\times$ over BF16 during backpropagation.
Poster
Kaan Ozkara · Tao Yu · Youngsuk Park
[ Hall A-E ]
Abstract
As the parameters of Large Language Models (LLMs) have scaled to hundreds of billions, the demand for efficient training methods—balancing faster computation and reduced memory usage without sacrificing accuracy—has become more critical than ever. In recent years, various mixed precision strategies, which involve different precision levels for optimization components, have been proposed to increase training speed with minimal accuracy degradation. However, these strategies often require manual adjustments and lack theoretical justification. In this work, we leverage stochastic rounding (SR) to address numerical errors of training with low-precision representation. We provide theoretical analyses of implicit regularization and convergence under the Adam optimizer when SR is utilized. With the insights from these analyses, we extend previous BF16 + SR strategy to be used in distributed settings, enhancing the stability and performance for large scale training. Empirical results from pre-training models with up to 6.7B parameters, for the first time, demonstrate that our BF16 with SR strategy outperforms (BF16, FP32) mixed precision strategies, achieving better validation perplexity, up to 1.54$\times$ higher throughput, and 30\% lower memory usage.
Poster
Dun Zeng · Zenglin Xu · SHIYU LIU · Yu Pan · Qifan Wang · Xiaoying Tang
[ Hall A-E ]
Abstract
Federated averaging (FedAvg) is the most fundamental algorithm in Federated learning (FL). Previous theoretical results assert that FedAvg convergence and generalization degenerate under heterogeneous clients. However, recent empirical results show that FedAvg can perform well in many real-world heterogeneous tasks.These results reveal an inconsistency between FL theory and practice that is not fully explained. In this paper, we show that common heterogeneity measures contribute to this inconsistency based on rigorous convergence analysis.Furthermore, we introduce a new measure \textit{client consensus dynamics} and prove that \textit{FedAvg can effectively handle client heterogeneity when an appropriate aggregation strategy is used}. Building on this theoretical insight, we present a simple and effective FedAvg variant termed FedAWARE. Extensive experiments on three datasets and two modern neural network architectures demonstrate that FedAWARE ensures faster convergence and better generalization in heterogeneous client settings. Moreover, our results show that FedAWARE can significantly enhance the generalization performance of advanced FL algorithms when used as a plug-in module.
Poster
Ignavier Ng · Shaoan Xie · Xinshuai Dong · Peter Spirtes · Kun Zhang
[ Hall A-E ]
Abstract
Causal representation learning aims to recover the latent causal variables and their causal relations, typically represented by directed acyclic graphs (DAGs), from low-level observations such as image pixels. A prevailing line of research exploits multiple environments, which assume how data distributions change, including single-node interventions, coupled interventions, or hard interventions, or parametric constraints on the mixing function or the latent causal model, such as linearity. Despite the novelty and elegance of the results, they are often violated in real problems. Accordingly, we formalize a set of desiderata for causal representation learning that applies to a broader class of environments, referred to as general environments. Interestingly, we show that one can fully recover the latent DAG and identify the latent variables up to minor indeterminacies under a nonparametric mixing function and nonlinear latent causal models, such as additive (Gaussian) noise models or heteroscedastic noise models, by properly leveraging sufficient change conditions on the causal mechanisms up to third-order derivatives. These represent, to our knowledge, the first results to fully recover the latent DAG from general environments under nonparametric mixing. Notably, our results are stronger than many existing works, but require less restrictive assumptions about changing environments.
Poster
Uday Kiran Reddy Tadipatri · Ben Haeffele · Joshua Agterberg · Rene Vidal
[ Hall A-E ]
Abstract
We propose a general framework for deriving generalization bounds for parallel positively homogeneous neural networks--a class of neural networks whose input-output map decomposes as the sum of positively homogeneous maps. Examples of such networks include matrix factorization and sensing, single-layer multi-head attention mechanisms, tensor factorization, deep linear and ReLU networks, and more. Our general framework is based on linking the non-convex empirical risk minimization (ERM) problem to a closely related convex optimization problem over prediction functions, which provides a global, achievable lower-boundto the ERM problem. We exploit this convex lower-bound to perform generalization analysis in the convex space while controlling the discrepancy between the convex model and its non-convex counterpart. We apply our general framework to a wide variety of models ranging from low-rank matrix sensing, to structured matrix sensing, two-layer linear networks, two-layer ReLU networks, and single-layer multi-head attention mechanisms, achieving generalization bounds with a sample complexity that scales almost linearly with the network width.
Poster
Oğuz Kaan Yüksel · Nicolas Flammarion
[ Hall A-E ]
Abstract
Next-token prediction with cross-entropy loss is the objective of choice in sequence and language modeling. Despite its widespread use, there is a lack of theoretical analysis regarding the generalization of models trained using this objective. In this work, we provide an analysis of empirical risk minimization for sequential inputs generated by order-$k$ Markov chains. Assuming bounded and Lipschitz logit functions, our results show that in-sample prediction error decays optimally with the number of tokens, whereas out-of-sample error incurs an additional term related to the mixing properties of the Markov chain. These rates depend on the statistical complexity of the hypothesis class and can lead to generalization errors that do not scale exponentially with the order of the Markov chain---unlike classical $k$-gram estimators. Finally, we discuss the possibility of achieving generalization rates independent of mixing.
Poster
Daiqi Gao · Hsin-Yu Lai · Predrag Klasnja · Susan Murphy
[ Hall A-E ]
Abstract
We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. All actions within a bag jointly impact a single reward, observed at the end of the bag. For example, in mobile health, multiple activity suggestions in a day collectively affect a user's daily commitment to being active. Our goal is to develop an online RL algorithm to maximize the discounted sum of the bag-specific rewards. To handle non-Markovian transitions within a bag, we utilize an expert-provided causal directed acyclic graph (DAG). Based on the DAG, we construct states as a dynamical Bayesian sufficient statistic of the observed history, which results in Markov state transitions within and across bags. We then formulate this problem as a periodic Markov decision process (MDP) that allows non-stationarity within a period. An online RL algorithm based on Bellman equations for stationary MDPs is generalized to handle periodic MDPs. We show that our constructed state achieves the maximal optimal value function among all state constructions for a periodic MDP. Finally, we evaluate the proposed method on testbed variants built from real data …
Poster
Fanqi Yan · Huy Nguyen · Le Dung · Pedram Akbarian Saravi · Nhat Ho
[ Hall A-E ]
Abstract
We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of the pre-trained model and the prompt may converge to zero during the training, leading to the prompt vanishing issue; (ii) the algebraic interaction among parameters of the pre-trained model and the prompt can occur via some partial differential equations and decelerate the prompt learning. In response, we introduce a distinguishability condition to control the previous parameter interaction. Additionally, we also investigate various types of expert structure to understand their effects on the convergence behavior of parameter estimation. In each scenario, we provide comprehensive convergence rates of parameter estimation along with the corresponding minimax lower bounds. Finally, we run several numerical experiments to empirically justify our theoretical findings.
Poster
Ziyu Wang · Chris Holmes
[ Hall A-E ]
Abstract
Applications of large language models often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears difficult to define in general cases. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed methods can be applied to black-box language models. We illustrate the methods on question answering and machine translation tasks. Our experiments provide a principled evaluation of task-specific calibration, and demonstrate that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.
Poster
Xu Cai · Jonathan Scarlett
[ Hall A-E ]
Abstract
The optimization of black-box functions with noisy observations is a fundamental problem with widespread applications, and has been widely studied under the assumption that the function lies in a reproducing kernel Hilbert space (RKHS). This problem has been studied extensively in the stationary setting, and near-optimal regret bounds are known via developments in both upper and lower bounds. In this paper, we consider non-stationary scenarios, which are crucial for certain applications but are currently less well-understood. Specifically, we provide the first algorithm-independent lower bounds, where the time variations are subject satisfying a total variation budget according to some function norm. Under $\ell_{\infty}$-norm variations, our bounds are found to be close to an existing upper bound (Hong et al., 2023). Under RKHS norm variations, the upper and lower bounds are still reasonably close but with more of a gap, raising the interesting open question of whether non-minor improvements in the upper bound are possible.
Poster
Jung Yeon Park · Sujay Bhatt · Sihan Zeng · Lawson Wong · Alec Koppel · Sumitra Ganesh · Robin Walters
[ Hall A-E ]
Abstract
Equivariant neural networks have shown great success in reinforcement learning, improving sample efficiency and generalization when there is symmetry in the task. However, in many problems, only approximate symmetry is present, which makes imposing exact symmetry inappropriate. Recently, approximately equivariant networks have been proposed for supervised classification and modeling physical systems. In this work, we develop approximately equivariant algorithms in reinforcement learning (RL). We define approximately equivariant MDPs and theoretically characterize the effect of approximate equivariance on the optimal Q function. We propose novel RL architectures using relaxed group and steerable convolutions and experiment on several continuous control domains and stock trading with real financial data. Our results demonstrate that the approximately equivariant network performs on par with exactly equivariant networks when exact symmetries are present, and outperforms them when the domains exhibit approximate symmetry. As an added byproduct of these techniques, we observe increased robustness to noise at test time. Our code is available at https://github.com/jypark0/approx_equiv_rl.
Poster
Simon Vary · David Martínez-Rubio · Patrick Rebeschini
[ Hall A-E ]
Abstract
We study first-order algorithms that are uniformly stable for empirical risk minimization (ERM) problems that are convex and smooth with respect to $p$-norms, $p \geq 1$. We propose a black-box reduction method that, by employing properties of uniformly convex regularizers, turns an optimization algorithm for Hölder smooth convex losses into a uniformly stable learning algorithm with optimal statistical risk bounds on the excess risk, up to a constant factor depending on $p$. Achieving a black-box reduction for uniform stability was posed as an open question by Attia and Koren (2022), which had solved the Euclidean case $p=2$. We explore applications that leverage non-Euclidean geometry in addressing binary classification problems.
Poster
Dheeraj Baby · Boran Han · Shuai Zhang · Cuixiong Hu · Bernie Wang · Yu-Xiang Wang
[ Hall A-E ]
Abstract
We study the well-motivated problem of online distribution shift in which the data arrive in batches and the distribution of each batch can change arbitrarily over time. Since the shifts can be large or small, abrupt or gradual, the length of the relevant historical data to learn from may vary over time, which poses a major challenge in designing algorithms that can automatically adapt to the best "attention span'' while remaining computationally efficient. We propose a meta-algorithm that takes any network architecture and any Online Learner (OL) algorithm as input and produces a new algorithm which provably enhances the performance of the given OL under non-stationarity. Our algorithm is efficient (it requires maintaining only $O(\log T)$ OL instances) and adaptive (it automatically chooses OL instances with the ideal "attention'' length at every timestamp). Experiments on various real-world datasets across text and image modalities show that our method consistently improves the accuracy of user specified OL algorithms for classification tasks. Key novel algorithmic ingredients include a multi-resolution instance design inspired by wavelet theory and a cross-validation-through-time technique. Both could be of independent interest.
Poster
Wonyoung Kim · Garud Iyengar · Assaf Zeevi
[ Hall A-E ]
Abstract
We consider Pareto front identification~(PFI) for linear bandits (PFILin), i.e., the goal is to identify a set of arms with undominated mean reward vectors when the mean reward vector is a linear function of the context. PFILin includes the best arm identification problem and multi-objective active learning as special cases. The sample complexity of our proposed algorithm is optimal up to a logarithmic factor. In addition, the regret incurred by our algorithm during the estimation is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front.Our key contribution is a new estimator that in every round updates the estimate for the unknown parameter along \emph{multiple} context directions -- in contrast to the conventional estimator that only updatesthe parameter estimate along the chosen context. This allows us to use low-regret arms to collect information about Pareto optimal arms. Our key innovation is to reuse the explorationsamples multiple times; in contrast to conventional estimators that use each sample only once. Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while controlling the regret.
Poster
Marvin Pförtner · Jonathan Wenger · Jon Cockayne · Philipp Hennig
[ Hall A-E ]
Abstract
Kalman filtering and smoothing are the foundational mechanisms for efficient inference in Gauss-Markov models.However, their time and memory complexities scale prohibitively with the size of the state space.This is particularly problematic in spatiotemporal regression problems, where the state dimension scales with the number of spatial observations.Existing approximate frameworks leverage low-rank approximations of the covariance matrix.But since they do not model the error introduced by the computational approximation, their predictive uncertainty estimates can be overly optimistic.In this work, we propose a probabilistic numerical method for inference in high-dimensional Gauss-Markov models which mitigates these scaling issues.Our matrix-free iterative algorithm leverages GPU acceleration and crucially enables a tunable trade-off between computational cost and predictive uncertainty.Finally, we demonstrate the scalability of our method on a large-scale climate dataset.
Poster
Yujia Zheng · Yang Liu · Jiaxiong Yao · Yingyao Hu · Kun Zhang
[ Hall A-E ]
Abstract
Nearly all identifiability results in unsupervised representation learning inspired by, e.g., independent component analysis, factor analysis, and causal representation learning, rely on assumptions of additive independent noise or noiseless regimes. In contrast, we study the more general case where noise can take arbitrary forms, depend on latent variables, and be non-invertibly entangled within a nonlinear function. We propose a general framework for identifying latent variables in the nonparametric noisy settings. We first show that, under suitable conditions, the generative model is identifiable up to certain submanifold indeterminacies even in the presence of non-negligible noise. Furthermore, under the structural or distributional variability conditions, we prove that latent variables of the general nonlinear models are identifiable up to trivial indeterminacies. Based on the proposed theoretical framework, we have also developed corresponding estimation methods and validated them in various synthetic and real-world settings. Interestingly, our estimate of the true GDP growth from alternative measurements suggests more insightful information on the economies than official reports. We expect our framework to provide new insight into how both researchers and practitioners deal with latent variables in real-world scenarios.
Poster
Juan Ramirez · Ignacio Hounie · Juan Elenter · Jose Gallego-Posada · Meraj Hashemizadeh · Alejandro Ribeiro · Simon Lacoste-Julien
[ Hall A-E ]
Abstract
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance *on every individual data point*.Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
Poster
Xutong Zhao · Yaqi Xie
[ Hall A-E ]
Abstract
Cooperative multi-agent reinforcement learning (MARL) aims to coordinate multiple agents to achieve a common goal. A key challenge in MARL is credit assignment, which involves assessing each agent's contribution to the shared reward. Given the diversity of tasks, agents may perform different types of coordination, with rewards attributed to diverse and often overlapping agent subsets. In this work, we formalize the credit assignment level as the number of agents cooperating to obtain a reward, and address scenarios with multiple coexisting levels. We introduce a multi-level advantage formulation that performs explicit counterfactual reasoning to infer credits across distinct levels. Our method, Multi-level Advantage Credit Assignment (MACA), captures agent contributions at multiple levels by integrating advantage functions that reason about individual, joint, and correlated actions. Utilizing an attention-based framework, MACA identifies correlated agent relationships and constructs multi-level advantages to guide policy learning. Comprehensive experiments on challenging Starcraft v1&v2 tasks demonstrate MACA's superior performance, underscoring its efficacy in complex credit assignment scenarios.
Poster
Michael Ito · Danai Koutra · Jenna Wiens
[ Hall A-E ]
Abstract
Homophily, as a measure, has been critical to increasing our understanding of graph neural networks (GNNs). However, to date this measure has only been analyzed in the context of static graphs. In our work, we explore homophily in dynamic settings. Focusing on graph convolutional networks (GCNs), we demonstrate theoretically that in dynamic settings, current GCN discriminative performance is characterized by the probability that a node's future label is the same as its neighbors' current labels. Based on this insight, we propose dynamic homophily, a new measure of homophily that applies in the dynamic setting. This new measure correlates with GNN discriminative performance and sheds light on how to potentially design more powerful GNNs for dynamic graphs. Leveraging a variety of dynamic node classification datasets, we demonstrate that popular GNNs are not robust to low dynamic homophily. Going forward, our work represents an important step towards understanding homophily and GNN performance in dynamic node classification.
Poster
Alessio Russo · Alberto Maria Metelli · Marcello Restelli
[ Hall A-E ]
Abstract
We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies. We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed $\textit{Action-wise OAS-UCRL}$ algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $\mathcal{O}(\sqrt{T \log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.
Poster
Ji Won Park · Kyunghyun Cho
[ Hall A-E ]
Abstract
Many risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables, for which the prediction algorithm may report correlated errors. In this work, we aim to construct the conformal prediction set accounting for the joint correlation structure of the vector-valued non-conformity scores. Drawing from the rich literature on multivariate quantiles and semiparametric statistics, we propose an algorithm to estimate the $1-\alpha$ quantile of the scores, where $\alpha$ is the user-specified miscoverage rate. In particular, we flexibly estimate the joint cumulative distribution function (CDF) of the scores using nonparametric vine copulas and improve the asymptotic efficiency of the quantile estimate using its influence function. The vine decomposition allows our method to scale well to a large number of targets. As well as guaranteeing asymptotically exact coverage, our method yields desired coverage and competitive efficiency on a range of real-world regression problems, including those with missing-at-random labels in the calibration set.
Poster
Ruicong Yao · Tim Verdonck · Jakob Raymaekers
[ Hall A-E ]
Abstract
Uncovering causal relationships in datasets that include both categorical and continuous variables is a challenging problem. The overwhelming majority of existing methods restrict their application to dealing with a single type of variable. Our contribution is a structural causal model designed to handle mixed-type data through a general function class. We present a theoretical foundation that specifies the conditions under which the directed acyclic graph underlying the causal model can be identified from observed data. In addition, we propose Mixed-type data Extension for Regression and Independence Testing (MERIT), enabling the discovery of causal connections in real-world classification settings. Our empirical studies demonstrate that MERIT outperforms its state-of-the-art competitor in causal discovery on relatively low-dimensional data.
Poster
Zongrui Zou · Jingcheng Liu · Jalaj Upadhyay
[ Hall A-E ]
Abstract
In this paper, we give an almost linear time and space algorithms to sample from an exponential mechanism with an $\ell_1$-score function defined over an exponentially large non-convex set. As a direct result, on input an $n$ vertex $m$ edges graph $G$, we present the first $\widetilde{O}(m)$ time and $O(m)$ space algorithms for differentially privately outputting an $n$ vertex $O(m)$ edges synthetic graph that approximates all the cuts and the spectrum of $G$. These are the first private algorithms for releasing synthetic graphs that nearly match this task's time and space complexity in the non-private setting while achieving the same (or better) utility as the previous works in the more practical sparse regime. Additionally, our algorithms can be extended to private graph analysis under continual observation.
Poster
Evgenia Rusak · Patrik Reizinger · Attila Juhos · Oliver Bringmann · Roland Zimmermann · Wieland Brendel
[ Hall A-E ]
Abstract
Prior theory work on Contrastive Learning via the InfoNCE loss showed that, under certain assumptions, the learned representations recover the ground-truth latent factors. We argue that these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they either assume equal variance across all latents or that certain latents are kept invariant. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change with a continuum of variability across all factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Finally, we discuss the remaining mismatches between theoretical assumptions and practical implementations.
Poster
Giorgio Micali · Clement Lezane · Annika Betken
[ Hall A-E ]
Abstract
This article establishes a method to answer a finite set of linear queries on a given dataset while ensuring differential privacy. To achieve this, we formulate the corresponding task as a saddle-point problem, i.e. an optimization problem whose solution corresponds to a distribution minimizing the difference between answers to the linear queries based on the true distribution and answers from a differentially private distribution. Against this background, we establish two new algorithms for corresponding differentially private data release: the first is based on the differentially private Frank-Wolfe method, the second combines randomized smoothing with stochastic convex optimization techniques for a solution to the saddle-point problem. While previous works assess the accuracy of differentially private algorithms with reference to the empirical data distribution, a key contribution of our work is a more natural evaluation of the proposed algorithms' accuracy with reference to the true data-generating distribution.
Poster
Gefan Yang · Elizabeth Baker · Michael Severinsen · Christy Hipsley · Stefan Sommer
[ Hall A-E ]
Abstract
The diffusion bridge, which is a diffusion process conditioned on hitting a specific state within a finite period, has found broad applications in various scientific and engineering fields. However, simulating diffusion bridges for modeling natural data can be challenging due to both the intractability of the drift term and continuous representations of the data. Although several methods are available to simulate finite-dimensional diffusion bridges, infinite-dimensional cases remain under explored. This paper presents a method that merges score-matching techniques with operator learning, enabling a direct approach to learn the infinite-dimensional bridge and achieving a discretization equivariant bridge simulation. We conduct a series of experiments, ranging from synthetic examples with closed-form solutions to the stochastic nonlinear evolution of real-world biological shape data. Our method demonstrates high efficacy, particularly due to its ability to adapt to any resolution without extra training.
Poster
Jakwang Kim · Jiyoung Park · Anirban Bhattacharya
[ Hall A-E ]
Abstract
There is growing interest in developing statistical estimators that achieve exponential concentration around a population target even when the data distribution has heavier than exponential tails. More recent activity has focused on extending such ideas beyond Euclidean spaces to Hilbert spaces and Riemannian manifolds. In this work, we show that such exponential concentration in presence of heavy tails can be achieved over a broader class of parameter spaces called CAT($\kappa$) spaces, a very general metric space equipped with the minimal essential geometric structure for our purpose, while being sufficiently broad to encompass most typical examples encountered in statistics and machine learning. The key technique is to develop and exploit a general concentration bound for the Fr\'echet median in CAT($\kappa$) spaces. We illustrate our theory through a number of examples, and provide empirical support through simulation studies.
Poster
Tim G. J. Rudner · Xiang Pan · Yucen Li · Ravid Shwartz-Ziv · Andrew Wilson
[ Hall A-E ]
Abstract
Fine-tuning off-the-shelf pre-trained neural networks has become the default starting point for a wide range of challenging prediction tasks—especially in computer vision and natural language processing, where pre-trained models trained on millions or even billions of data points are publicly available and can be fine-tuned with a moderate compute budget. However, while fine-tuned models have been shown to significantly improve predictive performance compared to models trained from scratch, they can exhibit poor calibration and fail to reliably identify challenging distribution shifts. In this paper, we improve uncertainty quantification in fine-tuned models by constructing a data-driven uncertainty-aware fine-tuning prior that assigns high probability density to parameters that induce predictive functions with high uncertainty on input points that are meaningfully different from the data. We derive a tractable variational objective to perform approximate inference in models with data-driven uncertainty-aware priors and evaluate models fine-tuned with such priors on different transfer learning tasks. We show that fine-tuning with uncertainty-aware priors significantly improves calibration, selective prediction, and semantic shift detection on computer vision and natural language classification tasks.
Poster
Zhuorui Ye · Farzan Farnia
[ Hall A-E ]
Abstract
Saliency maps have been widely used to interpret the decisions of neural network classifiers and discover phenomena from their learned functions. However, standard gradient-based maps are frequently observed to be highly sensitive to the randomness of training data and the stochasticity in the training process. In this work, we study the role of Gaussian smoothing in the well-known Smooth-Grad algorithm in the stability of the gradient-based maps to the randomness of training samples. We extend the algorithmic stability framework to gradient-based interpretation maps and prove bounds on the stability error of standard Simple-Grad, Integrated-Gradients, and Smooth-Grad saliency maps. Our theoretical results suggest the role of Gaussian smoothing in boosting the stability of gradient-based maps to the randomness of training settings. On the other hand, we analyze the faithfulness of the Smooth-Grad maps to the original Simple-Grad and show the lower fidelity under a more intense Gaussian smoothing. We support our theoretical results by performing several numerical experiments on standard image datasets. Our empirical results confirm our hypothesis on the fidelity-stability trade-off in the application of Gaussian smoothing to gradient-based interpretation maps.
Poster
Or Raveh · Junya Honda · Masashi Sugiyama
[ Hall A-E ]
Abstract
We introduce a multiplayer dueling bandit problem, which is tailored for distributed systems with only preference-based information available and is motivated by recent advancements in deep learning with human feedback. Compared to multiplayer bandits, this setting presents challenges in controlling the collaborative exploration of non-informative arm pairs. We demonstrate that the direct use of a Follow Your Leader black-box approach matches the asymptotic regret lower bound when utilizing known dueling bandit algorithms as a foundation. Additionally, we propose and analyze a message-passing fully distributed approach with a novel Condorcet-Winner recommendation protocol, resulting in expedited exploration in the nonasymptotic regime. Our experimental comparisons reveal that our multiplayer algorithms surpass single-player benchmark algorithms, underscoring their efficacy in addressing the nuanced challenges of the multiplayer dueling bandits.
Poster
Corentin Pla · Maxime Vono · Hugo Richard
[ Hall A-E ]
Abstract
We consider the problem of mean estimation under user-level local differential privacy, where $n$ users are contributing through their local pool of data samples.Previous work assume that the number of data samples is the same across users.In contrast, we consider a more general and realistic scenario where each user $u \in [n]$ owns $m_u$ data samples drawn from some generative distribution $\mu$; $m_u$ being unknown to the statistician but drawn from a known distribution $M$ over $\mathbb{N}$.Based on a distribution-aware mean estimation algorithm, we establish an $M$-dependent upper bounds on the worst-case risk over $\mu$ for the task of mean estimation. We then derive a lower bound. The two bounds are asymptotically matching up to logarithmic factors and reduce to known bounds when $m_u = m$ for any user $u$.
Poster
Francois Bachoc · Tommaso Cesari · Roberto Colomboni
[ Hall A-E ]
Abstract
We study a contextual version of the repeated brokerage problem. In each interaction, two traders with private valuations for an item seek to buy or sell based on the learner's—a broker—proposed price, which is informed by some contextual information.The broker's goal is to maximize the traders' net utility—also known as the gain from trade—by minimizing regret compared to an oracle with perfect knowledge of traders' valuation distributions.We assume that traders' valuations are zero-mean perturbations of the unknown item's current market value—which can change arbitrarily from one interaction to the next—and that similar contexts will correspond to similar market prices.We analyze two feedback settings: full-feedback, where after each interaction the traders' valuations are revealed to the broker, and limited-feedback, where only transaction attempts are revealed.For both feedback types, we propose algorithms achieving tight regret bounds.We further strengthen our performance guarantees by providing a tight $1/2$-approximation result showing that the oracle that knows the traders' valuation distributions achieves at least $1/2$ of the gain from trade of the omniscient oracle that knows in advance the actual realized traders' valuations.
Poster
Mikkel Jordahn · Jonas Vestergaard Jensen · Mikkel Schmidt · Michael Andersen
[ Hall A-E ]
Abstract
Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.
Poster
Yunyi Shen · Renato Berlinghieri · Tamara Broderick
[ Hall A-E ]
Abstract
Practitioners often aim to infer an unobserved population trajectory using sample snapshots at multiple time points. E.g. given single-cell sequencing data, scientists would like to learn how gene expression changes over a cell’s life cycle. But sequencing any cell destroys that cell. So we can access data for any particular cell only at a single time point, but we have data across many cells. The deep learning community has recently explored using Schrödinger bridges (SBs) and their extensions in similar settings. However, existing methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic (often set to Brownian motion within SB). But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model class for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a class of reference dynamics, not a single fixed one. We demonstrate the advantages of our method on simulated and real data.
Poster
Lemin Kong · Xiangkun Hu · Tong He · David Wipf
[ Hall A-E ]
Abstract
Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO). Although effective in certain real-world settings, we detail how the foundational role of DPO reparameterizations (and equivalency to applying RLHF with an optimal reward) may be obfuscated once inevitable optimization constraints are introduced during model training. This then motivates alternative derivations and analysis of DPO that remain intact even in the presence of such constraints. As initial steps in this direction, we re-derive DPO from a simple Gaussian estimation perspective, with strong ties to compressive sensing and classical constrained optimization problems involving noise-adaptive, concave regularization.
Poster
Kasimir Tanner · Matteo Vilucchio · Bruno Loureiro · FLORENT KRZAKALA
[ Hall A-E ]
Abstract
This work investigates adversarial training in the context of margin-based linear classifiers in the high-dimensional regime where the dimension $d$ and the number of data points $n$ diverge with a fixed ratio $\alpha = n / d$. We introduce a tractable mathematical model where the interplay between the data and adversarial attacker geometries can be studied, while capturing the core phenomenology observed in the adversarial robustness literature. Our main theoretical contribution is an exact asymptotic description of the sufficient statistics for the adversarial empirical risk minimiser, under generic convex and non-increasing losses for a Block Feature Model. Our result allow us to precisely characterise which directions in the data are associated with a higher generalisation/robustness trade-off, as defined by a robustness and a usefulness metric. This goes beyond previous models in the literature, which fail to capture a difference in performance between adversarially trained models in the high sample complexity regime.In particular, we unveil the existence of directions which can be defended without penalising accuracy. Finally, we show the advantage of defending non-robust features during training, identifying a uniform protection as an inherently effective defence mechanism.
Poster
Xiaohui Tu · Yossiri Adulyasak · Nima Akbarzadeh · Erick Delage
[ Hall A-E ]
Abstract
We consider fair resource allocation in sequential decision-making environments modeled as weakly coupled Markov decision processes, where resource constraints couple the action spaces of $N$ sub-Markov decision processes (sub-MDPs) that would otherwise operate independently. We adopt a fairness definition using the generalized Gini function instead of the traditional utilitarian (total-sum) objective. After introducing a general but computationally prohibitive solution scheme based on linear programming, we focus on the homogeneous case where all sub-MDPs are identical. For this case, we show for the first time that the problem reduces to optimizing the utilitarian objective over the class of "permutation invariant" policies. This result is particularly useful as we can exploit efficient algorithms that optimizes the utilitarian objective such as Whittle index policies in restless bandits to solve the problem with this fairness objective. For more general settings, we introduce a count-proportion-based deep reinforcement learning approach. Finally, we validate our theoretical findings with comprehensive experiments, confirming the effectiveness of our proposed method in achieving fairness.
Poster
Reza Asad · Reza Babanezhad · Issam Hadj Laradji · Nicolas Le Roux · Sharan Vaswani
[ Hall A-E ]
Abstract
Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently,~\citet{vaswani2021general} introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.
Poster
Jungeum Kim · Percy Zhai · Veronika Rockova
[ Hall A-E ]
Abstract
This paper develops a multivariate Bayesian posterior sampling method through generative quantile learning.Our method learns a mapping that can transform (spherically) uniform random vectors into posterior samples without adversarial training.We utilize Monge-Kantorovich depth in multivariate quantiles to directly sample from Bayesian credible sets, a unique feature not offered by typical posterior sampling methods.To enhance training of the quantile mapping, we design a neural network that automatically performs summary statistic extraction.This additional neural network structure has performance benefits including support shrinkage (or posterior contraction) as the observation sample size increases. We demonstrate the usefulness of our approach on several examples where the absence of likelihood renders classical MCMC infeasible. Finally, we provide frequentist theoretical justifications for our quantile learning framework.
Poster
Christian Toth · Christian Knoll · Franz Pernkopf · Robert Peharz
[ Hall A-E ]
Abstract
The traditional two-stage approach to causal inference first identifies a *single* causal model (or equivalence class of models), which is then used to answer causal queries. However, this neglects any epistemic model uncertainty. In contrast, *Bayesian* causal inference does incorporate epistemic uncertainty into query estimates via Bayesian marginalisation (posterior averaging) over *all* causal models. While principled, this marginalisation over entire causal models, i.e., both causal structures (graphs) and mechanisms, poses a tremendous computational challenge. In this work, we address this challenge by decomposing structure marginalisation into the marginalisation over (i) causal orders and (ii) directed acyclic graphs (DAGs) given an order. We can marginalise the latter in closed form by limiting the number of parents per variable and utilising Gaussian Processes to model mechanisms. To marginalise over orders, we use a sampling-based approximation, for which we devise a novel auto-regressive distribution over causal orders (ARCO). Our method outperforms state-of-the-art in structure learning on simulated non-linear additive noise benchmarks, and yields competitive results on real-world data. Furthermore, we can accurately infer interventional distributions and average causal effects.
Poster
Lucile Ter-Minassian · Liran Szlak · Ehud Karavani · Chris Holmes · Yishai Shimoni
[ Hall A-E ]
Abstract
Modelling causal effects from observational data for deciding policy actions can benefit from being interpretable and transparent; both due to the high stakes involved and the inherent lack of ground truth labels to evaluate the accuracy of such models. To date, attempts at transparent causal effect estimation consist of applying post hoc explanation methods to black-box models, which are not interpretable. Here, we present BICauseTree: an interpretable balancing method that identifies clusters where natural experiments occur locally. Our approach builds on decision trees with a customized objective function to improve balancing and reduce treatment allocation bias. Consequently, it can additionally detect subgroups presenting positivity violations, exclude them, and provide a covariate-based definition of the target population we can infer from and generalize to. We evaluate the method's performance using synthetic and realistic datasets, explore its bias-interpretability tradeoff, and show that it is comparable with existing approaches.
Poster
Guillaume Braun · Minh Ha Quang · Masaaki Imaizumi
[ Hall A-E ]
Abstract
We investigate the problem of learning a Single Index Model (SIM) from anisotropic Gaussian inputs by training a neuron using vanilla Stochastic Gradient Descent (SGD). Our analysis shows that, unlike spherical SGD -- which is commonly used for theoretical analysis and requires estimating the covariance matrix $Q \in \mathbb{R}^{d \times d}$ -- vanilla SGDcan naturally adapt to the covariance structure of the data without additional modifications. Our key theoretical contribution is a dimension-free upper bound on the sample complexity, which depends on $Q$, its alignment with the single index $w^*$, and the information exponent $k^*$. We complement this upper bound with a Correlated Statistical Query (CSQ) lower bound that matches the upper bound on average over $w^*$, although it is suboptimal in $k^*$. Finally, we validate and extend our theoretical findings through numerical simulations, demonstrating the practical effectiveness of vanilla SGD in this context.
Poster
Emanuele Marconato · Sebastien Lachapelle · Sebastian Weichwald · Luigi Gresele
[ Hall A-E ]
Abstract
We analyze identifiability as a possible explanation for the ubiquity of linear properties across language models, such as the vector difference between the representations of “easy” and “easiest” being parallel to that between “lucky” and “luckiest”. For this, we ask whether finding a linear property in one model implies that any model that induces the same distribution has that property, too. To answer that, we first prove an identifiability result to characterize distribution-equivalent next-token predictors, lifting a diversity requirement of previous results. Second, based on a refinement of relational linearity [Paccanaro and Hinton, 2001; Hernandez et al., 2024], we show how many notions of linearity are amenable to our analysis. Finally, we show that under suitable conditions, these linear properties either hold in all or none distribution equivalent next-token predictors.
Poster
Cyrille Kone · Marc Jourdan · Emilie Kaufmann
[ Hall A-E ]
Abstract
The problem of identifying the best answer among a collection of items having real-valued distribution is well-understood. Despite its practical relevance for many applications, fewer works have studied its extension when multiple and potentially conflicting metrics are available to assess an item's quality. Pareto set identification (PSI) aims to identify the set of answers whose means are not uniformly worse than another. This paper studies PSI in the transductive linear setting with potentially correlated objectives. Building on posterior sampling in both the stopping and the sampling rules, we propose the \hyperlink{PSIPS}{PSIPS} algorithm that deals simultaneously with structure and correlation without paying the computational cost of existing oracle-based algorithms. Both from a frequentist and Bayesian perspective, \hyperlink{PSIPS}{PSIPS} is asymptotically optimal. We demonstrate its good empirical performance in real-world and synthetic instances.
Poster
Charles Margossian · Lawrence Saul
[ Hall A-E ]
Abstract
Given an intractable target density $p$, variational inference (VI) attempts to find the best approximation $q$ from a tractable family $\mathcal Q$. This is typically done by minimizing the exclusive Kullback-Leibler divergence, $\text{KL}(q||p)$. In practice, $\mathcal Q$ is not rich enough to contain $p$, and the approximation is misspecified even when it is a unique global minimizer of $\text{KL}(q||p)$. In this paper, we analyze the robustness of VI to these misspecifications when $p$ exhibits certain symmetries and $\mathcal Q$ is a location-scale family that shares these symmetries. We prove strong guarantees for VI not only under mild regularity conditions but also in the face of severe misspecifications. Namely, we show that (i) VI recovers the mean of $p$ when $p$ exhibits an even symmetry, and (ii) it recovers the correlation matrix of $p$ when in addition $p$ exhibits an elliptical symmetry. These guarantees hold for the mean even when $q$ is factorized and $p$ is not, and for the correlation matrix even when $q$ and $p$ behave differently in their tails. We analyze various regimes of Bayesian inference where these symmetries are useful idealizations, and we also investigate experimentally how VI behaves in their absence.
Poster
Yingyu Liang · Zhizhou Sha · Zhenmei Shi · Zhao Song · Yufa Zhou
[ Hall A-E ]
Abstract
Previous work has demonstrated that attention mechanisms are Turing complete. More recently, it has been shown that a looped 9-layer Transformer can function as a universal programmable computer. In contrast, the multi-layer perceptrons with $\mathsf{ReLU}$ activation ($\mathsf{ReLU}$-$\mathsf{MLP}$), one of the most fundamental components of neural networks, is known to be expressive; specifically, a two-layer neural network is a universal approximator given an exponentially large number of hidden neurons. However, it remains unclear whether a $\mathsf{ReLU}$-$\mathsf{MLP}$ can be made into a universal programmable computer using a practical number of weights. In this work, we provide an affirmative answer that a looped 23-layer $\mathsf{ReLU}$-$\mathsf{MLP}$ is capable of performing the basic necessary operations, more efficiently and effectively functioning as a programmable computer than a looped Transformer. This indicates simple modules have stronger expressive power than previously expected and have not been fully explored. Our work provides insights into the mechanisms of neural networks and demonstrates that complex tasks, such as functioning as a programmable computer, do not necessarily require advanced architectures like Transformers.
Poster
Benedikt Fesl · Benedikt Böck · Florian Strasser · Michael Baur · Michael Joham · Wolfgang Utschick
[ Hall A-E ]
Abstract
Diffusion models (DMs) as generative priors have recently shown great potential for denoising tasks but lack theoretical understanding with respect to their mean square error (MSE) optimality. This paper proposes a novel denoising strategy inspired by the structure of the MSE-optimal conditional mean estimator (CME). The resulting DM-based denoiser can be conveniently employed using a pre-trained DM, being particularly fast by truncating reverse diffusion steps and not requiring stochastic re-sampling. We present a comprehensive (non-)asymptotic optimality analysis of the proposed diffusion-based denoiser, demonstrating polynomial-time convergence to the CME under mild conditions. Our analysis also derives a novel Lipschitz constant that depends solely on the DM’s hyperparameters. Further, we offer a new perspective on DMs, showing that they inherently combine an asymptotically optimal denoiser with a powerful generator, modifiable by switching re-sampling in the reverse process on or off. The theoretical findings are thoroughly validated with experiments based on various benchmark datasets.
Poster
Shyam Venkatasubramanian · Ali Pezeshki · Vahid Tarokh
[ Hall A-E ]
Abstract
We introduce a new approach to processing complex-valued data using DNNs consisting of parallel real-valued subnetworks with coupled outputs. Our proposed class of architectures, referred to as Steinmetz Neural Networks, incorporates multi-view learning to construct more interpretable representations in the latent space. Moreover, we present the Analytic Neural Network, which incorporates a consistency penalty that encourages analytic signal representations in the latent space of the Steinmetz neural network. This penalty enforces a deterministic and orthogonal relationship between the real and imaginary components. Using an information-theoretic construction, we demonstrate that the generalization gap upper bound posited by the analytic neural network is lower than that of the general class of Steinmetz neural networks. Our numerical experiments depict the improved performance and robustness to additive noise, afforded by our proposed networks on benchmark datasets and synthetic examples.
Poster
Yan Chen · Jose Blanchet · Krzysztof Dembczynski · Laura F. Nern · Aaron Flores
[ Hall A-E ]
Abstract
Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We propose a pseudo maximum likelihood estimator and study its asymptotic normality in the context of increasingly imbalanced populations relative to an increasingly large sample size. We provide theoretical guarantees for the introduced estimator. Additionally, we compute the optimal downsampling rate using a criterion that balances statistical accuracy and computational efficiency. Our numerical experiments, conducted on both synthetic and empirical data, further validate our theoretical results, and demonstrate that the introduced estimator outperforms commonly available alternatives.
Poster
Erfan Mirzaei · Andreas Maurer · Vladimir Kostic · Massimiliano Pontil
[ Hall A-E ]
Abstract
Learning from non-independent and non-identically distributed data poses a persistent challenge in statistical learning. In this study, we introduce data-dependent Bernstein inequalities tailored for vector-valued processes in Hilbert space. Our inequalities apply to both stationary and non-stationary processes and exploit the potential rapid decay of correlations between temporally separated variables to improve estimation. We demonstrate the utility of these bounds by applying them to covariance operator estimation in the Hilbert-Schmidt norm and to operator learning in dynamical systems, achieving novel risk bounds. Finally, we perform numerical experiments to illustrate the practical implications of these bounds in both contexts.
Poster
Steffen Schneider · Rodrigo González Laiz · Anastasiia Filippova · Markus Frey · Mackenzie Mathis
[ Hall A-E ]
Abstract
Gradient-based attribution methods aim to explain decisions of deep learning models but so far lack identifiability guarantees. Here, we propose a method to generate attribution maps with identifiability guarantees by developing a regularized contrastive learning algorithm trained on time-series data plus a new attribution method called Inverted Neuron Gradient (collectively named xCEBRA). We show theoretically that xCEBRA has favorable properties for identifying the Jacobian matrix of the data generating process. Empirically, we demonstrate robust approximation of zero vs. non-zero entries in the ground-truth attribution map on synthetic datasets, and significant improvements across previous attribution methods based on feature ablation, Shapley values, and other gradient-based methods. Our work constitutes a first example of identifiable inference of time-series attribution maps and opens avenues to a better understanding of time-series data, such as for neural dynamics and decision-processes within neural networks.
Poster
Thang Loi Nguyen · Loc Duong · Vo Duy
[ Hall A-E ]
Abstract
Feature Selection (FS) under domain adaptation (DA) is a critical task in machine learning, especially when dealing with limited target data. However, existing methods lack the capability to guarantee the reliability of FS under DA. In this paper, we introduce a novel statistical method to statistically test FS reliability under DA, named SFS-DA (statistical FS-DA). The key strength of SFS-DA lies in its ability to control the false positive rate (FPR) below a pre-specified level $\alpha$ (e.g., 0.05) while maximizing the true positive rate. Compared to the literature on statistical FS, SFS-DA presents a unique challenge in addressing the effect of DA to ensure the validity of the inference on FS results. We overcome this challenge by leveraging the Selective Inference (SI) framework. Specifically, by carefully examining the FS process under DA whose operations can be characterized by linear and quadratic inequalities, we prove that achieving FPR control in SFS-DA is indeed possible. Furthermore, we enhance the true detection rate by introducing a more strategic approach. Experiments conducted on both synthetic and real-world datasets robustly support our theoretical results, showcasing the SFS-DA’s superior performance.
Poster
Futoshi Futami
[ Hall A-E ]
Abstract
Bayesian inference is widely used in practice due to its ability to assess epistemic uncertainty (EU) in predictions. However, its computational complexity necessitates the use of approximation methods, such as variational inference (VI). When estimating EU within the VI framework, metrics such as the variance of the posterior predictive distribution and conditional mutual information are commonly employed. Despite their practical importance, these metrics lack comprehensive theoretical analysis. In this paper, we investigate these EU metrics by providing their novel relationship to excess risk, which allows for a convergence analysis based on PAC-Bayesian theory. Based on these analyses, we then demonstrate that some existing objective functions of VI regularize EU metrics in different ways leading to different performance in EU evaluation. Finally, we propose a novel objective function for VI that directly optimizes both prediction and EU under the PAC-Bayesian framework. Experimental results indicate that our algorithm significantly improves EU estimation compared to existing VI methods.
Poster
Lei You · Lele Cao · Mattias Nilsson · Bo Zhao · Lei Lei
[ Hall A-E ]
Abstract
Counterfactual explanations (CE) are the de facto method of providing insight and interpretability in black-box decision-making models by identifying alternative input instances that lead to different outcomes. This paper extends the concept of CE to a distributional context, broadening the scope from individual data points to entire input and output distributions, named distributional counterfactual explanation (DCE). In DCE, we take the stakeholder's perspective and shift focus to analyzing the distributional properties of the factual and counterfactual, drawing parallels to the classical approach of assessing individual instances and their resulting decisions. We leverage optimal transport (OT) to frame a chance-constrained optimization problem, aiming to derive a counterfactual distribution that closely aligns with its factual counterpart, substantiated by statistical confidence. Our proposed optimization method, Discount, strategically balances this confidence in both the input and output distributions. This algorithm is accompanied by an analysis of its convergence rate. The efficacy of our proposed method is substantiated through a series of quantitative and qualitative experiments, highlighting its potential to provide deep insights into decision-making models.
Poster
Rilind Sahitaj · Paulius Sasnauskas · Yiğit Yalın · Debmalya Mandal · Goran Radanovic
[ Hall A-E ]
Abstract
Performative Reinforcement Learning (PRL) refers to a scenario in which the deployed policy changes the reward and transition dynamics of the underlying environment. In this work, we study multi-agent PRL by incorporating performative effects into Markov Potential Games (MPGs). We introduce the notion of a performatively stable equilibrium (PSE) and show that it always exists under a reasonable sensitivity assumption. We then provide convergence results for state-of-the-art algorithms used to solve MPGs. Specifically, we show that independent policy gradient ascent (IPGA) and independent natural policy gradient (INPG) converge to an approximate PSE in the best-iterate sense, with an additional term that accounts for the performative effects. Furthermore, we show that INPG asymptotically converges to a PSE in the last-iterate sense. As the performative effects vanish, we recover the convergence rates from prior work. For a special case of our game, we provide finite-time last-iterate convergence results for a repeated retraining approach, in which agents independently optimize a surrogate objective. We conduct extensive experiments to validate our theoretical findings.
Poster
Nicolas Nguyen · Imad Aouali · Andras Gyorgy · Claire Vernade
[ Hall A-E ]
Abstract
We study the problem of Bayesian fixed-budget best-arm identification (BAI) in structured bandits. We propose an algorithm that uses fixed allocations based on the prior information and the structure of the environment. We provide theoretical bounds on its performance across diverse models, including the first prior-dependent upper bounds for linear and hierarchical BAI. Our key contribution lies in introducing novel proof techniques that yield tighter bounds for multi-armed BAI compared to existing approaches. Our work provides new insights into Bayesian fixed-budget BAI in structured bandits, and extensive experiments demonstrate the consistent and robust performance of our method in practice across various settings.
Poster
Somnath Basu Roy Chowdhury · Kumar Avinava Dubey · Ahmad Beirami · Rahul Kidambi · Nicholas Monath · Amr Ahmed · Snigdha Chaturvedi
[ Hall A-E ]
Abstract
Concept erasure is the task of erasing information about a concept (e.g., gender or race) from a representation set while retaining the maximum possible utility -- information from original representations. Concept erasure is useful in several applications, such as removing sensitive concepts to achieve fairness and interpreting the impact of specific concepts on a model's performance. Previous concept erasure techniques have prioritized robustly erasing concepts over retaining the utility of the resultant representations. However, there seems to be an inherent tradeoff between erasure and retaining utility, making it unclear how to achieve perfect concept erasure while maintaining high utility. In this paper, we offer a fresh perspective toward solving this problem by quantifying the fundamental limits of concept erasure through an information-theoretic lens. Using these results, we investigate constraints on the data distribution and the erasure functions required to achieve the limits of perfect concept erasure. Empirically, we show that the derived erasure functions achieve the optimal theoretical bounds. Additionally, we show that our approach outperforms existing methods on a range of synthetic and real-world datasets using GPT-4 representations.
Poster
Keyao Zhan · Puheng Li · Lei Wu
[ Hall A-E ]
Abstract
It was empirically observed in Enterazi et al. (2021) that when accounting for the permutation invariance of neural networks, there is likely no loss barrier along the linear interpolation between two SGD solutions -- a phenomenon known as linear mode connectivity (LMC) modulo permutation. This phenomenon has sparked significant attention due to both its theoretical interest and practical relevance in applications such as model merging. In this paper, we provide a fine-grained analysis of this phenomenon for two-layer ReLU networks under a teacher-student setup. We show that as the student network width $m$ increases, the LMC loss barrier modulo permutation exhibits a double descent behavior. Particularly, when $m$ is sufficiently large, the barrier decreases to zero at a rate $O(m^{-1/2})$. Notably, this rate does not suffer from the curse of dimensionality and demonstrates how substantial permutation can reduce the LMC loss barrier. Moreover, we observe a sharp transition in the sparsity of GD/SGD solutions when increasing the learning rate and investigate how this sparsity preference affects the LMC loss barrier modulo permutation. Experiments on both synthetic and MNIST datasets corroborate our theoretical predictions and reveal a similar trend for more complex network architectures.
Poster
Slavomir Hanzely
[ Hall A-E ]
Abstract
In this paper, we propose the first sketch-and-project Newton method with the fast O(1/k^2) global convergence rate for self-concordant functions. Our method, SGN, can be viewed in three ways: i) as a sketch-and-project algorithm projecting updates of the Newton method, ii) as a cubically regularized Newton method in the sketched subspaces, and iii) as a damped Newton method in the sketched subspaces.SGN inherits the best of all three worlds: the cheap iteration costs of the sketch-and-project methods, the state-of-the-art O(1/k^2) global convergence rate of the full-rank Newton-like methods, and the algorithm simplicity of the damped Newton methods. Finally, we demonstrate its comparable empirical performance to the baseline algorithms.
Poster
Salma Kharrat · Marco Canini · Samuel Horvath
[ Hall A-E ]
Abstract
This work addresses the challenges of data heterogeneity and communication constraints in decentralized federated learning (FL). We introduce decentralized personalized FL (DPFL), a bi-level optimization framework that enhances personalized FL by leveraging combinatorial relationships among clients, enabling fine-grained and targeted collaborations. By employing a constrained greedy algorithm, DPFL constructs a collaboration graph that guides clients in choosing suitable collaborators, enabling personalized model training tailored to local data while respecting a fixed and predefined communication and resource budget. Our theoretical analysis demonstrates that the proposed objective for constructing the collaboration graph yields superior or equivalent performance compared to any alternative collaboration structures, including pure local training. Extensive experiments across diverse datasets show that DPFL consistently outperforms existing methods, effectively handling non-IID data, reducing communication overhead, and improving resource efficiency in real-world decentralized FL scenarios. The code can be accessed at: https://github.com/salmakh1/DPFL.
Poster
Nikita Kalinin · Lukas Steinberger
[ Hall A-E ]
Abstract
In this paper, we study the problem of estimating the unknown mean $\theta$ of a unit variance Gaussian distribution in a locally differentially private (LDP) way. In the high-privacy regime ($\epsilon\le 1$), we identify an optimal privacy mechanism that minimizes the variance of the estimator asymptotically. Our main technical contribution is the maximization of the Fisher-Information of the sanitized data with respect to the local privacy mechanism $Q$. We find that the exact solution $Q_{\theta,\epsilon}$ of this maximization is the sign mechanism that applies randomized response to the sign of $X_i-\theta$, where $X_1,\dots, X_n$ are the confidential iid original samples. However, since this optimal local mechanism depends on the unknown mean $\theta$, we employ a two-stage LDP parameter estimation procedure which requires splitting agents into two groups. The first $n_1$ observations are used to consistently but not necessarily efficiently estimate the parameter $\theta$ by $\tilde{\theta_{n_1}}$. Then this estimate is updated by applying the sign mechanism with $\tilde{\theta}_{n_1}$ instead of $\theta$ to the remaining $n-n_1$ observations, to obtain an LDP and efficient estimator of the unknown mean.
Poster
Xin Liu
[ Hall A-E ]
Abstract
This paper studies the problem of preference-based stochastic linear contextual bandits with knapsack constraints (PbLinCBwK).We propose budget-aware optimistic and randomized exploration algorithms that achieve a regret of ${O}((\kappa+\frac{T\nu^*}{B})\sqrt{T}\log T),$ for any total budget $B=\Omega(\sqrt{T}).$The parameters $\kappa$ and $\frac{T\nu^*}{B}$ capture the effects of preference feedback and knapsack constraints, respectively. Our regret performance is near-optimal and matches the bound of LinCBwK under the mild condition $B=\Omega(\sqrt{T}).$ To achieve these results, we view the process of budget consumption and stopping time as Markov processing and analyze it via the Lyapunov drift method, which is translated into the strong regret guarantee. The experiments on synthetic PbLinCBwK and online content moderation setting further justify the theoretical results.
Poster
Angeliki Giannou · Liu Yang · Tianhao Wang · Dimitris Papailiopoulos · Jason Lee
[ Hall A-E ]
Abstract
Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We establish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic to the error more layers. Our results suggest the ability of the Transformer architecture to implement complex algorithms, beyond gradient descent.
Poster
Chengrui Qu · Laixi Shi · Kishan Panaganti · Pengcheng You · Adam Wierman
[ Hall A-E ]
Abstract
Online reinforcement learning (RL) typically requires online interaction data to learn a policy for a target task, but collecting such data can be high-stakes. This prompts interest in leveraging historical data to improve sample efficiency. The historical data may come from outdated or related source environments with different dynamics. It remains unclear how to effectively use such data in the target task to provably enhance learning and sample efficiency. To address this, we propose a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics. We show that -- without information on the dynamics shift -- general shifted-dynamics data, even with subtle shifts, does not reduce sample complexity in the target environment. However, focusing on HTRL with prior information on the degree of the dynamics shift, we design HySRL, a transfer algorithm that outperforms pure online RL with problem-dependent sample complexity guarantees. Finally, our experimental results demonstrate that HySRL surpasses the state-of-the-art pure online RL baseline.
Poster
Arnab Bhattacharyya · Weiming Feng · Piyush Srivastava
[ Hall A-E ]
Abstract
The total variation distance is a metric of central importance in statistics and probability theory. However, somewhat surprisingly, questions about computing it *algorithmically* appear not to have been systematically studied until very recently. In this paper, we contribute to this line of work by studying this question in the important special case of multivariate Gaussians. More formally, we consider the problem of approximating the total variation distance between two multivariate Gaussians to within an $\epsilon$-relative error. Previous works achieved a *fixed* constant relative error approximation via closed-form formulas. In this work, we give algorithms that given any two $n$-dimensional Gaussians $D_1,D_2$, and any error bound $\epsilon > 0$, approximate the total variation distance $D := d_{TV}(D_1,D_2)$ to $\epsilon$-relative accuracy in $\mathrm{poly}(n,\frac{1}{\epsilon},\log \frac{1}{D})$ operations. The main technical tool in our work is a reduction that helps us extend the recent progress on computing the TV-distance between *discrete* random variables to our continuous setting.
Poster
Alexander Koebler · Thomas Decker · Ingo Thon · Volker Tresp · Florian Buettner
[ Hall A-E ]
Abstract
We study the problem of monitoring machine learning models under gradual distribution shifts, where circumstances change slowly over time, often leading to unnoticed yet significant declines in accuracy. To address this, we propose Incremental Uncertainty-aware Performance Monitoring (IUPM), a novel label-free method that estimates performance changes by modeling gradual shifts using optimal transport. In addition, IUPM quantifies the uncertainty in the performance prediction and introduces an active labeling procedure to restore a reliable estimate under a limited labeling budget. Our experiments show that IUPM outperforms existing performance estimation baselines in various gradual shift scenarios and that its uncertainty awareness guides label acquisition more effectively compared to other strategies.
Poster
Dai Hai Nguyen · Tetsuya Sakurai · Hiroshi Mamitsuka
[ Hall A-E ]
Abstract
Variational Inference (VI) optimizes varia- tional parameters to closely align a variational distribution with the true posterior, being ap- proached through vanilla gradient descent in black-box VI or natural-gradient descent in natural-gradient VI. In this work, we reframe VI as the optimization of an objective that concerns probability distributions defined over a variational parameter space. Subsequently, we propose Wasserstein gradient descent for solving this optimization, where black-box VI and natural-gradient VI can be interpreted as special cases of the proposed Wasserstein gradient descent. To enhance the efficiency of optimization, we develop practical methods for numerically solving the discrete gradient flows. We validate the effectiveness of the pro- posed methods through experiments on syn- thetic and real-world datasets, supplemented by theoretical analyses.
Poster
Sebastian Pineda Arango · Pedro Mercado · Shubham Kapoor · Abdul Fatir Ansari · Lorenzo Stella · Huibin Shen · Hugo Sénétaire · Caner Turkmen · Oleksandr Shchur · Danielle Maddix · Michael Bohlke-Schneider · Bernie Wang · Syama Sundar Rangapuram
[ Hall A-E ]
Abstract
Covariates provide valuable information on external factors that influence time series and are critical in many real-world time series forecasting tasks. For example, in retail, covariates may indicate promotions or peak dates such as holiday seasons that heavily influence demand forecasts. Recent advances in pretraining large language model architectures for time series forecasting have led to highly accurate forecasters. However, the majority of these models do not readily use covariates as they are often specific to a certain task or domain. This paper introduces a new method to incorporate covariates into pretrained time series forecasting models. Our proposed approach incorporates covariate information into pretrained forecasting models through modular blocks that inject past and future covariate information, without necessarily modifying the pretrained model in consideration. In order to evaluate our approach, we introduce a benchmark composed of 32 different synthetic datasets with varying dynamics to evaluate the effectivity of forecasting models with covariates. Extensive evaluations on both synthetic and real datasets show that our approach effectively incorporates covariate information into pretrained models, outperforming existing baselines.
Poster
Omer Noy Klein · Alihan Hüyük · Ron Shamir · Uri Shalit · Mihaela van der Schaar
[ Hall A-E ]
Abstract
Randomized Controlled Trials (RCTs) are the gold standard for evaluating the effect of new medical treatments. Treatments must pass stringent regulatory conditions in order to be approved for widespread use, yet even after the regulatory barriers are crossed, real-world challenges might arise: Who should get the treatment? What is its true clinical utility? Are there discrepancies in the treatment effectiveness across diverse and under-served populations? We introduce two new objectives for future clinical trials that integrate regulatory constraints and treatment policy value for both the entire population and under-served populations, thus answering some of the questions above in advance. Designed to meet these objectives, we formulate Randomize First Augment Next (RFAN), a new framework for designing Phase III clinical trials. Our framework consists of a standard randomized component followed by an adaptive one, jointly meant to efficiently and safely acquire and assign patients into treatment arms during the trial. Then, we propose strategies for implementing RFAN based on causal, deep Bayesian active learning. Finally, we empirically evaluate the performance of our framework using synthetic and real-world semi-synthetic datasets.
Poster
Cheng Jiang · Sitian Qian
[ Hall A-E ]
Abstract
Modern high-energy physics (HEP) experiments are increasingly challenged by the vast size and complexity of their datasets, particularly regarding large-scale point cloud processing and long sequences. In this study, to address these challenges, we explore the application of structured state space models (SSMs), proposing one of the first trials to integrate local-sensitive hashing into either a hybrid or pure Mamba Model. Our results demonstrate that pure SSMs could serve as powerful backbones for HEP problems involving tasks for long sequence data with local inductive bias. By integrating locality-sensitive hashing into Mamba blocks, we achieve significant improvements over traditional backbones in key HEP tasks, surpassing them in inference speed and physics metrics while reducing computational overhead. In key tests, our approach demonstrated promising results, presenting a viable alternative to traditional transformer backbones by significantly reducing FLOPS while maintaining robust performance.
Poster
Zhu Wang · Praveen Raj Veluswami · Harsh Mishra · Sathya N. Ravi
[ Hall A-E ]
Abstract
First-order methods are widely employed for training neural networks that are used in practical applications. For classification of input features, Cross-Entropy based loss functions are often preferred since they are differentiable everywhere. Recent optimization results show that the convergence properties of first-order methods such as gradient descent are intricately tied to the separability of datasets and the induced loss landscape. We introduce Rooted Logistic Objectives (RLO) to improve practical convergence behavior with benefits for downstream tasks. We show that our proposed loss satisfies strict convexity properties and has better condition number properties that will benefit practical implementations. To evaluate our proposed RLO, we compare its performance on various classification benchmarks. Our results illustrate that training procedure converges faster with RLO in many cases. Furthermore, on two downstream tasks viz., post-training quantization and finetuning on quantized space, we show that it is possible to ensure lower performance degradation while using reduced precision for sequence prediction tasks in large language models over state of the art methods.
Poster
Wenjing Chen · Victoria Crawford
[ Hall A-E ]
Abstract
Leveraging the intrinsic structure of submodular functions to design more sample-efficient algorithms for submodular maximization (SM) has gained significant attention in recent studies. In a number of real-world applications such as diversified recommender systems and data summarization, the submodular function exhibits additional linear structure. In this paper, we consider the problem of linear submodular maximization under the bandit feedback in the pure-exploration setting, where the submodular objective function is defined as $f:2^U \rightarrow\mathbb{R}\_{\ge 0}$, where $f=\sum_{i=1}^dw_iF_{i}$. It is assumed that we have value oracle access to the functions $F_i$, but the coefficients $w_i$ are unknown, and $f$ can only be accessed via noisy queries. To harness the linear structure, we develop algorithms inspired by adaptive allocation algorithms in the best-arm identification for the linear bandit, with approximation guarantees arbitrarily close to the setting where we have value oracle access to $f$. Our approach efficiently leverages information from prior samples, offering a significant improvement in sample efficiency. Experimental results on both synthetic datasets and real-world datasets demonstrate the superior performance of our method compared to baseline algorithms, particularly in terms of sample efficiency.
Poster
Joseph Lazzaro · Ciara Pike-Burke
[ Hall A-E ]
Abstract
We study the piecewise constant bandit problem where the expected reward is a piecewise constant function with one change point (discontinuity) across the action space $[0,1]$ and the learner's aim is to locate the change point. Under the assumption of a fixed exploration budget, we provide the first non-asymptotic analysis of policies designed to locate abrupt changes in the mean reward function under bandit feedback. We study the problem under a large and small budget regime, and for both settings establish lower bounds on the error probability and provide algorithms with near matching upper bounds. Interestingly, our results show a separation in the complexity of the two regimes. We then propose a regime adaptive algorithm which is near optimal for both small and large budgets simultaneously.We complement our theoretical analysis with experimental results in simulated environments to support our findings.
Poster
Jie Xu · Zihan Wu · Cong Wang · Xiaohua Jia
[ Hall A-E ]
Abstract
To address the growing demand for privacy protection in machine learning, we propose an efficient and exact machine unlearning method for Large Models, called LMEraser. LMEraser takes a divide-and-conquer strategy with an adaptive prompt tuning mechanism to isolate data influence effectively. The training dataset is partitioned into public and private datasets. Public data are used to train the backbone of the model. Private data are clustered based on their diversity, and each cluster tunes a tailored prompt independently. This approach enables targeted unlearning by updating affected prompts, significantly reduces unlearning costs and maintains high model performance. Evaluations show that LMEraser reduces unlearning costs by 100 times compared to prior work without compromising model utility.
Poster
Ashwin Samudre · Mircea Petrache · Brian Nord · Shubhendu Trivedi
[ Hall A-E ]
Abstract
There has been much recent interest in designing neural networks (NNs) with relaxed equivariance, which interpolate between exact equivariance and full flexibility for consistent performance gains. In a separate line of work, structured parameter matrices with low displacement rank (LDR)---which permit fast function and gradient evaluation---have been used to create compact NNs, though primarily benefiting classical convolutional neural networks (CNNs). In this work, we propose a framework based on symmetry-based structured matrices to build approximately equivariant NNs with fewer parameters. Our approach unifies the aforementioned areas using Group Matrices (GMs), a forgotten precursor to the modern notion of regular representations of finite groups. GMs allow the design of structured matrices similar to LDR matrices, which can generalize all the elementary operations of a CNN from cyclic groups to arbitrary finite groups. We show GMs can also generalize classical LDR theory to general discrete groups, enabling a natural formalism for approximate equivariance. We test GM-based architectures on various tasks with relaxed symmetry and find that our framework performs competitively with approximately equivariant NNs and other structured matrix-based methods, often with one to two orders of magnitude fewer parameters.
Poster
Ruihan Xu · Yiping Lu
[ Hall A-E ]
Abstract
Iterative sketching and sketch-and-precondition are well-established randomized algorithms for solving large-scale over-determined linear least-squares problems. In this paper, we introduce a new perspective that interprets Iterative Sketching and Sketching-and-Precondition as forms of Iterative Refinement. We also examine the numerical stability of two distinct refinement strategies: iterative refinement and recursive refinement, which progressively improve the accuracy of a sketched linear solver. Building on this insight, we propose a novel algorithm, Sketched Iterative and Recursive Refinement (SIRR), which combines both refinement methods. SIRR aims to develop a stable randomized algorithm for least-squares problems, ensuring that the computed solution exactly solves a modified least-squares system, where the coefficient matrix deviates only slightly from the original matrix. To the best of our knowledge, SIRR is the first asymptotically fast, single-stage randomized least-squares solver that achieves both forward and backward stability.
Poster
Léo Dana · Muni Sreenivas Pydi · Yann Chevaleyre
[ Hall A-E ]
Abstract
Recent research has explored the memorization capacity of multi-head attention, but thesefindings are constrained by unrealistic limitations on the context size. We present a novel prooffor language-based Transformers that extends the current hypothesis to any context size. Ourapproach improves upon the state-of-the-art by achieving more effective exact memorizationwith an attention layer, while also introducing the concept of approximate memorization ofdistributions. Through experimental validation, we demonstrate that our proposed boundsmore accurately reflect the true memorization capacity of language models, and provide a precisecomparison with prior work.
Poster
Mathieu Bazinet · Valentina Zantedeschi · Pascal Germain
[ Hall A-E ]
Abstract
The sample compression theory provides generalization guarantees for predictors that can be fully defined using a subset of the training dataset and a (short) message string, generally defined as a binary sequence. Previous works provided generalization bounds for the zero-one loss, which is restrictive notably when applied to deep learning approaches. In this paper, we present a general framework for deriving new sample compression bounds that hold for real-valued unbounded losses. Using the Pick-To-Learn (P2L) meta-algorithm, which transforms the training method of any machine-learning predictor to yield sample-compressed predictors, we empirically demonstrate the tightness of the bounds and their versatility by evaluating them on random forests and multiple types of neural networks.
Poster
Zhaolin Ren · Runyu Zhang · Bo Dai · Na Li
[ Hall A-E ]
Abstract
Network Markov Decision Processes (MDPs), which are the de-facto model for multi-agent control, pose a significant challenge to efficient learning caused by the exponential growth of the global state-action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for multiagent reinforcement learning in network MDPs, which induces a network linear subspace for the local $Q$-function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for multiagent reinforcement learning in continuous state-action network MDPs, and provide end-to-end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation-based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local $Q$-functions.
Poster
Marco Miani · Hrittik Roy · Soren Hauberg
[ Hall A-E ]
Abstract
Bayesian deep learning all too often underfits so that the Bayesian prediction is less accurate than a simple point estimate. Uncertainty quantification then comes at the cost of accuracy. For linearized models, the null space of the generalized Gauss-Newton matrix corresponds to parameters that preserve the training predictions of the point estimate. We propose to build Bayesian approximations in this null space, thereby guaranteeing that the Bayesian predictive does not underfit. We suggest a matrix-free algorithm for projecting onto this null space, which scales linearly with the number of parameters and quadratically with the number of output dimensions. We further propose an approximation that only scales linearly with parameters to make the method applicable to generative models. An extensive empirical evaluation shows that the approach scales to large models, including vision transformers with 28 million parameters. Code is available at: https://github.com/h-roy/projected-bayes
Poster
Homer Durand · Gherardo Varando · Nathan Mankovich · Gustau Camps-Valls
[ Hall A-E ]
Abstract
We propose a regularisation strategy of classical machine learning algorithms rooted in causality that ensures robustness against distribution shifts. Building upon the anchor regression framework, we demonstrate how incorporating a straightforward regularisation term into the loss function of classical multivariate analysis algorithms, such as (orthonormalized) partial least squares, reduced-rank regression, and multiple linear regression, enables out-of-distribution generalisation. Our framework allows users to efficiently verify the compatibility of a loss function with the regularisation strategy. Estimators for selected algorithms are provided, showcasing consistency and efficacy in synthetic and real-world climate science problems. The empirical validation highlights the versatility of anchor regularisation, emphasizing its compatibility with multivariate analysis approaches and its role in enhancing replicability while guarding against distribution shifts. The extended anchor framework advances causal inference methodologies, addressing the need for reliable out-of-distribution generalisation.
Poster
Shirshendu Chatterjee · Soumendu Sundar Mukherjee · Tamojit Sadhukhan
[ Hall A-E ]
Abstract
We consider the offline changepoint estimation problem in the context of multilayer stochastic block models. We develop an algorithm involving suitably chosen CUSUM statistics based on the adjacency matrices of the observed networks for estimating a single changepoint present in the input data. We provide rigorous theoretical guarantees on the performance of the proposed method when one or more of the following phenomena occur at the changepoint: (a) merging of communities, (b) splitting of communities, and (c) changes in the connection probabilities among the communities. We derive a lower bound on the minimax detectability threshold involving the relevant signal strength parameter and show that the proposed algorithm can estimate the changepoint consistently when the signal strength is above a small multiplicative factor times the minimax detectability threshold. We do not make any a priori assumption on the sparsity of the underlying networks and only require that the overall average degree goes to infinity. Via simulation experiments, we empirically show that the proposed algorithm works in regimes of signal strength where global network changepoint estimation algorithms that do not take into account the community structure, fail to estimate an existing changepoint correctly. Finally, we apply our algorithm to a series of …
Poster
Baris Askin · Pranay Sharma · Gauri Joshi · Carlee Joe-Wong
[ Hall A-E ]
Abstract
We study a federated version of multi-objective optimization (MOO), where a single model is trained to optimize multiple objective functions. MOO has been extensively studied in the centralized setting but is less explored in federated or distributed settings. We propose FedCMOO, a novel communication-efficient federated multi-objective optimization (FMOO) algorithm that improves the error convergence performance of the model compared to existing approaches. Unlike prior works, the communication cost of FedCMOO does not scale with the number of objectives, as each client sends a single aggregated gradient, obtained using randomized SVD (singular value decomposition), to the central server. We provide a convergence analysis of the proposed method for smooth non-convex objective functions under milder assumptions than in prior work. In addition, we introduce a variant of FedCMOO that allows users to specify a preference over the objectives in terms of a desired ratio of the final objective values. Through extensive experiments, we demonstrate the superiority of our proposed method over baseline approaches.
Poster
Ayoub Ghriss · Claire Monteleoni
[ Hall A-E ]
Abstract
We propose a novel approach for optimizing the graph ratio-cut by modeling the binary assignments as random variables. We provide an upper bound on the expected ratio-cut, as well as an unbiased estimate of its gradient, to learn the parameters of the assignment variables in an online setting. The resulting clustering obtained through our probabilistic approach (PRCut) outperforms the Rayleigh quotient relaxation of the combinatorial problem, its online learning extensions, and several widely used methods. We demonstrate that the PRCut clustering has high fidelity to the similarity measure and can perform as well as a supervised classifier when label-based similarities are provided. This novel approach can leverage out-of-the-box self-supervised representations to achieve competitive performance and serve as an evaluation method for the quality of these representations.
Poster
Aoran Zhang · Wenbin Zhou · Liyan Xie · Shixiang Zhu
[ Hall A-E ]
Abstract
Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.
Poster
Shengbo Wang · NIAN SI · Jose Blanchet · Zhengyuan Zhou
[ Hall A-E ]
Abstract
We explore the control of stochastic systems with potentially continuous state and action spaces, characterized by the state dynamics $X_{t+1} = f(X_t, A_t, W_t)$. Here, $X$, $A$, and $W$ represent the state, action, and exogenous random noise processes, respectively, with $f$ denoting a known function that describes state transitions. Traditionally, the noise process $\set{W_t, t \geq 0}$ is assumed to be independent and identically distributed, with a distribution that is either fully known or can be consistently estimated. However, the occurrence of distributional shifts, typical in engineering settings, necessitates the consideration of the robustness of the policy. This paper introduces a distributionally robust stochastic control paradigm that accommodates possibly adaptive adversarial perturbation to the noise distribution within a prescribed ambiguity set. We examine two adversary models: current-action-aware and current-action-unaware, leading to different dynamic programming equations. Furthermore, we characterize the optimal finite sample minimax rates for achieving uniform learning of the robust value function across continuum states under both adversary types, considering ambiguity sets defined by $f_k$-divergence and Wasserstein distance. Finally, we demonstrate the applicability of our framework across various real-world settings.
Poster
Filippo Palomba · Andrea Pugnana · Jose M. Alvarez · Salvatore Ruggieri
[ Hall A-E ]
Abstract
Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems, which allows to identify the causal impact of the deferring strategy on predictive accuracy. We distinguish two scenarios. In the first one, we have access to both the human and ML model predictions for the deferred instances. Here, we can identify the individual causal effects for deferred instances and the aggregates of them. In the second one, only human predictions are available for the deferred instances. Here, we can resort to regression discontinuity design to estimate a local causal effect. We evaluate our approach on synthetic and real datasets for seven deferring systems from the literature.
Poster
Ossi Räisä · Antti Honkela
[ Hall A-E ]
Abstract
Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice with several real datasets, downstream predictors and error metrics. As our theory predicts, multiple synthetic datasets often improve accuracy, while a single large synthetic dataset gives at best minimal improvement, showing that our insights are practically relevant.
Poster
Renpu Liu · Peng Wang · Donghao Li · Cong Shen · Jing Yang
[ Hall A-E ]
Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for aligning artificial intelligence systems with human values, achieving remarkable success in fine-tuning large language models. However, existing RLHF frameworks often assume that human preferences are relatively homogeneous and can be captured by a single, unified reward model. This assumption overlooks the inherent diversity and heterogeneity across individuals, limiting the adaptability of RLHF to personalized scenarios and risking misalignments that can diminish user satisfaction and trust in AI systems. In this paper, we address these challenges by introducing Low-Rank Adaptation (LoRA) into the personalized RLHF framework. We apply LoRA in the parameter space of the aggregation of all personalized reward functions, thereby enabling efficient learning of personalized reward models from potentially limited local datasets. Our approach exploits potential shared structures among the local ground-truth reward models while allowing for individual adaptation, without relying on restrictive assumptions about shared representations as in prior works. We further establish sample complexity guarantees for our method. Theoretical analysis demonstrates the effectiveness of the proposed approach in capturing both shared and individual-specific structures within heterogeneous human preferences, addressing the dual challenge of personalization requirements and practical data constraints. Experimental results on real-world datasets …
Poster
Vishnu Teja Kunde · Vicram Rajagopalan · Chandra Shekhara Kaushik Valmeekam · Krishna Narayanan · Jean-Francois Chamberland · Dileep Kalathil · Srinivas Shakkottai
[ Hall A-E ]
Abstract
Pre-trained transformers exhibit the capability of adapting to new tasks through in-context learning (ICL), where they efficiently utilize a limited set of prompts without explicit model optimization. The canonical communication problem of estimating transmitted symbols from received observations can be modelled as an in-context learning problem: Received observations are essentially a noisy function of transmitted symbols, and this function can be represented by an unknown parameter whose statistics depend on an (also unknown) latent context. This problem, which we term in-context estimation (ICE), has significantly greater complexity than the extensively studied linear regression problem. The optimal solution to the ICE problem is a non-linear function of the underlying context. In this paper, we prove that, for a subclass of such problems, a single layer softmax attention transformer (SAT) computes the optimal solution of the above estimation problem in the limit of large prompt length. We also prove that the optimal configuration of such transformer is indeed the minimizer of the corresponding training loss. Further, we empirically demonstrate the proficiency of multi-layer transformers in efficiently solving broader in-context estimation problems. Through extensive simulations, we show that solving ICE problems using transformers significantly outperforms standard approaches. Moreover, just with a few context …
Poster
Mingyu Kim · Jongwoo Ko · Mijung Park
[ Hall A-E ]
Abstract
Prompt learning is a popular fine-tuning method for vision-language models due to its efficiency. It requires a small number of additional learnable parameters while significantly enhancing performance on target tasks. However, most existing methods suffer from overfitting to fine-tuning data, yielding poor generalizability. To address this, we propose a new training objective function based on a Bayesian learning principle to balance adaptability and generalizability. We derive a prior over the logits, where the mean function is parameterized by the pre-trained model, while the posterior corresponds to the fine-tuned model. This objective establishes a balance by allowing the fine-tuned model to adapt to downstream tasks while remaining close to the pre-trained model. To avoid the overfitting issues of the standard softmax function, we adopt the one-vs-each softmax approximation along with its P\'olya-Gamma augmentation (OVE-PG). We evaluate our method on several benchmark datasets and demonstrate that using the Bayesian principle for prompt learning is indeed a sensible choice. Code is available at the https://github.com/ParkLabML/Bayesian_Principles_Improve_Prompt_Learning_In_Vision_Language_Models.
Poster
Anand Jerry George · Nicolas Macris
[ Hall A-E ]
Abstract
We present a class of diffusion-based algorithms to draw samples from high-dimensional probability distributions given their unnormalized densities. Ideally, our methods can transport samples from a Gaussian distribution to a specified target distribution in finite time. Our approach relies on the stochastic interpolants framework to define a time-indexed collection of probability densities that bridge a Gaussian distribution to the target distribution. Subsequently, we derive a diffusion process that obeys the aforementioned probability density at each time instant. Obtaining such a diffusion process involves solving certain Hamilton-Jacobi-Bellman PDEs. We solve these PDEs using the theory of forward-backward stochastic differential equations (FBSDE) together with machine learning-based methods. Through numerical experiments, we demonstrate that our algorithm can effectively draw samples from distributions that conventional methods struggle to handle.
Poster
Kapilan Balagopalan · Kwang-Sung Jun
[ Hall A-E ]
Abstract
We propose a novel linear bandit algorithm called LinMED (Linear Minimum Empirical Divergence), which is a linear extension of the MED algorithm that was originally designed for multi-armed bandits.LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities, unlike the popular randomized algorithm called linear Thompson sampling.Such a feature proves useful for off-policy evaluation where the unbiased evaluation requires accurately computing the sampling probability.We prove that LinMED enjoys a near-optimal regret bound of $d\sqrt{n}$ up to logarithmic factors where $d$ is the dimension and $n$ is the time horizon.We further show that LinMED enjoys a $\frac{d^2}{\Delta}\left(\log^2(n)\right)\log\left(\log(n)\right)$ problem-dependent regret where $\Delta$ is the smallest suboptimality gap.Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.
Poster
Marta Campi · Guillaume Staerman · Gareth Peters · Tomoko Masui
[ Hall A-E ]
Abstract
Functional Isolation Forest (FIF) is a recent state-of-the-art Anomaly Detection (AD) algorithm designed for functional data. It relies on a tree partition procedure where an abnormality score is computed by projecting each curve observation on a drawn dictionary through a linear inner product. Such linear inner product and the dictionary are a priori choices that highly influence the algorithm's performances and might lead to unreliable results, particularly with complex datasets. This work aims to target such challenges by introducing Signature Isolation Forest, a novel class of AD algorithm leveraging the signature transform arising from rough path theory. Our objective is to remove the constraints imposed by FIF through the proposition of two algorithms which specifically target the linearity of the FIF inner product and the choice of the dictionary. We provide several numerical experiments, including a real-world applications benchmark showing the relevance of our methods.
Poster
Zifan LIU · Xinran Li · Shibo Chen · Gen Li · Jiashuo Jiang · Jun Zhang
[ Hall A-E ]
Abstract
Reinforcement learning (RL) has proven to be well-performed and versatile in inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded by two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to collect enough data and train the RL policy to convergence. Second, online experience may not reflect the true demand due to the lost-sales phenomenon typical in IC, which makes the learning process more challenging. To address the above challenges, we propose a training framework that combines reinforcement learning with feedback graph (RLFG) and intrinsically motivated exploration (IME) to boost sample efficiency. In particular, we first leverage the MDP structure inherent in lost-sales IC problems and design the feedback graph (FG) tailored to lost-sales IC problems to generate abundant side experiences aiding in RL updates. Then we conduct a rigorous theoretical analysis of how the designed FG reduces the sample complexity of RL methods. Guided by these insights, we design an intrinsic reward to direct the RL agent to explore to the state-action space with more side experiences, further exploiting FG’s capability. Experimental results …
Poster
Francesco Pezzicoli · Valentina Ros · Francois Landes · Marco Baity-Jesi
[ Hall A-E ]
Abstract
Class imbalance (CI) is a longstanding problem in machine learning, slowing down training and reducing performances. Although empirical remedies exist, it is often unclear which ones work best and when, due to the lack of an overarching theory. We address a common case of imbalance, that of anomaly (or outlier) detection. We provide a theoretical framework to analyze, interpret and address CI. It is based on an exact solution of the teacher-student perceptron model, through replica theory. Within this framework, one can distinguish several sources of CI: either intrinsic, train or test imbalance. Our analysis reveals that, depending on the specific problem setting, one source or another might dominate. We further find that the optimal train imbalance is generally different from 50%, with a non trivial dependence on the intrinsic imbalance, the abundance of data and on the noise in the learning. Moreover, there is a crossover between a small noise training regime where results are independent of the noise level to a high noise regime where performances quickly degrade with noise. Our results challenge some of the conventional wisdom on CI and pave the way for integrated approaches to the topic.
Poster
Avinandan Bose · Simon Du · Maryam Fazel
[ Hall A-E ]
Abstract
We study the problem of representational transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in finding a good policy for a target task. Unlike in online RL where the agent interacts with the environment while learning a policy, in the offline setting there cannot be such interactions in either the source tasks or the target task; thus multi-task offline RL can suffer from incomplete coverage.We propose an algorithm to compute pointwise uncertainty measures for the learnt representation in low-rank MDPs, and establish a data-dependent upper bound for the suboptimality of the learnt policy for the target task. Our algorithm leverages the collective exploration done by source tasks to mitigate poor coverage at some points by a few tasks, thus overcoming the limitation of needing uniformly good coverage for a meaningful transfer by existing offline algorithms. We complement our theoretical results with empirical evaluation on a rich-observation MDP which requires many samples for complete coverage. Our findings illustrate the benefits of penalizing and quantifying the uncertainty in the learnt representation.
Poster
Vorapong Suppakitpaisarn · Donlapark Ponnoprat · Nicha Hirankarn · Quentin Hillebrand
[ Hall A-E ]
Abstract
The problem of counting subgraphs or graphlets under local differential privacy is an important challenge that has attracted significant attention from researchers. However, much of the existing work focuses on small graphlets like triangles or $k$-stars. In this paper, we propose a non-interactive, locally differentially private algorithm capable of counting graphlets of any size $k$. When $n$ is the number of nodes in the input graph, we show that the expected $\ell_2$ error of our algorithm is $O(n^{k - 1})$. Additionally, we prove that there exists a class of input graphs and graphlets of size $k$ for which any non-interactive counting algorithm incurs an expected $\ell_2$ error of $\Omega(n^{k - 1})$, demonstrating the optimality of our result. Furthermore, we establish that for certain input graphs and graphlets, any locally differentially private algorithm must have an expected $\ell_2$ error of $\Omega(n^{k - 1.5})$. Our experimental results show that our algorithm is more accurate than the classical randomized response method.
Poster
Daniele Bracale · Subha Maity · Yuekai Sun · Moulinath Banerjee
[ Hall A-E ]
Abstract
In numerous predictive scenarios, the predictive model affects the sampling distribution; for example, job applicants often meticulously craft their resumes to navigate through a screening system. Such shifts in distribution are particularly prevalent in social computing, yet, the strategies to learn these shifts from data remain remarkably limited. Inspired by a microeconomic model that adeptly characterizes agents' behavior within labor markets, we introduce a novel approach to learning the distribution shift. Our method is predicated on a \emph{reverse causal model}, wherein the predictive model instigates a distribution shift exclusively through a finite set of agents' actions. Within this framework, we employ a microfoundation model for the agents' actions and develop a statistically justified methodology to learn the distribution shift map, which we demonstrate to effectively minimize the performative prediction risk.
Poster
Meltem Tatlı · Arpan Mukherjee · Prashanth L.A. · Karthikeyan Shanmugam · Ali Tajer
[ Hall A-E ]
Abstract
This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of {\em distortion riskmetrics}. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a \emph{mixture} of arms. This is in sharp contrast to the convention in the multi-arm bandit algorithms that there is generally a \emph{solitary} arm that maximizes the utility, whether purely reward-centric or risk-sensitive. This creates a major departure from the principles for designing bandit algorithms since there are uncountable mixture possibilities. The contributions of the paper are as follows: (i) it formalizes a general framework for risk-sensitive bandits, (ii) identifies standard risk-sensitive bandit models for which solitary arm selections is not optimal, (iii) and designs regret-efficient algorithms whose sampling strategies can accurately track optimal arm mixtures (when mixture is optimal) or the solitary arms (when solitary is optimal). The algorithms are shown to achieve a regret that scales according to $O((\log T/T )^{\nu})$, where $T$ is the horizon, and $\nu>0$ is a riskmetric-specific constant.
Poster
Tim Rensmeyer · Oliver Niggemann
[ Hall A-E ]
Abstract
Achieving robust uncertainty quantification for deep neural networks represents an important requirement in many real-world applications of deep learning such as medical imaging where it is necessary to assess the reliability of a neural network's prediction.Bayesian neural networks are a promising approach for modeling uncertainties in deep neural networks. Unfortunately, generating samples from the posterior distribution of neural networks is a major challenge. One significant advance in that direction would be the incorporation of adaptive step sizes, similar to modern neural network optimizers, into Monte Carlo Markov chain sampling algorithms without significantly increasing computational demand.Over the past years, several papers have introduced sampling algorithms with corresponding theorems stating that they achieve this property. In this paper, we demonstrate that these methods can have a substantial bias in the distribution they sample, even in the limit of vanishing step sizes and at full batch size. Furthermore, for most of the algorithms, we show that convergence to the correct distribution can be restored with a simple fix at the cost of increasing computational demand.
Poster
Leonardo Martins Bianco · Christine Keribin · Zacharie Naulet
[ Hall A-E ]
Abstract
Community detection is a fundamental task in graph analysis, with methods often relying on fitting models like the Stochastic Block Model (SBM) to observed networks. While many algorithms can accurately estimate SBM parameters when the input graph is a perfect sample from the model, real-world graphs rarely conform to such idealized assumptions. Therefore, robust algorithms are crucial—ones that can recover model parameters even when the data deviates from the assumed distribution. In this work, we propose SubSearch, an algorithm for robustly estimating SBM parameters by exploring the space of subgraphs in search of one that closely aligns with the model’s assumptions. Our approach also functions as an outlier detection method, properly identifying nodes responsible for the graph’s deviation from the model and going beyond simple techniques like pruning high-degree nodes. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method.
Poster
Saptarshi Roy · Sunrit Chakraborty · Debabrota Basu
[ Hall A-E ]
Abstract
High dimensional sparse linear bandits serve as an efficient model for sequential decision-making problems (e.g. personalized medicine), where high dimensional features (e.g. genomic data) on the users are available, but only a small subset of them are relevant. Motivated by data privacy concerns in these applications, we study the joint differentially private high dimensional sparse linear bandits, where both rewards and contexts are considered as private data. First, to quantify the cost of privacy, we derive a lower bound on the regret achievable in this setting. To further address the problem, we design a computationally efficient bandit algorithm, **F**orgetfu**L** **I**terative **P**rivate **HA**rd **T**hresholding (FLIPHAT). Along with doubling of episodes and episodic forgetting, FLIPHAT deploys a variant of Noisy Iterative Hard Thresholding (N-IHT) algorithm as a sparse linear regression oracle to ensure both privacy and regret-optimality. We show that FLIPHAT achieves optimal regret in terms of privacy parameters, context dimension, and time horizon up to a linear factor in model sparsity in the problem independent case. We analyze the regret by providing a novel refined analysis of the estimation error of N-IHT, which is of parallel interest.
Poster
Arthur Pignet · Chiara Regniez · John Klein
[ Hall A-E ]
Abstract
Despite the increasing demand for safer machine learning practices, the use of Uncertainty Quantification (UQ) methods in production remains limited. This limitation is exacerbated by the challenge of validating UQ methods in absence of UQ ground truth. In classification tasks, when only a usual set of test data is at hand, several authors suggested different metrics that can be computed from such test points while assessing the quality of quantified uncertainties. This paper investigates such metrics and proves that they are theoretically well-behaved and actually tied to some uncertainty ground truth which is easily interpretable in terms of model prediction trustworthiness ranking. Equipped with those new results, and given the applicability of those metrics in the usual supervised paradigm, we argue that our contributions will help promoting a broader use of UQ in deep learning.
Poster
Tom Kempton · Stuart Burrell · Connor Cheverall
[ Hall A-E ]
Abstract
Existing methods for the zero-shot detection of machine-generated text are dominated by three statistical quantities: log-likelihood, log rank, and entropy. As language models mimic the distribution of human text ever closer, this will limit our ability to build effective detection algorithms. To combat this, we introduce a method for detecting machine-generated text that is entirely agnostic of the generating language model. This is achieved by targeting a defect in the way that decoding strategies, such as temperature or top-k sampling, normalize conditional probability measures. This method can be rigorously theoretically justified, is easily explainable, and is conceptually distinct from existing methods for detecting machine-generated text. We evaluate our detector in the white and black box settings across various language models, datasets, and passage lengths. We also study the effect of paraphrasing attacks on our detector and the extent to which it is biased against non-native speakers. In each of these settings, the performance of our test is at least comparable to that of other state-of-the-art text detectors, and in some cases, we strongly outperform these baselines.
Poster
Xinyang Liu · Hengrong Du · Wei Deng · Ruqi Zhang
[ Hall A-E ]
Abstract
Hutchinson estimators are widely employed in training divergence-based likelihoods for diffusion models to ensure optimal transport (OT) properties. However, this estimator often suffers from high variance and scalability concerns. To address these challenges, we investigate Hutch++, an optimal stochastic trace estimator for generative models, designed to minimize training variance while maintaining transport optimality. Hutch++ is particularly effective for handling ill-conditioned matrices with large condition numbers, which commonly arise when high-dimensional data exhibits a low-dimensional structure. To mitigate the need for frequent and costly QR decompositions, we propose practical schemes that balance frequency and accuracy, backed by theoretical guarantees. Our analysis demonstrates that Hutch++ leads to generations of higher quality. Furthermore, this method exhibits effective variance reduction in various applications, including simulations, conditional time series forecasts, and image generation.
Poster
Avinandan Bose · Laurent Lessard · Maryam Fazel · Krishnamurthy Dvijotham
[ Hall A-E ]
Abstract
The rise of foundation models fine-tuned on human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. Existing research on provable certified robustness against data poisoning attacks primarily focuses on certifying robustness for static adversaries that modify a fraction of the dataset used to train the model before the training algorithm is applied. In practice, particularly when learning from human feedback in an online sense, adversaries can observe and react to the learning process and inject poisoned samples that optimize adversarial objectives better than when they are restricted to poisoning a static dataset once before the learning algorithm is applied. Indeed, it has been shown in prior work that online dynamic adversaries can be significantly more powerful than static ones. We present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean-estimation problem and binary classification problems and outline directions for extending this in further work.
Poster
Fabian Fumagalli · Maximilian Muschalik · Eyke Hüllermeier · Barbara Hammer · Julia Herbinger
[ Hall A-E ]
Abstract
Feature-based explanations, using perturbations or gradients, are a prevalent tool to understand decisions of black box machine learning models. Yet, differences between these methods still remain mostly unknown, which limits their applicability for practitioners. In this work, we introduce a unified framework for local and global feature-based explanations using two well-established concepts: functional ANOVA (fANOVA) from statistics, and the notion of value and interaction from cooperative game theory. We introduce three fANOVA decompositions that determine the influence of feature distributions, and use game-theoretic measures, such as the Shapley value and interactions, to specify the influence of higher-order interactions. Our framework combines these two dimensions to uncover similarities and differences between a wide range of explanation techniques for features and groups of features. We then empirically showcase the usefulness of our framework on synthetic and real-world datasets.
Poster
Xinxing Shi · Thomas Baldwin-McDonald · Mauricio Álvarez
[ Hall A-E ]
Abstract
Deep Gaussian Processes (DGPs) leverage a compositional structure to model non-stationary processes. DGPs typically rely on local inducing point approximations across intermediate GP layers. Recent advances in DGP inference have shown that incorporating global Fourier features from the Reproducing Kernel Hilbert Space (RKHS) can enhance the DGPs' capability to capture complex non-stationary patterns. This paper extends the use of these features to compositional GPs involving linear transformations. In particular, we introduce Ordinary Differential Equation(ODE)--based RKHS Fourier features that allow for adaptive amplitude and phase modulation through convolution operations. This convolutional formulation relates our work to recently proposed deep latent force models, a multi-layer structure designed for modelling nonlinear dynamical systems. By embedding these adjustable RKHS Fourier features within a doubly stochastic variational inference framework, our model exhibits improved predictive performance across various regression tasks.
Poster
Woojin Chae · Kihyuk (Ki) Hong · Yufan Zhang · Ambuj Tewari · Dabeen Lee
[ Hall A-E ]
Abstract
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs) under the Bellman optimality condition. Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $\widetilde{\mathcal{O}}(d\sqrt{\mathrm{sp}(v^*)T})$ over $T$ time steps where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$ is the dimension of the feature mapping. Our algorithm applies the recently developed technique of running value iteration on a discounted-reward MDP approximation with clipping by the span. We prove that the value iteration procedure, even with the clipping operation, converges. Moreover, we show that the associated variance term due to random transitions can be bounded even under clipping. Combined with the weighted ridge regression-based parameter estimation scheme, this leads to the nearly minimax optimal regret guarantee.
Poster
Liran Nochumsohn · Hedi Zisling · Omri Azencot
[ Hall A-E ]
Abstract
Accurate forecasting of multivariate time series data is important in many engineering and scientific applications. Recent state-of-the-art works ignore the inter-relations between variates, using their model on each variate independently. This raises several research questions related to proper modeling of multivariate data. In this work, we propose to view multivariate forecasting as a multi-task learning problem, facilitating the analysis of forecasting by considering the angle between task gradients and their balance. To do so, we analyze linear models to characterize the behavior of tasks. Our analysis suggests that tasks can be defined by grouping similar variates together, which we achieve via a simple clustering that depends on correlation-based similarities. Moreover, to balance tasks, we scale gradients with respect to their prediction error. Then, each task is solved with a linear model within our MTLinear framework. We evaluate our approach on challenging benchmarks in comparison to strong baselines, and we show it obtains on-par or better results on multivariate forecasting problems.
Poster
Daniele Bracale · Subha Maity · Felipe Maia Polo · Seamus Seamus · Moulinath Banerjee · Yuekai Sun
[ Hall A-E ]
Abstract
Often in prediction tasks, the predictive model itself can influence the distribution of the target variable, a phenomenon termed *performative prediction*. This influence often stems from strategic actions taken by stakeholders with a vested interest in predictive models.A key challenge that hinders the widespread adaptation of performative prediction in machine learning is that practitioners are generally unaware of the social impacts of their predictions. To address this gap, we propose a methodology for learning the distribution map that encapsulates the long-term impacts of predictive models on the population. Specifically, we model agents' responses as a cost-adjusted utility maximization problem and propose estimates for said cost. Our approach leverages optimal transport to align pre-model exposure (*ex ante*) and post-model exposure (*ex post*) distributions. We provide a rate of convergence for this proposed estimate and assess its quality through empirical demonstrations on a credit scoring dataset.
Poster
Jayoung Ryu · Charlotte Bunne · Luca Pinello · Aviv Regev · Romain Lopez
[ Hall A-E ]
Abstract
It is now possible to conduct large scale perturbation screens with complex readout modalities, such as different molecular profiles or high content cell images. While these open the way for systematic dissection of causal cell circuits, integrating such data across screens to maximize our ability to predict circuits poses substantial computational challenges, which have not been addressed. Here, we extend two Gromov-Wasserstein optimal transport methods to incorporate the perturbation label for cross-modality alignment. The obtained alignment is then employed to train a predictive model that estimates cellular responses to perturbations observed with only one measurement modality. We validate our method for the tasks of cross-modality alignment and cross-modality prediction in a recent multi-modal single-cell perturbation dataset. Our approach opens the way to unified causal models of cell biology.
Poster
jian xu · Shian Du · Junmei Yang · Xinghao Ding · Delu Zeng · John Paisley
[ Hall A-E ]
Abstract
Gaussian processes have been used to model the vector field of continuous dynamical systems, which are characterized by a probabilistic ordinary differential equation (GP-ODE). Bayesian inference for these models has been extensively studied and applied in tasks such as time series prediction. However, the use of standard GPs with basic kernels like squared exponential kernels has been common in GP-ODE research, limiting the model's ability to represent complex scenarios. To address this limitation, we introduce normalizing flows to reparameterize the ODE vector field, resulting in a data-driven prior distribution, thereby increasing flexibility and expressive power. We develop a variational inference algorithm that utilizes analytically tractable probability density functions of normalizing flows. Additionally, we also apply normalizing flows to the posterior inference of GP-ODEs to resolve the issue of strong mean-field assumptions. By applying normalizing flows in these ways, our model improves accuracy and uncertainty estimates for Bayesian GP-ODEs. We validate the effectiveness of our approach on simulated dynamical systems and real-world human motion data, including time series prediction and missing data recovery tasks.
Poster
Zhi Zhang · Chris Chow · Yasi Zhang · Yanchao Sun · Haochen Zhang · Eric Jiang · Han Liu · Furong Huang · Yuchen Cui · Oscar Madrid
[ Hall A-E ]
Abstract
Lifelong reinforcement learning (RL) has been developed as a paradigm for extending single-task RL to more realistic, dynamic settings. In lifelong RL, the "life" of an RL agent is modeled as a stream of tasks drawn from a task distribution.We propose EPIC (Empirical PAC-Bayes that Improves Continuously), a novel algorithm designed for lifelong RL using PAC-Bayes theory. EPIC learns a shared policy distribution, referred to as the world policy, which enables rapid adaptation to new tasks while retaining valuable knowledge from previous experiences. Our theoretical analysis establishes a relationship between the algorithm's generalization performance and the number of prior tasks preserved in memory. We also derive the sample complexity of EPIC in terms of RL regret. Extensive experiments on a variety of environments demonstrate that EPIC significantly outperforms existing methods in lifelong RL, offering both theoretical guarantees and practical efficacy through the use of the world policy.
Poster
Zichong Wang · Nhat Hoang · Xingyu Zhang · Kevin Bello · Xiangliang Zhang · Sitharama Iyengar · Wenbin Zhang
[ Hall A-E ]
Abstract
Fair Graph Neural Networks (GNNs) have been extensively studied in graph-based applications. However, most approaches to fair GNNs assume the full availability of demographic information by default, which is often unrealistic due to legal restrictions or privacy concerns, leaving a noticeable gap in methods for addressing bias under such constraints. To this end, we propose a novel method for fair graph learning without demographic information. Our approach leverages a Bayesian variational autoencoder to infer missing demographic information and uses disentangled latent variables to separately capture demographics-related and label-related information, reducing interference when inferring demographic proxies. Additionally, we incorporate a fairness regularizer that enables measuring model fairness without demographics while optimizing the fairness objective. Extensive experiments on three real-world graph datasets demonstrate the proposed method's effectiveness in improving both fairness and utility.
Poster
Peter Jacobs · Anirban Bhattacharya · Debdeep Pati · Lekha Patel · Jeff Phillips
[ Hall A-E ]
Abstract
We study the estimation of Zipfian distributions under $L_1$ loss, and provide near minimax optimal bounds in several regimes. Specifically, we assume observations arrive from a known alphabet, and with a known decay rate parametrizing the Zipfian, but we do not know a priori which alphabet elements have larger probability than others. We present a novel Sort and Snap estimator, which uses the empirical proportions to sort the alphabet, and then snaps them to the associated term from the Zipfian distribution. For arbitrary decay rates and smaller alphabet sizes, as well as for large decay rates and large alphabet sizes, we show an exact or minor variant of this estimator is near minimax optimal and has exponential improvement over the standard empirical proportion estimator. However, for small decay rates and larger alphabet sizes a simulation study indicates the standard empirical proportion estimator is competitive with Sort and Snap procedures. In addition to providing nearly tight bounds for important high-dimensional estimation problems, we believe the Sort and Snap estimator, and its analysis, will have independent interest.
Poster
Reda Marzouk · Shahaf Bassan · Guy Katz · De la Higuera
[ Hall A-E ]
Abstract
Recent studies have examined the computational complexity of computing Shapley additive explanations (also known as SHAP) across various models and distributions, revealing their tractability or intractability in different settings. However, these studies primarily focused on a specific variant called Conditional SHAP, though many other variants exist and address different limitations. In this work, we analyze the complexity of computing a much broader range of such variants, including Conditional, Interventional, and Baseline SHAP, while exploring both local and global computations. We show that both local and global Interventional and Baseline SHAP can be computed in polynomial time for various ML models under Hidden Markov Model distributions, extending popular algorithms such as TreeSHAP beyond empirical distributions. On the downside, we prove intractability results for these variants over a wide range of neural networks and tree ensembles. We believe that our results emphasize the intricate diversity of computing Shapley values, demonstrating how their complexity is substantially shaped by both the specific SHAP variant, the model type, and the distribution.
Poster
Arsenii Mustafin · Aleksei Pakharev · Alex Olshevsky · Yannis Paschalidis
[ Hall A-E ]
Abstract
We present a new geometric interpretation of Markov Decision Processes (MDPs) with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This advantage-preserving transformation of the MDP motivates a class of algorithms which we call *Reward Balancing*, which solve MDPs by iterating through these transformations, until an approximately optimal policy can be trivially found. We provide a convergence analysis of several algorithms in this class, in particular showing that for MDPs for unknown transition probabilities we can improve upon state-of-the-art sample complexity results.
Poster
Yuxin Wang · Botian Jiang · Yiran Guo · Quan Gan · David Wipf · Xuanjing Huang · Xipeng Qiu
[ Hall A-E ]
Abstract
Prior-Fitted Networks (PFNs) have recently been proposed to efficiently perform tabular classification tasks. Although they achieve good performance on small datasets, they encounter limitations with larger datasets. These limitations include significant memory consumption and increased computational complexity, primarily due to the impracticality of incorporating all training samples as inputs within these networks. To address these challenges, we investigate the fitting assumption for PFNs and input samples. Building on this understanding, we propose *BoostPFN* designed to enhance the performance of these networks, especially for large-scale datasets. We also theoretically validate the convergence of BoostPFN and our empirical results demonstrate that the BoostPFN method can outperform standard PFNs with the same size of training samples in large datasets and achieve a significant acceleration in training times compared to other established baselines in the field, including widely-used Gradient Boosting Decision Trees (GBDTs), deep learning methods and AutoML systems. High performance is maintained for up to 50x of the pre-training size of PFNs, substantially extending the limit of training samples. Through this work, we address the challenges of efficiently handling large datasets via PFN-based models, paving the way for faster and more effective tabular data classification training and prediction process.
Poster
Hao Yan · Keith Levin
[ Hall A-E ]
Abstract
Spectral estimators are fundamental in low-rank matrix models and arise throughout machine learning and statistics, with applications including network analysis, matrix completion and PCA. These estimators aim to recover the leading eigenvalues and eigenvectors of an unknown signal matrix observed subject to noise. While extensive research has addressed the statistical accuracy of spectral estimators under a variety of conditions, most previous work has assumed that the signal eigenvectors are incoherent with respect to the standard basis. This assumption typically arises because of suboptimal dependence on coherence in one or more concentration inequalities.Using a new matrix concentration result that may be of independent interest, we establish estimation error bounds for eigenvector and eigenvalue recovery whose dependence on coherence significantly improves upon prior work. Our results imply that coherence-free bounds can be achieved when the standard deviation of the noise is comparable to its Orlicz 1-norm (i.e., its subexponential norm). This matches known minimax lower bounds under Gaussian noise up to logarithmic factors.
Poster
Vincent Blot · Anastasios Angelopoulos · Michael Jordan · Nicolas Brunel
[ Hall A-E ]
Abstract
Science and technology have a growing need for effective mechanisms that ensure reliable, controlled performance from black-box machine learning algorithms. These performance guarantees should ideally hold conditionally on the input—that is the performance guarantees should hold, at least approximately, no matter what the input. However, beyond stylized discrete groupings such as ethnicity and gender, the right notion of conditioning can be difficult to define. For example, in problems such as image segmentation, we want the uncertainty to reflect the intrinsic difficulty of the test sample, but this may be difficult to capture via a conditioning event. Building on the recent work of Gibbs et al. [2023], we propose a methodology for achieving approximate conditional control of statistical risks—the expected value of loss functions—by adapting to the difficulty of test samples. Our framework goes beyond traditional conditional risk control based on user-provided conditioning events to the algorithmic, data-driven determination of appropriate function classes for conditioning. We apply this framework to various regression and segmentation tasks, enabling finer-grained control over model performance and demonstrating that by continuously monitoring and adjusting these parameters, we can achieve superior precision compared to conventional risk-control methods.
Poster
Juncheng Dong · Hao-Lun Hsu · Qitong Gao · Vahid Tarokh · Miroslav Pajic
[ Hall A-E ]
Abstract
Reinforcement learning (RL), while being the benchmark for policy formulation, often struggles to deliver robust solutions across varying scenarios, leading to marked performance drops under environmental perturbations.~Traditional adversarial training, based on a two-player max-min game, is known to bolster the robustness of RL agents, but it faces challenges: first, the complexity of the worst-case optimization problem may induce *over-optimism*, and second, the choice of a specific set of potential adversaries might lead to *over-pessimism* by considering implausible scenarios. In this work, we first observe that these two challenges do not balance out each other. Thus, we propose to apply variational optimization to optimize over the worst-case distribution of the adversary instead of a single worst-case adversary. Moreover, to counteract over-optimism, we train the RL agent to maximize the lower quantile of the cumulative rewards under worst-case adversary distribution. Our novel algorithm demonstrates a significant advancement over existing robust RL methods, corroborating the importance of the identified challenges and the effectiveness of our approach. To alleviate computational overhead associated with the proposed approach, we also propose a simplified version with lower computational burden and only minimal performance degradation. Extensive experiments validate that our approaches consistently yield policies with superior robustness.
Poster
Badr-Eddine Chérief-Abdellatif
[ Hall A-E ]
Abstract
This article deals with robust inference for parametric copula models. Estimation using canonical maximum likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the maximum mean discrepancy (MMD) principle. We derive nonasymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on [0,1]^d as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available. Supplementary materials for this article are available online.
Poster
Charles Riou
[ Hall A-E ]
Abstract
Bernstein's condition is a key assumption that guarantees fast rates in machine learning. For example, under this condition, the Gibbs posterior with prior pi has an excess risk in O(d_{pi}/n), as opposed to O(sqrt(d_{pi}/n)) in the general case, where n denotes the number of observations and d_{pi} is a complexity parameter which depends on the prior pi. In this paper, we examine the Gibbs posterior in the context of meta-learning, i.e., when learning the prior pi from T previous tasks. Our main result is that Bernstein's condition always holds at the meta level, regardless of its validity at the observation level. This implies that the additional cost to learn the Gibbs prior pi, which will reduce the term d_{pi} across tasks, is in O(1/T), instead of the expected O(1/sqrt(T)). We further illustrate how this result improves on the standard rates in three different settings: discrete priors, Gaussian priors and mixture of Gaussian priors.
Poster
Shogo Iwazaki · Shion Takeno
[ Hall A-E ]
Abstract
This paper studies a non-stationary kernelized bandit (KB) problem, also called time-varying Bayesian optimization, where one seeks to minimize the regret under an unknown reward function that varies over time. In particular, we focus on a near-optimal algorithm whose regret upper bound matches the regret lower bound. For this goal, we show the first algorithm-independent regret lower bound for non-stationary KB with squared exponential and Mat\'ern kernels, which reveals that an existing optimization-based KB algorithm with slight modification is near-optimal. However, this existing algorithm suffers from feasibility issues due to its huge computational cost.Therefore, we propose a novel near-optimal algorithm called restarting phased elimination with random permutation (R-PERP), which bypasses the huge computational cost. A technical key point is the simple permutation procedures of query candidates, which enable us to derive a novel tighter confidence bound tailored to the non-stationary problems.
Poster
Ziliang Zhong · Xiang Pan · Qi Lei
[ Hall A-E ]
Abstract
Machine learning models can suffer from performance degradation when applied to new tasks due to distribution shifts. Feature representation learning offers a robust solution to this issue. However, a fundamental challenge remains in devising the optimal strategy for feature selection. Existing literature is somewhat paradoxical: some advocate for learning invariant features from source domains, while others favor more diverse features. For better understanding, we propose a statistical framework that evaluates the utilities of the features (\ie how differently the features are used in each source task) based on the variance of their correlation to $y$ across different domains. Under our framework, we design and analyze a learning procedure consisting of learning content features (comprising both invariant and approximately shared features) from source tasks and fine-tuning them on the target task. Our theoretical analysis highlights the significance of learning approximately shared features—beyond strictly invariant ones—when distribution shifts occur. Our analysis also yields an improved population risk on target tasks compared to previous results. Inspired by our theory, we introduce ProjectionNet, a practical method to distinguish content features from environmental features via \textit{explicit feature space control}, further consolidating our theoretical findings.
Poster
Clement Canonne · Themistoklis Gouleakis · Yuhao Wang · Qiping Yang
[ Hall A-E ]
Abstract
We consider the task of Gaussian mean testing, that is, of testing whether a high-dimensional vector perturbed by white noise has large magnitude, or is the zero vector. This question, originating from the signal processing community, has recently seen a surge of interest from the machine learning and theoretical computer science community, and is by now fairly well understood. What is much less understood, and the focus of our work, is how to perform this task under *truncation*: that is, when the observations (i.i.d. samples from the underlying high-dimensional Gaussian) are only observed when they fall in an given subset of the domain $\mathbb{R}^d$. This truncation model, previously studied in the context of *learning* (instead of *testing*) the mean vector, has a range of applications, in particular in Economics and Social Sciences. As our work shows, sample truncations affect the complexity of the testing task in a rather subtle and surprising way.
Poster
Damin Kühn · Michael Schaub
[ Hall A-E ]
Abstract
Optimal transport (OT) provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as a Euclidean metric, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying significance of different features in the data, and existing supervised ground metric learning methods often fail to generalize across multiple classes or are limited to distributions with shared supports. To address this issue, this paper introduces a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points (samples) like a global metric, but requires only class labels on a distribution-level for training. The resulting learned global ground metric enables more accurate OT distances, which can significantly improve clustering and classification tasks. Further, we can create task-specific shared embeddings for elements (samples) from different distributions, including unseen data.
Poster
Paul Chang · Nasrulloh Loka · Daolang Huang · Ulpu Remes · Samuel Kaski · Luigi Acerbi
[ Hall A-E ]
Abstract
Amortized meta-learning methods based on pre-training have propelled fields like natural language processing and vision. Transformer-based neural processes and their variants are leading models for probabilistic meta-learning with a tractable objective. Often trained on synthetic data, these models implicitly capture essential latent information in the data-generation process. However, existing methods do not allow users to flexibly inject (condition on) and extract (predict) this probabilistic latent information at runtime, which is key to many tasks.We introduce the Amortized Conditioning Engine (ACE), a new transformer-based meta-learning model that explicitly represents latent variables of interest. ACE affords conditioning on both observed data and interpretable latent variables, the inclusion of priors at runtime, and outputs predictive distributions for discrete and continuous data and latents. We show ACE's practical utility across diverse tasks such as image completion and classification, Bayesian optimization, and simulation-based inference, demonstrating how a general conditioning framework can replace task-specific solutions.
Poster
Bo Chen · Xiaoyu Li · Yingyu Liang · Zhenmei Shi · Zhao Song
[ Hall A-E ]
Abstract
In-context learning has been recognized as a key factor in the success of Large Language Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in-context examples in the prompt during inference. Previous studies have demonstrated that the Transformer architecture used in LLMs can implement a single-step gradient descent update by processing in-context examples in a single forward pass. Recent work has further shown that, during in-context learning, a looped Transformer can implement multi-step gradient descent updates in forward passes. However, their theoretical results require an exponential number of in-context examples, $n = \exp(\Omega(T))$, where $T$ is the number of loops or passes, to achieve a reasonably low error. In this paper, we study linear looped Transformers in-context learning on linear vector generation tasks. We show that linear looped Transformers can implement multi-step gradient descent efficiently for in-context learning. Our results demonstrate that as long as the input data has a constant condition number, e.g., $n = O(d)$, the linear looped Transformers can achieve a small error by multi-step gradient descent during in-context learning. Furthermore, our preliminary experiments validate our theoretical analysis. Our findings reveal that the Transformer architecture possesses a stronger in-context learning …
Poster
Yuheng Ma · Ke Jia · Hanfang Yang
[ Hall A-E ]
Abstract
We initiate the study of locally differentially private (LDP) learning with public features. We define semi-feature LDP, where some features are publicly available while the remaining ones, along with the label, require protection under local differential privacy. Under semi-feature LDP, we demonstrate that the mini-max convergence rate for non-parametric regression is significantly reduced compared to that of classical LDP. Then we propose HistOfTree, an estimator that fully leverages the information contained in both public and private features. Theoretically, HistOfTree reaches the mini-max optimal convergence rate. Empirically, HistOfTree achieves superior performance on both synthetic and real data. We also explore scenarios where users have the flexibility to select features for protection manually. In such cases, we propose an estimator and a data-driven parameter tuning strategy, leading to analogous theoretical and empirical results.
Poster
Nishant Panda · Jehanzeb Chaudhry · Natalie Klein · James Carzon · Troy Butler
[ Hall A-E ]
Abstract
We derive local sensitivities of statistical quantities of interest with respect to model parameters in dynamical systems.Our main contribution is the extension of adjoint-based a posteriori analysis for differential operators of generic dynamical systems acting on states to the Liouville operator acting on probability densities of the states. This results in theoretically rigorous estimates of sensitivity and error for a broad class of computed quantities of interest while propagating uncertainty through dynamical systems.We also derive Monte-Carlo type estimators to make these estimates computationally tractable using spatio-temporal normalizing flows and exploiting the hyperbolic nature of the Liouville equation. Three examples demonstrate our method.First, for verification of the theoretical results, we use a 2D linear dynamical system with an initial multivariate Gaussian density. Then, we apply our method to the challenging task of propagating uncertainty in a double attractor system to illustrate sensitivities in bimodal distributions. Finally, we show that our method can provide sensitivities with respect to the parameters of Neural Ordinary Differential Equations (here, in the context of classification).
Poster
Mitsuhiro Fujikawa · Youhei Akimoto · Jun Sakuma · Kazuto Fukuchi
[ Hall A-E ]
Abstract
Transfer learning enhances prediction accuracy on a target distribution by leveraging data from a source distribution, demonstrating significant benefits in various applications. This paper introduces a novel dissimilarity measure that utilizes vicinity information, i.e., the local structure of data points, to analyze the excess error in classification under covariate shift, a transfer learning setting where marginal feature distributions differ but conditional label distributions remain the same. We characterize the excess error using the proposed measure and demonstrate faster or competitive convergence rates compared to previous techniques. Notably, our approach is effective in the support non-containment assumption, which often appears in real-world applications, holds. Our theoretical analysis bridges the gap between current theoretical findings and empirical observations in transfer learning, particularly in scenarios with significant differences between source and target distributions.
Poster
Nima Akbarzadeh · Yossiri Adulyasak · Erick Delage
[ Hall A-E ]
Abstract
In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.
Poster
Nandi Schoots · Mattia Jacopo Villani · Niels uit de Bos
[ Hall A-E ]
Abstract
Kolmogorov-Arnold Networks are a new family of neural network architectures which holds promise for overcoming the curse of dimensionality and has interpretability benefits (Liu et al., 2024). In this paper, we explore the connection between Kolmogorov Arnold Networks (KANs) with piecewise linear (univariate real) functions and ReLU networks. We provide completely explicit constructions to convert a piecewise linear KAN into a ReLU network and vice versa.
Poster
Yatin Dandi · Luca Pesce · Hugo Cui · FLORENT KRZAKALA · Yue Lu · Bruno Loureiro
[ Hall A-E ]
Abstract
A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported …
Poster
Qizhang Feng · Siva Rajesh Kasa · SANTHOSH KASA · Hyokun Yun · Choon Teo · Sravan Babu Bodapati
[ Hall A-E ]
Abstract
Large Language Models (LLMs) have seen widespread adoption due to their remarkable natural language capabilities. However, when deploying them in real-world settings, it is important to align LLMs to generate texts according to acceptable human standards. Methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) have enabled significant progress in refining LLMs using human preference data. However, the privacy concerns inherent in utilizing such preference data have yet to be adequately studied. In this paper, we investigate the vulnerability of LLMs aligned using two widely used methods - DPO and PPO - to membership inference attacks (MIAs). Our study has two main contributions: first, we theoretically motivate that DPO models are more vulnerable to MIA compared to PPO models; second, we introduce a novel reference-based attack framework specifically for analyzing preference data called PREMIA (\uline{Pre}ference data \uline{MIA}). Using PREMIA and existing baselines we empirically show that DPO models have a relatively heightened vulnerability towards MIA.
Poster
Xiangyu Guo · Ricardo Henao
[ Hall A-E ]
Abstract
High-resolution spatial transcriptomics (ST) technologies can capture gene expression at the cellular level along with spatial information, but are limited in the number of genes that can be profiled.In contrast, single-cell RNA sequencing (SC) provides more comprehensive gene expression profiles but lacks spatial context.To bridge these gaps, existing methods typically focus on single-modality prediction tasks, leveraging complementary information from the other modality. Here, we propose an attention-based cross-modal framework that simultaneously imputes gene expression for ST and recovers spatial locations for SC, while also providing uncertainty estimates for the expression of the imputed genes.Our approach was evaluated on three real-world datasets, where it consistently outperformed state-of-the-art methods in spatial gene profile imputation. Moreover, our framework enhances latent embedding integration between the two modalities, resulting in more accurate spatial position estimates.
Poster
Saba Ahmadi · Siddharth Bhandari · Avrim Blum · Chen Dan · Prabhav Jain
[ Hall A-E ]
Abstract
We initiate the study of a new notion of adversarial loss which we call distributional adversarial loss. In this notion, we assume for each original example, the allowed adversarial perturbation set is a family of distributions, and the adversarial loss over each example is the maximum loss over all the associated distributions. The goal is to minimize the overall adversarial loss. We show generalization guarantees for our notion of adversarial loss. Our notion of adversarial loss contrasts the prior work that considers a set of points, not distributions, as the perturbation set of each clean example. As an application of our approach, we show how to unify the two lines of work on randomized smoothing and robust learning in the PAC-learning setting and derive sample complexity bounds for randomized smoothing methods.Furthermore, we investigate the role of randomness in achieving robustness against adversarial attacks. We show a general derandomization technique that preserves the extent of a randomized classifier's robustness against adversarial attacks and show its effectiveness empirically.
Poster
Chungpa Lee · Jongho Im · Joseph Kim
[ Hall A-E ]
Abstract
Mixup is a widely adopted data augmentation technique known for enhancing the generalization of machine learning models by interpolating between data points. Despite its success and popularity, limited attention has been given to understanding the statistical properties of the synthetic data it generates. In this paper, we delve into the theoretical underpinnings of mixup, specifically its effects on the statistical structure of synthesized data. We demonstrate that while mixup improves model performance, it can distort key statistical properties such as variance, potentially leading to unintended consequences in data synthesis. To address this, we propose a novel mixup method that incorporates a generalized and flexible weighting scheme, better preserving the original data’s structure. Through theoretical developments, we provide conditions under which our proposed method maintains the (co)variance and distributional properties of the original dataset. Numerical experiments confirm that the new approach not only preserves the statistical characteristics of the original data but also sustains model performance across repeated synthesis, alleviating concerns of model collapse identified in previous research.
Poster
Yujia Wu · Bo Yang · Elynn Chen · Yuzhou Chen · Zheshi Zheng
[ Hall A-E ]
Abstract
Graph classification in medical imaging and drug discovery requires accuracy and robust uncertainty quantification. To address this need, we introduce Conditional Prediction ROC (CP-ROC) bands, offering uncertainty quantification for ROC curves and robustness to distributional shifts in test data. Although developed for Tensorized Graph Neural Networks (TGNNs), CP-ROC is adaptable to general Graph Neural Networks (GNNs) and other machine learning models. We establish statistically guaranteed coverage for CP-ROC under a local exchangeability condition. This addresses uncertainty challenges for ROC curves under non-iid setting, ensuring reliability when test graph distributions differ from training data. Empirically, to establish local exchangeability for TGNNs, we introduce a data-driven approach to construct local calibration sets for graphs. Comprehensive evaluations show that CP-ROC significantly improves prediction reliability across diverse tasks. This method enhances uncertainty quantification efficiency and reliability for ROC curves, proving valuable for real-world applications with non-iid objects.
Poster
Leo Widmer · Jiawei Huang · Niao He
[ Hall A-E ]
Abstract
Incentive design is a popular framework for guiding agents' learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agents' behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.
Poster
Wei Huang · Yuan Cao · Haonan Wang · Xin Cao · Taiji Suzuki
[ Hall A-E ]
Abstract
Graph neural networks (GNNs) have demonstrated remarkable capabilities in learning from graph-structured data, often outperforming traditional Multilayer Perceptrons (MLPs) in numerous graph-based tasks. Although existing works have demonstrated the benefits of graph convolution through Laplacian smoothing, expressivity or separability, there remains a lack of quantitative analysis comparing GNNs and MLPs from an optimization and generalization perspective. This study aims to address this gap by examining the role of graph convolution through feature learning theory. Using a signal-noise data model, we conduct a comparative analysis of the optimization and generalization between two-layer graph convolutional networks (GCNs) and their MLP counterparts. Our approach tracks the trajectory of signal learning and noise memorization in GNNs, characterizing their post-training generalization. We reveal that GNNs significantly prioritize signal learning, thus enhancing the regime of {low test error} over MLPs by $D^{q-2}$ times, where $D$ denotes a node's expected degree and $q$ is the power of ReLU activation function with $q>2$. This finding highlights a substantial and quantitative discrepancy between GNNs and MLPs in terms of optimization and generalization, a conclusion further supported by our empirical simulations on both synthetic and real-world datasets.
Poster
Sudeep Salgia · Nikola Pavlovic · Yuejie Chi · Qing Zhao
[ Hall A-E ]
Abstract
We consider the problem of differentially private stochastic convex optimization (DP-SCO) in a distributed setting with $M$ clients, where each of them has a local dataset of $N$ i.i.d. data samples from an underlying data distribution. The objective is to design an algorithm to minimize a convex population loss using a collaborative effort across $M$ clients, while ensuring the privacy of the local datasets. In this work, we investigate the accuracy-communication-privacy trade-off for this problem. We establish matching converse and achievability results using a novel lower bound and a new algorithm for distributed DP-SCO based on Vaidya’s plane cutting method. Thus, our results provide a complete characterization of the accuracy-communication-privacy trade-off for DP-SCO in the distributed setting.
Poster
Daniil Tiapkin · Evgenii Chzhen · Gilles Stoltz
[ Hall A-E ]
Abstract
We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$ stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order $\tilde{\mathcal{O}}(\mathrm{poly}(H)\sqrt{SAT})$, where $S$ and $A$ are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of $\sqrt{S}$, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound $\Omega(\sqrt{H^3SAT})$ as far as the dependencies in $S,A,T$ are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).
Poster
Konstantin Kutzkov
[ Hall A-E ]
Abstract
Random walk based node embedding algorithms have attracted a lot of attention due to their scalability and ease of implementation. Previous research has focused on different walk strategies, optimization objectives, and embedding learning models.Inspired by observations on real data, we take a different approach and propose a new regularization technique. More precisely, the frequencies of node pairs generated by the skip-gram model on random walk node sequences follow a highly skewed distribution which causes learning to be dominated by a fraction of the pairs. We address the issue by designing an efficient sampling procedure that generates node pairs according to their smoothed frequency. Theoretical and experimental results demonstrate the advantages of our approach.
Poster
Sascha Xu · Sarah Mameche · Jilles Vreeken
[ Hall A-E ]
Abstract
Identifying causal relationships is a cornerstone task in science, but most data-driven methods offer ambiguous results or require restrictive assumptions. Recent work on the basis of information theory shows promising results across many domains, but leaves open how to provably identify causal graphs. Here, we develop a general information-theoretic framework called TOPIC for causal discovery in topological order. TOPIC is based on the universal measure of Kolmogorov complexity and is fully identifiable. We show that TOPIC's guarantees extend to both the i.i.d. and non-i.i.d. continuous settings. Our evaluations on continuous, time series, and interventional data show that TOPIC, using domain-specific approximations of Kolmogorov complexity, learns faithful topological orderings and frequently outperforms specialized methods.
Poster
Blaise Delattre · Paul Caillon · Quentin Barthélemy · Erwan Fagnou · Alexandre Allauzen
[ Hall A-E ]
Abstract
Randomized smoothing has become a leading approach for certifying adversarial robustness in machine learning models. However, a persistent gap remains between theoretical certified robustness and empirical robustness accuracy. This paper introduces a new framework that bridges this gap by leveraging Lipschitz continuity for certification and proposing a novel, less conservative method for computing confidence intervals in randomized smoothing. Our approach tightens the bounds of certified robustness, offering a more accurate reflection of model robustness in practice. Through rigorous experimentation we show that our method improves the robust accuracy, compressing the gap between empirical findings and previous theoretical results. We argue that investigating local Lipschitz constants and designing ad-hoc confidence intervals can further enhance the performance of randomized smoothing. These results pave the way for a deeper understanding of the relationship between Lipschitz continuity and certified robustness.
Poster
Nithish Suresh Babu · Ritesh Kumar · Shashank Vatedka
[ Hall A-E ]
Abstract
We study the problem of unbiased minimum mean squared error quantization of the $L_1$ ball, with applications to distributed mean estimation and federated learning. Inspired by quantization of probability distributions using types, we design a novel computationally efficient unbiased quantization scheme for vectors that lie within the $L_1$ ball. We also derive upper bounds on the worst-case mean squared error achieved by our scheme and show that this is order optimal. We then use this to design polynomial (in the dimension of the input vectors)-time schemes for communication-efficient distributed mean estimation and distributed/federated learning, and demonstrate its effectiveness using simulations.
Poster
Yue Huang · Jiaojiao Zhang · Qing Ling
[ Hall A-E ]
Abstract
This paper explores locally differentially private distributed algorithms that solve non-convex empirical risk minimization problems. Traditional approaches often assume uniformly bounded stochastic gradients, which may not hold in practice. To address this issue, we propose differentially **Pri**vate **S**tochastic recursive **M**omentum with gr**A**dient clipping (PriSMA) that judiciously integrates clipping and momentum to enhance utility while guaranteeing privacy. Without assuming uniformly bounded stochastic gradients, given privacy requirement $(\epsilon,\delta)$, PriSMA achieves a learning error of $\tilde{\mathcal{O}}\big((\frac{\sqrt{d}}{\sqrt{M}N\epsilon})^\frac{2}{5}\big)$, where $M$ is the number of clients, $N$ is the number of data samples on each client and $d$ is the model dimension. This learning error bound is better than the state-of-the-art $\tilde{\mathcal{O}}\big((\frac{\sqrt{d}}{{\sqrt{M}N\epsilon}})^\frac{1}{3}\big)$ in terms of the dependence on $M$ and $N$.
Poster
Hamish Flynn · David Reeb
[ Hall A-E ]
Abstract
Confidence bounds are an essential tool for rigorously quantifying the uncertainty of predictions. They are a core component in many sequential learning and decision-making algorithms, with tighter confidence bounds giving rise to algorithms with better empirical performance and better performance guarantees. In this work, we use martingale tail inequalities to establish new confidence bounds for sequential kernel regression. Our confidence bounds can be computed by solving a conic program, although this bare version quickly becomes impractical, because the number of variables grows with the sample size. However, we show that the dual of this conic program allows us to efficiently compute tight confidence bounds. We prove that our new confidence bounds are always tighter than existing ones in this setting. We apply our confidence bounds to kernel bandit problems, and we find that when our confidence bounds replace existing ones, the KernelUCB (GP-UCB) algorithm has better empirical performance, a matching worst-case performance guarantee and comparable computational cost.
Poster
Bo Yuan · Jiaojiao Fan · Jiaming Liang · Yongxin Chen
[ Hall A-E ]
Abstract
We consider the problem of sampling from a target unnormalized distribution $\exp(-f(x))$ defined on $\mathbb{R}^d$ where $f(x)$ is smooth, but the smoothness parameter is unknown. As a key design parameter of Markov chain Monte Carlo (MCMC) algorithms, the step size is crucial for the convergence guarantee. Existing non-asymptotic analysis on MCMC with fixed step sizes indicates that the step size heavily relies on global smoothness. However, this choice does not utilize the local information and fails when the smoothness coefficient is hard to estimate. A tuning-free algorithm that can adaptively update stepsize is highly desirable. In this work, we propose an \textbf{adaptive} proximal sampler that can utilize the local geometry to adjust step sizes and is guaranteed to converge to the target distribution. Experiments demonstrate the comparable or superior performance of our algorithm over various baselines.
Poster
Shuhui Zhu · Baoxiang Wang · Sriram Ganapathi Subramanian · Pascal Poupart
[ Hall A-E ]
Abstract
The partial alignment and conflict of autonomous agents lead to mixed-motive scenarios in many real-world applications. However, agents may fail to cooperate in practice even when cooperation yields a better outcome. One well known reason for this failure comes from non-credible commitments. To facilitate commitments among agents for better cooperation, we define Markov Commitment Games (MCGs), a variant of commitment games, where agents can voluntarily commit to their proposed future plans. Based on MCGs, we propose a learnable commitment protocol via policy gradients. We further propose incentive-compatible learning to accelerate convergence to equilibria with better social welfare. Experimental results in challenging mixed-motive tasks demonstrate faster empirical convergence and higher returns for our method compared with its counterparts. Our code is available at https://github.com/shuhui-zhu/DCL.
Poster
Guojun Zhu · Sanguo Zhang · Mingyang Ren
[ Hall A-E ]
Abstract
Multi-source generative models have gained significant attention due to their ability to capture complex data distributions across diverse domains. However, existing approaches often struggle with limitations such as negative transfer and an over-reliance on large pre-trained models. To address these challenges, we propose a novel method that effectively handles scenarios with outlier source domains, while making weaker assumptions about the data, thus ensuring broader applicability. Our approach enhances robustness and efficiency, supported by rigorous theoretical analysis, including non-asymptotic error bounds and asymptotic guarantees. In the experiments, we validate our methods through numerical simulations and realworld data experiments, showcasing their practical effectiveness and adaptability.
Poster
Roman Malashin · Valeria Yachnaya · Alexandr Mullin
[ Hall A-E ]
Abstract
We investigate the training dynamics of deep classifiers by examining how hierarchical relationships between classes evolve during training. Through extensive experiments, we argue that the learning process in classification problems can be understood through the lens of label clustering. Specifically, we observe that networks tend to distinguish higher-level (hypernym) categories in the early stages of training, and learn more specific (hyponym) categories later. We introduce a novel framework to track the evolution of the feature manifold during training, revealing how the hierarchy of class relations emerges and refines across the network layers. Our analysis demonstrates that the learned representations closely align with the semantic structure of the dataset, providing a quantitative description of the clustering process. Notably, we show that in the hypernym label space, certain properties of neuronal collapse appear earlier than in the hyponym label space, helping to bridge the gap between the initial and terminal phases of learning. We believe our findings offer new insights into the mechanisms driving hierarchical learning in deep networks, paving the way for future advancements in understanding deep learning dynamics.
Poster
Monika Henzinger · A. R. Sricharan · Teresa Steiner
[ Hall A-E ]
Abstract
We study privately releasing column sums of a $d$-dimensional table with entries from a universe $\chi$ undergoing $T$ row updates, called histogram under continual release. Our mechanisms give better additive $\ell_\infty$-error than existing mechanisms for a large class of queries and input streams.Our first contribution is an output-sensitive mechanism in the insertions-only model ($\chi = \\{0,1\\}$) for maintaining (i) the histogram or (ii) queries that do not require maintaining the entire histogram, such as the maximum or minimum column sum, the median, or any quantiles.The mechanism has an additive error of $O(d\log^2 (dq^*)+\log T)$ whp, where $q^*$ is the maximum output value over all time steps on this dataset. The mechanism does not require $q^*$ as input. This breaks the $\Omega(d \log T)$ bound of prior work when $q^* \ll T$.Our second contribution is a mechanism for the turnstile model that admits negative entry updates ($\chi = \\{-1, 0,1\\}$). This mechanism has an additive error of $O(d \log^2 (dK) + \log T)$ whp, where $K$ is the number of times two consecutive data rows differ, and the mechanism does not require $K$ as input. This is useful when monitoring inputs that only vary under unusual circumstances. For $d=1$ this gives …
Poster
Jiaqi Sun · Yujia Zheng · Xinshuai Dong · Haoyue Dai · Kun Zhang
[ Hall A-E ]
Abstract
Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structures. In this paper, we propose to exploit the consistency between entity and relation type information for noise detection, resulting a novel self-supervised knowledge graph denoising method that avoids those problems. We formalize \textit{type inconsistency} noise as triples that deviate from the majority with respect to type-dependent reasoning along the topological structure. Specifically, we first extract a compact representation of a given knowledge graph via an encoder that models the type dependencies of triples. Then, the decoder reconstructs the original input knowledge graph based on the compact representation. It is worth noting that, our proposal has the potential to address the problems of knowledge graph compression and completion, although this is not our focus. For the specific task of noise detection, the discrepancy between the reconstruction results and the input knowledge graph provides an opportunity for denoising, which is facilitated by the type consistency embedded in our method. …
Poster
Xuefeng GAO · Lingjiong Zhu
[ Hall A-E ]
Abstract
Score-based generative modeling with probability flow ordinary differential equations (ODEs) has achieved remarkable success in a variety of applications. While various fast ODE-based samplers have been proposed in the literature and employed in practice, the theoretical understandings about convergence properties of the probability flow ODE are still quite limited. In this paper, we provide the first non-asymptotic convergence analysis for a general class of probability flow ODE samplers in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distributions. We then consider various examples and establish results on the iteration complexity of the corresponding ODE-based samplers. Our proof technique relies on spelling out explicitly the contraction rate for the continuous-time ODE and analyzing the discretization and score-matching errors by using synchronous coupling; the challenge in our analysis mainly arises from the inherent non-autonomy of the probability flow ODE and the specific exponential integrator that we study.
Poster
Yulong Yang · Bowen Feng · Keqin Wang · Naomi Leonard · Adji Bousso Dieng · Christine Allen-Blanchette
[ Hall A-E ]
Abstract
From pedestrians to Kuramoto oscillators, interactions between agents govern how dynamical systems evolve in space and time. Discovering how these agents relate to each other has the potential to improve our understanding of the often complex dynamics that underlie these systems. Recent works learn to categorize relationships between agents based on observations of their physical behavior. These approaches model relationship categories as outcomes of a categorical distribution which is limiting and contrary to real-world systems, where relationship categories often intermingle and interact. In this work, we introduce a level of abstraction between the observable behavior of agents and the latent categories that determine their behavior. To do this, we learn a mapping from agent observations to agent preferences for a set of latent categories. The learned preferences and inter-agent proximity are integrated in a nonlinear opinion dynamics model, which allows us to naturally identify mutually exclusive categories, predict an agent's evolution in time, and control an agent's behavior. Through extensive experiments, we demonstrate the utility of our model for learning interpretable categories, and the efficacy of our model for long-horizon trajectory prediction.
Poster
Gintare Karolina Dziugaite · Dan Roy
[ Hall A-E ]
Abstract
We study the generalization properties of neural networks through the lens of data complexity. Recent work by Buzaglo et al. (2024) shows that random (nearly) interpolating networks generalize, provided there is a small "teacher" network that achieves small excess risk. We give a short single-sample PAC-Bayes proof of this result and an analogous "fast-rate" result for random samples from Gibbs posteriors. The resulting oracle inequality motivates a new notion of data complexity, based on the minimal size of a teacher network required to achieve any given level of excess risk. We show that polynomial data complexity gives rise to power laws connecting risk to the number of training samples, like in empirical neural scaling laws. By comparing the "scaling laws" resulting from our bounds to those observed in empirical studies, we provide evidence for lower bounds on the data complexity of standard benchmarks.
Poster
Richard Schwank · Andrew McCormack · Mathias Drton
[ Hall A-E ]
Abstract
Proposed in Hyvärinen (2005), score matching is a statistical estimation procedure that does not require computation of distributional normalizing constants. In this work we utilize the geometric median of means to develop a robust score matching procedure that yields consistent parameter estimates in settings where the observed data has been contaminated. A special appeal of the proposed method is that it retains convexity in exponential family models. The new method is therefore particularly attractive for non-Gaussian, exponential family graphical models where evaluation of normalizing constants is intractable. Support recovery guarantees for such models when contamination is present are provided. Additionally, support recovery is studied in numerical experiments and on a precipitation dataset. We demonstrate that the proposed robust score matching estimator performs comparably to the standard score matching estimator when no contamination is present but greatly outperforms this estimator in a setting with contamination.
Poster
Siddharth Vishwanath · Jonathan Hehir
[ Hall A-E ]
Abstract
We consider the problem of recovering latent information from graphs under $\varepsilon$-edge local differential privacy where the presence of relationships/edges between two users/vertices remains confidential, even from the data curator. For the class of generalized random dot-product graphs, we show that a standard local differential privacy mechanism induces a specific geometric distortion in the latent positions. Leveraging this insight, we show that consistent recovery of the latent positions is achievable by appropriately adjusting the statistical inference procedure for the privatized graph. Furthermore, we prove that our procedure is nearly minimax-optimal under local edge differential privacy constraints. Lastly, we show that this framework allows for consistent recovery of geometric and topological information underlying the latent positions, as encoded in their persistence diagrams. Our results extend previous work from the private community detection literature to a substantially richer class of models and inferential tasks.
Poster
Asfandyar Azhar · Paul Thielen · Curtis Langlotz
[ Hall A-E ]
Abstract
We introduce MEDUSA (Medical Data Under Shadow Attacks), a novel hybrid model inversion framework that leverages gradient-based optimization and TCNNs to reconstruct high-fidelity medical images from model outputs in a gray-box setting. Unlike traditional attacks requiring full model details, MEDUSA uses surrogate shadow models trained on publicly available data, simulating limited-information scenarios often encountered in practice. Our approach shows that even with restricted access, quality image reconstructions are possible, raising serious privacy concerns for patient data. Contributions include demonstrating that a combination of gradient-based methods and TCNNs yields potent reconstructions, even with limited model access, and providing a detailed analysis of how different input configurations impact reconstruction quality. We also evaluate the reconstructions as viable training data, finding that they can approximate real images well enough to use for model training. Finally, we propose robust defensive mechanisms such as output vector truncation, Gaussian noise, and a new k-NN smearing technique to tackle privacy risks.
Poster
Imad Aouali · Victor-Emmanuel Brunel · David Rohde · Anna Korba
[ Hall A-E ]
Abstract
In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.
Poster
Gianluca Drappo · Arnaud Robert · Marcello Restelli · Aldo Faisal · Alberto Maria Metelli · Ciara Pike-Burke
[ Hall A-E ]
Abstract
We study goal-conditioned Hierarchical Reinforcement Learning (HRL), where a high-level agent instructs sub-goals to a low-level agent.Under the assumption of a sparse reward function and known hierarchical decomposition, we propose a new algorithm to learn optimal hierarchical policies.Our algorithm takes a low-level policy as input and is flexible enough to work with a wide range of low-level policies.We show that when the algorithm that computes the low-level policy is optimistic and provably efficient, our HRL algorithm enjoys a regret bound which represents a significant improvement compared to previous results for HRL. Importantly, our regret upper bound highlights key characteristics of the hierarchical decomposition that guarantee that our hierarchical algorithm is more efficient than the best monolithic approach.We support our theoretical findings with experiments that underscore that our method consistently outperforms algorithms that ignore the hierarchical structure.
Poster
Atsutoshi Kumagai · Tomoharu Iwata · Hiroshi Takahashi · Taishi Nishiyama · Yasuhiro Fujiwara
[ Hall A-E ]
Abstract
Positive and unlabeled (PU) learning is a fundamental task in many applications, which trains a binary classifier from only PU data. Existing PU learning methods typically assume that training and test distributions are identical. However, this assumption is often violated due to distribution shifts, and identifying shift types such as covariate and concept shifts is generally difficult. In this paper, we propose a distribution shift adaptation method for PU learning without assuming shift types by using a few PU data in the test distribution and PU data in the training distribution. Our method is based on the importance weighting, which learns the classifier in a principled manner by minimizing the importance-weighted training risk that approximates the test risk. Although existing methods require positive and negative data in both distributions for the importance weighting without assuming shift types, we theoretically show that it can be performed with only PU data in both distributions. Based on this finding, our neural network-based classifiers can be effectively trained by iterating the importance weight estimation and classifier learning. We show that our method outperforms various existing methods with seven real-world datasets.
Poster
Behnoosh Zamanlooy · Mario Diaz · Shahab Asoodeh
[ Hall A-E ]
Abstract
Local differential privacy (LDP) is increasingly employed in privacy-preserving machine learning to protect user data before sharing it with an untrusted aggregator. Most LDP methods assume that users possess only a single data record, which is a significant limitation since users often gather extensive datasets (e.g., images, text, time-series data) and frequently have access to public datasets. To address this limitation, we propose a locally private sampling framework that leverages both the private and public datasets of each user. Specifically, we assume each user has two distributions: $p$ and $q$ that represent their private and public datasets, respectively. The objective is to design a mechanism that generates a private sample approximating $p$ while simultaneously preserving $q$. We frame this objective as a minimax optimization problem using $f$-divergence as the utility measure. We fully characterize the minimax optimal mechanisms for general $f$-divergences provided that $p$ and $q$ are discrete distributions. Remarkably, we demonstrate that this optimal mechanism is universal across all $f$-divergences. Experiments validate the effectiveness of our minimax optimal mechanism compared to the state-of-the-art private sampler.
Poster
Krishna Kalagarla · Rahul Jain · Pierluigi Nuzzo
[ Hall A-E ]
Abstract
Constrained Markov decision processes (CMDPs) models are increasingly important in many applications with multiple objectives. When the model is unknown and must be learned online, it is desirable to ensure that the constraint is met, or at least the violation is bounded with time. In recent literature, progress has been made on this very challenging problem but with either unsatisfactory assumptions such as the knowledge of a safe policy, or have high cumulative regret. We propose the Safe-PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm efficiently trades-off exploration and exploitation using posterior sampling-based exploration, and yet provably suffers only bounded constraint violation using carefully-crafted pessimism. We establish a sub-linear $\tilde{O}(H^{2.5}\sqrt{|S|^2|A|K})$ upper bound on the Bayesian objective regret along with a bounded, i.e., $\tilde{O}(1)$ constraint-violation regret over $K$ episodes for an $|S|$-state, $|A|$-action, and $H$ horizon CMDP which improves over state-of-the-art algorithms for the same setting.
Poster
Robert Busa-Fekete · Umar Syed
[ Hall A-E ]
Abstract
We present new algorithms for estimating and testing \emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies $(\alpha, \beta)$-local differential privacy and estimates collision probability with error at most $\epsilon$ using $\tilde{O}\left(\frac{\log(1/\beta)}{\alpha^2 \epsilon^2}\right)$ samples for $\alpha \le 1$, which improves over previous work by a factor of $\frac{1}{\alpha^2}$. We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by $\epsilon$ using $\tilde{O}(\frac{1}{\epsilon^2})$ samples, even when $\epsilon$ is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.
Poster
Nikola Pavlovic · Sudeep Salgia · Qing Zhao
[ Hall A-E ]
Abstract
We consider distributed kernel bandits where $N$ agents aim to collaboratively maximize an unknown reward function that lies in a reproducing kernel Hilbert space. Each agent sequentially queries the function to obtain noisy observations at the query points. Agents can share information through a central server, with the objective of minimizing regret that is accumulating over time $T$ and aggregating over agents. We develop the first algorithm that achieves the optimal regret order (as defined by centralized learning) with a communication cost that is sublinear in both $N$ and $T$. The key features of the proposed algorithm are the uniform exploration at the local agents and shared randomness with the central server. Working together with the sparse approximation of the GP model, these two key components make it possible to preserve the learning rate of the centralized setting at a diminishing rate of communication.
Poster
Benjamin Howson · Sarah Filippi · Ciara Pike-Burke
[ Hall A-E ]
Abstract
This paper studies the cooperative stochastic $k$-armed bandit problem, where $m$ agents collaborate to identify the optimal action. Rather than adapting a specific single-agent algorithm, we propose a general-purpose black-box reduction that extends any single-agent algorithm to the multi-agent setting. Under mild assumptions, we prove that our black-box approach preserves the regret guarantees of the chosen algorithm, and is capable of achieving minimax-optimality up to an additive graph-dependent term. Our method applies to various bandit settings, including heavy-tailed and duelling bandits, and those with local differential privacy. Empirically, it is competitive with or outperforms specialized multi-agent algorithms.
Poster
Zhiyong Wang · Jize Xie · Yi Chen · John C. S. Lui · Dongruo Zhou
[ Hall A-E ]
Abstract
We investigate the non-stationary stochastic linear bandit problem where the reward distribution evolves each round. Existing algorithms characterize the non-stationarity by the total variation budget $B_K$, which is the summation of the change of the consecutive feature vectors of the linear bandits over $K$ rounds. However, such a quantity only measures the non-stationarity with respect to the expectation of the reward distribution, which makes existing algorithms sub-optimal under the general non-stationary distribution setting. In this work, we propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. Specifically, we introduce two novel algorithms: Restarted Weighted$\text{OFUL}^+$ and Restarted $\text{SAVE}^+$. These algorithms address cases where the variance information of the rewards is known and unknown, respectively. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary stochastic linear bandits under different settings. Experimental evaluations further validate the superior performance of our proposed algorithms over existing works.
Poster
Juliusz Ziomek · Masaki Adachi · Michael A. Osborne
[ Hall A-E ]
Abstract
Bayesian optimisation requires fitting a Gaussian process model, which in turn requires specifying prior on the unknown black-box function---most of the theoretical literature assumes this prior is known. However, it is common to have more than one possible prior for a given black-box function, for example suggested by domain experts with differing opinions. In some cases, the type-II maximum likelihood estimator for selecting prior enjoys the consistency guarantee, but it does not universally apply to all types of priors. If the problem is stationary, one could rely on the Regret Balancing scheme to conduct the optimisation, but in the case of time-varying problems, such a scheme cannot be used. To address this gap in existing research, we propose a novel algorithm, PE-GP-UCB, which is capable of solving time-varying Bayesian optimisation problems even without the exact knowledge of the function's prior. The algorithm relies on the fact that either the observed function values are consistent with some of the priors, in which case it is easy to reject the wrong priors, or the observations are consistent with all candidate priors, in which case it does not matter which prior our model relies on. We provide a regret bound on the proposed …
Poster
Yigit Efe Erginbas · Thomas Courtade · Ramchandran Kannan
[ Hall A-E ]
Abstract
We consider an assortment selection and pricing problem in which a seller has $N$ different items available for sale. In each round, the seller observes a $d$-dimensional contextual preference information vector for the user, and offers to the user an assortment of $K$ items at prices chosen by the seller. The user selects at most one of the products from the offered assortment according to a multinomial logit choice model whose parameters are unknown. The seller observes which, if any, item is chosen at the end of each round, with the goal of maximizing cumulative revenue over a selling horizon of length $T$. For this problem, we propose an algorithm that learns from user feedback and achieves a revenue regret of order $\widetilde{\mathcal{O}}(d \sqrt{K T} / L_0 )$ where $L_0$ is the minimum price sensitivity parameter. We also obtain a lower bound of order $\Omega(d \sqrt{T}/ L_0)$ for the regret achievable by any algorithm.
Poster
Alessio Mazzetto · Reza Esfandiarpoor · Akash Singirikonda · Eli Upfal · Stephen Bach
[ Hall A-E ]
Abstract
We introduce an adaptive method with formal quality guarantees for weak supervision in a non-stationary setting. Our goal is to infer the unknown labels of a sequence of data by using weak supervision sources that provide independent noisy signals of the correct classification for each data point. This setting includes crowdsourcing and programmatic weak supervision. We focus on the non-stationary case, where the accuracy of the weak supervision sources can drift over time, e.g., because of changes in the underlying data distribution. Due to the drift, older data could provide misleading information to infer the label of the current data point. Previous work relied on a priori assumptions on the magnitude of the drift to decide how much data to use from the past. In contrast, our algorithm does not require any assumptions on the drift, and it adapts based on the input by dynamically varying its window size. In particular, at each step, our algorithm estimates the current accuracies of the weak supervision sources by identifying a window of past observations that guarantees a near-optimal minimization of the trade-off between the error due to the variance of the estimation and the error due to the drift. Experiments on synthetic …
Poster
Xiangyu Chang · Yingcong Li · Muti Kara · Samet Oymak · Amit Roy-Chowdhury
[ Hall A-E ]
Abstract
The in-context learning capabilities of modern language models have motivated a deeper mathematical understanding of sequence models. A line of recent work has shown that linear attention models can emulate projected gradient descent iterations to implicitly learn the task vector from the data provided in the context window. In this work, we consider a novel setting where the global task distribution can be partitioned into a union of conditional task distributions. We then examine the use of task-specific prompts and prediction heads for learning the prior information associated with the conditional task distribution using a one-layer attention model. Our results on loss landscape show that task-specific prompts facilitate a covariance-mean decoupling where prompt-tuning explains the conditional mean of the distribution whereas the variance is learned/explained through in-context learning. Incorporating task-specific head further aids this process by entirely decoupling estimation of mean and variance components. This covariance-mean perspective similarly explains how jointly training prompt and attention weights can provably help over fine-tuning after pretraining.
Poster
Haotian Sun · Antoine Moulin · Tongzheng Ren · Arthur Gretton · Bo Dai
[ Hall A-E ]
Abstract
We study the problem of causal effect estimation in the presence of unobserved confounders, focusing on two settings: instrumental variable (IV) regression with additional observed confounders, and proxy causal learning. Our approach uses a singular value decomposition of a conditional expectation operator combined with a saddle-point optimization method. In the IV regression setting, this can be viewed as a neural network generalization of the seminal approach due to \cite{darolles2011nonparametric}. Saddle-point formulations have recently gained attention because they mitigate the double-sampling bias and are compatible with modern function approximation methods. We provide experimental validation across various settings and show that our approach outperforms existing methods on common benchmarks.
Poster
Maryam Aliakbarpour · Konstantina Bairaktari · Adam Smith · Marika Swanberg · Jonathan Ullman
[ Hall A-E ]
Abstract
Model personalization allows a set of individuals, each facing a different learning task, to train models that are more accurate for each person than those they could develop individually. The goals of personalization are captured in a variety of formal frameworks, such as multitask learning and metalearning. Combining data for model personalization poses risks for privacy because the output of an individual's model can depend on the data of other individuals. In this work we undertake a systematic study of differentially private personalized learning. Our first main contribution is to construct a taxonomy of formal frameworks for private personalized learning. This taxonomy captures different formal frameworks for learning as well as different threat models for the attacker. Our second main contribution is to prove separations between the personalized learning problems corresponding to different choices. In particular, we prove a novel separation between private multitask learning and private metalearning.
Poster
NAICHANG KE · Ryogo Tanaka · Yoshinobu Kawahara
[ Hall A-E ]
Abstract
We consider an operator-based latent Markov representation of a stochastic nonlinear dynamical system, where the stochastic evolution of the latent state embedded in a reproducing kernel Hilbert space is described with the corresponding transfer operator, and develop a spectral method to learn this representation based on the theory of stochastic realization. The embedding may be learned simultaneously using reproducing kernels, for example, constructed with feed-forward neural networks. We also address the generalization of sequential state-estimation (Kalman filtering) in stochastic nonlinear systems, and of operator-based eigen-mode decomposition of dynamics, for the representation. Several examples with synthetic and real-world data are shown to illustrate the empirical characteristics of our methods, and to investigate the performance of our model in sequential state-estimation and mode decomposition.
Poster
Congye Wang · Wilson Ye Chen · Heishiro Kanagawa · Chris Oates
[ Hall A-E ]
Abstract
An informal observation, made by several authors, is that the adaptive design of a Markov transition kernel has the flavour of a reinforcement learning task. Yet, to-date it has remained unclear how to exploit modern reinforcement learning technologies for adaptive MCMC. The aim of this paper is to set out a general framework, called *Reinforcement Learning Metropolis—Hastings*, that is theoretically supported and empirically validated. Our principal focus is on learning fast-mixing Metropolis—Hastings transition kernels, which we cast as deterministic policies and optimise via a policy gradient. Control of the learning rate provably ensures conditions for ergodicity are satisfied. The methodology is used to construct a gradient-free sampler that out-performs a popular gradient-free adaptive Metropolis--Hastings algorithm on $\approx$90% of tasks in the *PosteriorDB* benchmark.
Poster
Aaron Mishkin · Alberto Bietti · Robert Gower
[ Hall A-E ]
Abstract
We study level set teleportation, an optimization routine which tries to accelerate gradient descent (GD) by maximizing the gradient norm over a level set of the objective. While teleportation intuitively speeds-up GD via bigger steps, current work lacks convergence theory for convex functions, guarantees for solving the teleportation operator, and even clear empirical evidence showing this acceleration. We resolve these open questions. For convex functions satisfying Hessian stability, we prove that GD with teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where teleportation neither improves nor worsens convergence. To evaluate teleportation in practice, we develop a projected-gradient method requiring only Hessian-vector products. We use this to show that gradient methods with access to a teleportation oracle out-perform their standard versions on a variety of problems. We also find that GD with teleportation is faster than truncated Newton methods, particularly for non-convex optimization.
Poster
Chenhan Fu · guoming wang · Juncheng Li · Rongxing Lu · Siliang Tang
[ Hall A-E ]
Abstract
The decoding strategies widely used in large language models (LLMs) today are Top-$p$ Sampling and Top-$k$ Sampling, both of which are methods situated between greedy decoding and random sampling. Inspired by the concept of loss aversion from prospect theory in behavioral economics, and the endowment effect as highlighted by Richard H. Thaler, the 2017 Nobel Memorial Prize in Economic Sciences — particularly the principle that "the negative utility of an equivalent loss is approximately twice the positive utility of a comparable gain" — we have developed a new decoding strategy called Loss Sampling. We have demonstrated the effectiveness and validity of our method on several LLMs, including Llama-2, Llama-3 and Mistral. Our approach improves text quality by 4-30\% across four pure text tasks while maintaining diversity in text generation. Furthermore, we also extend our method to multimodal large models (LMs) and Beam Search, demonstrating the effectiveness and versatility of Loss Sampling with improvements ranging from 1-10\%.
Poster
Enea Monzio Compagnoni · Rustem Islamov · Frank Proske · Aurelien Lucchi
[ Hall A-E ]
Abstract
Distributed methods are essential for handling machine learning pipelines comprising large-scale models and datasets. However, their benefits often come at the cost of increased communication overhead between the central server and agents, which can become the main bottleneck, making training costly or even unfeasible in such systems. Compression methods such as quantization and sparsification can alleviate this issue. Still, their robustness to large and heavy-tailed gradient noise, a phenomenon sometimes observed in language modeling, remains poorly understood. This work addresses this gap by analyzing Distributed Compressed SGD (DCSGD) and Distributed SignSGD (DSignSGD) using stochastic differential equations (SDEs). Our results show that DCSGD with unbiased compression is more vulnerable to noise in stochastic gradients, while DSignSGD remains robust, even under large and heavy-tailed noise. Additionally, we propose new scaling rules for hyperparameter tuning to mitigate performance degradation due to compression. These findings are empirically validated across multiple deep learning architectures and datasets, providing practical recommendations for distributed optimization.
Poster
Baptiste Abélès · Gergely Neu · Eugenio Clerico
[ Hall A-E ]
Abstract
Traditional generalization results in statistical learning require a training data set made of independently drawn examples. Most of the recent efforts to relax this independence assumption have considered either purely temporal (mixing) dependencies, or graph-dependencies, where non-adjacent vertices correspond to independent random variables. Both approaches have their own limitations, the former requiring a temporal ordered structure, and the latter lacking a way to quantify the strength of inter-dependencies. In this work, we bridge these two lines of work by proposing a framework where dependencies decay with graph distance. We derive generalization bounds leveraging the online-to-PAC framework, by deriving a novel concentration result and introducing an online learning framework incorporating the graph structure. The resulting high-probability generalization guarantees depend on both the mixing rate and the graph's chromatic number.
Poster
Mengjing Wu · Junyu Xuan · Jie Lu
[ Hall A-E ]
Abstract
Classical parameter-space Bayesian inference for Bayesian neural networks (BNNs) suffers from several unresolved prior issues, such as knowledge encoding intractability and pathological behaviours in deep networks, which can lead to improper posterior inference. To address these issues, functional Bayesian inference has recently been proposed leveraging functional priors, such as the emerging functional variational inference. In addition to variational methods, stochastic gradient Markov Chain Monte Carlo (MCMC) is another scalable and effective inference method for BNNs to asymptotically generate samples from the true posterior by simulating continuous dynamics. However, existing MCMC methods perform solely in parameter space and inherit the unresolved prior issues, while extending these dynamics to function space is a non-trivial undertaking. In this paper, we introduce novel functional MCMC schemes, including stochastic gradient versions, based on newly designed diffusion dynamics that can incorporate more informative functional priors. Moreover, we prove that the stationary measure of these functional dynamics is the target posterior over functions. Our functional MCMC schemes demonstrate improved performance in both predictive accuracy and uncertainty quantification on several tasks compared to naive parameter-space MCMC and functional variational inference.
Poster
Alessandro Mastrototaro · Mathias Müller · Jimmy Olsson
[ Hall A-E ]
Abstract
General state-space models (SSMs) are widely used in statistical machine learning and are among the most classical generative models for sequential time-series data. SSMs, comprising latent Markovian states, can be subjected to variational inference (VI), but standard VI methods like the importance-weighted autoencoder (IWAE) lack functionality for streaming data. To enable online VI in SSMs when the observations are received in real time, we propose maximising an IWAE-type variational lower bound on the asymptotic contrast function, rather than the standard IWAE ELBO, using stochastic approximation. Unlike the recursive maximum likelihood method, which directly maximises the asymptotic contrast, our approach, called online sequential IWAE (OSIWAE), allows for online learning of both model parameters and a Markovian recognition model for inferring latent states. By approximating filter state posteriors and their derivatives using sequential Monte Carlo (SMC) methods, we create a particle-based framework for online VI in SSMs. This approach is more theoretically well-founded than recently proposed online variational SMC methods. We provide rigorous theoretical results on the learning objective and a numerical study demonstrating the method's efficiency in learning model parameters and particle proposal kernels.
Poster
Viacheslav Yusupov · Maxim Rakhuba · Evgeny Frolov
[ Hall A-E ]
Abstract
In this paper, we propose a new geometric approach for knowledge graph completion via low rank tensor approximation. We augment a pretrained and well-established Euclidean model based on a Tucker tensor decomposition with a novel hyperbolic interaction term. This correction enables more nuanced capturing of distributional properties in data better aligned with real-world knowledge graphs. By combining two geometries together, our approach improves expressivity of the resulting model achieving new state-of-the-art link prediction accuracy with a significantly lower number of parameters compared to the previous Euclidean and hyperbolic models.
Poster
Peihao Wang · Zhiwen Fan · Dejia Xu · Dilin Wang · Sreyas Mohan · Forrest Iandola · Rakesh Ranjan · YILEI LI · qiang liu · Zhangyang Wang · Vikas Chandra
[ Hall A-E ]
Abstract
Score distillation has emerged as one of the most prevalent approaches for text-to-3D asset synthesis. Essentially, score distillation updates 3D parameters by lifting and back-propagating scores averaged over different views. In this paper, we reveal that the gradient estimation in score distillation is inherent to high variance. Through the lens of variance reduction, the effectiveness of SDS and VSD can be interpreted as applications of various control variates to the Monte Carlo estimator of the distilled score. Motivated by this rethinking and based on Stein's identity, we propose a more general solution to reduce variance for score distillation, termed *Stein Score Distillation (SSD)*. SSD incorporates control variates constructed by Stein identity, allowing for arbitrary baseline functions. This enables us to include flexible guidance priors and network architectures to explicitly optimize for variance reduction. In our experiments, the overall pipeline, dubbed *SteinDreamer*, is implemented by instantiating the control variate with a monocular depth estimator. The results show that SSD can effectively reduce the distillation variance and consistently improve visual quality for both object- and scene-level generation.
Poster
Alicja Rączkowska · Aleksandra Osowska-Kurczab · Jacek Szczerbiński · Kalina Jasinska-Kobus · Klaudia Nazarko
[ Hall A-E ]
Abstract
Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set a …
Poster
Argyrios Gerogiannis · Yu-Han Huang · Venugopal V. Veeravalli
[ Hall A-E ]
Abstract
We study the problem of Non-Stationary Reinforcement Learning (NS-RL) without prior knowledge about the system’s non-stationarity. A state-of-the-art, black-box algorithm, known as MASTER, is considered, with a focus on identifying the conditions under which it can achieve its stated goals. Specifically, we prove that MASTER's non-stationarity detection mechanism is not triggered for practical choices of horizon, leading to performance akin to a random restarting algorithm. Moreover, we show that the regret bound for MASTER, while being order optimal, stays above the worst-case linear regret until unreasonably large values of the horizon. To validate these observations, MASTER is tested for the special case of piecewise stationary multi-armed bandits, along with methods that employ random restarting, and others that use quickest change detection to restart. A simple, order optimal random restarting algorithm, that has prior knowledge of the non-stationarity is proposed as a baseline. The behavior of the MASTER algorithm is validated in simulations, and it is shown that methods employing quickest change detection are more robust and consistently outperform MASTER and other random restarting approaches.
Poster
Yingqian Cui · Pengfei He · Xianfeng Tang · Qi He · Chen Luo · Jiliang Tang · Yue Xing
[ Hall A-E ]
Abstract
Few-shot Chain-of-Thought (CoT) prompting has demonstrated strong performance in improving the reasoning capabilities of large language models (LLMs). While theoretical investigations have been conducted to understand CoT, the underlying transformer used in these studies isolates the CoT reasoning process into separated in-context learning steps (Stepwise ICL). In this work, we theoretically show that, compared to Stepwise ICL, the transformer gains better error correction ability and more accurate predictions if the reasoning from earlier steps (Coherent CoT) is integrated. Given that this coherent reasoning changes the behavior of the transformer, we further investigate the sensitivity of the transformer with Coherent CoT when the demonstration examples are corrupted at the inference stage. Our theoretical results indicate that the transformer is more sensitive to errors in intermediate reasoning steps than the final outcome. Building upon this observation, we propose an improvement on CoT by incorporating both correct and incorrect reasoning paths in the demonstration. Our experiments validate the effectiveness of the proposed approach.
Poster
Lynn Chua · Badih Ghazi · Charlie Harrison · Pritish Kamath · Ravi Kumar · Ethan Leeman · Pasin Manurangsi · Amer Sinha · Chiyuan Zhang
[ Hall A-E ]
Abstract
We introduce the _Balls-and-Bins_ sampling for differentially private (DP) optimization methods such as DP-SGD. While it has been common practice to use some form of shuffling in DP-SGD implementations, privacy accounting algorithms have typically assumed that Poisson subsampling is used instead. Recent work by Chua et al. (2024), however, pointed out that shuffling based DP-SGD can have a much larger privacy cost in practical regimes of parameters. In this work we show that the Balls-and-Bins sampling achieves the "best-of-both" samplers, namely, the implementation of Balls-and-Bins sampling is similar to that of Shuffling and models trained using DP-SGD with Balls-and-Bins sampling achieve utility comparable to those trained using DP-SGD with Shuffling at the same noise multiplier, and yet, Balls-and-Bins sampling enjoys similar-or-better privacy amplification as compared to Poisson subsampling in practical regimes.
Poster
Ruofeng Yang · Bo Jiang · Shuai Li
[ Hall A-E ]
Abstract
Recently, variance exploding (VE) diffusion models have achieved state-of-the-art (SOTA) performance in two implementations: (1) the SDE-based implementation and (2) the probability flow ODE (PFODE) implementation. However, only a few works analyze the iteration complexity of VE-based models, and most focus on SDE-based implementation with strong assumptions. In this work, we prove the first polynomial iteration complexity under the realistic bounded support assumption for these two implementations. For the SDE-based implementation, we explain why the current SOTA VE-based model performs better than previous VE models. After that, we provide an improved result under the linear subspace data assumption and explain the great performance of VE models under the manifold data. For the PFODE-based implementation, the current results depend exponentially on problem parameters. Inspired by the previous predictor-corrector analysis framework, we propose the PFODE-Corrector algorithm and prove the polynomial complexity for the basic algorithm with uniform stepsize. After that, we show that VE-based models are more suitable for large stepsize and propose an exponential-decay stepsize version algorithm to improve the results.
Poster
Zhaolu Liu · Mauricio Barahona · Robert Peach
[ Hall A-E ]
Abstract
Traditional measures based solely on pairwise associations often fail to capture the complex statistical structure of multivariate data.Existing approaches for identifying information shared among $d>3$ variables are frequently computationally intractable, asymmetric with respect to a target variable, or unable to account for all the ways in which the joint probability distribution can be factorised. Here we present a systematic framework based on lattice theory to derive higher-order information-theoretic measures for multivariate data. Our construction uses lattice and operator function pairs, whereby an operator function is applied over a lattice that represents the algebraic relationships among variables. We show that many commonly used measures can be derived within this framework, yet they fail to capture all interactions for $d>3$, either because they are defined on restricted sublattices, or because the use of the KL divergence as an operator function, a typical choice, leads to undesired disregard of groups of interactions. To fully characterise all interactions among $d$ variables, we introduce the Streitberg Information, which is defined over the full partition lattice and uses generalised divergences (beyond KL) as operator functions. We validate the Streitberg Information on synthetic data, and illustrate its application in detecting complex interactions among stocks, decoding …
Poster
Lucia Pezzetti · Stefano Favaro · Stefano Peluchetti
[ Hall A-E ]
Abstract
Bayesian Neural Networks represent a fascinating confluence of deep learning and probabilistic reasoning, offering a compelling framework for understanding uncertainty in complex predictive models. In this paper, we investigate the use of the preconditioned Crank-Nicolson algorithm and its Langevin version to sample from a reparametrised posterior distribution of the neural network's weights, as the widths grow larger. In addition to being robust in the infinite-dimensional setting, we prove that the acceptance probabilities of the proposed algorithms approach 1 as the width of the network increases, independently of any stepsize tuning. Moreover, we examine and compare how the mixing speeds of the underdamped Langevin Monte Carlo, the preconditioned Crank-Nicolson and the preconditioned Crank-Nicolson Langevin samplers are influenced by changes in the network width in some real-world cases. Our findings suggest that, in wide Bayesian Neural Networks configurations, the preconditioned Crank-Nicolson algorithm allows for a scalable and more efficient sampling of the reparametrised posterior distribution, as also evidenced by a higher effective sample size and improved diagnostic results compared with the other analysed algorithms.
Poster
Shaan Ul Haque · Siva Theja Maguluri
[ Hall A-E ]
Abstract
Motivated by engineering applications such as resource allocation in networks and inventory systems, we consider average-reward Reinforcement Learning with unbounded state space and reward function. Recent work Murthy et al. 2024 studied this problem in the actor-critic framework and established finite sample bounds assuming access to a critic with certain error guarantees. We complement their work by studying Temporal Difference (TD) learning with linear function approximation and establishing finite-time bounds with the optimal sample complexity. These results are obtained using the following general-purpose theorem for non-linear Stochastic Approximation (SA). Suppose that one constructs a Lyapunov function for a non-linear SA with certain drift condition. Then, our theorem establishes finite-time bounds when this SA is driven by unbounded Markovian noise under suitable conditions. It serves as a black box tool to generalize sample guarantees on SA from i.i.d. or martingale difference case to potentially unbounded Markovian noise. The generality and the mild assumptions of the setup enables broad applicability of our theorem. We illustrate its power by studying two more systems: (i) We improve upon the finite-time bounds of Q-learning in Chen et al. 2024 by tightening the error bounds and also allowing for a larger class of behavior policies. (ii) …
Poster
Yang Li · Junier Oliva
[ Hall A-E ]
Abstract
Many real-world situations allow for the acquisition of additional relevant information when making decisions with limited or uncertain data. However, traditional RL approaches either require all features to be acquired beforehand (e.g. in a MDP) or regard part of them as missing data that cannot be acquired (e.g. in a POMDP). In this work, we consider RL models that may actively acquire features from the environment to improve the decision quality and certainty, while automatically balancing the cost of feature acquisition process and the reward of task decision process. We propose the Active-Acquisition POMDP and identify two types of the acquisition process for different application domains. In order to assist the agent in the actively-acquired partially-observed environment and alleviate the exploration-exploitation dilemma, we develop a model-based approach, where a deep generative model is utilized to capture the dependencies of the features and impute the unobserved features. The imputations essentially represent the beliefs of the agent. Equipped with the dynamics model, we develop hierarchical RL algorithms to resolve both types of the AA-POMDPs. Empirical results demonstrate that our approach achieves considerably better performance than existing POMDP-RL solutions.
Poster
Amir Joudaki · Thomas Hofmann
[ Hall A-E ]
Abstract
Understanding how neural networks transform input data across layers is fundamental to unraveling their learning and generalization capabilities. Although prior work has used insights from kernel methods to study neural networks, a global analysis of how the similarity between hidden representations evolves across layers remains underexplored. In this paper, we introduce a theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs. Operating under the mean-field regime, we show that the kernel sequence evolves deterministically via a kernel map, which only depends on the activation function. By expanding activation using Hermite polynomials and using their algebraic properties, we derive an explicit form for kernel map and fully characterize its fixed points. Our analysis reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture. We further extend our results to networks with residual connections and normalization layers, demonstrating similar convergence behaviors. This work provides new insights into the implicit biases of deep neural networks and how architectural choices influence the evolution of representations across layers.
Poster
Andi Zhang · Tim Xiao · Weiyang Liu · Robert Bamler · Damon Wischik
[ Hall A-E ]
Abstract
We revisit the likelihood ratio between a pretrained large language model (LLM) and its finetuned variant as a criterion for out-of-distribution (OOD) detection. The intuition behind such a criterion is that, the pretrained LLM has the prior knowledge about OOD data due to its large amount of training data, and once finetuned with the in-distribution data, the LLM has sufficient knowledge to distinguish their difference. Leveraging the power of LLMs, we show that, the likelihood ratio can serve as an effective OOD detection criterion. Moreover, we apply the proposed LLM-based likelihood ratio to detect OOD questions in question-answering (QA) systems, which can be used to improve the performance of specialized LLMs for general questions. Given that likelihood can be easily obtained by the loss functions within contemporary neural network frameworks, it is straightforward to implement this approach in practice with **three lines** of code. Since both the pretrained LLMs and its various finetuned models are widely available from online platforms such as Hugging Face, our proposed criterion can be effortlessly incorporated for OOD detection without the need for further training. We conduct comprehensive evaluation across on multiple settings, including far OOD, near OOD, spam detection, and QA scenarios, to demonstrate …
Poster
Xin Liu · Weijia Zhang · Min-Ling Zhang
[ Hall A-E ]
Abstract
In survival analysis, subjects often face competing risks; for example, individuals with cancer may also suffer from heart disease or other illnesses, which can jointly influence the prognosis of risks and censoring. Traditional survival analysis methods often treat competing risks as independent and fail to accommodate the dependencies between different conditions. In this paper, we introduce HACSurv, a survival analysis method that learns Hierarchical Archimedean Copulas structures and cause-specific survival functions from data with competing risks. HACSurv employs a flexible dependency structure using hierarchical Archimedean copulas to represent the relationships between competing risks and censoring. By capturing the dependencies between risks and censoring, HACSurv improves the accuracy of survival predictions and offers insights into risk interactions. Experiments on synthetic dataset demonstrate that our method can accurately identify the complex dependency structure and precisely predict survival distributions, whereas the compared methods exhibit significant deviations between their predictions and the true distributions. Experiments on multiple real-world datasets also demonstrate that our method achieves better survival prediction compared to previous state-of-the-art methods.
Poster
Chenyang Li · Yingyu Liang · Zhenmei Shi · Zhao Song · Tianyi Zhou
[ Hall A-E ]
Abstract
In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focus to the complex algebraic learning task of modular addition involving $k$ inputs. Our research presents a thorough analytical characterization of the features learned by stylized one-hidden layer neural networks and one-layer Transformers in addressing this task. A cornerstone of our theoretical framework is the elucidation of how the principle of margin maximization shapes the features adopted by one-hidden layer neural networks. Let $p$ denote the modulus, $D_p$ denote the dataset of modular arithmetic with $k$ inputs and $m$ denote the network width. We demonstrate that a neuron count of $ m \geq 2^{2k-2} \cdot (p-1) $, these networks attain a maximum $ L_{2,k+1} $-margin on the dataset $ D_p $. Furthermore, we establish that each hidden-layer neuron aligns with a specific Fourier spectrum, integral to solving modular addition problems. By correlating our findings with the empirical observations of similar studies, we contribute to a deeper comprehension …
Poster
kenghao zheng · ZI LONG · Shuxin Wang
[ Hall A-E ]
Abstract
Time series forecasting plays a critical role across various sectors, including economics, energy, transportation planning, and weather prediction. However, the inherent complexity and non-stationarity of real-world systems present considerable challenges for accurate modeling.Traditional approaches, often reliant on high-dimensional embeddings, tend to obscure multivariate relationships and suffer from performance limitations, particularly when dealing with intricate temporal patterns. To address these issues, we propose HAR-former, a Hybrid Transformer with an Adaptive Time-Frequency Representation Matrix ,which combines the strengths of Multi-Layer Perceptrons (MLPs) and Transformers to process trend and seasonal components, respectively. HAR-former leverages a novel adaptive time-frequency representation matrix to bridge the gap between time and frequency domains, allowing the model to capture both long-range dependencies and localized patterns. Extensive experimental evaluation on eight real-world benchmark datasets demonstrates that HAR-former significantly outperforms existing state-of-the-art (SOTA) methods, establishing it as a robust solution for complex time series forecasting tasks.
Poster
Xingchi Li · Guanxun Li · Xianyang Zhang
Abstract
Watermarking techniques embed statistical signals within content generated by large language models to help trace its source. Although existing methods perform well on long texts, their effectiveness significantly decreases for shorter texts. We introduce a statistical detection approach that improves the power of watermark detection, particularly in shorter texts. Our method leverages both the watermark key sequence and the next token probabilities (NTPs) to determine whether a text is generated by a large language model. We demonstrate the optimality of our approach and analyze its power properties. We also investigate an approach to estimating NTPs and extend our method to scenarios where texts face potential attacks such as substitutions, insertions, or deletions. We validate the effectiveness of our technique using texts generated by Meta-Llama-3-8B from Meta and Mistral-7B-v0.1 from Mistral AI, utilizing prompts extracted from Google's C4 dataset. In scenarios without attacks and with short text lengths, our method demonstrates approximately 65% power improvement compared to the baseline method on average. We release all code publicly at https://github.com/doccstat/llm-watermark-adaptive.