Poster
Samuel Tesfazgi · Leonhard Sprandl · Sandra Hirche
[ Hall A-E ]
Abstract
The practical deployment of learning-based autonomous systems would greatly benefit from tools that flexibly obtain safety guarantees in the form of certificate functions from data. While the geometrical properties of such certificate functions are well understood, synthesizing them using machine learning techniques still remains a challenge. To mitigate this issue, we propose a diffeomorphic function learning framework where prior structural knowledge of the desired output is encoded in the geometry of a simple surrogate function, which is subsequently augmented through an expressive, topology-preserving state-space transformation. Thereby, we achieve an indirect function approximation framework that is guaranteed to remain in the desired hypothesis space. To this end, we introduce a novel approach to construct diffeomorphic maps based on RBF networks, which facilitate precise, local transformations around data. Finally, we demonstrate our approach by learning diffeomorphic Lyapunov functions from real-world data and apply our method to different attractor systems.
Poster
Katherine Tieu · Dongqi Fu · Jun Wu · Jingrui He
[ Hall A-E ]
Abstract
In the era of foundation models, Out-of-Distribution (OOD) problems, i.e., the data discrepancy between the training environments and testing environments, hinder AI generalization. Further, relational data like graphs disobeying the Independent and Identically Distributed (IID) condition makes the problem more challenging, especially much harder when it is associated with time. Motivated by this, to realize the robust invariant learning over temporal graphs, we want to investigate what components in temporal graphs are most invariant and representative with respect to labels. With the Information Bottleneck (IB) method, we propose an error-bounded Invariant Link Selector that can distinguish invariant components and variant components during the training process to make the deep learning model generalizable for different testing scenarios. Besides deriving a series of rigorous generalizable optimization functions, we also equip the training with task-specific loss functions, e.g., temporal link prediction, to make pre-trained models solve real-world application tasks like citation recommendation and merchandise recommendation, as demonstrated in our experiments with state-of-the-art (SOTA) methods. Our code is available at https://github.com/kthrn22/OOD-Linker
Poster
Linlin Yu · Kangshuo Li · Pritom Saha · Yifei Lou · Feng Chen
[ Hall A-E ]
Abstract
Accurate quantification of both aleatoric and epistemic uncertainties is essential when deploying Graph Neural Networks (GNNs) in high-stakes applications such as drug discovery and financial fraud detection, where reliable predictions are critical. Although Evidential Deep Learning (EDL) efficiently quantifies uncertainty using a Dirichlet distribution over predictive probabilities, existing EDL-based GNN (EGNN) models require modifications to the network architecture and retraining, failing to take advantage of pre-trained models.We propose a plug-and-play framework for uncertainty quantification in GNNs that works with pre-trained models without the need for retraining. Our Evidential Probing Network (EPN) uses a lightweight Multi-Layer-Perceptron (MLP) head to extract evidence from learned representations, allowing efficient integration with various GNN architectures. We further introduce evidence-based regularization techniques, referred to as EPN-reg, to enhance the estimation of epistemic uncertainty with theoretical justifications. Extensive experiments demonstrate that the proposed EPN-reg achieves state-of-the-art performance in accurate and efficient uncertainty quantification, making it suitable for real-world deployment.
Poster
Tyler LaBonte · Kuo-Wei Lai · Vidya Muthukumar
[ Hall A-E ]
Abstract
Modern machine learning methods have recently demonstrated remarkable capability to generalize under task shift, where latent knowledge is transferred to a different, often more difficult, task under a similar data distribution. We investigate this phenomenon in an overparameterized linear regression setting where the task shifts from classification during training to regression during evaluation. In the zero-shot case, wherein no regression data is available, we prove that task shift is impossible in both sparse signal and random signal models for any Gaussian covariate distribution. In the few-shot case, wherein limited regression data is available, we propose a simple postprocessing algorithm which asymptotically recovers the ground-truth predictor. Our analysis leverages a fine-grained characterization of individual parameters arising from minimum-norm interpolation which may be of independent interest. Our results show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data.
Poster
Antonin Schrab · Ilmun Kim
[ Hall A-E ]
Abstract
We propose a general method for constructing robust permutation tests under data corruption. The proposed tests effectively control the non-asymptotic type I error under data corruption, and we prove their consistency in power under minimal conditions. This contributes to the practical deployment of hypothesis tests for real-world applications with potential adversarial attacks. For the two-sample and independence settings, we show that our kernel robust tests are minimax optimal, in the sense that they are guaranteed to be non-asymptotically powerful against alternatives uniformly separated from the null in the kernel MMD and HSIC metrics at some optimal rate (tight with matching lower bound). We point out that existing differentially private tests can be adapted to be robust to data corruption, and we demonstrate in experiments that our proposed tests achieve much higher power than these private tests. Finally, we provide publicly available implementations and empirically illustrate the practicality of our robust tests.
Poster
James McInerney · Nathan Kallus
[ Hall A-E ]
Abstract
Uncertainty quantification in deep learning is crucial for safe and reliable decision-making in downstream tasks. Existing methods quantify uncertainty at the last layer or other approximations of the network which may miss some sources of uncertainty in the model. To address this gap, we propose an uncertainty quantification method for large networks based on variation due to regularization. Essentially, predictions that are more (less) sensitive to the regularization of network parameters are less (more, respectively) certain. This principle can be implemented by deterministically tweaking the training loss during the fine-tuning phase and reflects confidence in the output as a function of all layers of the network. We show that regularization variation (RegVar) provides rigorous uncertainty estimates that, in the infinitesimal limit, exactly recover the Laplace approximation in Bayesian deep learning. We demonstrate its success in several deep learning architectures, showing it can scale tractably with the network size while maintaining or improving uncertainty quantification quality. Our experiments across multiple datasets show that RegVar not only identifies uncertain predictions effectively but also provides insights into the stability of learned representations.
Poster
Inioluwa Raji · Lydia Liu
[ Hall A-E ]
Abstract
Automated decision systems (ADS) are broadly deployed to inform or support human decision- making across a wide range of consequential contexts. However, various context-specific details complicate the goal of establishing meaningful experimental evaluations for prediction-based interventions. Notably, specific experimental design decisions may induce cognitive biases in human decision makers, which could then significantly alter the observed effect sizes of the prediction intervention. In this paper, we formalize and investigate various models of human decision-making in the presence of a predictive model aid. We show that each of these behavioral models produces dependencies across decision subjects and results in the violation of existing assumptions, with consequences for treatment effect estimation. This work aims to further advance the scientific validity of intervention-based evaluation schemes for the assessment of ADS deployments.
Poster
Antonio Ribeiro · Thomas Schön · Dave Zachariah · Francis Bach
[ Hall A-E ]
Abstract
Adversarial training can be used to learn models that are robust against perturbations. For linear models, it can be formulated as a convex optimization problem. Compared to methods proposed in the context of deep learning, leveraging the optimization structure allows significantly faster convergence rates. Still, the use of generic convex solvers can be inefficient for large-scale problems. Here, we propose tailored optimization algorithms for the adversarial training of linear models, which render large-scale regression and classification problems more tractable. For regression problems, we propose a family of solvers based on iterative ridge regression and, for classification, a family of solvers based on projected gradient descent. The methods are based on extended variable reformulations of the original problem. We illustrate their efficiency in numerical examples.
Poster
Måns Magnusson · Jakob Torgander · Paul Bürkner · Lu Zhang · Bob Carpenter · Aki Vehtari
[ Hall A-E ]
Abstract
The general applicability and robustness of posterior inference algorithms is critical to widely used probabilistic programming languages such as Stan, PyMC, Pyro, and Turing.jl. When designing a new inference algorithm, whether it involves Monte Carlo sampling or variational approximation, the fundamental problem is evaluating its accuracy and efficiency across a range of representative target posteriors. To solve this problem, we propose posteriordb, a database of models and data sets defining target densities along with reference Monte Carlo draws. We further provide a guide to the best practices in using posteriordb for algorithm evaluation and comparison. To provide a wide range of realistic posteriors, posteriordb currently comprises 120 representative models with data, and has been instrumental in developing several inference algorithms.
Poster
Mahbod Majid · Rattana Pukdee · Vishwajeet Agrawal · Burak Varici · Pradeep Ravikumar
[ Hall A-E ]
Abstract
Self-supervised learning methods that mask parts of the input data and train models to predict the missing components have led to significant advances in machine learning. These approaches learn conditional distributions $p(x_T \mid x_S)$ simultaneously, where $x_S$ and $x_T$ are subsets of the observed variables. In this paper, we examine the core problem of when all these conditional distributions are consistent with some joint distribution, and whether common models used in practice can learn consistent conditionals. We explore this problem in two settings. First, for the complementary conditioning sets where $S \cup T$ is the complete set of variables, we introduce the concept of path consistency, a necessary condition for a consistent joint. Second, we consider the case where we have access to $p(x_T \mid x_S)$ for all subsets $S$ and $T$. In this case, we propose the concepts of autoregressive and swap consistency, which we show are necessary and sufficient conditions for a consistent joint. For both settings, we analyze when these consistency conditions hold and show that standard discriminative models \emph{may fail to satisfy them}. Finally, we corroborate via experiments that proposed consistency measures can be used as proxies for evaluating the consistency of conditionals $p(x_T \mid x_S)$, …
Poster
Alex Chen · Philippe Chlenski · Kenneth Munyuza · Antonio Moretti · Christian Andersson Naesseth · Itsik Pe'er
[ Hall A-E ]
Abstract
Hyperbolic space naturally encodes hierarchical structures such as phylogenies (binary trees), where inward-bending geodesics reflect paths through least common ancestors, and the exponential growth of neighborhoods mirrors the super-exponential scaling of topologies. This scaling challenge limits the efficiency of Euclidean-based approximate Bayesian inference methods. Motivated by the geometric connections between trees and hyperbolic space, we develop novel hyperbolic extensions of two sequential search algorithms: Combinatorial and Nested Combinatorial Sequential Monte Carlo (\textsc{Csmc} and \textsc{Ncsmc}). Our approach introduces consistent and unbiased estimators, along with variational inference methods (\textsc{H-Vcsmc} and \textsc{H-Vncsmc}), which outperform their Euclidean counterparts. Empirical results demonstrate improved speed, scalability and performance in high-dimensional Bayesian phylogenetic inference tasks.
Poster
Zhirui Chen · P. N. Karthik · Yeow Meng Chee · Vincent Tan
[ Hall A-E ]
Abstract
We consider a multi-armed bandit setting with finitely many arms, in which each arm yields an $M$-dimensional vector reward upon selection. We assume that the reward of each dimension (a.k.a. {\em objective}) is generated independently of the others. The best arm of any given objective is the arm with the largest component of mean corresponding to the objective. The end goal is to identify the best arm of {\em every} objective in the shortest (expected) time subject to an upper bound on the probability of error (i.e., fixed-confidence regime). We establish a problem-dependent lower bound on the limiting growth rate of the expected stopping time, in the limit of vanishing error probabilities. This lower bound, we show, is characterised by a max-min optimisation problem that is computationally expensive to solve at each time step. We propose an algorithm that uses the novel idea of {\em surrogate proportions} to sample the arms at each time step, eliminating the need to solve the max-min optimisation problem at each step. We demonstrate theoretically that our algorithm is asymptotically optimal. In addition, we provide extensive empirical studies to substantiate the efficiency of our algorithm. While existing works on pure exploration with multi-objective multi-armed bandits …
Poster
Daniel Marks · Dario Paccagnan
[ Hall A-E ]
Abstract
Generalisation bounds are crucial for providing data-driven models with performance and safety guarantees. In this respect, bounds that do not require a held-out test set are particularly valuable as they allow the use of all data for training. While many such bounds do not improve upon the train-test approach, which remains the gold standard, the P2L algorithm (Paccagnan et al., 2023) has shown great potential. However, P2L comes with limitations, including computational overhead, reliance on consistent data, and restriction to non-Bayesian settings. In this work, we overcome these challenges in general settings and employ the corresponding results to show that classical Gaussian process (GP) training procedures can be interpreted as instantiations of P2L, thus inheriting tight, self-certified bounds. Three contributions underpin these conclusions. First, we introduce early stopping in P2L, equipping it with a tight generalisation bound to reduce training costs and address the non-consistent case. Second, we adapt P2L to the Bayesian setting and demonstrate its equivalence to posterior updating in a hierarchical model. Third, we show that greedy subset-of-data GPs are special P2L instantiations. Numerical evidence shows that the resulting P2L bounds we obtain compare favourably with the train-test and PAC-Bayes approaches on various real-world datasets.
Poster
Rickmer Schulte · David Rügamer
[ Hall A-E ]
Abstract
Additive models (AMs) have sparked a lot of interest in machine learning recently, allowing the incorporation of interpretable structures into a wide range of model classes. Many commonly used approaches to fit a wide variety of potentially complex additive models build on the idea of boosting additive models. While boosted additive models (BAMs) work well in practice, certain theoretical aspects are still poorly understood, including general convergence behavior and what optimization problem is being solved when accounting for the implicit regularizing nature of boosting. In this work, we study the solution paths of BAMs and establish connections with other approaches for certain classes of problems. Along these lines, we derive novel convergence results for BAMs, which yield crucial insights into the inner workings of the method. While our results generally provide reassuring theoretical evidence for the practical use of BAMs, they also uncover some "pathologies" of boosting for certain additive model classes concerning their convergence behavior that require caution in practice. We empirically validate our theoretical findings through several numerical experiments.
Poster
Haoyang Hong · Ioanna Papanikolaou · Sonali Parbhoo
[ Hall A-E ]
Abstract
Mitigating shortcuts, where models exploit spurious correlations in training data, remains a significant challenge for improving generalization. Regularization methods have been proposed to address this issue by enhancing model generalizability. However, we demonstrate that these methods can sometimes overregularize, inadvertently suppressing causal features along with spurious ones. In this work, we analyze the theoretical mechanisms by which regularization mitigates shortcuts and explore the limits of its effectiveness. Additionally, we identify the conditions under which regularization can successfully eliminate shortcuts without compromising causal features. Through experiments on synthetic and real-world datasets, our comprehensive analysis provides valuable insights into the strengths and limitations of regularization techniques for addressing shortcuts, offering guidance for developing more robust models.
Poster
Haoye Lu · Spencer Szabados · Yaoliang Yu
[ Hall A-E ]
Abstract
In recent years, diffusion models have become the leading approach for distribution learning. This paper focuses on structure-preserving diffusion models (SPDM), a specific subset of diffusion processes tailored for distributions with inherent structures, such as group symmetries. We complement existing sufficient conditions for constructing SPDMs by proving complementary necessary ones. Additionally, we propose a new framework that considers the geometric structures affecting the diffusion process. Leveraging this framework, we design a structure-preserving bridge model that maintains alignment between the model’s endpoint couplings. Empirical evaluations on equivariant diffusion models demonstrate their effectiveness in learning symmetric distributions and modeling transitions between them. Experiments on real-world medical images confirm that our models preserve equivariance while maintaining high sample quality. We also showcase the practical utility of our framework by implementing an equivariant denoising diffusion bridge model, which achieves reliable equivariant image noise reduction and style transfer, irrespective of prior knowledge of image orientation.
Poster
Zhongxi Fang · Xun Su · Tomohisa Tabuchi · Jianming Huang · Hiroyuki Kasai
[ Hall A-E ]
Abstract
Multidimensional Scaling (MDS) is an essential technique in multivariate analysis, with Weighted MDS (WMDS) commonly employed for tasks such as dimensionality reduction and graph drawing. However, the optimization of WMDS poses significant challenges due to the highly non-convex nature of its objective function. Stress Majorization, a method classified under the Majorization-Minimization algorithm, is among the most widely used solvers for this problem because it guarantees non-increasing loss values during optimization, even with a non-convex objective function. Despite this advantage, Stress Majorization suffers from high computational complexity, specifically $\mathcal{O}(\max(n^3, n^2 p))$ per iteration, where $n$ denotes the number of data points, and $p$ represents the projection dimension, rendering it impractical for large-scale data analysis. To mitigate the computational challenge, we introduce StableMDS, a novel gradient descent-based method that reduces the computational complexity to $\mathcal{O}(n^2 p)$ per iteration. StableMDS achieves this computational efficiency by applying gradient descent independently to each point, thereby eliminating the need for costly matrix operations inherent in Stress Majorization. Furthermore, we theoretically ensure non-increasing loss values and optimization stability akin to Stress Majorization. These advancements not only enhance computational efficiency but also maintain stability, thereby broadening the applicability of WMDS to larger datasets.
Poster
Mohammadreza Mousavi Kalan · Eitan Neugut · Samory Kpotufe
[ Hall A-E ]
Abstract
We consider the problem of transfer learning in outlier detection where target abnormal data is rare. While transfer learning has been considered extensively in traditional classification, the problem of transfer in outlier detection and more generally in imbalanced classification settings has received less attention. We propose a general algorithmic approach which is shown theoretically to yield strong guarantees w.r.t. to a range of changes in abnormal distribution, and at the same time amenable to practical implementation. We then investigate different instantiations of this general algorithmic approach, e.g., based on multi-layer neural networks, and show empirically that they significantly outperform natural extensions of transfer methods from traditional classification (which are the only solutions available at the moment)
Poster
Madhumitha Shridharan · Garud Iyengar
[ Hall A-E ]
Abstract
We consider a non-convex optimization formulation for learning the weighted adjacency matrix $W$ of a directed acyclic graph (DAG) that uses acyclicity constraints that are functions of $|W_{ij}|^\beta$, for $\beta \in \mathbb{N}$. State-of-the-art algorithms for this problem use gradient-based Karush-Kuhn-Tucker (KKT) optimality conditions which only yield useful search directions for $\beta =1$. Therefore, constraints with $\beta \geq 2$ have been ignored in the literature, and their empirical performance remains unknown. We introduce $\beta$-th Order Taylor Series Expansion Based Local Search ($\beta$-LS) which yields actionable descent directions for any $\beta \in \mathbb{N}$. Our empirical experiments show that $2$-LS obtains solutions of higher quality than $1$-LS, $3$-LS and $4$-LS. $2$-LSopt, an optimized version of $2$-LS, obtains high quality solutions significantly faster than the state of the art which uses $\beta=1$. Moreover, $2$-LSopt does not need any graph-size specific hyperparameter tuning. We prove that $\beta$-LSopt is guaranteed to converge to a Coordinate-Wise Local Stationary Point (Cst) for any $\beta \in \mathbb{N}$. If the objective function is convex, $\beta$-LSopt converges to a local minimum.
Poster
Ulysse Gazin · ruth heller · Etienne Roquain · Aldo Solari
[ Hall A-E ]
Abstract
In a split conformal framework with $K$ classes, a calibration sample of $n$ labeled examples is observed for inference on the label of a new unlabeled example. We explore the setting where a 'batch' of $m$ independent such unlabeled examples is given, and the goal is to construct a batch prediction set with 1-$\alpha$ coverage. Unlike individual prediction sets, the batch prediction set is a collection of label vectors of size $m$, while the calibration sample consists of univariate labels. A natural approach is to apply the Bonferroni correction, which concatenates individual prediction sets at level $1-\alpha/m$. We propose a uniformly more powerful solution, based on specific combinations of conformal $p$-values that exploit the Simes inequality. We provide a general recipe for valid inference with any combinations of conformal $p$-values, and compare the performance of several useful choices. Intuitively, the pooled evidence of relatively `easy' examples within the batch can help provide narrower batch prediction sets. Additionally, we introduce a more computationally intensive method that aggregates batch scores and can be even more powerful. The theoretical guarantees are established when all examples are independent and identically distributed (iid), as well as more generally when iid is assumed only conditionally within …
Poster
Zilong Deng · Simon Khan · Shaofeng Zou
[ Hall A-E ]
Abstract
In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $\tau$ at each step, named Iterated CVaR. %We consider the sample complexity of obtaining an $\epsilon$-optimal policy in an infinite horizon discounted MDP, given access to a generative model. % We first build a connection between Iterated CVaR RL with $(s, a)$-rectangular distributional robust RL with the specific uncertainty set for CVaR. We develop nearly matching upper and lower bounds on the sample complexity for this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $\epsilon$-optimal policy with at most $\tilde{\mathcal{O}}\left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2}\right)$ samples, where $\gamma$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, if $\tau \geq \gamma$, then the sample complexity can be further improved to $\tilde{\mathcal{O}}\left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right)$. We further show a minimax lower bound of ${\tilde{\mathcal{O}}}\left(\frac{(1-\gamma \tau)SA}{(1-\gamma)^4\tau\epsilon^2}\right)$. For a constant risk level $0<\tau\leq 1$, our upper and lower bounds match with each other, demonstrating the tightness and optimality of our analyses. We also investigate a limiting case with a small risk level …
Poster
Anshul Thakur · Soheila Molaei · Patrick Schwab · Danielle Belgrave · Kim Branson · David Clifton
[ Hall A-E ]
Abstract
Federated Learning (FL) involves a server aggregating local models from clients to compute a global model. However, this process can struggle to position the global model in low-loss regions of the parameter space for all clients, resulting in subpar convergence and inequitable performance across clients. This issue is particularly pronounced in non-IID settings, common in clinical contexts, where variations in data distribution, class imbalance, and training sample sizes result in client heterogeneity. To address this issue, we propose a mode connectivity-based FL framework that ensures the global model resides within the overlapping low-loss regions of all clients in the parameter space. This framework models the low-loss regions as non-linear mode connections between the current global and local models, and optimises to identify an intersection among these mode connections to define the new global model. This approach enhances training stability and convergence, yielding better and more equitable performance compared to standard FL frameworks like federated averaging. Empirical evaluations across multiple healthcare datasets demonstrate the benefits of the proposed framework.
Poster
Abdullah Tokmak · Kiran Krishnan · Thomas Schön · Dominik Baumann
[ Hall A-E ]
Abstract
Popular safe Bayesian optimization (BO) algorithms learn control policies for safety-critical systems in unknown environments. However, most algorithms make a smoothness assumption, which is encoded by a known bounded norm in a reproducing kernel Hilbert space (RKHS). The RKHS is a potentially infinite-dimensional space, and it remains unclear how to reliably obtain the RKHS norm of an unknown function. In this work, we propose a safe BO algorithm capable of estimating the RKHS norm from data. We provide statistical guarantees on the RKHS norm estimation, integrate the estimated RKHS norm into existing confidence intervals and show that we retain theoretical guarantees, and prove safety of the resulting safe BO algorithm. We apply our algorithm to safely optimize reinforcement learning policies on physics simulators and on a real inverted pendulum, demonstrating improved performance, safety, and scalability compared to the state-of-the-art.
Poster
Hue Dang · Matthew Wicker · Goetz Botterweck · Andrea Patane
[ Hall A-E ]
Abstract
We tackle the problem of computing guarantees for the robustness of neural networks against quantisation of their inputs, parameters and activation values. In particular, we pose the problem of bounding the worst-case discrepancy between the original neural network and all possible quantised ones parametrised by a given maximum quantisation diameter $\epsilon > 0$ over a finite dataset. To achieve this, we first reformulate the problem in terms of bilinear optimisation, which can be solved for provable bounds on the robustness guarantee. We then show how a quick scheme based on interval bound propagation can be developed and implemented during training so to allow for the learning of neural networks robust against a continuous family of quantisation techniques. We evaluated our methodology on a variety of architectures on datasets such as MNIST, F-MNIST and CIFAR10. We demonstrate how non-trivial bounds on guaranteed accuracy can be obtained on several architectures and how quantisation robustness can be significantly improved through robust training.
Poster
Xingzhi Sun · Danqi Liao · Kincaid MacDonald · Yanlei Zhang · Guillaume Huguet · Guy Wolf · Ian Adelstein · Tim G. J. Rudner · Smita Krishnaswamy
[ Hall A-E ]
Abstract
Rapid growth of high-dimensional datasets in fields such as single-cell RNA sequencing and spatial genomics has led to unprecedented opportunities for scientific discovery, but it also presents unique computational and statistical challenges. Traditional methods struggle with geometry-aware data generation, interpolation along meaningful trajectories, and transporting populations via feasible paths. To address these issues, we introduce Geometry-Aware Generative Autoencoder (GAGA), a novel framework that combines extensible manifold learning with generative modeling. GAGA constructs a neural network embedding space that respects the intrinsic geometries discovered by manifold learning and learns a novel warped Riemannian metric on the data space. This warped metric is derived from both the points on the data manifold and negative samples off the manifold, allowing it to characterize a meaningful geometry across the entire latent space. Using this metric, GAGA can uniformly sample points on the manifold, generate points along geodesics, and interpolate between populations across the learned manifold. GAGA shows competitive performance in simulated and real-world datasets, including a 30% improvement over SOTA in single-cell population-level trajectory inference.
Poster
David Huk · Mark Steel · Ritabrata Dutta
[ Hall A-E ]
Abstract
We propose reinterpreting copula density estimation as a discriminative task. Under this novel estimation scheme, we train a classifier to distinguish samples from the joint density from those of the product of independent marginals, recovering the copula density in the process. We derive equivalences between well-known copula classes and classification problems naturally arising in our interpretation. Furthermore, we show our estimator achieves theoretical guarantees akin to maximum likelihood estimation. By identifying a connection with density ratio estimation, we benefit from the rich literature and models available for such problems. Empirically, we demonstrate the applicability of our approach by estimating copulas of real and high-dimensional datasets, outperforming competing copula estimators in density evaluation as well as sampling.
Poster
Seyedeh Baharan Khatami · Harsh Parikh · Haowei Chen · Sudeepa Roy · Babak Salimi
[ Hall A-E ]
Abstract
Estimating causal effects in social network data presents unique challenges due to the presence of spillover effects and network-induced confounding. While much of the existing literature addresses causal inference in social networks, many methods rely on strong assumptions about the form of network-induced confounding. These assumptions often fail to hold in high-dimensional networks, limiting the applicability of such approaches. To address this, we propose a novel methodology that integrates graph machine learning techniques with the double machine learning framework, facilitating accurate and efficient estimation of both direct and peer effects in a single observational social network. Our estimator achieves semiparametric efficiency under mild regularity conditions, enabling consistent uncertainty quantification. Through extensive simulations, we demonstrate the accuracy, robustness, and scalability of our method. Finally, we apply the proposed approach to examine the impact of Self-Help Group participation on financial risk tolerance, highlighting its practical relevance.
Poster
Swetha Ganesh · Washim Uddin Mondal · Vaneet Aggarwal
[ Hall A-E ]
Abstract
We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.
Poster
HAO ZHU · Daniel M. Steinberg · Piotr Koniusz
[ Hall A-E ]
Abstract
In this work, we present a novel theoretical framework for analyzing and modeling protein fitness landscapes using spectral graph theory. By representing the protein sequence space as a generalized Hamming graph and studying its spectral properties, we derive a set of powerful tools for quantifying the ruggedness, epistasis, and other key characteristics of the landscape. We prove strong approximation and sampling results, showing that the landscape can be efficiently learned and optimized from limited and noisy data. Building on this foundation, we introduce Propagational Convolutional Neural Networks (PCNNs), a new class of inductive surrogate oracle. We provide rigorous theoretical guarantees on the generalization and convergence properties of PCNNs, using techniques from the Neural Tangent Kernel framework. Extensive experiments on real-world protein engineering tasks demonstrate the superiority of PCNNs over state-of-the-art methods, achieving higher fitness and better generalization from limited data.
Poster
Andi Nika · Jonathan Nöther · Debmalya Mandal · Parameswaran Kamalaruban · Adish Singla · Goran Radanovic
[ Hall A-E ]
Abstract
We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy $\pi^\dagger$ by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce $\pi^\dagger$. We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce $\pi^\dagger$. Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.
Poster
Yingqian Cui · Jie Ren · Pengfei He · Hui Liu · Jiliang Tang · Yue Xing
[ Hall A-E ]
Abstract
We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing theoretical literature predominantly focuses on providing convergence upper bounds to show that trained transformers with single-/multi-head attention can obtain a good in-context learning performance, our research centers on comparing the exact convergence of single- and multi-head attention more rigorously. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention.When the number of in-context examples $D$ increases, the prediction loss using single-/multi-head attention is in $O(1/D)$, and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, our technical framework in calculating the exact convergence further facilitates studying more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is preferred over single-head attention. Our results verify the effectiveness of the design of multi-head attention in the transformer architecture.
Poster
Honghua Zhang · Benjie Wang · Marcelo Arenas · Guy Van den Broeck
[ Hall A-E ]
Abstract
Probabilistic circuits (PCs) are a unifying representation for probabilistic models that support tractable inference. Numerous applications of PCs like controllable text generation depend on the ability to efficiently multiply two circuits. Existing multiplication algorithms require that the circuits respect the same structure, i.e. variable scopes decomposes according to the same vtree. In this work, we propose and study the task of restructuring structured(-decomposable) PCs, that is, transforming a structured PC such that it conforms to a target vtree. We propose a generic approach for this problem and show that it leads to novel polynomial-time algorithms for multiplying circuits respecting different vtrees, as well as a practical depth-reduction algorithm that preserves structured decomposibility. Our work opens up new avenues for tractable PC inference, suggesting the possibility of training with less restrictive PC structures while enabling efficient inference by changing their structures at inference time.
Poster
Krzysztof Kacprzyk · Mihaela van der Schaar
[ Hall A-E ]
Abstract
Symbolic regression (SR) is a machine learning approach aimed at discovering mathematical closed-form expressions that best fit a given dataset. Traditional complexity measures in SR, such as the number of terms or expression tree depth, often fail to capture the difficulty of specific analytical tasks a user might need to perform. In this paper, we introduce a new complexity measure designed to quantify the difficulty of conducting single-feature global perturbation analysis (SGPA)—a type of analysis commonly applied in fields like physics and risk scoring to understand the global impact of perturbing individual input features. We present a unified mathematical framework that formalizes and generalizes these established practices, providing a precise method to assess how challenging it is to apply SGPA to different closed-form equations. This approach enables the definition of novel complexity metrics and constraints directly tied to this practical analytical task. Additionally, we establish a reconstruction theorem, offering potential insights for developing future optimization techniques in SR.
Poster
Nikola Pavlovic · Sudeep Salgia · Qing Zhao
[ Hall A-E ]
Abstract
We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space (RKHS). We study this problem under the additional constraint of joint differential privacy, where the agents needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that improves upon the state of the art and achieves an error rate of $\mathcal{O}\left(\sqrt{\dfrac{\gamma_T}{T}} + \dfrac{\gamma_T}{T \varepsilon}\right)$ after $T$ queries for a large class of kernel families, where $\gamma_T$ represents the effective dimensionality of the kernel and $\varepsilon > 0$ is the privacy parameter. Our results are based on novel estimator for the reward function that simultaneously enjoys high utility along with a low-sensitivity to observed rewards and contexts, which is crucial to obtain an improved performance.
Poster
Samuele Fonio · Roberto Esposito · Marco Aldinucci
[ Hall A-E ]
Abstract
Non-Euclidean geometries have garnered significant research interest, particularly in their application to Deep Learning. Utilizing specific manifolds as embedding spaces has been shown to enhance neural network representational capabilities by aligning these spaces with the data's latent structure. In this paper, we focus on hyperbolic manifolds and introduce a novel framework, Hyperbolic Prototypical Entailment Cones (HPEC). The core innovation of HPEC lies in utilizing angular relationships, rather than traditional distance metrics, to more effectively capture the similarity between data representations and their corresponding prototypes. This is achieved by leveraging hyperbolic entailment cones, a mathematical construct particularly suited for embedding hierarchical structures in the Poincare' Ball, along with a novel Backclip mechanism. Our experimental results demonstrate that this approach significantly enhances performance in high-dimensional embedding spaces. To substantiate these findings, we evaluate HPEC on four diverse datasets across various embedding dimensions, consistently surpassing state-of-the-art methods in Prototype Learning.
Poster
Flavio Giorgi · Cesare Campagnano · Fabrizio Silvestri · Gabriele Tolomei
[ Hall A-E ]
Abstract
Explainable Artificial Intelligence (XAI) has emerged as a critical area of research to unravel the opaque inner logic of (deep) machine learning models. Among the various XAI techniques proposed in the literature, counterfactual explanations stand out as one of the most promising approaches. However, these ``what-if'' explanations are frequently complex and technical, making them difficult for non-experts to understand and, more broadly, challenging for humans to interpret.To bridge this gap, in this work, we exploit the power of open-source Large Language Models to generate natural language explanations when prompted with valid counterfactual instances produced by state-of-the-art explainers for graph-based models. Experiments across several graph datasets and counterfactual explainers show that our approach effectively produces accurate natural language representations of counterfactual instances, as demonstrated by key performance metrics
Poster
Daniel Dold · Julius Kobialka · Nicolai Palm · Emanuel Sommer · David Rügamer · Oliver Dürr
[ Hall A-E ]
Abstract
Understanding the structure of neural network loss surfaces, particularly the emergence of low-loss tunnels, is critical for advancing neural network theory and practice. In this paper, we propose a novel approach to directly embed loss tunnels into the loss landscape of neural networks. Exploring the properties of these loss tunnels offers new insights into their length and structure and sheds light on some common misconceptions. We then apply our approach to Bayesian neural networks, where we improve subspace inference by identifying pitfalls and proposing a more natural prior that better guides the sampling procedure.
Poster
Francesco Micheli · Efe Balta · Anastasios Tsiamis · John Lygeros
[ Hall A-E ]
Abstract
We address the challenge of sequential data-driven decision-making under context distributional uncertainty. This problem arises in numerous real-world scenarios where the learner optimizes black-box objective functions in the presence of uncontrollable contextual variables. We consider the setting where the context distribution is uncertain but known to lie within an ambiguity set defined as a ball in the Wasserstein distance. We propose a novel algorithm for Wasserstein Distributionally Robust Bayesian Optimization that can handle continuous context distributions while maintaining computational tractability. Our theoretical analysis combines recent results in self-normalized concentration in Hilbert spaces and finite-sample bounds for distributionally robust optimization to establish sublinear regret bounds that match state-of-the-art results. Through extensive comparisons with existing approaches on both synthetic and real-world problems, we demonstrate the simplicity, effectiveness, and practical applicability of our proposed method.
Poster
Yves Rychener · Daniel Kuhn · Yifan Hu
[ Hall A-E ]
Abstract
We investigate group fairness regularizers in federated learning, aiming to train a globally fair model in a distributed setting. Ensuring global fairness in distributed training presents unique challenges, as fairness regularizers typically involve probability metrics between distributions across all clients and are not naturally separable by client. To address this, we introduce a function-tracking scheme for the global fairness regularizer based on a Maximum Mean Discrepancy (MMD), which incurs a small communication overhead. This scheme seamlessly integrates into most federated learning algorithms while preserving rigorous convergence guarantees, as demonstrated in the context of FedAvg. Additionally, when enforcing differential privacy, the kernel-based MMD regularization enables straightforward analysis through a change of kernel, leveraging an intuitive interpretation of kernel convolution. Numerical experiments confirm our theoretical insights.
Poster
BEOMJUN KIM · Jaehwan Kim · Kangyeon Kim · Sunwoo Kim · Heejin Ahn
[ Hall A-E ]
Abstract
Evaluating dataset quality is an essential task, as the performance of artificial intelligence (AI) systems heavily depends on it. A traditional method for evaluating dataset quality involves training an AI model on the dataset and testing it on a separate test set. However, this approach requires significant computational time. In this paper, we propose a computationally efficient method for quantifying dataset quality. Specifically, our method measures how well the dataset covers the input probability distribution, ensuring that a high-quality dataset minimizes out-of-distribution inputs.We present a GPU-accelerated algorithm for approximately implementing the proposed method. We highlight three applications of our approach. First, it can evaluate the impact of data management practices, such as data cleaning and core set selection. We experimentally demonstrate that the quality assessment provided by our method strongly correlates with the traditional approach, achieving an $R^2 \geq 0.985$ in most cases while being 60-1200 times faster. Second, it can monitor the quality of continuously growing datasets with computation time proportional to the added data size. Finally, our method can estimate the performance of traditional methods for large datasets.
Poster
Christina Baek · Aditi Raghunathan · Zico Kolter
[ Hall A-E ]
Abstract
Under distribution shifts, deep networks exhibit a surprising phenomenon: in-distribution (ID) versus out-of-distribution (OOD) accuracy is often strongly linearly correlated across architectures and hyperparameters, accompanied by the same linear trend in ID versus OOD agreement between the predictions of any pair of such independently trained networks. The latter phenomenon called ``agreement-on-the-line'' enables precise unlabeled OOD performance estimation of models. In this work, we discover that agreement-on-the-line emerges even in linear classifiers over Gaussian class conditional distributions. We provide theoretical guarantees for this phenomenon in classifiers optimized via randomly initialized gradient descent, approximated by linear interpolations between random vectors and the Bayes-optimal classifier. Next, we prove a lower bound on the residual of the correlation between ID versus OOD agreement that grows proportionally with the residual of accuracy. Real-world experiments on CIFAR10C shifts, validate our findings and the broader relevance of our theoretical framework.
Poster
Ashkan Soleymani · Behrooz Tahmasebi · Stefanie Jegelka · Patrick Jaillet
[ Hall A-E ]
Abstract
While invariances naturally arise in almost any type of real-world data, no efficient and robust test exists for detecting them in observational data under arbitrarily given group actions. We tackle this problem by studying measures of invariance that can capture even negligible underlying patterns. Our first contribution is to show that, while detecting subtle asymmetries is computationally intractable, a randomized method can be used to robustly estimate closeness measures to invariance within constant factors. This provides a general framework for robust statistical tests of invariance. Despite the extensive and well-established literature, our methodology, to the best of our knowledge, is the first to provide statistical tests for general group invariances with finite-sample guarantees on Type II errors. In addition, we focus on kernel methods and propose deterministic algorithms for robust testing with respect to both finite and infinite groups, accompanied by a rigorous analysis of their convergence rates and sample complexity. Finally, we revisit the general framework in the specific case of kernel methods, showing that recent closeness measures to invariance, defined via group averaging, are provably robust, leading to powerful randomized algorithms.
Poster
Gil Kur · Aditya Guntuboyina
[ Hall A-E ]
Abstract
We study the task of estimating a log-concave density in $\mathbb{R}^d$ using the Maximum Likelihood Estimator, known as the log-concave MLE. We show that for every $d \geq 4$, the log-concave MLE attains an \emph{adaptive rate} when the negative logarithm of the underlying density is the maximum of $k$ affine functions, meaning that the estimation error for such a density is significantly lower than the minimax rate for the class of log-concave densities. Specifically, we prove that for such densities, the risk of the log-concave MLE is of order $c(k) \cdot n^{-\frac{4}{d}}$ in terms of the Hellinger squared distance. This result complements the work of (Kim et al. AoS 2018) and Feng et al. (AoS 2021), who addressed the cases $d = 1$ and $d \in \{2,3\}$, respectively. Our proof provides a unified and relatively simple approach for all $d \geq 1$, and is based on techniques from stochastic convex geometry and empirical process theory, which may be of independent interest.
Poster
Yuta Nakahara · Shota Saito · Naoki Ichijo · Koki Kazama · Toshiyasu Matsushima
[ Hall A-E ]
Abstract
Deterministic decision trees have difficulty in evaluating uncertainty especially for small samples. To solve this problem, we interpret the decision trees as stochastic models and consider prediction problems in the framework of Bayesian decision theory. Our models have three kinds of parameters: a tree shape, leaf parameters, and inner parameters. To make Bayesian optimal decisions, we have to calculate the posterior distribution of these parameters. Previously, two types of methods have been proposed. One marginalizes out the leaf parameters and samples the tree shape and the inner parameters by Metropolis-Hastings (MH) algorithms. The other marginalizes out both the leaf parameters and the tree shape based on a concept called meta-trees and approximates the posterior distribution for the inner parameters by a bagging-like method. In this paper, we propose a novel MH algorithm where the leaf parameters and the tree shape are marginalized out by using the meta-trees and only the inner parameters are sampled. Moreover, we update all the inner parameters simultaneously in each MH step. This algorithm accelerates the convergence and mixing of the Markov chain. We evaluate our algorithm on various benchmark datasets with other state-of-the-art methods. Further, our model provides a novel statistical evaluation of feature importance.
Poster
Ziyad Benomar · Vianney Perchet
[ Hall A-E ]
Abstract
The field of learning-augmented algorithms has gained significant attention in recent years. Using potentially inaccurate predictions, these algorithms must exhibit three key properties: consistency, robustness, and smoothness. In scenarios with stochastic predictions, a strong average-case performance is required. Typically, the design of such algorithms involves a natural tradeoff between consistency and robustness, and previous works aimed to achieve Pareto-optimal tradeoffs for specific problems. However, in some settings, this comes at the expense of smoothness. In this paper, we explore the tradeoffs between all the mentioned criteria and show how they can be balanced.
Poster
Amartya Sanyal · Yaxi Hu · Yaodong Yu · Yian Ma · Yixin Wang · Bernhard Schölkopf
[ Hall A-E ]
Abstract
Accuracy-on-the-line is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behaviour and may even exacerbate it. We formally prove a lower bound on OOD error in a linear classification model, characterising the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
Poster
Tathagata Sadhukhan · Manit Paul · Raaz Dwivedi
[ Hall A-E ]
Abstract
Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of missingness. In particular, we consider matrix completion settings where the entries of the underlying matrix follow an latent non-linear factor model, with the non-linearity belonging to a \Holder function class that is less smooth than Lipschitz. Our results establish following favorable properties for a suitable two-sided NN: (1) The mean squared error (MSE) of NN adapts to the smoothness of the non-linearity, (2) under certain regularity conditions, the NN error rate matches the rate obtained by an oracle equipped with the knowledge of both the row and column latent factors, and finally (3) NN's MSE is non-trivial for a wide range of settings even when several matrix entries might be missing deterministically. We support our theoretical findings via extensive numerical simulations and a case study with data from a mobile health study, HeartSteps.
Poster
Zhi Zhang · Kyle Ritscher · Oscar Madrid
[ Hall A-E ]
Abstract
This paper introduces and analyzes quantile additive trend filtering, a novel approach to model the conditional quantiles of the response variable given multivariate covariates. Under the assumption that the true model is additive, and that the components are functions whose $r$th order weak derivatives have bounded total variation, our estimator is a constrained version of quantile trend filtering within additive models. The primary theoretical contributions are the error rate of our estimator in both fixed and growing input dimensions. In the fixed dimension case, we show that our estimator attains a rate that mirrors the non-quantile minimax rate for additive trend filtering, featuring the main term $n^{-2r/(2r+1)}$. For growing input dimension ($d$), our rate has an additional polynomial factor $d^{(2r+2)/(2r+1)}$.We propose a practical algorithm for implementing quantile additive trend filtering using dimension-wise backfitting. Experiments in both real data and simulations confirm our theoretical findings. We provide a public implementation of the algorithm at https://github.com/zzh237/QATF.
Poster
Brendan Mallery · James Murphy · Shuchin Aeron
[ Hall A-E ]
Abstract
We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of reference measures given a set of coefficients belonging to the simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of barycenters with respect to the entropy-regularized Wasserstein-2 cost as solutions that correspond to a fixed point of an average of the entropy-regularized displacement maps. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when the measure being analyzed is a barycenter with respect to the entropy-regularized Wasserstein-2 cost. We show that these coefficients, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, and that barycentric coefficients are stable with respect to perturbations in the Wasserstein-2 metric. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show …
Poster
Hongni Wang · Junxi Zhang · Na Li · Linglong Kong · Bei Jiang · Xiaodong Yan
[ Hall A-E ]
Abstract
In healthcare and precision medicine, estimating optimal treatment regimes for right-censored data while ensuring fairness across ethnic subgroups is crucial but remains underexplored. The problem presents two key challenges: measuring heterogeneous treatment effects (HTE) under fairness constraints and dealing with censoring mechanisms. We propose a general framework for estimating HTE using nonparametric methods and integrating user-controllable fairness constraints to address these problems. Under mild regularization assumptions, our method is theoretically grounded, demonstrating the double robustness property of the HTE estimator. Using this framework, we demonstrate that optimal treatment strategies balance fairness and utility. Using extensive simulations and real-world data analysis, we uncovered the potential of this method to guide the selection of treatment methods that are equitable and effective.
Poster
Yang Hu · Tianyi Chen · Na Li · Kai Wang · Bo Dai
[ Hall A-E ]
Abstract
Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with *only* experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the *curse of horizon*. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a *linear representation* of value function and stationary distribution correction ratio, *i.e.*, primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, **SpectralDICE**, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.
Poster
Alessio Russo · Yichen Song · Aldo Pacchiano
[ Hall A-E ]
Abstract
We study the sample complexity of pure exploration in an online learning problem with a feedback graph. This graph dictates the feedback available to the learner, covering scenarios between full-information, pure bandit feedback, and settings with no feedback on the chosen action. While variants of this problem have been investigated for regret minimization, no prior work has addressed the pure exploration setting, which is the focus of our study. We derive an instance-specific lower bound on the sample complexity of learning the best action with fixed confidence, even when the feedback graph is unknown and stochastic, and present unidentifiability results for Bernoulli rewards. Additionally, our findings reveal how the sample complexity scales with key graph-dependent quantities. Lastly, we introduce TaS-FG (Track and Stop for Feedback Graphs), an asymptotically optimal algorithm, and demonstrate its efficiency across different graph configurations.
Poster
Qiaobo Li · Zixiang Chen · Yihe Deng · Yiwen Kou · Yuan Cao · Quanquan Gu
[ Hall A-E ]
Abstract
Representation learning, particularly multi-task representation learning, has gained widespread popularity in various deep learning applications, ranging from computer vision to natural language processing, due to its remarkable generalization performance. Despite its growing use, our understanding of the underlying mechanisms remains limited. In this paper, we provide a theoretical analysis elucidating why multi-task representation learning outperforms its single-task counterpart in scenarios involving over-parameterized two-layer convolutional neural networks trained by gradient descent. Our analysis is based on a data model that encompasses both task-shared and task-specific features, a setting commonly encountered in real-world applications. We also present experiments on synthetic and real-world data to illustrate and validate our theoretical findings.
Poster
Tingting Ni · Maryam Kamgarpour
[ Hall A-E ]
Abstract
We consider discounted infinite-horizon constrained Markov decision processes (CMDPs), where the goal is to find an optimal policy that maximizes the expected cumulative reward while satisfying expected cumulative constraints. Motivated by the application of CMDPs in online learning for safety-critical systems, we focus on developing a model-free and $\textit{simulator-free}$ algorithm that ensures $\textit{constraint satisfaction during learning}$. To this end, we employ the LB-SGD algorithm proposed in \citep{usmanova2022log}, which utilizes an interior-point approach based on the log-barrier function of the CMDP. Under the commonly assumed conditions of relaxed Fisher non-degeneracy and bounded transfer error in policy parameterization, we establish the theoretical properties of the LB-SGD algorithm. In particular, unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-6})$. Compared to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA, the LB-SGD algorithm requires an additional $\mathcal{O}(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with the same Fisher non-degenerate parameterization.
Poster
Sobihan Surendran · Antoine Godichon-Baggioni · Sylvain Le Corff
[ Hall A-E ]
Abstract
Variational Autoencoders (VAE) are popular generative models used to sample from complex data distributions. Despite their empirical success in various machine learning tasks, significant gaps remain in understanding their theoretical properties, particularly regarding convergence guarantees. This paper aims to bridge that gap by providing non-asymptotic convergence guarantees for VAE trained using both Stochastic Gradient Descent and Adam algorithms. We derive a convergence rate of $\mathcal{O}(\log n / \sqrt{n})$, where $n$ is the number of iterations of the optimization algorithm, with explicit dependencies on the batch size, the number of variational samples, and other key hyperparameters. Our theoretical analysis applies to both Linear VAE and Deep Gaussian VAE, as well as several VAE variants, including $\beta$-VAE and IWAE. Additionally, we empirically illustrate the impact of hyperparameters on convergence, offering new insights into the theoretical understanding of VAE training.
Poster
Gilad Turok · Chirag Modi · Bob Carpenter
[ Hall A-E ]
Abstract
Hamiltonian Monte Carlo (HMC) is the mainstay of applied Bayesian inference for differentiable models. However, HMC still struggles to sample from hierarchical models that induce densities with multiscale geometry: a large step size is needed to efficiently explore low curvature regions while a small step size is needed to accurately explore high curvature regions. We introduce the delayed rejection generalized HMC (DR-G-HMC) sampler that overcomes this challenge by employing dynamic step size selection, inspired by differential equation solvers. In generalized HMC, each iteration does a single leapfrog step. DR-G-HMC sequentially makes proposals with geometrically decreasing step sizes upon rejection of earlier proposals. This simulates Hamiltonian dynamics that can adjust its step size along a (stochastic) Hamiltonian trajectory to deal with regions of high curvature. DR-G-HMC makes generalized HMC competitive by decreasing the number of rejections which otherwise cause inefficient backtracking and prevents directed movement. We present experiments to demonstrate that DR-G-HMC (1) correctly samples from multiscale densities, (2) makes generalized HMC methods competitive with the state of the art No-U-Turn sampler, and (3) is robust to tuning parameters.
Poster
Haotian Ye · Himanshu Jain · Chong You · Ananda Theertha Suresh · Haowei Lin · James Zou · Felix Yu
[ Hall A-E ]
Abstract
In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.
Poster
Abhishek Sharma · Leo Benac · Sonali Parbhoo · Finale Doshi-Velez
[ Hall A-E ]
Abstract
Within batch reinforcement learning, safe policy improvement seeks to ensure that the learned policy performs at least as well as the behavior policy that generated the dataset. The core challenge is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (called `decision points') while still utilizing data from sparsely visited states by using them for trajectory-based value estimates. By selectively limiting the state-actions where the policy deviates from the behavior, we achieve tighter theoretical guarantees that depend only on the counts of frequently observed state-action pairs rather than on state-action space size. Our empirical results confirm DPRL provides both safety and performance improvements across synthetic and real-world applications.
Poster
Brian Cho · Dominik Meier · Kyra Gan · Nathan Kallus
[ Hall A-E ]
Abstract
In multi-armed bandits, reward maximization and pure exploration are often at odds with each other. The former focuses on exploiting arms with the highest means, while the latter may require constant exploration across all arms. In this work, we focus on good arm identification (GAI), a pure exploration objective that aims to label arms with means above a threshold as quickly as possible. We show that GAI can be efficiently solved by combining a reward-maximizing sampling algorithm with a novel nonparametric anytime-valid sequential test for labeling arm means.We begin by presenting the theoretical guarantees of our proposed sequential test. Under nonparametric assumptions, our test ensures strict error control and asymptotically achieves the minimax optimal e-power, a notion of power for anytime-valid tests. Building on this, we propose an algorithm for GAI by pairing regret-minimizing sampling schemes with our sequential test as a stopping criterion. We show that this approach achieves minimax optimal stopping times for labeling arms with means above a threshold, under an error probability constraint δ. Our empirical results validate our approach beyond the minimax setting, reducing the expected number of samples for all stopping times by at least 50% across both synthetic and real-world settings.
Poster
Aleksandar Armacki · Shuhua Yu · Pranay Sharma · Gauri Joshi · Dragana Bajovic · Dusan Jakovetic · Soummya Kar
[ Hall A-E ]
Abstract
We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-\zeta})$, where $\zeta \in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex …
Poster
Yatong Chen · Andrew Estornell · Yevgeniy Vorobeychik · Yang Liu
[ Hall A-E ]
Abstract
Individuals often aim to reverse undesired outcomes in interactions with automated systems, like loan denials, by either implementing system-recommended actions (recourse), or manipulating their features. While providing recourse benefits users and enhances system utility, it also provides information about the decision process that can be used for more effective strategic manipulation, especially when the individuals collectively share such information with each other. We show that this tension leads rational utility-maximizing systems to frequently withhold recourse, resulting in decreased population utility, particularly impacting sensitive groups. To mitigate these effects, we explore the role of recourse subsidies, finding them effective in increasing the provision of recourse actions by rational systems, as well as lowering the potential social cost and mitigating unfairness caused by recourse withholding.
Poster
Peyman Morteza
[ Hall A-E ]
Abstract
We develop a mathematical framework to address a broad class of metric and preference learning problems within a Hilbert space. We obtain a novel representer theorem for the simultaneous task of metric and preference learning. Our key observation is that the representer theorem for this task can be derived by regularizing the problem with respect to the norm inherent in the task structure. For the general task of metric learning, our framework leads to a simple and self-contained representer theorem and offers new geometric insights into the derivation of representer theorems for this task. In the case of Reproducing Kernel Hilbert Spaces (RKHSs), we illustrate how our representer theorem can be used to express the solution of the learning problems in terms of finite kernel terms similar to classical representer theorems. Lastly, our representer theorem leads to a novel nonlinear algorithm for metric and preference learning. We compare our algorithm against challenging baseline methods on real-world rank inference benchmarks, where it achieves competitive performance. Notably, our approach significantly outperforms vanilla ideal point methods and surpasses strong baselines across multiple datasets. Code available at: https://github.com/PeymanMorteza/Metric-Preference-Learning-RKHS
Poster
Florian Kalinke · Zoltan Szabo · Bharath Sriperumbudur
[ Hall A-E ]
Abstract
Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with the flexibility of kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the construction of powerful goodness-of-fit tests where it is sufficient to know the target distribution up to a multiplicative constant. However, the typical U- and V-statistic-based KSD estimators suffer from a quadratic runtime complexity, which hinders their application in large-scale settings. In this work, we propose a Nyström-based KSD acceleration---with runtime $\mathcal{O} \left(mn+m^3\right)$ for $n$ samples and $m\ll n$ Nyström points---, show its $\sqrt{n}$-consistency with a classical sub-Gaussian assumption, and demonstrate its applicability for goodness-of-fit testing on a suite of benchmarks. We also show the $\sqrt n$-consistency of the quadratic-time KSD estimator.
Poster
Paulius Rauba · Qiyao Wei · Mihaela van der Schaar
[ Hall A-E ]
Abstract
We consider the problem of auditing *black-box* large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question---can we understand how the outputs of black-box LLMs depend on *each input token*? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a *practical tool* for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that …
Poster
Shimeng Huang · Niklas Pfister · Jack Bowden
[ Hall A-E ]
Abstract
Observational genome-wide association studies are now widely used for causal inference in genetic epidemiology. To maintain privacy, such data is often only publicly available as summary statistics, and often studies for the endogenous covariates and the outcome are available separately. This has necessitated methods tailored to two-sample summary statistics. Current state-of-the-art methods modify linear instrumental variable (IV) regression---with genetic variants as instruments---to account for unmeasured confounding. However, since the endogenous covariates can be high dimensional, standard IV assumptions are generally insufficient to identify all causal effects simultaneously. We ensure identifiability by assuming the causal effects are sparse and propose a sparse causal effect two-sample IV estimator, spaceTSIV, adapting the spaceIV estimator by Pfister and Peters (2022) for two-sample summary statistics. We provide two methods, based on L0- and L1-penalization, respectively. We prove identifiability of the sparse causal effects in the two-sample setting and consistency of spaceTSIV. The performance of spaceTSIV is compared with existing two-sample IV methods in simulations. Finally, we showcase our methods using real proteomic and gene-expression data for drug-target discovery.
Poster
Safwan Labbi · Daniil Tiapkin · Lorenzo Mancini · Paul Mangold · Eric Moulines
[ Hall A-E ]
Abstract
In this paper, we present the Federated Upper Confidence Bound Value Iteration algorithm ($\texttt{Fed-UCBVI}$), a novel extension of the $\texttt{UCBVI}$ algorithm (Azar et al., 2017) tailored for the federated learning framework. We prove that the regret of $\texttt{Fed-UCBVI}$ scales as $\tilde O(\sqrt{H^3 |S| |A| T / M})$, with a small additional term due to heterogeneity, where $|S|$ is the number of states, $|A|$ is the number of actions, $H$ is the episode length, $M$ is the number of agents, and $T$ is the number of episodes. Notably, in the single-agent setting, this upper bound matches the minimax lower bound up to polylogarithmic factors, while in the multi-agent scenario, $\texttt{Fed-UCBVI}$ has linear speed-up. To conduct our analysis, we introduce a new measure of heterogeneity, which may hold independent theoretical interest. Furthermore, we show that, unlike existing federated reinforcement learning approaches, $\texttt{Fed-UCBVI}$'s communication complexity only marginally increases with the number of agents.
Poster
Ruijia Zhang · Siliang Zeng · Chenliang Li · Alfredo Garcia · Mingyi Hong
[ Hall A-E ]
Abstract
The goal of the Inverse reinforcement learning (IRL) task is to identify the underlyingreward function and the corresponding optimal policy from a set of expert demonstrations. Most algorithms with theoretical guarantees in IRL assume the rewardhas a linear structure. In this work, we want to extend our understanding of theIRL problem when the reward is parametrized by some neural network structures. Meanwhile, conventional IRL algorithms usually adopt a nested structure, thusexhibiting computational inefficiency, especially when the MDP is high-dimensional.We address this problem by proposing the first neural single-loop maximumlikelihood algorithm. Due to the nonlinearity of neural network approximation, theprevious global convergence result established on linear reward scenarios is no longerguaranteed. We give the nonasymptotic convergence analysis of our proposed neuralalgorithm by using the overparameterization of certain neural networks. However,it still remains unknown whether the proposed neural algorithm can identify theglobally optimal reward and the corresponding optimal policy. Under some over-parameterized neural network structures, we provide affirmative answers to bothquestions. To our knowledge, this is the first IRL algorithm with a non-asymptoticconvergence guarantee that identifies provably global optimum within neural networksettings.
Poster
Julianna Piskorz · Nicolás Astorga · Jeroen Berrevoets · Mihaela van der Schaar
[ Hall A-E ]
Abstract
Making treatment effect estimation actionable for personalized decision-making requires overcoming the costs and delays of acquiring necessary features. While many machine learning models estimate Conditional Average Treatment Effects (CATE), they mostly assume that _all_ relevant features are readily available at prediction time – a scenario that is rarely realistic. In practice, acquiring features, such as medical tests, can be both expensive and time-consuming, highlighting the need for strategies that select the most informative features for each individual, enhancing decision accuracy while controlling costs. Existing active feature acquisition (AFA) methods, developed for supervised learning, fail to address the unique challenges of CATE, such as confounding, overlap, and the structural similarities of potential outcomes under different treatments. To tackle these challenges, we propose specialised feature acquisition metrics and estimation strategies tailored to the CATE setting. We demonstrate the effectiveness of our methods through experiments on synthetic datasets designed to reflect common biases and data issues. In doing so, this work aims to bridge the gap between cutting-edge CATE estimation techniques and their practical, cost-efficient application in personalised treatment assignment.
Poster
Remi Khellaf · Aurélien Bellet · julie Josse
[ Hall A-E ]
Abstract
We study Federated Causal Inference, an approach to estimate treatment effects from decentralized data across centers. We compare three classes of Average Treatment Effect (ATE) estimators derived from the Plug-in G-Formula, ranging from simple meta-analysis to one-shot and multi-shot federated learning, the latter leveraging the full data to learn the outcome model (albeit requiring more communication). Focusing on Randomized Controlled Trials (RCTs), we derive the asymptotic variance of these estimators for linear models. Our results provide practical guidance on selecting the appropriate estimator for various scenarios, including heterogeneity in sample sizes, covariate distributions, treatment assignment schemes, and center effects. We validate these findings with a simulation study.
Poster
Zijun Gao
[ Hall A-E ]
Abstract
Accurate heterogeneous treatment effect (HTE) estimation is essential for personalized recommendations, making it important to evaluate and compare HTE estimators. Due to missing counterfactuals, traditional evaluation methods are inapplicable. Current HTE assessment methods rely on additional estimation or matching on test data, whose uncertainty is often ignored, leading to incorrect HTE estimator selection. We propose incorporating uncertainty quantification into HTE estimator comparisons. In addition, we suggest shifting the focus to the estimation and inference of the relative error between methods rather than the absolute errors. Methodology-wise, we develop a relative error estimator based on the semi-parametric efficient influence function and establish the estimator's asymptotic distribution for inference. Compared to absolute error-based methods, the relative error estimator (1) is less sensitive to the errors in nuisance function estimation, satisfies a "global double robustness" property, and (2) its confidence intervals are often narrower, making it more powerful for determining the more accurate HTE estimator. Through extensive empirical study of simulated data and ACIC benchmark datasets, we show that the relative error-based method more effectively identifies the better HTE estimator with statistical confidence, even with limited test data or inaccurate nuisance estimators.
Poster
Petar Bevanda · Max Beier · Alexandre Capone · Stefan Sosnowski · Sandra Hirche · Armin Lederer
[ Hall A-E ]
Abstract
We propose a family of Gaussian processes (GP) for dynamical systems with linear time-invariant responses, which are nonlinear only in initial conditions. This linearity allows us to tractably quantify forecasting andrepresentational uncertainty, simultaneously alleviating the challenge of computing the distribution of trajectories from a GP-based dynamical system and enabling a new probabilistic treatment of learning Koopman operator representations. Using a trajectory-based equivariance – which we refer to as Koopman equivariance – we obtain a GP model with enhanced generalization capabilities. To allow for large-scale regression, we equip our framework with variational inference based on suitable inducing points. Experiments demonstrate on-par and often better forecasting performance compared to kernel-based methods for learning dynamical systems.
Poster
Raphael Carpintero Perez · Sébastien da Veiga · Josselin Garnier · Brian Staber
[ Hall A-E ]
Abstract
In computational physics, machine learning has now emerged as a powerful complementary tool to explore efficiently candidate designs in engineering studies. Outputs in such supervised problems are signals defined on meshes, and a natural question is the extension of general scalar output regression models to such complex outputs. Changes between input geometries in terms of both size and adjacency structure in particular make this transition non-trivial. In this work, we propose an innovative strategy for Gaussian process regression where inputs are large and sparse graphs with continuous node attributes and outputs are signals defined on the nodes of the associated inputs. The methodology relies on the combination of regularized optimal transport, dimension reduction techniques, and the use of Gaussian processes indexed by graphs. In addition to enabling signal prediction, the main point of our proposal is to come with confidence intervals on node values, which is crucial for uncertainty quantification and active learning. Numerical experiments highlight the efficiency of the method to solve real problems in fluid dynamics and solid mechanics.
Poster
Saptarshi Chakraborty · Peter Bartlett
[ Hall A-E ]
Abstract
The field of unpaired image-to-image translation has undergone a significant transformation with the introduction of Generative Adversarial Networks (GANs), with CycleGAN and DiscoGAN as prominent variants. While these models show impressive empirical performance, their statistical properties are under-studied. In this paper, we propose a framework for analyzing the generalization error in cross-domain deep generative models. Our findings reveal that when provided with independent and identically distributed (i.i.d.) samples from two domains, the translation error, measured under the Wasserstein-1 loss, scales as $\tilde{\mathcal{O}} \left(\min(n, m)^{-1/\max(d,\tilde{d})}\right)$,provided that the true model possesses sufficient smoothness and the network sizes are chosen appropriately.Here, $n$ and $m$ represent the sizes of the sample sets, while $d$ and $\tilde{d}$ denote the dimensions of the respective data domains. Furthermore, we highlight the importance of a cycle loss term for ensuring distributional cycle consistency.Additionally, we provide insights into the relationship between the network size and the number of data points. Notably, as the true model exhibits greater smoothness, it suffices to work with smaller networks.
Poster
Jaehyun Park · Junyeop Kwon · Dabeen Lee
[ Hall A-E ]
Abstract
We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logit (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ regret where $\gamma$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $\Omega(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $\Omega(d(1-\gamma)^{-3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $\gamma$. Lastly, we show a regret lower bound of $\Omega(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.
Poster
Ricardo Baptista · Aram-Alexandre Pooladian · Michael Brennan · Youssef Marzouk · Jonathan Niles-Weed
[ Hall A-E ]
Abstract
Conditional simulation is a fundamental task in statistical modeling: Generate samples from the conditionals given finitely many data points from a joint distribution. One promising approach is to construct conditional Brenier maps, where the components of the map pushforward a reference distribution to conditionals of the target. While many estimators exist, few, if any, come with statistical or algorithmic guarantees. To this end, we propose a non-parametric estimator for conditional Brenier maps based on the computational scalability of \emph{entropic} optimal transport. Our estimator leverages a result of \citet{carlier2010knothe}, which shows that optimal transport maps under a rescaled quadratic cost asymptotically converge to conditional Brenier maps; our estimator is precisely the entropic analogues of these converging maps. We provide heuristic justifications for how to choose the scaling parameter in the cost as a function of the number of samples by fully characterizing the Gaussian setting. We conclude by comparing the performance of the estimator to other machine learning and non-parametric approaches on benchmark datasets and Bayesian inference problems.
Poster
Hisham Husain · Julien Monteil
[ Hall A-E ]
Abstract
Latent variable collaborative filtering methods have been a standard approach to modelling user-click interactions due to their simplicity and effectiveness. However, there is limited work on analyzing the mathematical properties of these methods in particular on preventing the overfitting towards the identity, and such methods typically utilize loss functions that overlook the geometry between items. In this work, we introduce a notion of generalization gap in collaborative filtering and analyze this with respect to latent collaborative filtering models. We present a geometric upper bound that gives rise to loss functions, and a way to meaningfully utilize the geometry of the items to improve recommendations. We show how these losses can be minimized and gives the recipe to a new latent collaborative filtering algorithm, which we refer to as GeoCF, due to the geometric nature of our results. We then show experimentally that our proposed GeoCF algorithm can outperform other all existing methods on the Movielens20M and Netflix datasets, as well as two large-scale internal datasets. In summary, our work proposes a theoretically sound method which paves a way to better understand generalization of collaborative filtering at large.
Poster
Talal Alrawajfeh · Joonas Jälkö · Antti Honkela
[ Hall A-E ]
Abstract
Differential privacy (DP) provides robust privacy guarantees for statistical inference, but this can lead to unreliable results and biases in downstream applications. While several noise-aware approaches have been proposed which integrate DP perturbation into the inference, they are limited to specific types of simple probabilistic models. In this work, we propose a novel method for noise-aware approximate Bayesian inference based on stochastic gradient variational inference which can also be applied to high-dimensional and non-conjugate models. We also propose a more accurate evaluation method for noise-aware posteriors. Empirically, our inference method has similar performance to existing methods in the domain where they are applicable. Outside this domain, we obtain accurate coverages on high-dimensional Bayesian linear regression and well-calibrated predictive probabilities on Bayesian logistic regression with the UCI Adult dataset.
Poster
Calvin Osborne · Eliza O'Reilly
[ Hall A-E ]
Abstract
Random feature maps are used to decrease the computational cost of kernel machines in large-scale problems. The Mondrian kernel is one such example of a fast random feature approximation of the Laplace kernel, generated by a computationally efficient hierarchical random partition of the input space known as the Mondrian process. In this work, we study a variation of this random feature map by applying a uniform random rotation to the input space before running the Mondrian process to approximate a kernel that is invariant under rotations. We obtain a closed-form expression for the isotropic kernel that is approximated, as well as a uniform convergence rate of the uniformly rotated Mondrian kernel to this limit. To this end, we utilize techniques from the theory of stationary random tessellations in stochastic geometry and prove a new result on the geometry of the typical cell of the superposition of uniformly rotated Mondrian tessellations. Finally, we test the empirical performance of this random feature map on both synthetic and real-world datasets, demonstrating its improved performance over the Mondrian kernel on a dataset that is debiased from the standard coordinate axes.
Poster
Ojash Neopane · Aaditya Ramdas · Aarti Singh
[ Hall A-E ]
Abstract
Estimation of the Average Treatment Effect (ATE) is a core problem in causal inference with strong connections to Off-Policy Evaluation in Reinforcement Learning. This paper considers the problem of adaptively selecting the treatment allocation probability in order to improve estimation of the ATE. The majority of prior work on adaptive ATE estimation focus on asymptotic guarantees, and in turn overlooks important practical considerations such as the difficulty of learning the optimal treatment allocation as well as hyper-parameter selection. Existing non-asymptotic methods are limited by poor empirical performance and exponential dependence on problem parameters. In order to address these gaps, we propose and analyze the Clipped Second Moment Tracking (ClipSMT) algorithm, a variant of an existing algorithm with strong asymptotic optimality guarantees, and provide finite sample bounds on its Neyman regret. Our analysis shows that, in the superpopulation setting, ClipSMT achieves exponential improvements in Neyman regret on two fronts: improving the dependence on $T$ from $O(\sqrt{T})$ to $O(\log T)$, as well as reducing the exponential dependence on problem parameters to a polynomial dependence---although the setting we consider is slightly less general. We conclude with simulations which show the marked improvement of \clipSMT over existing approaches.
Poster
Edmund Lau · Zach Furman · George Wang · Daniel Murfet · Susan Wei
[ Hall A-E ]
Abstract
The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC's theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer language models. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning's complexity and the principle of parsimony.
Poster
Alejandro de la Concha · Nicolas Vayatis · Argyris Kalogeratos
[ Hall A-E ]
Abstract
Multiple two-sample test problem in a graph-structured setting is a common scenario in fields such as Spatial Statistics and Neuroscience. Each node $v$ in fixed graph deals with a two-sample testing problem between two node-specific probability density functions, $p_v$ and $q_v$. The goal is to identify nodes where the null hypothesis $p_v = q_v$ should be rejected, under the assumption that connected nodes would yield similar test outcomes. We propose the non-parametric collaborative two-sample testing (CTST) framework that efficiently leverages the graph structure and minimizes the assumptions over $p_v$ and $q_v$. CTST integrates elements from f-divergence estimation, Kernel Methods, and Multitask Learning. We use synthetic experiments and a real sensor network detecting seismic activity to demonstrate that CTST outperforms state-of-the-art non-parametric statistical tests that apply at each node independently, hence disregard the geometry of the problem.
Poster
Arnab Bhattacharyya · Constantinos Daskalakis · Themistoklis Gouleakis · Yuhao Wang
[ Hall A-E ]
Abstract
We provide efficient algorithms for the problem of distribution learning from high-dimensional Gaussian data where in each sample, some of the variable values are missing. We suppose that the variables are {\em missing not at random (MNAR)}.The missingness model, denoted by $\mathbb{S}(\mathbf{y})$, is the function that maps any point $\mathbf{y}\in \mathbb{R}^d$ to the subsets of its coordinates that are seen. In this work, we assume that it is known. We study the following two settings:- [**Self-censoring**] An observation $\mathbf{x}$ is generated by first sampling the true value $\mathbf{y}$ from a $d$-dimensional Gaussian $\mathcal{N}(\mathbf{\mu}^*, \Sigma^*)$ with unknown $\mathbf{\mu}^*$ and $\Sigma^*$. For each coordinate $i$, there exists a set $S_i\subseteq \mathbb{R}^d$ such that $x_i=y_i$ if and only if $y_i\in S_i$. Otherwise, $x_i$ is missing and takes a generic value (e.g ``?"). We design an algorithm that learns $\mathcal{N}(\mathbf{\mu}^*, \Sigma^*)$ up to TV distance $\varepsilon$, using $\textup{poly}(d, 1/\varepsilon)$ samples, assuming only that each pair of coordinates is observed with sufficiently high probability. - [**Linear thresholding**] An observation $\mathbf{x}$ is generated by first sampling $\mathbf{y}$ from a $d$-dimensional Gaussian $\mathcal{N}(\mathbf{\mu}^*, \Sigma)$ with unknown $\mathbf{\mu}^*$ and known $\Sigma$, and then applying the missingness model $\mathbb{S}$ where $\mathbb{S}(\mathbf{y}) = \{i \in [d]: \mathbf{v}_i^T \mathbf{y} \leq b_i\}$ …
Poster
Tomoya Murata · Atsushi Nitanda · Taiji Suzuki
[ Hall A-E ]
Abstract
This study extends the problem settings of Invariant Risk Minimization(IRM) for Out-of-Distribution generalization problems to unknown clustered environments settings. In this scenario, where a given set of environments exhibits an unknown clustered structure, our objective is to identify a single invariant feature extractor and per-cluster regressors (or classifiers) built on top of the feature extractor. To achieve this, we propose a new framework called Clustered IRM for simultaneously identifying the cluster structure and the invariant features. Our theoretical analysis demonstrates that the required number of training environments for such identification is only $O(d_\mathrm{sp} + K^2)$, where $d_\mathrm{sp}$ represents the dimensionality of the spurious features, and $K$ is the number of clusters. Numerical experiments validate the effectiveness of our proposed framework.
Poster
Diego Martinez-Taboada · Aaditya Ramdas
[ Hall A-E ]
Abstract
We present a sequential version of the kernelized Stein discrepancy goodness-of-fit test, which allows for conducting goodness-of-fit tests for unnormalized densities that are continuously monitored and adaptively stopped. That is, the sample size need not be fixed prior to data collection; the practitioner can choose whether to stop the test or continue to gather evidence at any time while controlling the false discovery rate. In stark contrast to related literature, we do not impose uniform boundedness on the Stein kernel. Instead, we exploit the potential boundedness of the Stein kernel at arbitrary point evaluations to define test martingales, that give way to the subsequent novel sequential tests. We prove the validity of the test, as well as an asymptotic lower bound for the logarithmic growth of the wealth process under the alternative. We further illustrate the empirical performance of the test with a variety of distributions, including restricted Boltzmann machines.
Poster
Jonathan Geuter · Clément Bonet · Anna Korba · David Alvarez-Melis
[ Hall A-E ]
Abstract
Deep Equilibrium Models (DEQs) are a class of implicit neural networks that solve for a fixed point of a neural network in their forward pass. Traditionally, DEQs take sequences as inputs, but have since been applied to a variety of data. In this work, we present Distributional Deep Equilibrium Models (DDEQs), extending DEQs to discrete measure inputs, such as sets or point clouds. We provide a theoretically grounded framework for DDEQs. Leveraging Wasserstein gradient flows, we show how the forward pass of the DEQ can be adapted to find fixed points of discrete measures under permutation-invariance, and derive adequate network architectures for DDEQs. In experiments, we show that they can compete with state-of-the-art models in tasks such as point cloud classification and point cloud completion, while being significantly more parameter-efficient.
Poster
Junyu Cao · Ruijiang Gao · Esmaeil Keyvanshokooh
[ Hall A-E ]
Abstract
Human doctors frequently recommend actionable recourses that allow patients to modify their conditions to access more effective treatments. Inspired by such healthcare scenarios, we propose the Recourse Linear UCB (\textsf{RLinUCB}) algorithm, which optimizes both action selection and feature modifications by balancing exploration and exploitation. We further extend this to the Human-AI Linear Recourse Bandit (\textsf{HR-Bandit}), which integrates human expertise to enhance performance. \textsf{HR-Bandit} offers three key guarantees: (i) a warm-start guarantee for improved initial performance, (ii) a human-effort guarantee to minimize required human interactions, and (iii) a robustness guarantee that ensures sublinear regret even when human decisions are suboptimal. Empirical results, including a healthcare case study, validate its superior performance against existing benchmarks.
Poster
Michael Ito · Jiong Zhu · Dexiong Chen · Danai Koutra · Jenna Wiens
[ Hall A-E ]
Abstract
In this work, we theoretically demonstrate that current graph positional encodings (PEs) are not beneficial and could potentially hurt performance in tasks involving heterophilous graphs, where nodes that are close tend to have different labels. This limitation is critical as many real-world networks exhibit heterophily, and even highly homophilous graphs can contain local regions of strong heterophily. To address this limitation, we propose Learnable Laplacian Positional Encodings (LLPE), a new PE that leverages the full spectrum of the graph Laplacian, enabling them to capture graph structure on both homophilous and heterophilous graphs. Theoretically, we prove LLPE's ability to approximate a general class of graph distances and demonstrate its generalization properties. Empirically, our evaluation on 12 benchmarks demonstrates that LLPE improves accuracy across a variety of GNNs, including graph transformers, by up to 35% and 14% on synthetic and real-world graphs, respectively. Going forward, our work represents a significant step towards developing PEs that effectively capture complex structures in heterophilous graphs.
Poster
Han Bao · Shinsaku Sakaue
[ Hall A-E ]
Abstract
Inverse optimization aims to recover the unknown state in forward optimization after observing a state-outcome pair. This is relevant when we want to identify the underlying state of a system or to design a system with desirable outcomes. Whereas inverse optimization has been investigated in the algorithmic perspective during past two decades, its formulation intimately tied with the principal's subjective choice of a desirable state---indeed, this is crucial to make the inverse problem well-posed. We go beyond the conventional inverse optimization by building upon prediction market, where multiple agents submit their beliefs until converging to market equilibria. The market equilibria express the crowd consensus on a desirable state, effectively eschewing the subjective design. To this end, we derive a proper scoring rule for prediction market design in the context of inverse optimization.
Poster
Matthijs Ebbens · Nicole Funk · Jan Höckendorff · Christian Sohler · Vera Weil
[ Hall A-E ]
Abstract
We study the $k$-center problem in the context of individual fairness.Let $P$ be a set of $n$ points in a metric space and $r_x$ be the distance between $x \in P$ and its $\lceil n/k \rceil$-th nearest neighbor. The problem asks to optimize the $k$-center objective under the constraint that, for every point $x$, there is a center within distance $r_x$. We give bicriteria $(\beta,\gamma)$-approximation algorithms that compute clusterings such that every point $x \in P$ has a center within distance $\beta r_x$ and the clustering cost is at most $\gamma$ times the optimal cost.Our main contributions are a deterministic $O(n^2+ kn \log n)$ time $(2,2)$-approximation algorithm and a randomized $O(nk\log(n/\delta)+k^2/\varepsilon)$ time $(10,2+\varepsilon)$-approximation algorithm, where $\delta$ denotes the failure probability. For the latter, we develop a randomized sampling procedure to compute constant factor approximations for the values $r_x$ for all $x\in P$ in subquadratic time; we believe this procedure to be of independent interest within the context of individual fairness.
Poster
Jake Fawkes · Lucile Ter-Minassian · Desi Ivanova · Uri Shalit · Chris Holmes
[ Hall A-E ]
Abstract
Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge depends not only on reduction in epistemic uncertainty but also on improvement in overlap. To address this challenge, we introduce the first \emph{cryptographically secure} information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the \emph{Expected Information Gain} (EIG) using multi-party computation to ensure that no raw data is revealed. We further demonstrate that our approach can be combined with differential privacy (DP) to meet arbitrary privacy requirements whilst preserving more accurate computation compared to DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation.Code is publicly available: \url{https://github.com/LucileTerminassian/causal_prospective_merge}.
Poster
James Odgers · Ruby Sedgwick · Chrysoula Kappatou · Ruth Misener · Sarah Filippi
[ Hall A-E ]
Abstract
This work develops a Bayesian non-parametric approach to signal separation where the signals may vary according to latent variables. Our key contribution is to augment Gaussian Process Latent Variable Models (GPLVMs) for the case where each data point comprises the weighted sum of a known number of pure component signals observed across several input locations.Our framework allows arbitrary non-linear variations in the signals while being able to incorporate useful priors for the linear weights, such as including a sum-to-one constraint. Our contributions are particularly relevant to spectroscopy, where changing conditions may cause the underlying pure component signals to vary from sample to sample.To demonstrate the uses of our model we compare our method to alternatives on several simulated and real datasets.
Poster
Riccardo Poiani · Marc Jourdan · Emilie Kaufmann · Rémy Degenne
[ Hall A-E ]
Abstract
We study the fixed-confidence best-arm identification problem in unimodal bandits, in which the means of the arms increase with the index of the arm up to their maximum, then decrease. We derive two lower bounds on the stopping time of any algorithm. The instance-dependent lower bound suggests that due to the unimodal structure, only three arms contribute to the leading confidence-dependent cost. However, a worst-case lower bound shows that a linear dependence on the number of arms is unavoidable in the confidence-independent cost. We propose modifications of Track-and-Stop and a Top Two algorithm that leverage the unimodal structure. Both versions of Track-and-Stop are asymptotically optimal for one-parameter exponential families. The Top Two algorithm is asymptotically near-optimal for Gaussian distributions and we prove a non-asymptotic guarantee matching the worse-case lower bound. The algorithms can be implemented efficiently and we demonstrate their competitive empirical performance.
Poster
Tianyu Chen · Vansh Bansal · James Scott
[ Hall A-E ]
Abstract
Neural posterior estimation (NPE), a simulation-based computational approach for Bayesian inference, has shown great success in approximating complex posterior distributions. Existing NPE methods typically rely on normalizing flows, which approximate a distribution by composing many simple, invertible transformations. But flow-based models, while state of the art for NPE, are known to suffer from several limitations, including training instability and sharp trade-offs between representational power and computational cost. In this work, we demonstrate the effectiveness of conditional diffusions coupled with high-capacity summary networks for amortized NPE. Conditional diffusions address many of the challenges faced by flow-based methods. Our results show that, across a highly varied suite of benchmarking problems for NPE architectures, diffusions offer improved stability, superior accuracy, and faster training times, even with simpler, shallower models. Building on prior work on diffusions for NPE, we show that these gains persist across a variety of different summary network architectures. Code is available at https://github.com/TianyuCodings/cDiff.
Poster
Amartya Banerjee · Lee · Nir Sharon · Caroline Moosmüller
[ Hall A-E ]
Abstract
Capturing data from dynamic processes through cross-sectional measurements is seen in many fields, such as computational biology. Trajectory inference deals with the challenge of reconstructing continuous processes from such observations. In this work, we propose methods for B-spline approximation and interpolation of point clouds through consecutive averaging that is intrinsic to the Wasserstein space. Combining subdivision schemes with optimal transport-based geodesic, our methods carry out trajectory inference at a chosen level of precision and smoothness, and can automatically handle scenarios where particles undergo division over time. We prove linear convergence rates and rigorously evaluate our method on cell data characterized by bifurcations, merges, and trajectory splitting scenarios like *supercells*, comparing its performance against state-of-the-art trajectory inference and interpolation methods. The results not only underscore the effectiveness of our method in inferring trajectories but also highlight the benefit of performing interpolation and approximation that respect the inherent geometric properties of the data.
Poster
Masanari Kimura · Howard Bondell
[ Hall A-E ]
Abstract
The density ratio of two probability distributions is one of the fundamental tools in mathematical and computational statistics and machine learning, and it has a variety of known applications. Therefore, density ratio estimation from finite samples is a very important task, but it is known to be unstable when the distributions are distant from each other. One approach to address this problem is density ratio estimation using incremental mixtures of the two distributions. We geometrically reinterpret existing methods for density ratio estimation based on incremental mixtures. We show that these methods can be regarded as iterating on the Riemannian manifold along a particular curve between the two probability distributions. Making use of the geometry of the manifold, we propose to consider incremental density ratio estimation along generalized geodesics on this manifold. To achieve such a method requires Monte Carlo sampling along geodesics via transformations of the two distributions. We show how to implement an iterative algorithm to sample along these geodesics and show how changing the distances along the geodesic affect the variance and accuracy of the estimation of the density ratio.
Poster
Yasunari Hikima · Ken Kobayashi · Akinori Tanaka · Akiyoshi Sannai · Naoki Hamada
[ Hall A-E ]
Abstract
Multi-objective optimization aims to find a set of solutions that achieve the best trade-off among multiple conflicting objective functions. While various multi-objective optimization algorithms have been proposed so far, most of them aim to find finite solutions as an approximation of the Pareto set, which may not adequately capture the entire structure of the Pareto set, especially when the number of variables is large. To overcome this limitation, we propose a method to obtain a parametric hypersurface representing the entire Pareto set instead of a finite set of points. Since the Pareto set of an $M$-objective optimization problem typically forms an $(M-1)$-dimensional simplex, we use a Bézier simplex as a model to express the Pareto set. We then develop a stochastic gradient descent-based algorithm that updates the Bézier simplex model toward the Pareto set, introducing a preconditioning matrix to enhance convergence. Our convergence analysis demonstrated that the proposed algorithm outperforms naive stochastic gradient descent in terms of convergence rate. Furthermore, we validate the effectiveness of our method through various multi-objective optimization problem instances, including real-world problems.
Poster
Jiaqi Han · Mingjian Jiang · Yuxuan Song · Stefano Ermon · Minkai Xu
[ Hall A-E ]
Abstract
Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that generalizes and extends existing approaches. $f$-PO minimizes $f$-divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of $f$-divergences. We provide theoretical analysis of $f$-PO's properties and conduct extensive experiments on state-of-the-art language models using benchmark datasets. Results demonstrate $f$-PO's effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, MT-Bench, and Open LLM Leaderboard v2. Additionally, we present ablation studies exploring the impact of different $f$-divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of language model alignment. Code is available at https://github.com/MinkaiXu/fPO.
Poster
Zexuan Sun · Garvesh Raskutti
[ Hall A-E ]
Abstract
As opaque black-box predictive models such as neural networks become more prevalent, the need to develop interpretations for these models is of great interest. The concept of $\textit{variable importance}$ is an interpretability measure that applies to any predictive model and assesses how much a variable or set of variables improves prediction performance. When the number of variables is large, estimating variable importance presents a significant challenge because re-training neural networks or other black-box algorithms requires significant additional computation. In this paper, we address this challenge for algorithms using gradient descent and gradient boosting (e.g. neural networks, gradient-boosted decision trees). By using the ideas of early stopping of gradient-based methods in combination with warm-start using the $\textit{dropout}$ method, we develop a scalable method to estimate variable importance for any algorithm that can be expressed as an $\textit{iterative kernel update equation}$. Importantly, we provide theoretical guarantees by using the theory for early stopping of kernel-based methods for neural networks with sufficient large width and gradient-boosting decision trees that use symmetric tree as a weaker learner. We also demonstrate the efficacy of our methods through simulations and a real data example which illustrates the computational benefit of early stopping rather than fully re-training …
Poster
Tao Wen · Zihan Wang · Quan Zhang · Qi Lei
[ Hall A-E ]
Abstract
Deep learning models can suffer from severe performance degradation when relying on spurious correlations between input features and labels, making the models perform well on training data but have poor prediction accuracy for minority groups. This problem arises especially when training data are limited or imbalanced. While most prior work focuses on learning invariant features (with consistent correlations to y), it overlooks the potential harm of spurious correlations between features. We hereby propose Elastic Representation (ElRep) to learn features by imposing Nuclear- and Frobenius-norm penalties on the representation from the last layer of a neural network. Similar to the elastic net, ElRep enjoys the benefits of learning important features without losing feature diversity. The proposed method is simple yet effective. It can be integrated into many deep learning approaches to mitigate spurious correlations and improve group robustness. Moreover, we theoretically show that ElRep has minimum negative impacts on in-distribution predictions. This is a remarkable advantage over approaches that prioritize minority groups at the cost of overall performance.
Poster
Sachindra Pasan Dissanayake · Faisal Hamman · Barproda Halder · Ilia Sucholutsky · Qiuyi Zhang · Sanghamitra Dutta
[ Hall A-E ]
Abstract
Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.
Poster
Kihyuk (Ki) Hong · Woojin Chae · Yufan Zhang · Dabeen Lee · Ambuj Tewari
[ Hall A-E ]
Abstract
We study the problem of infinite-horizon average-reward reinforcement learning with linear Markov decision processes (MDPs). The associated Bellman operator of the problem not being a contraction makes the algorithm design challenging. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$. In this paper, we propose the first algorithm that achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret with computational complexity polynomial in the problem parameters, without making strong assumptions on dynamics. Our approach approximates the average-reward setting by a discounted MDP with a carefully chosen discounting factor, and then applies an optimistic value iteration. We propose an algorithmic structure that plans for a nonstationary policy through optimistic value iteration and follows that policy until a specified information metric in the collected data doubles. Additionally, we introduce a value function clipping procedure for limiting the span of the value function for sample efficiency.
Poster
Zheyang Shen · Jeremias Knoblauch · Sam Power · Chris Oates
[ Hall A-E ]
Abstract
Deterministic mathematical models, such as those specified via differential equations, are a powerful tool to communicate scientific insight. However, such models are necessarily simplified descriptions of the real world. Generalised Bayesian methodologies have been proposed for inference with misspecified models, but these are typically associated with vanishing parameter uncertainty as more data are observed. In the context of a misspecified deterministic mathematical model, this has the undesirable consequence that posterior predictions become deterministic and certain, while being incorrect. Taking this observation as a starting point, we propose *Prediction-Centric Uncertainty Quantification*, where a mixture distribution based on the deterministic model confers improved uncertainty quantification in the predictive context. Computation of the mixing distribution is cast as a (regularised) gradient flow of the maximum mean discrepancy (MMD), enabling consistent numerical approximations to be obtained. Results are reported on both a toy model from population ecology and a real model of protein signalling in cell biology.
Poster
Fengxue Zhang · Thomas Desautels · Yuxin Chen
[ Hall A-E ]
Abstract
Multi-fidelity Bayesian optimization (MFBO) is a powerful approach that utilizes low-fidelity, cost-effective sources to expedite the exploration and exploitation of a high-fidelity objective function. Existing MFBO methods with theoretical foundations either lack justification for performance improvements over single-fidelity optimization or rely on strong assumptions about the relationships between fidelity sources to construct surrogate models and direct queries to low-fidelity sources. To mitigate the dependency on cross-fidelity assumptions while maintaining the advantages of low-fidelity queries, we introduce a random sampling and partition-based MFBO framework with deep kernel learning. This framework is robust to cross-fidelity model misspecification and explicitly illustrates the benefits of low-fidelity queries. Our results demonstrate that the proposed algorithm effectively manages complex cross-fidelity relationships and efficiently optimizes the target fidelity function.
Poster
Lucas Kook
[ Hall A-E ]
Abstract
Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters, Bühlmann, and Meinshausen) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor Type I error control) and additive noise models are not suitable for applications in which the response is not measured on a continuous scale, but reflects categories or counts. Here, we develop transformation-model (tram) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses (these model classes, generally, do not allow for identifiability when there is no exogenous heterogeneity). As an invariance test, we propose tram-GCM based on the expected conditional covariance between environments and score residuals with uniform asymptotic level guarantees. For the special case of linear shift trams, we also consider tram-Wald, which tests invariance based on the Wald statistic. We provide an open-source 𝖱 package tramicp and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients. Supplementary materials …
Poster
Yannick Eich · Bastian Alt · Heinz Koeppl
[ Hall A-E ]
Abstract
We propose a novel, tractable latent state inference scheme for Markov jump processes, for which exact inference is often intractable. Our approach is based on an entropic matching framework that can be embedded into the well-known expectation propagation algorithm.We demonstrate the effectiveness of our method by providing closed-form results for a simple family of approximate distributions and apply it to the general class of chemical reaction networks, which are a crucial tool for modeling in systems biology.Moreover, we derive closed-form expressions for point estimation of the underlying parameters using an approximate expectation maximization procedure. We evaluate our method across various chemical reaction networks and compare it to multiple baseline approaches, demonstrating superior performance in approximating the mean of the posterior process. Finally, we discuss the limitations of our method and potential avenues for future improvement, highlighting its promising direction for addressing complex continuous-time Bayesian inference problems.
Poster
Xiaoxue Han · Huzefa Rangwala · Yue Ning
[ Hall A-E ]
Abstract
Graph Neural Networks (GNNs) are susceptible to distribution shifts, creating vulnerability and security issues in critical domains. There is a pressing need to enhance the generalizability of GNNs on out-of-distribution (OOD) test data. Existing methods that target learning an invariant (feature, structure)-label mapping often depend on oversimplified assumptions about the data generation process, which do not adequately reflect the actual dynamics of distribution shifts in graphs. In this paper, we introduce a more realistic graph data generation model using Structural Causal Models (SCMs), allowing us to redefine distribution shifts by pinpointing their origins within the generation process. Building on this, we propose a casual decoupling framework, DeCaf, that independently learns unbiased feature-label and structure-label mappings. We provide a detailed theoretical framework that shows how our approach can effectively mitigate the impact of various distribution shifts. We evaluate DeCaf across both real-world and synthetic datasets that demonstrate different patterns of shifts, confirming its efficacy in enhancing the generalizability of GNNs. Our code is available at: https://github.com/hanxiaoxue114/DeCaf-GraphOOD.
Poster
Ha Manh Bui · Enrique Mallada · Anqi Liu
[ Hall A-E ]
Abstract
By leveraging the representation power of deep neural networks, neural upper confidence bound (UCB) algorithms have shown success in contextual bandits. To further balance the exploration and exploitation, we propose Neural-$\sigma^2$-LinearUCB, a variance-aware algorithm that utilizes $\sigma^2_t$, i.e., an upper bound of the reward noise variance at round $t$, to enhance the uncertainty quantification quality of the UCB, resulting in a regret performance improvement. We provide an oracle version for our algorithm characterized by an oracle variance upper bound $\sigma^2_t$ and a practical version with a novel estimation for this variance bound. Theoretically, we provide rigorous regret analysis for both versions and prove that our oracle algorithm achieves a better regret guarantee than other neural-UCB algorithms in the neural contextual bandits setting. Empirically, our practical method enjoys a similar computational efficiency, while outperforming state-of-the-art techniques by having a better calibration and lower regret across multiple standard settings, including on the synthetic, UCI, MNIST, and CIFAR-10 datasets.
Poster
Frederiek Wesel · Kim Batselier
[ Hall A-E ]
Abstract
In this paper we establish a new connection between Tensor Network (TN)-constrained kernel machines and Gaussian Processes (GPs). We prove that the outputs of Canonical Polyadic Decomposition (CPD) and TensorTrain (TT)-constrained kernel machines converge in the limit of large ranks to the same GP which we fully characterize, when specifying appropriate i.i.d. priors across their components. We show that TT-constrained models achieve faster convergence to the GP compared to their CPD counterparts for thesame number of model parameters. The convergence to the GP occurs as the ranks tend toinfinity, as opposed to the standard approachwhich introduces TNs as an additional constraint on the posterior. This implies that thenewly established priors allow the models tolearn features more freely as they necessitateinfinitely more parameters to converge to aGP, which is characterized by a fixed learningrepresentation and thus no feature learning.As a consequence, the newly derived priors yield more flexible models which can better fit the data, albeit at increased risk of overfitting. We demonstrate these considerations by means of two numerical experiments.
Poster
Shanyun Gao · Raghavendra Addanki · Tong Yu · Ryan Rossi · Murat Kocaoglu
[ Hall A-E ]
Abstract
Change point detection in time series aims to identify moments when the probability distribution of time series changes. It is widely applied in many areas, such as human activity sensing and medical science. In the context of multivariate time series, this typically involves examining the joint distribution of multiple variables: If the distribution of any one variable changes, the entire time series undergoes a distribution shift. However, in practical applications, we may be interested only in certain components of the time series, exploring abrupt changes in their distributions while accounting for the presence of other components. Here, assuming an underlying structural causal model that governs the time-series data generation, we address this task by proposing a two-stage non-parametric algorithm that first learns parts of the causal structure through constraint-based discovery methods, and then employs conditional relative Pearson divergence estimation to identify the change points. The conditional relative Pearson divergence quantifies the distribution difference between consecutive segments in the time series, while the causal discovery method allows a focus on the causal mechanism, facilitating access to independent and identically distributed (IID) samples. Theoretically, the typical assumption of samples being IID in conventional change point detection methods can be relaxed based on …
Poster
Fuqiang Liu · Sicong Jiang · Luis Miranda-Moreno · Seongjin Choi · Lijun Sun
[ Hall A-E ]
Abstract
Large Language Models (LLMs) have recently demonstrated significant potential in the field of time series forecasting, offering impressive capabilities in handling complex temporal data. However, their robustness and reliability in real-world applications remain under-explored, particularly concerning their susceptibility to adversarial attacks. In this paper, we introduce a targeted adversarial attack framework for LLM-based time series forecasting. By employing both gradient-free and black-box optimization methods, we generate minimal yet highly effective perturbations that significantly degrade the forecasting accuracy across multiple datasets and LLM architectures. Our experiments, which include models like LLMTime with GPT-3.5, GPT-4, LLaMa, and Mistral, TimeGPT, and TimeLLM show that adversarial attacks lead to much more severe performance degradation than random noise, and demonstrate the broad effectiveness of our attacks across different LLMs. The results underscore the critical vulnerabilities of LLMs in time series forecasting, highlighting the need for robust defense mechanisms to ensure their reliable deployment in practical applications. The code repository can be found at https://github.com/JohnsonJiang1996/AdvAttack_LLM4TS.
Poster
Michael Lindon · Nathan Kallus
[ Hall A-E ]
Abstract
Motivated by monitoring the arrival of incoming adverse events such as customer support calls or crash events from users exposed to an experimental product change, we consider sequential hypothesis testing of continuous-time counting processes. Specifically, we provide a multivariate confidence process on the cumulative rates $(\Lambda^A_t, \Lambda^B_t)$ giving an anytime-valid coverage guarantee $\mathbb{P}[(\Lambda^A_t, \Lambda^B_t) \in C^\alpha_t \, \forall t >0] \geq 1-\alpha$. This provides simultaneous confidence process on $\Lambda^A_t$, $\Lambda^B_t$ and their difference $\Lambda^B_t-\Lambda^A_t$, allowing each arm of the experiment and the difference between them to be safely monitored throughout the experiment. We extend our results by constructing a closed-form $e$-process for testing the equality of rates with a time-uniform Type-I error guarantee at a nominal $\alpha$. We characterize the asymptotic growth rate of the proposed $e$-process under the alternative and show that it has power 1 when the average rates of the two process differ in the limit.
Poster
Tung Long Vuong
[ Hall A-E ]
Abstract
In recent years, deep discrete representation learning (DRL) has achieved significant success across various domains. Most DRL frameworks (e.g., the widely used VQ-VAE and its variants) have primarily focused on generative settings, where the quality of a representation is implicitly gauged by the fidelity of its generation. In fact, the goodness of a discrete representation remain ambiguously defined across the literature. In this work, we adopt a practical approach that examines DRL from a task-driven perspective. We propose a unified framework that explores the usefulness of discrete features in relation to downstream tasks, with generation naturally viewed as one possible application. In this context, the properties of discrete representations as well as the way they benefit certain tasks are also relatively understudied. We therefore provide an additional theoretical analysis of the trade-off between representational capacity and sample complexity, shedding light on how discrete representation utilization impacts task performance. Finally, we demonstrate the flexibility and effectiveness of our framework across diverse applications.
Poster
Chudi Zhong · Panyu Chen · Cynthia Rudin
[ Hall A-E ]
Abstract
Faithful explanations are essential for machine learning models in high-stakes applications. Inherently interpretable models are well-suited for these applications because they naturally provide faithful explanations by revealing their decision logic. However, model designers often need to keep these models proprietary to maintain their value. This creates a tension: we need models that are interpretable—allowing human decision-makers to understand and justify predictions, but not transparent, so that the model's decision boundary is not easily replicated by attackers. Shielding the model's decision boundary is particularly challenging alongside the requirement of completely faithful explanations, since such explanations reveal the true logic of the model for an entire subspace around each query point. This work provides an approach, FaithfulDefense, that creates model explanations for logical models that are completely faithful, yet reveal as little as possible about the decision boundary. FaithfulDefense is based on a maximum set cover formulation, and we provide multiple formulations for it, taking advantage of submodularity.
Poster
Jarren Briscoe · Garrett Kepler · Daryl DeFord · Assefaw Gebremedhin
[ Hall A-E ]
Abstract
Evaluating machine learning models is crucial not only for determining their technical accuracy but also for assessing their potential societal implications. While the potential for low-sample-size bias in algorithms is well known, we demonstrate the significance of sample-size bias induced by combinatorics in classification metrics. This revelation challenges the efficacy of these metrics in assessing bias with high resolution, especially when comparing groups of disparate sizes, which frequently arise in social applications. We provide analyses of the bias that appears in several commonly applied metrics and propose a model-agnostic assessment and correction technique. Additionally, we analyze counts of undefined cases in metric calculations, which can lead to misleading evaluations if improperly handled. This work illuminates the previously unrecognized challenge of combinatorics and probability in standard evaluation practices and thereby advances approaches for performing fair and trustworthy classification methods.
Poster
Ashwin Renganathan · Kade Carlson
[ Hall A-E ]
Abstract
Classical evolutionary approaches for multiobjective optimization are quite accurate but incur a lot of queries to the objectives; this can be prohibitive when objectives are expensive oracles. A sample-efficient approach to solving multiobjective optimization is via Gaussian process (GP) surrogates and Bayesian optimization (BO). Multiobjective Bayesian optimization (MOBO) involves the construction of an acquisition function which is optimized to acquire new observation candidates sequentially. This ``inner'' optimization can be hard due to various reasons: acquisition functions being nonconvex, nondifferentiable and/or unavailable in analytical form; batch sampling usually exacerbates these problems and the success of MOBO heavily relies on this inner optimization. This, ultimately, affects their sample efficiency. To overcome these challenges, we propose a Thompson sampling (TS) based approach ($q\texttt{POTS}$). Whereas TS chooses candidates according to the probability that they are optimal, $q\texttt{POTS}$ chooses candidates according to the probability that they are Pareto optimal. Instead of a hard acquisition function optimization, $q\texttt{POTS}$ solves a cheap multiobjective optimization on the GP posteriors with evolutionary approaches. This way we get the best of both worlds: accuracy of evolutionary approaches and sample-efficiency of MOBO. New candidates are chosen on the posterior GP Pareto frontier according to a maximin distance criterion. $q\texttt{POTS}$ is endowed …
Poster
David Burt · Yunyi Shen · Tamara Broderick
[ Hall A-E ]
Abstract
Spatial prediction tasks are key to weather forecasting, studying air pollution impacts, and other scientific endeavors. Determining how much to trust predictions made by statistical or physical methods is essential for the credibility of scientific conclusions. Unfortunately, classical approaches for validation fail to handle mismatch between locations available for validation and (test) locations where we want to make predictions. This mismatch is often not an instance of covariate shift (as commonly formalized) because the validation and test locations are fixed (e.g., on a grid or at select points) rather than i.i.d. from two distributions. In the present work, we formalize a check on validation methods: that they become arbitrarily accurate as validation data becomes arbitrarily dense. We show that classical and covariate-shift methods can fail this check. We propose a method that builds from existing ideas in the covariate-shift literature, but adapts them to the validation data at hand. We prove that our proposal passes our check. And we demonstrate its advantages empirically on simulated and real data.
Poster
Debmalya Mandal · Andi Nika · Parameswaran Kamalaruban · Adish Singla · Goran Radanovic
[ Hall A-E ]
Abstract
We study data corruption robustness for reinforcement learning with human feedback (RLHF) in an offline setting. Given an offline dataset of pairs of trajectories along with feedback about human preferences, an $\varepsilon$-fraction of the pairs is corrupted (e.g., feedback flipped or trajectory features manipulated), capturing an adversarial attack or noisy human preferences. We aim to design algorithms that identify a near-optimal policy from the corrupted data, with provable guarantees. Existing theoretical works have separately studied the settings of corruption robust RL (learning from scalar rewards directly under corruption) and offline RLHF (learning from human feedback without corruption); however, they are inapplicable to our problem of dealing with corrupted data in offline RLHF setting. To this end, we design novel corruption robust offline RLHF methods under various assumptions on the coverage of the data-generating distributions. At a high level, our methodology robustifies an offline RLHF framework by first learning a reward model along with confidence sets and then learning a pessimistic optimal policy over the confidence set. Our key insight is that learning optimal policy can be done by leveraging an offline corruption-robust RL oracle in different ways (e.g., zero-order oracle or first-order oracle), depending on the data coverage assumptions. To …
Poster
Hung-Hsu Chou · Johannes Maly · Claudio Mayrink Verdun · Bernardo da Costa · Heudson Mirandola
[ Hall A-E ]
Abstract
Over the past years, there has been significant interest in understanding the implicit bias of gradient descent optimization and its connection to the generalization properties of overparametrized neural networks. Several works observed that when training linear diagonal networks on the square loss for regression tasks (which corresponds to overparametrized linear regression) gradient descent converges to special solutions, e.g., non-negative ones. We connect this observation to Riemannian optimization and view overparametrized GD with identical initialization as a Riemannian GD. We use this fact for solving non-negative least squares (NNLS), an important problem behind many techniques, e.g., non-negative matrix factorization. We show that gradient flow on the reparametrized objective converges globally to NNLS solutions, providing convergence rates also for its discretized counterpart. Unlike previous methods, we do not rely on the calculation of exponential maps or geodesics. We further show accelerated convergence using a second-order ODE, lending itself to accelerated descent methods. Finally, we establish the stability against negative perturbations and discuss generalization to other constrained optimization problems.
Poster
James Thornton · Louis Béthune · Ruixiang ZHANG · Arwen Bradley · Preetum Nakkiran · Shuangfei Zhai
[ Hall A-E ]
Abstract
Diffusion models may be formulated as a time-indexed sequence of energy-based models, where the score corresponds to the negative gradient of an energy function. As opposed to learning the score directly, an energy parameterization is attractive as the energy itself can be used to control generation via Monte Carlo samplers. Architectural constraints and training instability in energy parameterized models have so far yielded inferior performance compared to directly approximating the score or denoiser. We address these deficiencies by introducing a novel training regime for the energy function through distillation of pre-trained diffusion models. We further showcase the synergies between energy and score by casting the diffusion sampling procedure as a Feynman Kac Model with energy weighted potentials. This formalism enables composition and low temperature sampling through sequential Monte Carlo.
Poster
Wee Chaimanowong · Ying Zhu
[ Hall A-E ]
Abstract
Permutation invariance is among the most common symmetries that can be exploited to simplify complex problems in machine learning. There has been a tremendous surge of research activities in building permutation invariant machine learning architectures. However, less attention is given to: (1) how to statistically test for the assumption of permutation invariance of coordinates in a random vector where the dimension is allowed to grow with the sample size; (2) how to estimate permutation invariant density functions; (3) how much ``smaller'' is the class of smooth functions with permutation invariance compared to that without permutation invariance. In this paper, we take a step back and examine these fundamental questions. In particular, our testing method is based on a sorting trick, and our estimation method is based on an averaging trick. These tricks substantially simplify the exploitation of permutation invariance. We also analyze the metric entropy of permutation invariant function classes and compare them with their counterparts without imposing permutation invariance.
Poster
Junkyu Lee · Tian Gao · Elliot Nelson · Miao Liu · Debarun Bhattacharjya · Songtao Lu
[ Hall A-E ]
Abstract
Many practical reinforcement learning environments have a discrete factored action space that induces a large combinatorial set of actions, thereby posing significant challenges. Existing approaches leverage the regular structure of the action space and resort to a linear decomposition of Q-functions, which avoids enumerating all combinations of factored actions. In this paper, we consider Q-functions defined over a lower dimensional projected subspace of the original action space, and study the condition for the unbiasedness of decomposed Q-functions using causal effect estimation from the no unobserved confounder setting in causal statistics. This leads to a general scheme which we call action decomposed reinforcement learning that uses the projected Q-functions to approximate the Q-function in standard model-free reinforcement learning algorithms. The proposed approach is shown to improve sample complexity in a model-based reinforcement learning setting. We demonstrate improvements in sample efficiency compared to state-of-the-art baselines in online continuous control environments and a real-world offline sepsis treatment environment.
Poster
Prathamesh Dharangutte · Jie Gao · Ruobin Gong · Guanyang Wang
[ Hall A-E ]
Abstract
This work proposes a class of differentially private mechanisms for linear queries, in particular range queries, that leverages correlated input perturbation to simultaneously achieve unbiasedness, consistency, statistical transparency, and control over utility requirements in terms of accuracy targets expressed either in certain query margins or as implied by the hierarchical database structure. The proposed Cascade Sampling algorithm instantiates the mechanism exactly and efficiently. Our theoretical and empirical analysis demonstrates that we achieve near-optimal utility, effectively compete with other methods, and retain all the favorable statistical properties discussed earlier.
Poster
Jim Zhao · Nikita Doikov · Aurelien Lucchi
[ Hall A-E ]
Abstract
This paper addresses the optimization problem of minimizing non-convex continuous functions,a problem highly relevant in high-dimensional machine learning scenarios, particularly those involving over-parameterization. We analyze a randomized coordinate second-order method named SSCN, which can be interpreted as applying the cubic regularization of Newton's method in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, making it applicable in higher-dimensional scenarios.Theoretically, we establish strong global convergence guarantees for non-convex functions to a stationary point, with interpolating rates for arbitrary subspace sizes andallowing inexact curvature estimation, starting from an arbitrary initialization.When increasing the subspace size, our complexity matches the $\mathcal{O}(\epsilon^{-3/2})$ rate of the full Newton's method with cubic regularization.Additionally, we propose an adaptive sampling scheme ensuring the exact convergence rate of $\mathcal{O}(\epsilon^{-3/2}, \epsilon^{-3})$ to a second-order stationary point, without requiring to sample all coordinates.Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods and other second-order subspace methods.
Poster
Yuying Duan · Gelei Xu · Yiyu Shi · Michael Lemmon
[ Hall A-E ]
Abstract
With the emerging application of Federated Learning (FL) in finance, hiring and healthcare, FL models are regulated to be fair, preventing disparities with respect to legally protected attributes such as race or gender. Two concepts of fairness are important in FL: global and local fairness. Global fairness addresses the disparity across the entire population and local fairness is concerned with the disparity within each client. Prior fair FL frameworks have improved either global or local fairness without considering both. Furthermore, while the majority of studies on fair FL focuses on binary settings, many real-world applications are multi-class problems. This paper proposes a framework that investigates the minimum accuracy lost for enforcing a specified level of global and local fairness in multi-class FL settings. Our framework leads to a simple post-processing algorithm that derives fair outcome predictors from the Bayesian optimal score functions. Experimental results show that our algorithm outperforms the current state of the art (SOTA) with regard to the accuracy-fairness tradoffs, computational and communication costs. Codes are available at: https://github.com/papersubmission678/The-cost-of-local-and-global-fairness-in-FL.
Poster
Ziwei Su · Diego Klabjan
[ Hall A-E ]
Abstract
Stochastic simulation models are generative models that mimic complex systems to help with decision-making. The reliability of these models heavily depends on well-calibrated input model parameters. However, in many practical scenarios, only output-level data are available to learn the input model parameters, which is challenging due to the often intractable likelihood of the stochastic simulation model. Moreover, stochastic simulation models are frequently inexact, with discrepancies between the model and the target system. No existing methods can effectively learn and quantify the uncertainties of input parameters using only output-level data. In this paper, we propose to learn differentiable input parameters of stochastic simulation models using output-level data via kernel score minimization with stochastic gradient descent. We quantify the uncertainties of the learned input parameters using a frequentist confidence set procedure based on a new asymptotic normality result that accounts for model inexactness. The proposed method is evaluated on exact and inexact G/G/1 queueing models as well as a stochastic volatility model.
Poster
Xudong Sun · Nutan Chen · Alexej Gossmann · Yu Xing · Matteo Wohlrapp · Emilio Dorigatti · Carla Feistner · Felix Drost · Daniele Scarcella · Lisa Beer · Carsten Marr
[ Hall A-E ]
Abstract
A probabilistic graphical model is proposed, modeling the joint model parameter and multiplier evolution, with a hypervolume based likelihood, promoting multi-objective descent in structural risk minimization. We address multi-objective model parameter optimization via a surrogate single objective penalty loss with time-varying multipliers, equivalent to online scheduling of loss landscape. The multi-objective descent goal is dispatched hierarchically into a series of constraint optimization sub-problems with shrinking bounds according to Pareto dominance. The bound serves as setpoint for the low-level multiplier controller to schedule loss landscapes via output feedback of each loss term. Our method forms closed loop of model parameter dynamic, circumvents excessive memory requirements and extra computational burden of existing multi-objective deep learning methods, and is robust against controller hyperparameter variation, demonstrated on domain generalization tasks.
Poster
Ambrus Tamás · Szabolcs Szentpéteri · Balázs Csanád Csáji
[ Hall A-E ]
Abstract
Stochastic multi-armed bandits (MABs) provide a fundamental reinforcement learning model to study sequential decision making in uncertain environments. The upper confidence bounds (UCB) algorithm gave birth to the renaissance of bandit algorithms, as it achieves near-optimal regret rates under various moment assumptions. Up until recently most UCB methods relied on concentration inequalities leading to confidence bounds which depend on moment parameters, such as the variance proxy, that are usually unknown in practice. In this paper, we propose a new distribution-free, data-driven UCB algorithm for symmetric reward distributions, which needs no moment information. The key idea is to combine a refined, one-sided version of the recently developed resampled median-of-means (RMM) method with UCB. We prove a near-optimal regret bound for the proposed anytime, parameter-free RMM-UCB method, even for heavy-tailed distributions.
Poster
Yan Yang · Bin Gao · Ya-xiang Yuan
[ Hall A-E ]
Abstract
Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we design both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms enjoy the convergence rate $\mathcal{O}\left(\epsilon^{-1}\right)$. To extend the applicability, a stochastic version of the model-free algorithm is proposed, along with results on its convergence rate and sampling complexity. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.
Poster
Yue Xing
[ Hall A-E ]
Abstract
In recent years, studies such as \cite{carmon2019unlabeled,gowal2021improving} demonstrate that incorporating additional real or generated data with pseudo-labels can enhance adversarial training through a two-stage training framework. In this paper, we perform a theoretical analysis of the asymptotic behavior of this method in high-dimensional regression problem when using two-layer neural networks. We first derive the asymptotics of the two-stage training framework using linear regression as a preliminary. Then, we analyze the convergence of two-layer neural networks in the two-stage framework. The analysis considers two different regimes: in the first stage of the framework, it is a high-dimensional regime, and in the second stage, the sample size is much larger than the data dimension. To analyze adversarial training, we track the change of the adversarial attack, and reveal that training with two-layer neural networks gives a prediction performance similar to training a linear model with some particular $\mathcal{L}_2$ regularization corresponding to different regimes. To highlight our technical contribution, we are the first to investigate adversarial training in two-layer neural networks under moderate attack strength, which is different from most existing literature in vanishing attack strength.
Poster
Sandeep Nagar · Girish Varma
[ Hall A-E ]
Abstract
The inverse of an invertible convolution is an important operation that comes up in Normalizing Flows, Image Deblurring, etc. The naive algorithm for backpropagation of this operation using Gaussian elimination has running time $O(n^3)$ where $n$ is the number of pixels in the image. We give a fast parallel backpropagation algorithm with running time $O(\sqrt{n})$ for a square image and provide a GPU implementation of the same.Inverse of Convolutions are usually used in Normalizing Flows in the sampling pass, making them slow. We propose to use Inverse of Convolutions in the forward (image to latent vector) pass of the Normalizing flow. Since the sampling pass is the inverse of the forward pass, it will use convolutions only, resulting in efficient sampling times. We use our parallel backpropagation algorithm for optimizing the inverse of convolution layer, resulting in fast training times also. We implement this approach in various Normalizing Flow backbones, resulting in our Inverse-Flow models. We benchmark Inverse-Flow on standard datasets and show significantly improved sampling times with similar bits per dimension compared to previous models.
Poster
Debmalya Mandal · Goran Radanovic
[ Hall A-E ]
Abstract
We study the setting of \emph{performative reinforcement learning} where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~\cite{MTR23} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to \emph{linear Markov decision processes} which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a \emph{performatively stable policy}. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a …
Poster
David Bosch · Ashkan Panahi
[ Hall A-E ]
Abstract
The Convex Gaussian Min-Max Theorem (CGMT) allows for the study of min-max optimization problems over bilinear Gaussian forms by instead considering an alternative optimization problem whose statistical properties are tied to that of the primary optimization. We prove a generalization of the CGMT to a family of problems in machine learning (ML) with correlated entries in the data matrix. This family includes various familiar examples of problems with shared weights or repeated features. In particular, we make use of our theorem to obtain asymptotically exact learning curves for regression with vector valued labels, regression with complex variables, and regression with convolution.
Poster
Yufeng Zhang · Fengzhuo Zhang · Zhuoran Yang · Zhaoran Wang
[ Hall A-E ]
Abstract
In-Context Learning (ICL) ability has been found efficient across a wide range of applications, where the Large Language Models (LLM) learn to complete the tasks from the examples in the prompt without tuning the parameters. In this work, we conduct a comprehensive study to understand ICL from a statistical perspective. First, we show that the perfectly pretrained LLMs perform Bayesian Model Averaging (BMA) for ICL under a dynamic model of examples in the prompt. The average error analysis for ICL is then built for the perfectly pretrained LLMs with the analysis of BMA. Second, we demonstrate how the attention structure boosts the BMA implementation. With sufficient examples in the prompt, attention is proven to perform BMA under the Gaussian linear ICL model, which also motivates the explicit construction of the hidden concepts from the attention heads values. Finally, we analyze the pretraining behavior of LLMs. The pretraining error is decomposed as the generalization error and the approximation error. The generalization error is upper bounded via PAC-Bayes framework. Then the ICL average error of the pretrained LLMs is shown to be the sum of $O(T^{-1})$ and the pretraining error. In addition, we analyze the ICL performance of the pretrained LLMs with …
Poster
Diantong Li · Fengxue Zhang · Chong Liu · Yuxin Chen
[ Hall A-E ]
Abstract
Multi-objective Bayesian optimization has been widely adopted in scientific experiment design, including drug discovery and hyperparameter optimization. In practice, regulatory or safety concerns often impose additional thresholds on certain attributes of the experimental outcomes. Previous work has primarily focused on constrained single-objective optimization tasks or active search under constraints. The existing constrained multi-objective algorithms address the issue with heuristics and approximations, posing challenges to the analysis of the sample efficiency.We propose a novel constrained multi-objective Bayesian optimization algorithm **COMBOO** that balances active learning of the level-set defined on multiple unknowns with multi-objective optimization within the feasible region. We provide both theoretical analysis and empirical evidence, demonstrating the efficacy of our approach on various synthetic benchmarks and real-world applications.
Poster
Mingyu Pu · Songhao Wang · Haowei Wang · Szu Hui Ng
[ Hall A-E ]
Abstract
Gaussian Process (GP) models are widely utilized as surrogate models in scientific and engineering fields. However, standard GP models are limited to continuous variables due to the difficulties in establishing correlation structures for categorical variables. To overcome this limitation, we introduce **WE**ighted Euclidean distance matrices **G**aussian **P**rocess (WEGP). WEGP constructs the kernel function for each categorical input by estimating the Euclidean distance matrix (EDM) among all categorical choices of this input. The EDM is represented as a linear combination of several predefined base EDMs, each scaled by a positive weight. The weights, along with other kernel hyperparameters, are inferred using a fully Bayesian framework. We analyze the predictive performance of WEGP theoretically. Numerical experiments validate the accuracy of our GP model, and by WEGP, into Bayesian Optimization (BO), we achieve superior performance on both synthetic and real-world optimization problems. The code is available at: https://github.com/pmy0124nus/WEGP.
Poster
Wenyuan Zhao · Haoyuan Chen · Tie Liu · Rui Tuo · Chao Tian
[ Hall A-E ]
Abstract
With the strengths of both deep learning and kernel methods like Gaussian Processes (GPs), Deep Kernel Learning (DKL) has gained considerable attention in recent years. From the computational perspective, however, DKL becomes challenging when the input dimension of the GP layer is high. To address this challenge, we propose the Deep Additive Kernel (DAK) model, which incorporates i) an additive structure for the last-layer GP; and ii) induced prior approximation for each GP unit. This naturally leads to a last-layer Bayesian neural network (BNN) architecture. The proposed method enjoys the interpretability of DKL as well as the computational advantages of BNN. Empirical results show that the proposed approach outperforms state-of-the-art DKL methods in both regression and classification tasks.
Poster
Saeyoung Rho · Andrew Tang · Noah Bergam · Rachel Cummings · Vishal Misra
[ Hall A-E ]
Abstract
In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool. SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data. As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC. To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups. ClusterSC incorporates a clustering step to select only the relevant donors for the target. We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets. The results indicate that ClusterSC consistently outperforms classical SC approaches.
Poster
Jeongyeol Kwon · Luke Dotson · Yudong Chen · Qiaomin Xie
[ Hall A-E ]
Abstract
Previous studies on two-timescale stochastic approximation (SA) mainly focused on bounding mean-squared errors under diminishing stepsize schemes. In this work, we investigate {\it constant} stpesize schemes through the lens of Markov processes, proving that the iterates of both timescales converge to a unique joint stationary distribution in Wasserstein metric. We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes in the presence of Markovian noise. Specifically, with two constant stepsizes $\alpha < \beta$, we show that the biases scale linearly with both stepsizes as $\Theta(\alpha)+\Theta(\beta)$ up to higher-order terms, while the variance of the slower iterate (resp., faster iterate) scales only with its own stepsize as $O(\alpha)$ (resp., $O(\beta)$). Unlike previous work, our results require no additional assumptions such as $\beta^2 \ll \alpha$ nor extra dependence on dimensions. These fine-grained characterizations allow tail-averaging and extrapolation techniques to reduce variance and bias, improving mean-squared error bound to $O(\beta^4 + \frac{1}{t})$ for both iterates.
Poster
Xiaoyan Hu · Ho-fung Leung · Farzan Farnia
[ Hall A-E ]
Abstract
Existing frameworks for evaluating and comparing generative models consider an offline setting, where the evaluator has access to large batches of data produced by the models. However, in practical scenarios, the goal is often to identify and select the best model using the fewest possible generated samples to minimize the costs of querying data from the sub-optimal models. In this work, we propose an online evaluation and selection framework to find the generative model that maximizes a standard assessment score among a group of available models. We view the task as a multi-armed bandit (MAB) and propose upper confidence bound (UCB) bandit algorithms to identify the model producing data with the best evaluation score that quantifies the quality and diversity of generated data. Specifically, we develop the MAB-based selection of generative models considering the Fréchet Distance (FD) and Inception Score (IS) metrics, resulting in the FD-UCB and IS-UCB algorithms. We prove regret bounds for these algorithms and present numerical results on standard image datasets. Our empirical results suggest the efficacy of MAB approaches for the sample-efficient evaluation and selection of deep generative models. The project code is available at [https://github.com/yannxiaoyanhu/dgm-online-eval](https://github.com/yannxiaoyanhu/dgm-online-eval).
Poster
Aldo Carranza · Susan Athey
[ Hall A-E ]
Abstract
We consider the problem of using observational bandit feedback data from multiple heterogeneous data sources to learn a personalized decision policy that robustly generalizes across diverse target settings. To achieve this, we propose a minimax regret optimization objective to ensure uniformly low regret under general mixtures of the source distributions. We develop a policy learning algorithm tailored to this objective, combining doubly robust offline policy evaluation techniques and no-regret learning algorithms for minimax optimization. Our regret analysis shows that this approach achieves the minimal worst-case mixture regret up to a moderated vanishing rate of the total data across all sources. Our analysis, extensions, and experimental results demonstrate the benefits of this approach for learning robust decision policies from multiple data sources.
Poster
Aya Kayal · Sattar Vakili · Laura Toni · Alberto Bernacchia
[ Hall A-E ]
Abstract
Reinforcement Learning (RL) problems are being considered under increasingly more complex structures. While tabular and linear models have been thoroughly explored, the analytical study of RL under non-linear function approximation, especially kernel-based models, has recently gained traction for their strong representational capacity and theoretical tractability. In this context, we examine the question of statistical efficiency in kernel-based RL within the reward-free RL framework, specifically asking: how many samples are required to design a near-optimal policy? Existing work addresses this question under restrictive assumptions about the class of kernel functions. We first explore this question assuming a generative model, then relax this assumption at the cost of increasing the sample complexity by a factor of $H$, the episode length. We tackle this fundamental problem using a broad class of kernels and a simpler algorithm compared to prior work. Our approach derives new confidence intervals for kernel ridge regression, specific to our RL setting, that may be of broader applicability. We further validate our theoretical findings through simulations.
Poster
Maryam Aliakbarpour · Syomantak Chaudhuri · Thomas Courtade · Alireza Fallah · Michael Jordan
[ Hall A-E ]
Abstract
Local Differential Privacy (LDP) offers strong privacy guarantees without requiring users to trust external parties. However, LDP applies uniform protection to all data features, including less sensitive ones, which degrades performance of downstream tasks. To overcome this limitation, we propose a Bayesian framework, Bayesian Coordinate Differential Privacy (BCDP), that enables feature-specific privacy quantification. This more nuanced approach complements LDP by adjusting privacy protection according to the sensitivity of each feature, enabling improved performance of downstream tasks without compromising privacy. We characterize the properties of BCDP and articulate its connections with standard non-Bayesian privacy frameworks. We further apply our BCDP framework to the problems of private mean estimation and ordinary least-squares regression. The BCDP-based approach obtains improved accuracy compared to a purely LDP-based approach, without compromising on privacy.
Poster
Paweł Teisseyre · Timo Martens · Jessa Bekker · Jesse Davis
[ Hall A-E ]
Abstract
Learning from positive and unlabeled data (PU learning) aims to train a binary classification model when only positive and unlabeled examples are available. Typically, learners assume that there is a labeling mechanism that determines which positive labels are observed. A particularly challenging setting arises when the observed positive labels are a biased sample from the positive distribution. Current approaches either require estimating the propensity scores, which are the instance-specific probabilities that a positive example's label will be observed, or make overly restricting assumptions about the labeling mechanism. We make a novel assumption about the labeling mechanism which we show is more general than several commonly used existing ones. Moreover, the combination of our novel assumption and theoretical results from robust statistics can simplify the process of learning from biased PU data. Empirically, our approach offers superior predictive and run time performance compared to the state-of-the-art methods.
Poster
Chungpa Lee · Jeongheon Oh · Kibok Lee · Jy-yong Sohn
[ Hall A-E ]
Abstract
Supervised contrastive learning (SupCL) has emerged as a prominent approach in representation learning, leveraging both supervised and self-supervised losses. However, achieving an optimal balance between these losses is challenging; failing to do so can lead to class collapse, reducing discrimination among individual embeddings in the same class. In this paper, we present theoretically grounded guidelines for SupCL to prevent class collapse in learned representations. Specifically, we introduce the Simplex-to-Simplex Embedding Model (SSEM), a theoretical framework that models various embedding structures, including all embeddings that minimize the supervised contrastive loss. Through SSEM, we analyze how hyperparameters affect learned representations, offering practical guidelines for hyperparameter selection to mitigate the risk of class collapse. Our theoretical findings are supported by empirical results across synthetic and real-world datasets.
Poster
Sarah Alnegheimish · Zelin He · Matthew Reimherr · Akash Chandrayan · Abhinav Pradhan · Luca D'Angelo
[ Hall A-E ]
Abstract
With the widespread availability of sensor data across industrial and operational systems, we frequently encounter heterogeneous time series from multiple systems. Anomaly detection is crucial for such systems to facilitate predictive maintenance. However, most existing anomaly detection methods are designed for either univariate or single-system multivariate data, making them insufficient for these complex scenarios. To address this, we introduce M$^2$AD, a framework for unsupervised anomaly detection in multivariate time series data from multiple systems. M$^2$AD employs deep models to capture expected behavior under normal conditions, using the residuals as indicators of potential anomalies. These residuals are then aggregated into a global anomaly score through a Gaussian Mixture Model and Gamma calibration. We theoretically demonstrate that this framework can effectively address heterogeneity and dependencies across sensors and systems. Empirically, M$^2$AD outperforms existing methods in extensive evaluations by 21% on average, and its effectiveness is demonstrated on a large-scale real-world case study on 130 assets in Amazon Fulfillment Centers. Our code and results are available at https://github.com/sarahmish/M2AD.
Poster
Sharmila Duppala · Juan Luque · John Dickerson · Seyed Esmaeili
[ Hall A-E ]
Abstract
We study the canonical fair clustering problem where each cluster is constrained to have close to population-level representation of each group. Despite significant attention, the salient issue of having incomplete knowledge about the group membership of each point has been superficially addressed. In this paper, we consider a setting where the assigned group memberships are noisy. We introduce a simple noise model that requires a small number of parameters to be given by the decision maker. We then present an algorithm for fair clustering with provable \emph{robustness} guarantees. Our framework enables the decision maker to trade off between the robustness and the clustering quality. Unlike previous work, our algorithms are backed by worst-case theoretical guarantees. Finally, we empirically verify the performance of our algorithm on real world datasets and show its superior performance over existing baselines.
Poster
Rahil Morjaria · Saikiran Bulusu · Venkata Gandikota · Sidharth Jaggi
[ Hall A-E ]
Abstract
Group testing is the problem of identifying a small subset of defectives from a large set using as few binary tests as possible. In most current literature on group testing the binary test outcome is $1$ if the pool contains at least one defective, and $0$ otherwise. In this work we initiate the study of a generalized model of group testing that accommodates the physical effects of dilution of infected samples in large pools. In this model the binary test outcome is $1$ with probability $f(\rho)$, where $\rho$ is the density of the defectives in the test, and $f:[0,1]\rightarrow [0,1]$ is a given "test function" that models this dilution process. For a large class of test functions our results establish near-optimal sample complexity bounds, by providing information-theoretic lower bounds on the number of tests necessary to recover the set of defective items, and providing computationally efficient algorithms with sample complexities that match these lower bounds up to constant or logarithmic factors. Furthermore, using tools from real analysis, we extend our results to any "sufficiently well-behaved function" $f:[0,1]\rightarrow [0,1]$.
Poster
Zhiqun Zuo · Ding Zhu · Mahed Abroshan
[ Hall A-E ]
Abstract
This paper presents a post-processing algorithm for training fair neural network regression models that satisfy statistical parity, utilizing an explainable singular value decomposition (SVD) of the weight matrix. We propose a linear transformation of the weight matrix, whereby the singular values derived from the SVD of the transformed matrix directly correspond to the differences in the first and second moments of the output distributions across two groups. Consequently, we can convert the fairness constraints into constraints on the singular values. We analytically solve the problem of finding the optimal weights under these constraints. Experimental validation on various datasets demonstrates that our method achieves a similar or superior fairness-accuracy trade-off compared to the baselines without using the sensitive attribute at the inference time.
Poster
Shogo Iwazaki · Tomohiko Tanabe · Mitsuru Irie · Shion Takeno · Kota Matsui · Yu Inatsu
[ Hall A-E ]
Abstract
We study Bayesian optimization problems where observation of the objective function fails stochastically, e.g., synthesis failures in materials development. For this problem, although several heuristic methods have been proposed, they do not have theoretical guarantees and sometimes deteriorate in practice. We propose two algorithms that have a trade-off relation between regret bounds and practical performance. The first one is the first no-regret algorithm for this problem. The second one shows superior practical performance; however, we need some modification of the algorithm to obtain a no-regret guarantee, which is slightly worse than the first one. We demonstrate the effectiveness of our methods in numerical experiments, including the simulation function motivated by quasi-crystal synthesis.
Poster
Mengfan Xu · Diego Klabjan
[ Hall A-E ]
Abstract
Multi-armed Bandit motivates methods with provable upper bounds on regret and also the counterpart lower bounds have been extensively studied in this context. Recently, Multi-agent Multi-armed Bandit has gained significant traction in various domains, where individual clients face bandit problems in a distributed manner and the objective is the overall system performance, typically measured by regret. While efficient algorithms with regret upper bounds have emerged, limited attention has been given to the corresponding regret lower bounds, except for a recent lower bound for adversarial settings, which, however, has a gap with let known upper bounds. To this end, we herein provide the first comprehensive study on regret lower bounds across different settings and establish their tightness. Specifically, when the graphs exhibit good connectivity properties and the rewards are stochastically distributed, we demonstrate a lower bound of order $O(\log T)$ for instance-dependent bounds and $\sqrt{T}$ for mean-gap independent bounds which are tight. Assuming adversarial rewards, we establish a lower bound $O(T^{\frac{2}{3}})$ for connected graphs, thereby bridging the gap between the lower and upper bound in the prior work. We also show a linear regret lower bound when the graph is disconnected. These lower bounds are made possible through our newly constructed …
Poster
Da Long · Zhitong Xu · Qiwei Yuan · Yin Yang · Shandian Zhe
[ Hall A-E ]
Abstract
Fourier Neural Operator (FNO) is a powerful and popular operator learning method. However, FNO is mainly used in forward prediction, yet a great many applications rely on solving inverse problems. In this paper, we propose an invertible Fourier Neural Operator (iFNO) for jointly tackling the forward and inverse problems. We developed a series of invertible Fourier blocks in the latent channel space to share the model parameters, exchange the information, and mutually regularize the learning for the bi-directional tasks. We integrated a variational auto-encoder to capture the intrinsic structures within the input space and to enable posterior inference so as to mitigate challenges of illposedness, data shortage, noises that are common in inverse problems. We proposed a three-step process to combine the invertible blocks and the VAE component for effective training. The evaluations on seven benchmark forward and inverse tasks have demonstrated the advantages of our approach. The code is available at https://github.com/BayesianAIGroup/iFNO.
Poster
Anup Rao · Peng Zhang
[ Hall A-E ]
Abstract
We investigate experimental design for randomized controlled trials (RCTs) with both equal and unequal treatment-control assignment probabilities. Our work makes progress on the connection between the distributional discrepancy minimization (DDM) problem introduced by Harshaw et al. (2024) and the design of RCTs. We make two main contributions: First, we prove that approximating the optimal solution of the DDM problem within a certain constant error is NP-hard. Second, we introduce a new Multiplicative Weights Update (MWU) algorithm for the DDM problem, which improves the Gram-Schmidt walk algorithm used by Harshaw et al. (2024) when assignment probabilities are unequal. Building on the framework of Harshaw et al. (2024) and our MWU algorithm, we then develop the MWU design, which reduces the worst-case mean-squared error in estimating the average treatment effect. Finally, we present a comprehensive simulation study comparing our design with commonly used designs.
Poster
Tobias Wegel · Filip Kovačević · Alexandru Tifrea · Fanny Yang
[ Hall A-E ]
Abstract
Simultaneously addressing multiple objectives is becoming increasingly important in modern machine learning. At the same time, data is often high-dimensional and costly to label.For a single objective such as prediction risk, conventional regularization techniques are known to improve generalization when the data exhibits low-dimensional structure like sparsity. However, it is largely unexplored how to leverage this structure in the context of multi-objective learning (MOL) with multiple competing objectives. In this work, we discuss how the application of vanilla regularization approaches can fail, and propose a two-stage MOL framework that can successfully leverage low-dimensional structure. We demonstrate its effectiveness experimentally for multi-distribution learning and fairness-risk trade-offs.
Poster
Vu Hoang · Hung Tran · Sunil Gupta · Vu Nguyen
[ Hall A-E ]
Abstract
Bayesian optimization (BO) is a leading method for optimizing expensive black-box optimization and has been successfully applied across various scenarios. However, BO suffers from the curse of dimensionality, making it challenging to scale to high-dimensional problems. Existing work has adopted a variable selection strategy to select and optimize only a subset of variables iteratively. Although this approach can mitigate the high-dimensional challenge in BO, it still leads to sample inefficiency. To address this issue, we introduce a novel method that identifies important variables by estimating the length scales of Gaussian process kernels. Next, we construct an effective search region consisting of multiple subspaces and optimize the acquisition function within this region, focusing on only the important variables. We demonstrate that our proposed method achieves cumulative regret with a sublinear growth rate in the worst case while maintaining computational efficiency. Experiments on high-dimensional synthetic functions and real-world problems show that our method achieves state-of-the-art performance.
Poster
Yilin Xie · Shiqiang Zhang · Joel Paulson · Calvin Tsay
[ Hall A-E ]
Abstract
Bayesian optimization relies on iteratively constructing and optimizing an acquisition function. The latter turns out to be a challenging, non-convex optimization problem itself. Despite the relative importance of this step, most algorithms employ sampling- or gradient-based methods, which do not provably converge to global optima. This work investigates mixed-integer programming (MIP) as a paradigm for *global* acquisition function optimization. Specifically, our Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP) formulation introduces a piecewise-linear approximation for Gaussian process kernels and admits a corresponding MIQP representation for acquisition functions. The proposed method is applicable to uncertainty-based acquisition functions for any stationary or dot-product kernel. We analyze the theoretical regret bounds of the proposed approximation, and empirically demonstrate the framework on synthetic functions, constrained benchmarks, and a hyperparameter tuning task.
Poster
Kevin Luo · Yufan Li · Pragya Sur
[ Hall A-E ]
Abstract
Two key tasks in high-dimensional regularized regression are tuning the regularization strength for accurate predictions and estimating the out-of-sample risk. It is known that the standard approach — $k$-fold cross-validation — is inconsistent in modern high-dimensional settings. While leave-one-out and generalized cross-validation remain consistent in some high-dimensional cases, they become inconsistent when samples are dependent or contain heavy-tailed covariates. As a first step towards modeling structured sample dependence and heavy tails, we use right-rotationally invariant covariate distributions — a crucial concept from compressed sensing. In the proportional asymptotics regime where the number of features and samples grow comparably, which is known to better reflect the empirical behavior in moderately sized datasets, we introduce a new framework, ROTI-GCV, for reliably performing cross-validation under these challenging conditions. Along the way, we propose new estimators for the signal-to-noise ratio and noise variance. We conduct experiments that demonstrate the accuracy of our approach in a variety of synthetic and semi-synthetic settings.
Poster
Masahiro Kato · Shinji Ito
[ Hall A-E ]
Abstract
We investigate the \emph{linear contextual bandit problem} with independent and identically distributed (i.i.d.) contexts. In this problem, we aim to develop a \emph{Best-of-Both-Worlds} (BoBW) algorithm with regret upper bounds in both stochastic and adversarial regimes. We develop an algorithm based on \emph{Follow-The-Regularized-Leader} (FTRL) with Tsallis entropy, referred to as the $\alpha$-\emph{Linear-Contextual (LC)-Tsallis-INF}. We show that its regret is at most $O(\log(T))$ in the stochastic regime under the assumption that the suboptimality gap is uniformly bounded from below, and at most $O(\sqrt{T})$ in the adversarial regime. Furthermore, our regret analysis is extended to more general regimes characterized by the \emph{margin condition} with a parameter $\beta \in (1, \infty]$, which imposes a milder assumption on the suboptimality gap than in previous studies. We show that the proposed algorithm achieves $O\left(\log(T)^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$ regret under the margin condition.
Poster
Baozhen Wang · Xingye Qiao
[ Hall A-E ]
Abstract
In many real applications of statistical learning, collecting sufficiently many training data is often expensive, time-consuming, or even unrealistic. In this case, a transfer learning approach, which aims to leverage knowledge from a related source domain to improve the learning performance in the target domain, is more beneficial. There have been many transfer learning methods developed under various distributional assumptions. In this article, we study a particular type of classification problem, called conformal prediction, under a new distributional assumption for transfer learning. Classifiers under the conformal prediction framework predict a set of plausible labels instead of one single label for each data instance, affording a more cautious and safer decision. We consider a generalization of the covariate shift with posterior drift setting for transfer learning. Under this setting, we propose a weighted conformal classifier that leverages both the source and target samples, with a coverage guarantee in the target domain. Theoretical studies demonstrate favorable asymptotic properties. Numerical studies further illustrate the usefulness of the proposed method.
Poster
David Martínez-Rubio · Christophe Roux · Christopher Criscitiello · Sebastian Pokutta
[ Hall A-E ]
Abstract
In this work, we study optimization problems of the form $\min_x \max_y f(x, y)$, where $f(x, y)$ is defined on a product Riemannian manifold $\mathcal{M} \times \mathcal{N}$ and is $\mu_x$-strongly geodesically convex (g-convex) in $x$ and $\mu_y$-strongly g-concave in $y$, for $\mu_x, \mu_y \geq 0$. We design accelerated methods when $f$ is $(L_x, L_y, L_{xy})$-smooth and $\mathcal{M}$, $\mathcal{N}$ are Hadamard. To that aim we introduce new g-convex optimization results, of independent interest: we show global linear convergence for metric-projected Riemannian gradient descent and improve existing accelerated methods by reducing geometric constants. Additionally, we complete the analysis of two previous works applying to the Riemannian min-max case by removing an assumption about iterates staying in a pre-specified compact set.
Poster
Siyan Zhao · Daniel Israel · Guy Van den Broeck · Aditya Grover
[ Hall A-E ]
Abstract
During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling will incur a significant overhead on decoding time. In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs increasingly support longer context lengths, potentially up to 10 million tokens, variations in prompt lengths within a batch become more pronounced. To address this, we propose prepacking, a simple yet effective method to optimize prefilling computation. To avoid redundant computation on pad tokens, prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm. It then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches for multiple prompts within a single sequence. On standard curated dataset containing prompts with varying lengths, we obtain a significant speed and memory efficiency improvements as compared to the default padding-based prefilling computation within Huggingface across a range of base model configurations and inference serving scenarios.
Poster
Daniel Csillag · Claudio Struchiner · Guilherme Goedert
[ Hall A-E ]
Abstract
When a machine learning model is deployed, its predictions can alter its environment, as better informed agents strategize to suit their own interests. With such alterations in mind, existing approaches to uncertainty quantification break. In this work we propose a new framework, Strategic Conformal Prediction, which is capable of robust uncertainty quantification in such a setting. Strategic Conformal Prediction is backed by a series of theoretical guarantees spanning marginal coverage, training-conditional coverage, tightness and robustness to misspecification that hold in a distribution-free manner. Experimental analysis further validates our method, showing its remarkable effectiveness in face of arbitrary strategic alterations, whereas other methods break.
Poster
Anica Kostic · Vincent Runge · Charles Truong
[ Hall A-E ]
Abstract
Time series analysis of non-Euclidean data is highly challenging and crucial for many real-world applications. We address the problem of detecting multiple changes in time series within these complex data spaces. Hadamard spaces, which encompass important data spaces like positive semidefinite matrices, certain Wasserstein spaces, and hyperbolic spaces, provide the right general framework to address this complexity. We propose a computationally efficient two-step iterative optimization algorithm called HOP (Hadamard Optimal Partitioning) that detects changes in the sequence of so-called Fréchet means. Under mild conditions, the proposed method consistently estimates the change point locations. HOP is highly versatile, accommodating structural assumptions such as cyclic patterns and epidemic settings, making it unique in the literature. We validate its performance in synthetic and real-world scenarios, including applications in human gait analysis using EMG data with low SNR and behavioral analysis of animal motion.
Poster
Anshul Thakur · Elena Gal · Soheila Molaei · Xiao Gu · Patrick Schwab · Danielle Belgrave · Kim Branson · David Clifton
[ Hall A-E ]
Abstract
This paper presents Adaptive Parameter Optimisation (APO), a novel framework for optimising shared models across multiple clinical tasks, addressing the challenges of balancing strict parameter sharing—often leading to task conflicts—and soft parameter sharing, which may limit effective cross-task information exchange. The proposed APO framework leverages insights from the lazy behaviour observed in over-parameterised neural networks, where only a small subset of parameters undergo any substantial updates during training. APO dynamically identifies and updates task-specific parameters while treating parameters associated with other tasks as protected, limiting their modification to prevent interference. The remaining unassigned parameters remain unchanged, embodying the lazy training phenomenon. This dynamic management of task-specific, protected, and unclaimed parameters across tasks enables effective information sharing, preserves task-specific adaptability, and mitigates gradient conflicts without enforcing a uniform representation. Experimental results across diverse healthcare datasets demonstrate that APO surpasses traditional information-sharing approaches, such as multi-task learning and model-agnostic meta-learning, in improving task performance.
Poster
Jake Fawkes · Michael O'Riordan · Athanasios Vlontzos · Oriol Corcoll · Ciarán Gilligan-Lee
[ Hall A-E ]
Abstract
Observational data is often readily available in large quantities, but can lead to biased causal effect estimates due to the presence of unobserved confounding. Recent works attempt to remove this bias by supplementing observational data with experimental data, which, when available, is typically on a smaller scale due to the time and cost involved in running a randomised controlled trial. In this work, we prove a theorem that places fundamental limits on this ``best of both worlds'' approach. Using the framework of impossible inference, we show that although it is possible to use experimental data to \emph{falsify} causal effect estimates from observational data, in general it is not possible to \emph{validate} such estimates. Our theorem proves that while experimental data can be used to detect bias in observational studies, without additional assumptions on the smoothness of the correction function, it can not be used to remove it. We provide a practical example of such an assumption, developing a novel Gaussian Process based approach to construct intervals which contain the true treatment effect with high probability, both inside and outside of the support of the experimental data. We demonstrate our methodology on both simulated and semi-synthetic datasets and make the \href{https://github.com/Jakefawkes/Obs_and_exp_data}{code …
Poster
Matthew Werenski · Brendan Mallery · Shuchin Aeron · James Murphy
[ Hall A-E ]
Abstract
We propose the linear barycentric coding model (LBCM) which utilizes the linear optimal transport (LOT) metric for analysis and synthesis of probability measures. We provide a closed-form solution to the variational problem characterizing the probability measures in the LBCM and establish equivalence of the LBCM to the set of 2-Wasserstein barycenters in the special case of compatible measures. Computational methods for synthesizing and analyzing measures in the LBCM are developed with finite sample guarantees. One of our main theoretical contributions is to identify an LBCM, expressed in terms of a simple family, which is sufficient to express all probability measures on the closed unit interval. We show that a natural analogous construction of an LBCM in 2 dimensions fails, and we leave it as an open problem to identify the proper extension in more than 1 dimension. We conclude by demonstrating the utility of LBCM for covariance estimation and data imputation.
Poster
Di Wu · Chengshuai Shi · Ruida Zhou · Cong Shen
[ Hall A-E ]
Abstract
Pure exploration is one of the fundamental problems in multi-armed bandits (MAB). However, existing works mostly focus on specific pure exploration tasks, without a holistic view of the general pure exploration problem. This work fills this gap by introducing a versatile framework to study pure exploration, with a focus on identifying the pairwise relationships between targeted arm pairs. Moreover, unlike existing works that only optimize the stopping time (i.e., sample complexity), this work considers that arms are associated with potentially different costs and targets at optimizing the cumulative cost that occurred during learning. Under the general framework of pairwise pure exploration with arm-specific costs, a performance lower bound is derived. Then, a novel algorithm, termed CAET (Cost-Aware Pairwise Exploration Task), is proposed. CAET builds on the track-and-stop principle with a novel design to handle the arm-specific costs, which can potentially be zero and thus represent a very challenging case. Theoretical analyses prove that the performance of CAET approaches the lower bound asymptotically. Special cases are further discussed, including an extension to regret minimization, which is another major focus of MAB. The effectiveness and efficiency of CAET are also verified through experimental results under various settings.
Poster
Taole Sha · Michael Zhang
[ Hall A-E ]
Abstract
Mixture-of-expert (MOE) models are popular methods in machine learning, since they can model heterogeneous behaviour across the space of the data using an ensemble collection of learners. These models are especially useful for modelling dynamic data as time-dependent data often exhibit non-stationarity and heavy-tailed errors, which may be inappropriate to model with a typical single expert model. We propose a mixture of Student-$t$ processes with an adaptive structure for the covariance and noise behaviour for each mixture. Moreover, we use a sequential Monte Carlo (SMC) sampler to perform online inference as data arrive in real time. We demonstrate the superiority of our proposed approach over other models on synthetic and real-world datasets to prove the necessity of the novel method.
Poster
Moule Lin · Shuhao Guan · Weipeng Jing · Goetz Botterweck · Andrea Patane
[ Hall A-E ]
Abstract
While offering a principled framework for uncertainty quantification in deep learning, the employment of Bayesian Neural Networks (BNNs) is still constrained by their increased computational requirements and the convergence difficulties when training very deep, state-of-the-art architectures. In this work, We reinterpret weight-sharing quantization techniques from a stochastic perspective in the context of training and inference with Bayesian Neural Networks (BNNs). Specifically, we leverage 2D-adaptive Gaussian distributions, Wasserstein distance estimations, and alpha-blending to encode the stochastic behavior of a BNN in a lower-dimensional, soft Gaussian representation. Through extensive empirical investigation, we demonstrate that our approach significantly reduces the computational overhead inherent in Bayesian learning by several orders of magnitude, enabling efficient Bayesian training of large-scale models, such as ResNet-101 and Vision Transformer (VIT). On various computer vision benchmarks—including CIFAR-10, CIFAR-100, and ImageNet1k—our approach compresses model parameters by approximately 50× and reduces model size by 75% while achieving accuracy and uncertainty estimations comparable to state-of-the-art.
Poster
Xianwen Deng · Yijun Wang · Zhi Xue
[ Hall A-E ]
Abstract
Source-free domain adaptation (SFDA) aims to adapt a source model, initially trained on a fully-labeled source domain, to an unlabeled target domain. Previous works assume that the statistics of Batch Normalization layers in the source model capture domain-specific knowledge and directly replace them with target domain-related statistics during training. However, our observations indicate that *source-like* samples in target data exhibit less deviation in the feature space of the source model when preserving the source domain-relevant statistics. In this paper, we propose co-training the source model with frozen Batch Normalization layers as part of the domain adaptation process. Specifically, we combine the source model and the target model to produce more robust pseudo-labels for *global* class clustering and to identify more precise neighbor samples for *local* neighbor clustering. Extensive experiments validate the effectiveness of our approach, showcasing its superiority over current state-of-the-art methods on three standard benchmarks. Our codes are available on https://github.com/SJTU-dxw/BN-SFDA.
Poster
Chengzhi Shi · Gözde Özcan · Miquel Sirera Perelló · Yuanyuan Li · Nina I. Shamsi · Stratis Ioannidis
[ Hall A-E ]
Abstract
We study pixel-wise regression problems with sparsely annotated images. Traditional regression methods based on mean squared error emphasize pixels with labels, leading to distorted predictions in unlabeled areas. To address this limitation, we introduce Neural Point Processes, a novel approach that combines 2D Gaussian Processes with neural networks to leverage spatial correlations between sparse labels on images. This approach offers two key advantages: it imposes smoothness constraints on the model output and enables conditional predictions when sparse labels are available at inference time. Empirical results on synthetic and real-world datasets demonstrate a substantial improvement in mean-squared error and $R^2$ scores, outperforming standard regression techniques. On the real-world dataset COWC, we achieve an $R^2$ of $0.769$ with $81$ out of $40,000$ ($0.2$%) points labeled, while standard regression loss (MSE) results in an $R^2$ of $0.060$.
Poster
Seiyun Shin · Ilan Shomorony · Peter Macgregor
[ Hall A-E ]
Abstract
We propose a fast and dynamic algorithm for Density-Based Spatial Clustering of Applications with Noise (DBSCAN) that efficiently supports online updates.Traditional DBSCAN algorithms, designed for batch processing, become computationally expensive when applied to dynamic datasets, particularly in large-scale applications where data continuously evolves.To address this challenge, our algorithm leverages the Euler Tour Trees data structure, enabling dynamic clustering updates without the need to reprocess the entire dataset.This approach preserves a near-optimal accuracy in density estimation, as achieved by the state-of-the-art static DBSCAN method (Esfandiari et al., 2021). Our method achieves an improved time complexity of $O(d \log^3(n) + \log^4(n))$ for everydata point insertion and deletion, where $n$ and $d$ denote the total number of updates and the data dimension, respectively.Empirical studies also demonstrate significant speedups over conventional DBSCANs in real-time clustering of dynamic datasets, while maintaining comparable or superior clustering quality.
Poster
Camille Castera · Peter Ochs
[ Hall A-E ]
Abstract
Towards designing learned optimization algorithms that are usable beyond their training setting, we identify key principles that classical algorithms obey, but have up to now, not been used for Learning to Optimize (L2O). Following these principles, we provide a general design pipeline, taking into account data, architecture and learning strategy, and thereby enabling a synergy between classical optimization and L2O, resulting in a philosophy of Learning Optimization Algorithms. As a consequence our learned algorithms perform well far beyond problems from the training distribution. We demonstrate the success of these novel principles by designing a new learning-enhanced BFGS algorithm and provide numerical experiments evidencing its adaptation to many settings at test time.
Poster
Chase Walker · Md Rubel Ahmed · Sumit Kumar Jha · Rickard Ewetz
[ Hall A-E ]
Abstract
Computer vision models can be explained by attributing the output decision to the input pixels. While effective methods for explaining convolutional neural networks have been proposed, these methods often produce low-quality attributions when applied to vision transformers (ViTs). State-of-the-art methods for explaining ViTs capture the flow of patch information using transition matrices. However, we observe that transition matrices alone are not sufficiently expressive to accurately explain ViT models. In this paper, we define a theoretical approach to creating explanations for ViTs called InFlow. The framework models the patch-to-patch information flow using a combination of transition matrices and patch embeddings. Moreover, we define an algebra for updating the transition matrices of series connected components, diverging paths, and converging paths in the ViT model. This algebra allows the InFlow framework to produce high quality attributions which explain ViT decision making. In experimental evaluation on ImageNet, with three models, InFlow outperforms six ViT attribution methods in the standard insertion, deletion, SIC and AIC metrics by up to 18%. Qualitative results demonstrate InFlow produces more relevant and sharper explanations. Code is publicly available at https://github.com/chasewalker26/InFlow-ViT-Explanation.
Poster
Adam Elmachtoub · Henry Lam · Haixiang Lan · Haofeng Zhang
[ Hall A-E ]
Abstract
Data-driven optimization aims to translate a machine learning model into decision-making by optimizing decisions on estimated costs. Such a pipeline can be conducted by fitting a distributional model which is then plugged into the target optimization problem. While this fitting can utilize traditional methods such as maximum likelihood, a more recent approach uses estimation-optimization integration that minimizes decision error instead of estimation error. Although intuitive, the statistical benefit of the latter approach is not well understood yet is important to guide the prescriptive usage of machine learning. In this paper, we dissect the performance comparisons between these approaches in terms of the amount of model misspecification. In particular, we show how the integrated approach offers a ``universal double benefit'' on the top two dominating terms of regret when the underlying model is misspecified, while the traditional approach can be advantageous when the model is nearly well-specified. Our comparison is powered by finite-sample tail regret bounds that are derived via new higher-order expansions of regrets and the leveraging of a recent Berry-Esseen theorem.
Poster
Antonio Gois · Mehrnaz Mofakhami · Fernando Santos · Simon Lacoste-Julien · Gauthier Gidel
[ Hall A-E ]
Abstract
Agents often have individual goals which depend on a group's actions. If agents trust a forecast of collective action and adapt strategically, such prediction can influence outcomes non-trivially, resulting in a form of performative prediction. This effect is ubiquitous in scenarios ranging from pandemic predictions to election polls, but existing work has ignored interdependencies among predicted agents. As a first step in this direction, we study a collective risk dilemma where agents dynamically decide whether to trust predictions based on past accuracy. As predictions shape collective outcomes, social welfare arises naturally as a metric of concern. We explore the resulting interplay between accuracy and welfare, and demonstrate that searching for stable accurate predictions can minimize social welfare with high probability in our setting. By assuming knowledge of a Bayesian agent behavior model, we then show how to achieve better trade-offs and use them for mechanism design.
Poster
Zixin Kuang · Meng-Fen Chiang · Wang-Chien Lee
[ Hall A-E ]
Abstract
Rationale learning aims to automatically uncover the underlying explanations for NLP predictions. Previous studies in rationale learning mainly focus on the relevance of independent tokens with the predictions without considering their marginal contribution and the collective readability of extracted rationales. Through an empirical analysis, we argue that the sufficiency, informativeness, and readability of rationales are essential for explaining diverse end-task predictions. Accordingly, we propose Shapley-value Guided Rationale Editor (SHARE), an unsupervised approach that refines editable rationales while predicting task outcomes. SHARE extracts a sequence of tokens as a rationale, providing a collective explanation that is sufficient, informative, and readable. SHARE is highly adaptable for tasks like sentiment analysis, claim verification, and question answering, and can integrate seamlessly with various language models to provide explainability. Extensive experiments demonstrate its effectiveness in balancing sufficiency, informativeness, and readability across diverse applications. Our code and datasets are available at https://github.com/zixinK/SHARE.
Poster
Kun Zhao · Jiayi Wang · Yifei Lou
[ Hall A-E ]
Abstract
This paper focuses on recovering an underlying matrix from its noisy partial entries, a problem commonly known as matrix completion. We delve into the investigation of a non-convex regularization, referred to as transformed $L_1$ (TL1), which interpolates between the rank and the nuclear norm of matrices through a hyper-parameter $a \in (0, \infty)$. While some literature adopts such regularization for matrix completion, it primarily addresses scenarios with uniformly missing entries and focuses on algorithmic advances. To fill in the gap in the current literature, we provide a comprehensive statistical analysis for the estimator from a TL1-regularized recovery model under general sampling distribution. In particular, we show that when $a$ is sufficiently large, the matrix recovered by the TL1-based model enjoys a convergence rate measured by the Frobenius norm, comparable to that of the model based on the nuclear norm, despite the challenges posed by the non-convexity of the TL1 regularization. When $a$ is small enough, we show that the rank of the estimated matrix remains a constant order when the true matrix is exactly low-rank. A trade-off between controlling the error and the rank is established through different choices of tuning parameters. The appealing practical performance of TL1 regularization is …
Poster
Gen Li · Zhihan Huang · Yuting Wei
[ Hall A-E ]
Abstract
Consistency models, which were proposed to mitigate the high computational overhead during thesampling phase of diffusion models, facilitate single-step sampling while attaining state-of-the-art empiricalperformance. When integrated into the training phase, consistency models attempt to train a sequenceof consistency functions capable of mapping any point at any time step of the diffusion process to itsstarting point. Despite the empirical success, a comprehensive theoretical understanding of consistencytraining remains elusive. This paper takes a first step towards establishing theoretical underpinnings forconsistency models. We demonstrate that, in order to generate samples within $\varepsilon$ proximity to the targetin distribution (measured by some Wasserstein metric), it suffices for the number of steps in consistencylearning to exceed the order of $d^{5/2}/\varepsilon$, with $d$ the data dimension. Our theory offers rigorous insights intothe validity and efficacy of consistency models, illuminating their utility in downstream inference tasks.
Poster
Cheng Peng · Stan Uryasev
[ Hall A-E ]
Abstract
This paper proposes a new approach to estimating the distribution of a response variable conditioned on factors. We model the conditional quantile function as a mixture (weighted sum) of basis quantile functions, with weights depending on these factors. The estimation problem is formulated as a convex optimization problem. The objective function is equivalent to the continuous ranked probability score (CRPS). This approach can be viewed as conducting quantile regressions for all confidence levels simultaneously while inherently avoiding quantile crossing. We use spline functions of factors as a primary example for the weight function. We prove an approximation property of the model. To address computational challenges, we propose a dimensionality reduction method using tensor decomposition and an alternating algorithm. Our approach offers flexibility, interpretability, tractability, and extendability. Numerical experiments demonstrate its effectiveness.