Poster
Yarden Cohen · Alexandre Wu Navarro · Jes Frellsen · Richard Turner · Raziel Riemer · Ari Pakman
[ Hall A-E ]
Abstract
The need for regression models to predict circular values arises in many scientific fields. In this work we explore a family of expressive and interpretable distributions over circle-valued random functions related to Gaussian processes targeting two Euclidean dimensions conditioned on the unit circle. The probability model has connections with continuous spin models in statistical physics. Moreover, its density is very simple and has maximum-entropy, unlike previous Gaussian process-based approaches, which use wrapping or radial marginalization. For posterior inference, we introduce a new Stratonovich-like augmentation that lends itself to fast Gibbs sampling. We argue that transductive learning in these models favors a Bayesian approach to the parameters and apply our sampling scheme to the Double Metropolis-Hastings algorithm. We present experiments applying this model to the prediction of (i) wind directions and (ii) the percentage of the running gait cycle as a function of joint angles.
Poster
Tong Yang · Jincheng Mei · Hanjun Dai · Zixin Wen · Shicong Cen · Dale Schuurmans · Yuejie Chi · Bo Dai
[ Hall A-E ]
Abstract
Recent advances in aligning large language models with human preferences have corroborated the growing importance of best-of-$N$ distillation (BOND). However, the iterative BOND algorithm is prohibitively expensive in practice due to the sample and computation inefficiency. This paper addresses the problem by revealing a unified game-theoretic connection between iterative BOND and self-play alignment, which unifies seemingly disparate algorithmic paradigms. Based on the connection, we establish a novel framework, \textbf{WIN} rate \textbf{D}ominance~(WIND), with a series of efficient algorithms for regularized win rate dominance optimization that approximates iterative BOND in the parameter space. We provides provable sample efficiency guarantee for one of the WIND variant with the square loss objective. The experimental results confirm that our algorithm not only accelerates the computation, but also achieves superior sample efficiency compared to existing methods.
Poster
Emanuele Troiani · Yatin Dandi · Leonardo Defilippis · Lenka Zdeborova · Bruno Loureiro · FLORENT KRZAKALA
[ Hall A-E ]
Abstract
Multi-index models - functions which only depend on the covariates through a non-linear transformation of their projection on a subspace - are a useful benchmark for investigating feature learning with neural networks. This paper examines the theoretical boundaries of efficient learnability in this hypothesis class, focusing particularly on the minimum sample complexity required for weakly recovering their low-dimensional structure with first-order iterative algorithms, in the high-dimensional regime where the number of samples is $n=\alpha d$ is proportional to the covariate dimension $d$. Our findings unfold in three parts: (i) first, we identify under which conditions a \textit{trivial subspace} can be learned with a single step of a first-order algorithm for any $\alpha>0$; (ii) second, in the case where the trivial subspace is empty, we provide necessary and sufficient conditions for the existence of an {\it easy subspace} consisting of directions that can be learned only above a certain sample complexity $\alpha>\alpha_c$. The critical threshold $\alpha_{c}$ marks the presence of a computational phase transition, in the sense that it is conjectured that no efficient iterative algorithm can succeed for $\alpha<\alpha_c$. In a limited but interesting set of really hard directions -akin to the parity problem- $\alpha_c$ is found …
Poster
Alex Kulesza · Ananda Theertha Suresh · Yuyan Wang
[ Hall A-E ]
Abstract
We derive the optimal differentially private additive noise mechanism for queries in $\mathbb{R}^d$ when sensitivity and error are defined by an arbitrary norm $||\cdot||_K$. The optimal mechanism is a generalization of the staircase mechanism, which is known to be optimal under the $\ell_1$ norm when $d \leq 2$; we extend the mechanism and its guarantee to arbitrary norms and dimensions, proving a conjecture of Geng et al. [2015] along the way. The generalized staircase mechanism we derive can be viewed as a refinement of the $K$-norm mechanism of Hardt and Talwar [2010] , with improvements particularly evident in the low-privacy regime as $\epsilon \to \infty$. We show how to implement the generalized staircase mechanism efficiently, given an efficient algorithm for sampling the unit $K$-norm ball, and demonstrate that it significantly reduces error in realistic settings, including under non-standard norms that are common in practice, and across a range of error metrics.
Poster
Yang Yang · Kai Zhang · Ping-Shou Zhong
[ Hall A-E ]
Abstract
This paper focuses on testing conditional independence between two random variables ($X$ and $Y$) given a set of high-dimensional confounding variables ($Z$). The high dimensionality of these confounding variables presents a challenge, often resulting in inflated type-I errors or insufficient power in many existing tests. To address this issue, we leverage the power of Deep Neural Networks (DNNs) to handle complex, high-dimensional data while mitigating the curse of dimensionality. We propose a novel test procedure, DeepBET. First, a DNN is used on part of the data to estimate the conditional means of $X$ and $Y$ given $Z$. Then, binary expansion testing (BET) are applied to the predicted errors from the remaining data. Additionally, we implement a multiple-split procedure to further enhance the power of the test. DeepBET is computationally efficient and robust to the tuning parameters in DNNs. Interestingly, the DeepBET statistic converges at a root-$n$ rate despite the nonparametric and high-dimensional nature of the confounding effects. Our numerical results demonstrate that the proposed method controls type-I error under various scenarios and enhances both power and interpretability for conditional dependence when present, making it a robust alternative for testing conditional independence in high-dimensional settings. When applied to dry eye disease …
Poster
Ferdinando Fioretto · Diptangshu Sen · Juba Ziani
[ Hall A-E ]
Abstract
Networks in sectors like telecommunications and transportation often contain sensitive user data, requiring privacy enhancing technologies during data release to ensure privacy. While Differential Privacy (DP) is recognized as the leading standard for privacy preservation, its use comes with new challenges, as the noise added for privacy introduces inaccuracies or biases. DP techniques have also been found to distribute these biases disproportionately across different populations, inducing fairness issues.This paper investigates the effects of DP on bias and fairness when releasing network edge weights. We specifically examine how these privacy measures affect decision-making tasks, such as computing shortest paths, which are crucial for routing in transportation and communications networks, and provide both theoretical insights and empirical evidence on the inherent trade-offs between privacy, accuracy, and fairness for network data release.
Poster
Yucheng Liu · Xiaodong Li
[ Hall A-E ]
Abstract
We investigate how to select the number of communities for weighted networks without a full likelihood modeling. First, we propose a novel weighted degree-corrected stochastic block model (DCSBM), where the mean adjacency matrix is modeled in the same way as in the standard DCSBM, while the variance profile matrix is assumed to be related to the mean adjacency matrix through a given variance function. Our method of selecting the number of communities is based on a sequential testing framework. In each step, the weighted DCSBM is fitted via some spectral clustering method. A key component of our method is matrix scaling on the estimated variance profile matrix. The resulting scaling factors can be used to normalize the adjacency matrix, from which the test statistic is then obtained. Under mild conditions on the weighted DCSBM, our proposed procedure is shown to be consistent in estimating the true number of communities. Numerical experiments on both simulated and real-world network data demonstrate the desirable empirical properties of our method.
Poster
Sangil Han · Kyoowon Kim · Sungkyu Jung
[ Hall A-E ]
Abstract
In this paper, we explore the theoretical properties of subspace recovery using Winsorized Principal Component Analysis (WPCA), utilizing a common data transformation technique that caps extreme values to mitigate the impact of outliers. Despite the widespread use of winsorization in various tasks of multivariate analysis, its theoretical properties, particularly for subspace recovery, have received limited attention. We provide a detailed analysis of the accuracy of WPCA, showing that increasing the number of samples while decreasing the proportion of outliers guarantees the consistency of the sample subspaces from WPCA with respect to the true population subspace. Furthermore, we establish perturbation bounds that ensure the WPCA subspace obtained from contaminated data remains close to the subspace recovered from pure data. Additionally, we extend the classical notion of breakdown points to subspace-valued statistics and derive lower bounds for the breakdown points of WPCA. Our analysis demonstrates that WPCA exhibits strong robustness to outliers while maintaining consistency under mild assumptions. A toy example is provided to numerically illustrate the behavior of the upper bounds for perturbation bounds and breakdown points, emphasizing winsorization's utility in subspace recovery.
Poster
Mathieu Besançon · Sebastian Pokutta · Elias Wirth
[ Hall A-E ]
Abstract
We propose the pivoting meta algorithm (PM) to enhance optimization algorithms that generate iterates as convex combinations of vertices of a feasible region $C\subseteq \mathbb{R}^n$, including Frank-Wolfe (FW) variants. PM guarantees that the active set (the set of vertices in the convex combination) of the modified algorithm remains as small as $dim(C)+1$ as stipulated by Carathéodory's theorem. PM achieves this by reformulating the active set expansion task into an equivalent linear program, which can be efficiently solved using a single pivot step akin to the primal simplex algorithm; the convergence rate of the original algorithms are maintained. Furthermore, we establish the connection between PM and active set identification, in particular showing under mild assumptions that PM applied to the away-step Frank-Wolfe algorithm or the blended pairwise Frank-Wolfe algorithm bounds the active set size by the dimension of the optimal face plus $1$. We provide numerical experiments to illustrate practicality and efficacy on active set size reduction.
Poster
Haolin Zou · Arnab Auddy · Kamiar Rahnama Rad · Arian Maleki
[ Hall A-E ]
Abstract
Despite a large and significant body of recent work focusing on the hyperparameter tuning of regularized models in the high dimensional regime, a theoretical understanding of this problem for non-differentiable penalties such as generalized LASSO and nuclear norm is missing. In this paper we resolve this challenge. We study the hyperparameter tuning problem in the proportional high dimensional regime where both the sample size $n$ and number of features $p$ are large, and $n/p$ and the signal-to-noise ratio (per observation) remain finite. To achieve this goal, we first provide finite-sample upper bounds on the expected squared error of leave-one-out cross-validation (LO) in estimating the out-of-sample risk. Building on this result, we establish the consistency of the hyperparameter tuning method that is based on minimizing LO's estimate.Our simulation results confirm the accuracy and sharpness of our theoretical results.
Poster
Arkapal Panda · Utpal Garain
[ Hall A-E ]
Abstract
A key challenge in calibrating Multi-Label Classification(MLC) problems is to consider the interdependencies among labels. To address this, in this research we propose an unbiased, differentiable, trainable calibration error estimator for MLC problems by using Copula. Unlike other methods for calibrating MLC tasks that focus on marginal calibration, this novel estimator takes label interdependencies into account and enables us to tackle the strictest notion of calibration that is canonical calibration. To design the estimator, we begin by leveraging the kernel trick to construct a continuous distribution from the discrete label space. Then we take a semiparametric approach to construct the estimator where the marginals are modeled non-parametrically and the Copula is modeled parametrically. Theoretically we show that our estimator is unbiased and converges to true $L^p$ calibration error. We also use our estimator as a regularizer at the time of training and observe that it reduces calibration error on test datasets significantly. Experiments on a well established dataset endorses our claims.
Poster
Chanwoo Chun · SueYeon Chung · Daniel Lee
[ Hall A-E ]
Abstract
Analyzing the structure of sampled features from an input data distribution is challenging when constrained by limited measurements in both the number of inputs and features. Traditional approaches often rely on the eigenvalue spectrum of the sample covariance matrix derived from finite measurement matrices; however, these spectra are sensitive to the size of the measurement matrix, leading to biased insights. In this paper, we introduce a novel algorithm that provides unbiased estimates of the spectral moments of the kernel integral operator in the limit of infinite inputs and features from finitely sampled measurement matrices. Our method, based on dynamic programming, is efficient and capable of estimating the moments of the operator spectrum. We demonstrate the accuracy of our estimator on radial basis function (RBF) kernels, highlighting its consistency with the theoretical spectra. Furthermore, we showcase the practical utility and robustness of our method in understanding the geometry of learned representations in neural networks.
Poster
Ayush Bharti · Daolang Huang · Samuel Kaski · Francois-Xavier Briol
[ Hall A-E ]
Abstract
Simulation-based inference (SBI) is the preferred framework for estimating parameters of intractable models in science and engineering. A significant challenge in this context is the large computational cost of simulating data from complex models, and the fact that this cost often depends on parameter values. We therefore propose *cost-aware SBI methods* which can significantly reduce the cost of existing sampling-based SBI methods, such as neural SBI and approximate Bayesian computation. This is achieved through a combination of rejection and self-normalised importance sampling, which significantly reduces the number of expensive simulations needed. Our approach is studied extensively on models from epidemiology to telecommunications engineering, where we obtain significant reductions in the overall cost of inference.
Poster
Elisabeth Griesbauer · Claudia Czado · Arnoldo Frigessi · Ingrid Haff
[ Hall A-E ]
Abstract
We propose TVineSynth, a vine copula based synthetic tabular data generator, which is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve DP by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution, so that it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model that, combined with the specific tree structure of the vine, causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. Privacy is here measured with membership (MIA) and attribute inference attacks (AIA). Further, we theoretically justify how the construction of TVineSynth ensures AIA privacy under a natural privacy measure for continuous sensitive attributes. When compared to competitor models, with and without DP, on simulated and on real-world data, TVineSynth achieves a superior privacy-utility balance.
Poster
Cyrille Kone · Emilie Kaufmann · Laura Richert
[ Hall A-E ]
Abstract
We study the Pareto Set Identification (PSI) problem in a structured multi-output linear bandit model. In this setting each arm is associated a feature vector belonging to $\mathbb{R}^h$ and its mean vector in $\mathbb{R}^d$ linearly depends on this feature vector through a common unknown matrix $\Theta \in \mathbb{R}^{h \times d}$. The goal is to identify the set of non-dominated arms by adaptively collecting samples from the arms. We introduce and analyze the first optimal design-based algorithms for PSI, providing nearly optimal guarantees in both the fixed-budget and the fixed-confidence settings. Notably, we show that the difficulty of these tasks mainly depends on the sub-optimality gaps of $h$ arms only. Our theoretical results are supported by an extensive benchmark on synthetic and real-world datasets.
Poster
Margarida Campos · João Cálem · Sophia Sklaviadis · Mario Figueiredo · Andre Martins
[ Hall A-E ]
Abstract
Conformal prediction is a distribution-free framework for uncertainty quantification that replaces point predictions with sets, offering marginal coverage guarantees (i.e., ensuring that the sets contain the true label with a specified probability, in expectation). In this paper, we uncover a novel connection between conformal prediction and sparse "softmax-like" transformations, such as sparsemax and $\gamma$-entmax (with $\gamma> 1$), which assign nonzero probability only to some labels. We introduce new non-conformity scores for classification which make the calibration process correspond to the widely used temperature scaling method. At test time, applying these sparse transformations with the calibrated temperature leads to a support set (i.e., the set of labels with nonzero probability) that automatically inherits the coverage guarantees of conformal prediction. Through experiments on computer vision and text classification benchmarks, we demonstrate that the proposed method achieves competitive results in terms of coverage, efficiency, and adaptiveness compared to standard non-conformity scores based on softmax.
Poster
Elizabeth Baker · Moritz Schauer · Stefan Sommer
[ Hall A-E ]
Abstract
We propose a new algorithm for learning a bridged diffusion process using score-matching methods. Our method relies on reversing the dynamics of the forward process and using this to learn a score function, which, via Doob's $h$-transform, gives us a bridged diffusion process; that is, a process conditioned on an endpoint. In contrast to prior methods, ours learns the score term $\nabla_x \log p(t, x; T, y)$, for given $t, y$ directly, completely avoiding the need for first learning a time-reversal. We compare the performance of our algorithm with existing methods and see that it outperforms using the (learned) time-reversals to learn the score term. The code can be found at https://github.com/libbylbaker/forward_bridge.
Poster
Jeroen Berrevoets · Jakob Raymaekers · Mihaela van der Schaar · Tim Verdonck · Ruicong Yao
[ Hall A-E ]
Abstract
The introduction of the NOTEARS algorithm resulted in a wave of research on differentiable Directed Acyclic Graph (DAG) learning. Differentiable DAG learning transforms the combinatorial problem of identifying the DAG underlying a Structural Causal Model (SCM) into a constrained continuous optimization problem. Being differentiable, these problems can be solved using gradient-based tools which allow integration into other differentiable objectives. However, in contrast to classical constrained-based algorithms, the identifiability properties of differentiable algorithms are poorly understood. We illustrate that even in the well-known Linear Non-Gaussian Additive Model (LiNGAM), the current state-of-the-art methods do not identify the true underlying DAG. To address the issue, we propose NOTIME (*Non-combinatorial Optimization of Trace exponential and Independence MEasures*), the first differentiable DAG learning algorithm with *provable* identifiability guarantees under the LiNGAM by building on a measure of (joint) independence. With its identifiability guarantees, NOTIME remains invariant to normalization of the data on a population level, a property lacking in existing methods. NOTIME compares favourably against NOTEARS and other (scale-invariant) differentiable DAG learners, across different noise distributions and normalization procedures. Introducing the first identifiability guarantees to general LiNGAM is an important step towards practical adoption of differentiable DAG learners.
Poster
Julien Aubert · Louis Köhler · Luc Lehéricy · Giulia Mezzadri · Patricia Reynaud-Bouret
[ Hall A-E ]
Abstract
Learning for animals or humans is the process that leads to behaviors better adapted to the environment. This process highly depends on the individual that learns and is usually observed only through the individual's actions. This article presents ways to use this individual behavioral data to find the model that best explains how the individual learns. We propose two model selection methods: a general hold-out procedure and an AIC-type criterion, both adapted to non-stationary dependent data. We provide theoretical error bounds for these methods that are close to those of the standard i.i.d. case. To compare these approaches, we apply them to contextual bandit models and illustrate their use on both synthetic and experimental learning data in a human categorization task.
Poster
Chirag Modi · Diana Cai · Lawrence Saul
[ Hall A-E ]
Abstract
Black-box variational inference (BBVI) scales poorly to high-dimensional problems when it is used to estimate a multivariate Gaussianapproximation with a full covariance matrix. In this paper, we extend the _batch-and-match_ (BaM) framework for score-based BBVI toproblems where it is prohibitively expensive to store such covariance matrices, let alone to estimate them. Unlike classical algorithms forBBVI, which use stochastic gradient descent to minimize the reverse Kullback-Leibler divergence, BaM uses more specialized updatesto match the scores of the target density and its Gaussian approximation. We extend the updates for BaM by integrating them with a more compact parameterization of full covariance matrices. In particular, borrowing ideas from factor analysis, we add an extra step toeach iteration of BaM---a _patch_---that projects each newly updated covariance matrix into a more efficiently parameterized family of diagonal plus low rank matrices. We evaluate this approach on a variety of synthetic target distributions and real-world problems inhigh-dimensional inference.
Poster
Francesco Bacchiocchi · Matteo Bollini · Matteo Castiglioni · Alberto Marchesi · Nicola Gatti
[ Hall A-E ]
Abstract
Stackelberg games (SGs) constitute the most fundamental and acclaimed models of strategic interactions involving some form of commitment. Moreover, they form the basis of more elaborate models of this kind, such as, e.g., Bayesian persuasion and principal-agent problems. Addressing learning tasks in SGs and related models is crucial to operationalize them in practice, where model parameters are usually unknown. In this paper, we revise the sample complexity of learning an optimal strategy to commit to in SGs. We provide a novel algorithm that (i) does not require any of the limiting assumptions made by state-of-the-art approaches and (ii) deals with a trade-off between sample complexity and termination probability arising when leader’s strategies representation has finite precision. Such a trade-off has been completely neglected by existing algorithms and, if not properly managed, it may result in them using exponentially-many samples. Our algorithm requires novel techniques, which also pave the way to addressing learning problems in other models with commitment ubiquitous in the real world.
Poster
Achraf Azize · Debabrota Basu
[ Hall A-E ]
Abstract
In a Membership Inference (MI) game, an attacker tries to infer whether a target point was included or not in the input of an algorithm. Existing works show that some target points are easier to identify, while others are harder. This paper explains the target-dependent hardness of membership attacks by studying the powers of the optimal attacks in a *fixed-target* MI game. We characterise the optimal advantage and trade-off functions of attacks against the empirical mean in terms of the Mahalanobis distance between the target point and the data-generating distribution. We further derive the impacts of two privacy defences, i.e. adding Gaussian noise and sub-sampling, and that of target misspecification on optimal attacks. As by-products of our novel analysis of the Likelihood Ratio (LR) test, we provide a new covariance attack which generalises and improves the scalar product attack. Also, we propose a new optimal canary-choosing strategy for auditing privacy in the white-box federated learning setting. Our experiments validate that the Mahalanobis score explains the hardness of *fixed-target* MI games.
Poster
Mátyás Schubert · Tom Claassen · Sara Magliacane
[ Hall A-E ]
Abstract
Causal discovery can be computationally demanding for large numbers of variables. If we only wish to estimate the causal effects on a small subset of target variables, we might not need to learn the causal graph for all variables, but only a small subgraph that includes the targets and their adjustment sets. In this paper, we focus on identifying causal effects between target variables in a computationally and statistically efficient way.This task combines causal discovery and effect estimation, aligning the discovery objective with the effects to be estimated.We show that definite non-ancestors of the targets are unnecessary to learn causal relations between the targets and to identify efficient adjustments sets.We sequentially identify and prune these definite non-ancestors with our Sequential Non-Ancestor Pruning (SNAP) framework, which can be used either as a preprocessing step to standard causal discovery methods, or as a standalone sound and complete causal discovery algorithm.Our results on synthetic and real data show that both approaches substantially reduce the number of independence tests and the computation time without compromising the quality of causal effect estimations.
Poster
Giora Simchoni · Saharon Rosset
[ Hall A-E ]
Abstract
We introduce copula-based neural networks (COPNN), a novel framework that extends beyond the limitations of Gaussian marginals for random effects in mixed models. COPNN integrates the flexibility of Gaussian copulas in capturing rich dependence structures with arbitrary marginal distributions, with the expressive power of deep neural networks (DNN), allowing it to model large non-Gaussian data in both regression and classification settings, while using batch learning and stochastic gradient descent. Unlike traditional linear and non-linear mixed models, which assume Gaussianity for random effects, COPNN leverages copulas to decouple the marginal distribution from the dependence structure, caused by spatial, temporal and high-cardinality categorical features. This is achieved by minimizing a batch negative log-likelihood (NLL) loss in the continuous case, and a batch negative pairwise log-likelihood in the binary case. We demonstrate COPNN’s effectiveness through extensive experiments on both simulated and real datasets. COPNN reduces NLL and MSE in the regression setting, and improves predictive accuracy in the classification setting, compared to previous state of the art methods which integrate random effects into DNN. Our real-world experiments, conducted on datasets from automotive pricing and retail traffic forecasting, further validate COPNN's ability to improve performance over traditional methods for dealing with high-cardinality categorical features.
Poster
Yuliang Ji · Jian Wu · Yuanzhe Xi
[ Hall A-E ]
Abstract
Deep neural networks have achieved substantial success across various scientific computing tasks. A pivotal challenge within this domain is the rapid and parallel approximation of matrix inverses, critical for numerous applications. Despite significant progress, there currently exists no universal neural-based method for approximating matrix inversion. This paper presents a theoretical analysis demonstrating the fundamental limitations of neural networks in developing a generalized matrix inversion model. We expand the class of Lipschitz functions to encompass a wider array of neural network models, thereby refining our theoretical approach. Moreover, we delineate specific conditions under which neural networks can effectively approximate matrix inverses. Our theoretical results are supported by experimental results from diverse matrix datasets, exploring the efficacy of neural networks in addressing the matrix inversion challenge.
Poster
Houssam Zenati · Judith Abécassis · julie Josse · Bertrand Thirion
[ Hall A-E ]
Abstract
Uncovering causal mediation effects is of significant value to practitioners who aim to isolate treatment effects from potential mediator effects. We propose a double machine learning (DML) algorithm for mediation analysis that supports continuous treatments. To estimate the target mediated response curve, our method employs a kernel-based doubly robust moment function for which we prove asymptotic Neyman orthogonality. This allows us to obtain an asymptotic normality with nonparametric convergence rate while allowing for nonparametric or parametric estimation of the nuisance parameters. Subsequently, we derive an optimal bandwidth strategy along with a procedure to estimate asymptotic confidence intervals. Finally, to illustrate the benefits of our method, we provide a numerical evaluation of our approach on a simulation along with an application on medical real-world data to analyze the effect of glycemic control on cognitive functions.
Poster
Bariscan Bozkurt · Ben Deaner · Dimitri Meunier · Liyuan Xu · Arthur Gretton
[ Hall A-E ]
Abstract
We address the setting of Proxy Causal Learning (PCL), which has the goal of estimating causal effects from observed data in the presence of hidden confounding. Proxy methods accomplish this task using two proxy variables related to the latent confounder: a treatment proxy (related to the treatment) and an outcome proxy (related to the outcome). Two approaches have been proposed to perform causal effect estimation given proxy variables; however only one of these has found mainstream acceptance, since the other was understood to require density ratio estimation - a challenging task in high dimensions. In the present work, we propose a practical and effective implementation of the second approach, which bypasses explicit density ratio estimation and is suitable for continuous and high-dimensional treatments. We employ kernel ridge regression to derive estimators, resulting in simple closed-form solutions for dose-response and conditional dose-response curves, along with consistency guarantees. Our methods empirically demonstrate superior or comparable performance to existing frameworks on synthetic and real-world datasets.
Poster
Gabriel Moreira · Manuel Marques · Joao Costeira · Alexander Hauptmann
[ Hall A-E ]
Abstract
Learning image representations that capture rich semantic relationships remains a significant challenge. Existing approaches are either contrastive, lacking robust theoretical guarantees, or struggle to effectively represent the partial orders inherent to structured visual-semantic data. In this paper, we introduce a nuclear norm-based loss function, grounded in the same information theoretic principles that have proved effective in self-supervised learning. We present a theoretical characterization of this loss, demonstrating that, in addition to promoting class orthogonality, it encodes the spectral geometry of the data within a subspace lattice. This geometric representation allows us to associate logical propositions with subspaces, ensuring that our learned representations adhere to a predefined symbolic structure.
Poster
Xiusi Li · Sékou-Oumar Kaba · Siamak Ravanbakhsh
[ Hall A-E ]
Abstract
Causal representation learning (CRL) enhances machine learning models' robustness and generalizability by learning structural causal models associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown intervention, to identify the latent causal model. (Brehmer et al., 2022) showed that this is indeed possible, given that all latent variables can be intervened on individually. However, this is a highly restrictive assumption in many systems. In this work, we instead assume interventions on arbitrary subsets of latent variables, which is more realistic. We introduce a theoretical framework that calculates the degree to which we can identify a causal model, given a set of possible interventions, up to an abstraction that describes the system at a higher level of granularity.
Poster
Angel David Reyero Lobo · Alexis Ayme · Claire Boyer · Erwan Scornet
[ Hall A-E ]
Abstract
Supervised learning with missing data aims at building the best prediction of a target output based on partially-observed inputs. Major approaches to address this problem can be decomposed into $(i)$ impute-then-predict strategies, which first fill in the empty input components and then apply a unique predictor and $(ii)$ Pattern-by-Pattern (P-b-P) approaches, where a predictor is built on each missing pattern. In this paper, we theoretically analyze how three classical linear classifiers, namely perceptron, logistic regression and linear discriminant analysis (LDA), behave with Missing Completely At Random (MCAR) data, depending on the strategy (imputation or P-b-P) to handle missing values. We prove that both imputation and P-b-P approaches are ill-specified in a logistic regression framework, thus questioning the relevance of such approaches to handle missing data. The most favorable auspices to perform classification with missing data concern P-b-P LDA methods. We provide finite-sample bounds for the excess risk in this framework, even for high-dimensional settings or MNAR data. Experiments illustrate our theoretical findings.
Poster
Chenyang Li · Yingyu Liang · Zhenmei Shi · Zhao Song
[ Hall A-E ]
Abstract
The weighted low-rank approximation problem is a fundamental numerical linear algebra problem and has many applications in machine learning. Given a $n \times n$ weight matrix $W$ and a $n \times n$ matrix $A$, the goal is to find two low-rank matrices $U, V \in \mathbb{R}^{n \times k}$ such that the cost of $\\| W \circ (U V^\top - A) \\|_F^2$ is minimized. Previous work has to pay $\Omega(n^2)$ time when matrices $A$ and $W$ are dense, e.g., having $\Omega(n^2)$ non-zero entries. In this work, we show that there is a certain regime, even if $A$ and $W$ are dense, we can still hope to solve the weighted low-rank approximation problem in almost linear $n^{1+o(1)}$ time.
Poster
Sihan Zeng · Sujay Bhatt · Alec Koppel · Sumitra Ganesh
[ Hall A-E ]
Abstract
We consider discrete-time stationary mean field games (MFG) with unknown dynamics and design algorithms for finding the equilibrium with finite-time complexity guarantees. Prior solutions to the problem assume either the contraction of a mean field optimality-consistency operator or strict weak monotonicity, which may be overly restrictive. In this work, we introduce a new class of solvable MFGs, named the "fully herding class", which expands the known solvable class of MFGs and for the first time includes problems with multiple equilibria. We propose a direct policy optimization method, Accelerated Single-loop Actor Critic Algorithm for Mean Field Games (ASAC-MFG), that provably finds a global equilibrium for MFGs within this class, under suitable access to a single trajectory of Markovian samples. Different from the prior methods, ASAC-MFG is single-loop and single-sample-path. We establish the finite-time and finite-sample convergence of ASAC-MFG to a mean field equilibrium via new techniques that we develop for multi-time-scale stochastic approximation. We support the theoretical results with illustrative numerical simulations.When the mean field does not affect the transition and reward, a MFG reduces to a Markov decision process (MDP) and ASAC-MFG becomes an actor-critic algorithm for finding the optimal policy in average-reward MDPs, with a sample complexity matching the …
Poster
Sungee Hong · Zhengling Qi · Raymond K. W. Wong
[ Hall A-E ]
Abstract
We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.
Poster
Aramayis Dallakyan · Yang Ni
[ Hall A-E ]
Abstract
The discovery of causal relationships from observational data is very challenging. Many recent approaches rely on complexity or uncertainty concepts to impose constraints on probability distributions, aiming to identify specific classes of directed acyclic graph (DAG) models. In this paper, we introduce a novel identifiability criterion for DAGs that places constraints on the conditional variances of additive noise models. We demonstrate that this criterion extends and generalizes existing identifiability criteria in the literature that employ (conditional) variances as measures of uncertainty in (conditional) distributions. For linear structural equation models, we present a new algorithm that leverages the concept of weak majorization applied to the diagonal elements of the Cholesky factor of the covariance matrix to learn a topological ordering of variables. Through extensive simulations and the analysis of bank connectivity data, we provide evidence of the effectiveness of our approach in successfully recovering DAGs. The code for reproducing the results in this paper is available in Supplementary Materials.
Poster
Bassel Hamoud · Ilnura Usmanova · Kfir Yehuda Levy
[ Hall A-E ]
Abstract
We present the first theoretical guarantees for zero constraint violation in Online Convex Optimization (OCO) across all rounds, addressing dynamic constraint changes. Unlike existing approaches in constrained OCO, which allow for occasional safety breaches, we provide the first approach for maintaining strict safety under the assumption of gradually evolving constraints, namely the constraints change at most by a small amount between consecutive rounds. This is achieved through a primal-dual approach and Online Gradient Ascent in the dual space. We show that employing a dichotomous learning rate enables ensuring both safety, via zero constraint violation, and sublinear regret. Our framework marks a departure from previous work by providing the first provable guarantees for maintaining absolute safety in the face of changing constraints in OCO.
Poster
Irit Chelly · Roy Uziel · Oren Freifeld · Ari Pakman
[ Hall A-E ]
Abstract
Neural models for amortized probabilistic clustering yield samples of cluster labels given a set-structured input, while avoiding lengthy Markov chain runs and the need for explicit data likelihoods. Existing methods which label each data point sequentially, like the Neural Clustering Process, often lead to cluster assignments highly dependent on the data order. Alternatively, methods that sequentially create full clusters, do not provide assignment probabilities. In this paper, we introduce GFNCP, a novel framework for amortized clustering. GFNCP is formulated as a Generative Flow Network with a shared energy-based parametrization of policy and reward. We show that the flow matching conditions are equivalent to consistency of the clustering posterior under marginalization, which in turn implies order invariance. GFNCP also outperforms existing methods in clustering performance on both synthetic and real-world data.
Poster
Jia Lin Hau · Erick Delage · Esther Derman · Mohammad Ghavamzadeh · Marek Petrik
[ Hall A-E ]
Abstract
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.
Poster
Eliot Beyler · Francis Bach
[ Hall A-E ]
Abstract
In this paper, we derive variational inference upper-bounds on the log-partition function of a pairwize Markov random fields on the Boolean hypercube, based on quantum relaxations of the Kullback-Leibler divergence. We then propose an efficient algorithm to compute these bounds based on primal-dual optimization. An improvement of these bounds through the use of "hierarchies", similar to sum-of-squares (SoS) hierarchies is proposed, and we present a greedy algorithm to select among these relaxations. We carry extensive numerical experiments and compare with state-of-the-art methods for this inference problem.
Poster
Daniel Galperin · Ullrich Köthe
[ Hall A-E ]
Abstract
Good generative models should not only synthesize high quality data, but also utilize interpretable representations that aid human understanding of their behavior.However, it is difficult to measure objectively if and to what degree desirable properties of disentangled representations have been achieved.Inspired by the principle of independent mechanisms, we address this difficulty by introducing a novel set of tractable information-theoretic evaluation metrics.We demonstrate the usefulness of our metrics on illustrative toy examples and conduct an in-depth comparison of various normalizing flow architectures and $\beta$-VAEs on the EMNIST dataset.Our method allows to sort latent features by importance and assess the amount of residual correlations of the resulting concepts.The most interesting finding of our experiments is a ranking of model architectures in terms of their inductive bias to converge to aligned and disentangled representations during training.
Poster
Alex Buna · Patrick Rebeschini
[ Hall A-E ]
Abstract
Recent progress in robust statistical learning has mainly tackled convex problems, like mean estimation or linear regression, with non-convex challenges receiving less attention. Phase retrieval exemplifies such a non-convex problem, requiring the recovery of a signal from only the magnitudes of its linear measurements, without phase (sign) information. While several non-convex methods, especially those involving the Wirtinger Flow algorithm, have been proposed for noiseless or mild noise settings, developing solutions for heavy-tailed noise and adversarial corruption remains an open challenge. In this paper, we investigate an approach that leverages robust gradient descent techniques to improve the Wirtinger Flow algorithm's ability to simultaneously cope with fourth moment bounded noise and adversarial contamination in both the inputs (covariates) and outputs (responses). We address two scenarios: known zero-mean noise and completely unknown noise. For the latter, we propose a preprocessing step that alters the problem into a new format that does not fit traditional phase retrieval approaches but can still be resolved with a tailored version of the algorithm for the zero-mean noise context.
Poster
Antonios Valkanas · Boris Oreshkin · Mark Coates
[ Hall A-E ]
Abstract
Online deep learning tackles the challenge of learning from data streams by balancing two competing goals: fast learning and deep learning. However, existing research primarily emphasizes deep learning solutions, which are more adept at handling the ''deep'' aspect than the ''fast'' aspect of online learning. In this work, we introduce an alternative paradigm through a hybrid multilearner approach. We begin by developing a fast online logistic regression learner, which operates without relying on backpropagation. It leverages closed-form recursive updates of model parameters, efficiently addressing the fast learning component of the online learning challenge. This approach is further integrated with a cascaded multilearner design, where shallow and deep learners are co-trained in a cooperative, synergistic manner to solve the online learning problem. We demonstrate that this approach achieves state-of-the-art performance on standard online learning datasets. We make our code available: https://github.com/AntonValk/MODL
Poster
Liyuan Xu · Arthur Gretton
[ Hall A-E ]
Abstract
We consider the problem of causal effect estimation with an unobserved confounder, where we observe a single proxy variable that is associated with the confounder. Although it has been shown that the recovery of an average causal effect is impossible in general from a single proxy variable, we show that causal recovery is possible if the outcome is generated deterministically. This generalizes existing work on causal methods with a single proxy variable to the continuous treatment setting. We propose two kernel-based methods for this setting: the first based on the two-stage regression approach, and the second based on a maximum moment restriction approach. We prove that both approaches can consistently estimate the causal effect, and we empirically demonstrate that we can successfully recover the causal effect on challenging synthetic benchmarks.
Poster
Lucas GNECCO HEREDIA · Matteo Sammut · Muni Sreenivas Pydi · Rafael Pinot · Benjamin Negrevergne · Yann Chevaleyre
[ Hall A-E ]
Abstract
Randomization as a mean to improve the adversarial robustness of machine learning models has recently attracted significant attention. Unfortunately, much of the theoretical analysis so far has focused on binary classification, providing only limited insights into the more complex multiclass setting. In this paper, we take a step toward closing this gap by drawing inspiration from the field of graph theory. Our analysis focuses on discrete data distributions, allowing us to cast the adversarial risk minimization problems within the well-established framework of set packing problems. By doing so, we are able to identify three structural conditions on the support of the data distribution that are necessary for randomization to improve robustness. Furthermore, we are able to construct several data distributions where (contrarily to binary classification) switching from a deterministic to a randomized solution significantly reduces the optimal adversarial risk. These findings highlight the crucial role randomization can play in enhancing robustness to adversarial attacks in multiclass classification.
Poster
Grigor Bezirganyan · Sana Sellami · Laure Berti-Equille · Sébastien Fournier
[ Hall A-E ]
Abstract
Multimodal AI models are increasingly used in fields like healthcare, finance, and autonomous driving, where information is drawn from multiple sources or modalities such as images, texts, audios, videos. However, effectively managing uncertainty—arising from noise, insufficient evidence, or conflicts between modalities—is crucial for reliable decision-making. Current uncertainty-aware machine learning methods leveraging, for example, evidence averaging, or evidence accumulation underestimate uncertainties in high-conflict scenarios. Moreover, the state-of-the-art evidence averaging strategy is not order invariant and fails to scale to multiple modalities. To address these challenges, we propose a novel multimodal learning method with order-invariant evidence fusion and introduce a conflict-based discounting mechanism that reallocates uncertain mass when unreliable modalities are detected. We provide both theoretical analysis and experimental validation, demonstrating that unlike the previous work, the proposed approach effectively distinguishes between conflicting and non-conflicting samples based on the provided uncertainty estimates, and outperforms the previous models in uncertainty-based conflict detection.
Poster
Anna Bonnet · Maxime Sangnier
[ Hall A-E ]
Abstract
This paper addresses nonparametric estimation of nonlinear multivariate Hawkes processes, where the interaction functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). Motivated by applications in neuroscience, the model allows complex interaction functions, in order to express exciting and inhibiting effects, but also a combination of both (which is particularly interesting to model the refractory period of neurons), and considers in return that conditional intensities are rectified by the ReLU function. The latter feature incurs several methodological challenges, for which workarounds are proposed in this paper. In particular, it is shown that a representer theorem can be obtained for approximated versions of the log-likelihood and the least-squares criteria. Based on it, we propose an estimation method, that relies on two common approximations (of the ReLU function and of the integral operator). We provide a bound that controls the impact of these approximations. Numerical results on synthetic data confirm this fact as well as the good asymptotic behavior of the proposed estimator. It also shows that our method achieves a better performance compared to related nonparametric estimation techniques and suits neuronal applications.
Poster
Colin Dirren · Mattia Bianchi · Panagiotis D. Grontas · John Lygeros · Florian Dorfler
[ Hall A-E ]
Abstract
We study the convex-concave bilinear saddle-point problem $\min_x \max_y f(x) + y^\top Ax - g(y)$, where both, only one, or none of the functions $f$ and $g$ are strongly convex, and suitable rank conditions on the matrix $A$ hold. The solution of this problem is at the core of many machine learning tasks. By employing tools from monotone operator theory, we systematically prove the contractivity (in turn, the linear convergence) of several first-order primal-dual algorithms, including the Chambolle–Pock method. Our approach results in concise proofs, and it yields new convergence guarantees and tighter bounds compared to known results.
Poster
Wenfu Xia · Fengpei Li · Ying Sun · Ziping Zhao
[ Hall A-E ]
Abstract
Covariance matrix estimation is a fundamental problem in multivariate data analysis, which becomes particularly challenging in high-dimensional settings due to the curse of dimensionality. To enhance estimation accuracy, structural regularization is often imposed on the precision matrix (the inverse covariance matrix) for covariance selection. In this paper, we study covariance selection in a distributed setting, where data is spread across a network of agents. We formulate the problem as a Gaussian maximum likelihood estimation problem with structural penalties and propose a novel algorithmic framework called NetGGM. Unlike existing methods that rely on a central coordinator, NetGGM operates in a fully decentralized manner with low computational complexity. We provide theoretical guarantees showing that NetGGM converges linearly to the global optimum while ensuring consensus among agents. Numerical experiments validate its convergence properties and demonstrate that it outperforms state-of-the-art methods in precision matrix estimation.
Poster
Paul Mangold · Alain Durmus · Aymeric Dieuleveut · Sergey Samsonov · Eric Moulines
[ Hall A-E ]
Abstract
In this paper, we present a novel analysis of $\texttt{FedAvg}$ with constant step size, relying on the Markov property of the underlying process. We demonstrate that the global iterates of the algorithm converge to a stationary distribution and analyze its resulting bias and variance relative to the problem's solution.We provide a first-order bias expansion in both homogeneous and heterogeneous settings. Interestingly, this bias decomposes into two distinct components: one that depends solely on stochastic gradient noise and another on client heterogeneity.Finally, we introduce a new algorithm based on the Richardson-Romberg extrapolation technique to mitigate this bias.
Poster
Daniel Williams · Leyang Wang · Qizhen Ying · Song Liu · Mladen Kolar
[ Hall A-E ]
Abstract
This paper addresses differential inference in time-varying parametric probabilistic models, like graphical models with changing structures. Instead of estimating a high-dimensional model at each time and estimating changes later, we directly learn the differential parameter, i.e., the time derivative of the parameter. The main idea is treating the time score function of an exponential family model as a linear model of the differential parameter for direct estimation. We use time score matching to estimate parameter derivatives. We prove the consistency of a regularized score matching objective and demonstrate the finite-sample normality of a debiased estimator in high-dimensional settings. Our methodology effectively infers differential structures in high-dimensional graphical models, verified on simulated and real-world datasets. The code reproducing our experiments can be found at: \url{https://github.com/Leyangw/tsm}.
Poster
Okan Koc · Alexander Soen · Chao-Kai Chiang · Masashi Sugiyama
[ Hall A-E ]
Abstract
Current machine learning systems are brittle in the face of distribution shifts (DS), where the target distribution that the system is tested on differs from the source distribution used to train the system. This problem of robustness to DS has been studied extensively in the field of domain adaptation. For deep neural networks, popular methods for unsupervised domain adaptation (UDA) are domain matching methods that try to align the marginal distributions in the feature or output space. The current theoretical understanding of these methods, however, are limited and existing theoretical frameworks are not precise enough to characterize their performance in practice. To this end, we derive new bounds based on optimal transport that analyze the UDA problem. Our new bound includes a term which we dub as entanglement, consisting of an expectation of Wasserstein distance between conditionals with respect to changing data distributions. Analysis of the entanglement term provides a novel perspective on the unoptimizable aspects of UDA. In various experiments with multiple models across several DS scenarios, we show that this term can be used to explain the varying performance of UDA algorithms.
Poster
Jordan Penn · Lee Gunderson · Gecia Bravo-Hermsdorff · Ricardo Silva · David Watson
[ Hall A-E ]
Abstract
Instrumental variables (IVs) are widely used to estimate causal effects in the presence of unobserved confounding between an exposure $X$ and outcome $Y$. An IV must affect $Y$ exclusively through $X$ and be unconfounded with $Y$. We present a framework for relaxing these assumptions with tuneable and interpretable "budget constraints". Our algorithm returns a feasible set of causal effects that can be identified exactly given perfect knowledge of observable covariance statistics. This feasible set might contain disconnected sets of possible solutions for the causal effect. We discuss conditions under which this set is sharp, i.e., contains all and only effects consistent with the background assumptions and the joint distribution of observable variables. Our method applies to a wide class of semiparametric models, and we demonstrate how its ability to select specific subsets of instruments confers an advantage over convex relaxations in both linear and nonlinear settings. We adapt our algorithm to form confidence sets that are asymptotically valid under a common statistical assumption from the Mendelian randomization literature. An accompanying R package, budgetIVr, is available from CRAN.
Poster
Vladimir Braverman · Prathamesh Dharangutte · Shreyas Pai · Vihan Shah · Chen Wang
[ Hall A-E ]
Abstract
We study the dynamic correlation clustering problem with *adaptive* edge label flips. In correlation clustering, we are given a $n$-vertex complete graph whose edges are labeled either $(+)$ or $(-)$, and the goal is to minimize the total number of $(+)$ edges between clusters and the number of $(-)$ edges within clusters. We consider the dynamic setting with adversarial robustness, in which the *adaptive* adversary can flip the label of an edge based on the current output of the algorithm. Our main result is a randomized algorithm that always maintains an $O(1)$-approximation to the optimal correlation clustering with $O(\log^{2}{n})$ amortized update time. Prior to our work, no algorithm with $O(1)$-approximation and $\text{polylog}{(n)}$ update time for the adversarially robust setting was known. We further validate our theoretical results with experiments on synthetic and real-world datasets with competitive empirical performances. Our main technical ingredient is an algorithm that maintains *sparse-dense decomposition* with $\text{polylog}{(n)}$ update time, which could be of independent interest.
Poster
Pedro Seber e Silva · Richard Braatz
[ Hall A-E ]
Abstract
N-glycosylation has many essential biological roles, and is important for biotherapeutics as it can affect drug efficacy, duration of effect, and toxicity. The prediction of N-glycosylation and other important biopharmaceutical production values have mostly been limited to mechanistic modeling. We present a residual hybrid modeling approach that integrates mechanistic modeling with machine learning to produce significantly more accurate predictions for N-glycosylation and bioproduction. For the largest dataset, the residual hybrid models have an average 736-fold reduction in testing prediction error. Furthermore, the residual hybrid models have lower prediction errors than the mechanistic models for all of the predicted variables in the datasets. We provide the automatic machine learning software used in this work, allowing reproduction and use of our software for other tasks.
Poster
Sarbojit Roy · Malik Shahid Sultan · Tania Vallejo · Leena Ibrahim · Hernando Ombao
[ Hall A-E ]
Abstract
Interpretable classification of time series presents significant challenges in high dimensions. Traditional feature selection methods in the frequency domain often assume sparsity in spectral density matrices (SDMs) or their inverses, which can be restrictive for real-world applications. We propose a model-based approach for classifying high-dimensional stationary time series by assuming sparsity in the difference between inverse SDMs. The estimators for model parameters possess consistency under fairly general conditions. Additionally, we introduce a method to screen the most discriminatory frequencies for classification, which exhibits the ${\it sure\ screening\ property}$. The novelty of our method lies in the interpretability of the model parameters, making it especially suitable for fields like neuroscience, where understanding differences in brain network connectivity across various states is crucial. The proposed approach is evaluated using a variety of simulated examples. We apply it to EEG and calcium imaging datasets to demonstrate its practical relevance.
Poster
Tomoharu Iwata · Atsutoshi Kumagai · Yasutoshi Ida
[ Hall A-E ]
Abstract
We propose a few-shot learning method for linear regression, which learns how to choose regularization weights from multiple tasks with different feature spaces, and uses the knowledge for unseen tasks. Linear regression is ubiquitous in a wide variety of fields. Although regularization weight tuning is crucial to performance, it is difficult when only a small amount of training data are available. In the proposed method, task-specific regularization weights are generated using a neural network-based model by taking a task-specific training dataset as input, where our model is shared across all tasks. For each task, linear coefficients are optimized by minimizing the squared loss with an L2 regularizer using the generated regularization weights and the training dataset. Our model is meta-learned by minimizing the expected test error of linear regression with the task-specific coefficients using various training datasets. In our experiments using synthetic and real-world datasets, we demonstrate the effectiveness of the proposed method on few-shot regression tasks compared with existing methods.
Poster
Alexander Timans · Christoph-Nikolas Straehle · Kaspar Sakmann · Christian Andersson Naesseth · Eric Nalisnick
[ Hall A-E ]
Abstract
Multiple hypothesis testing (MHT) frequently arises in scientific inquiries, and concurrent testing of multiple hypotheses inflates the risk of Type-I errors or false positives, rendering MHT corrections essential. This paper addresses MHT in the context of conformal prediction, a flexible framework for predictive uncertainty quantification. Some conformal applications give rise to simultaneous testing, and positive dependencies among tests typically exist. We introduce max-rank, a novel correction that exploits these dependencies whilst efficiently controlling the family-wise error rate. Inspired by existing permutation-based corrections, max-rank leverages rank order information to improve performance and integrates readily with any conformal procedure. We establish its theoretical and empirical advantages over the common Bonferroni correction and its compatibility with conformal prediction, highlighting the potential to strengthen predictive uncertainty estimates.
Poster
Nicolas Menet · Jonas Hübotter · Parnian Kassraie · Andreas Krause
[ Hall A-E ]
Abstract
We consider the problem of computing the *probability of maximality* (PoM) of a Gaussian random vector, i.e., the probability for each dimension to be maximal. This is a key challenge in applications ranging from Bayesian optimization to reinforcement learning, where the PoM not only helps with finding an optimal action, but yields a fine-grained analysis of the action domain, crucial in tasks such as drug discovery. Existing techniques are costly, scaling polynomially in computation and memory with the vector size. We introduce LITE, the first approach for estimating Gaussian PoM with *almost-linear time and memory* complexity. LITE achieves SOTA accuracy on a number of tasks, while being in practice several orders of magnitude faster than the baselines. This also translates to a better performance on downstream tasks such as entropy estimation and optimal control of bandits. Theoretically, we cast LITE as entropy-regularized UCB and connect it to prior PoM estimators.
Poster
Jaiden Fairoze · Guillermo Ortiz-Jimenez · Mel Vecerik · Somesh Jha · Sven Gowal
[ Hall A-E ]
Abstract
This work investigates the theoretical boundaries of creating publicly-detectable schemes to enable the provenance of watermarked imagery. Metadata-based approaches like C2PA provide unforgeability and public-detectability. ML techniques offer robust retrieval and watermarking. However, no existing scheme combines robustness, unforgeability, and public-detectability. In this work, we formally define such a scheme and establish its existence. Although theoretically possible, we find that at present, it is intractable to build certain components of our scheme without a leap in deep learning capabilities. We analyze these limitations and propose research directions that need to be addressed before we can practically realize robust and publicly-verifiable provenance.
Poster
Ross Irwin · Alessandro Tibo · Jon Paul Janet · Simon Olsson
[ Hall A-E ]
Abstract
Methods for jointly generating molecular graphs along with their 3D conformations have gained prominence recently due to their potential impact on structure-based drug design. Current approaches, however, often suffer from very slow sampling times or generate molecules with poor chemical validity. Addressing these limitations, we propose Semla, a scalable E(3)-equivariant message passing architecture. We further introduce an unconditional 3D molecular generation model, SemlaFlow, which is trained using equivariant flow matching to generate a joint distribution over atom types, coordinates, bond types and formal charges. Our model produces state-of-the-art results on benchmark datasets with as few as 20 sampling steps, corresponding to a two order-of-magnitude speedup compared to state-of-the-art. Furthermore, we highlight limitations of current evaluation methods for 3D generation and propose new benchmark metrics for unconditional molecular generators. Finally, using these new metrics, we compare our model's ability to generate high quality samples against current approaches and further demonstrate SemlaFlow's strong performance.
Poster
Juntong Chen · Johannes Schmidt-Hieber · Claire Donnat · Olga Klopp
[ Hall A-E ]
Abstract
Graph Convolutional Networks (GCNs) have become a pivotal method in machine learning for modeling functions over graphs. Despite their widespread success across various applications, their statistical properties (e.g., consistency, convergence rates) remain ill-characterized. To begin addressing this knowledge gap, we consider networks for which the graph structure implies that neighboring nodes exhibit similar signals and provide statistical theory for the impact of convolution operators. Focusing on estimators based solely on neighborhood aggregation, we examine how two common convolutions—the original GCN and GraphSAGE convolutions—affect the learning error as a function of the neighborhood topology and the number of convolutional layers. We explicitly characterize the bias variance type trade-off incurred by GCNs as a function of the neighborhood size and identify specific graph topologies where convolution operators are less effective. Our theoretical findings are corroborated by synthetic experiments, and provide a start to a deeper quantitative understanding of convolutional effects in GCNs for offering rigorous guidelines for practitioners.
Poster
Yixin Yan · QIAO YANG · Ziping Zhao
[ Hall A-E ]
Abstract
Covariance matrix estimation is a fundamental problem in multivariate data analysis. In many situations, it is often observed that variables exhibit a positive linear dependency, indicating a positive linear correlation. This paper tackles the challenge of estimating covariance matrices with positive correlations in high-dimensional settings. We propose a positive definite thresholding covariance estimation problem that includes nonconvex sparsity penalties and nonnegative correlation constraints. To address this problem, we introduce a multistage adaptive estimation algorithm based on majorization-minimization (MM). This algorithm progressively refines the estimates by solving a weighted $\ell_{1}$-regularized problem at each stage. Additionally, we present a comprehensive theoretical analysis that characterizes the estimation error associated with the estimates generated by the MM algorithm. The analysis reveals that the error comprises two components: the optimization error and the statistical error. The optimization error decreases to zero at a linear rate, allowing the proposed estimator to eventually reach the oracle statistical rate under mild conditions. Furthermore, we explore various extensions based on the proposed estimation technique. Our theoretical findings are supported by extensive numerical experiments conducted on both synthetic and real-world datasets.Furthermore, we demonstrate that the proposed estimation technique can be expanded to the correlation matrix estimation scenario.Our theoretical findings are …
Poster
Lingkai Kong · Yuanqi Du · Wenhao Mu · Kirill Neklyudov · Valentin De Bortoli · Dongxia Wu · Haorui Wang · Aaron Ferber · Yian Ma · Carla Gomes · Chao Zhang
[ Hall A-E ]
Abstract
Addressing real-world optimization problems becomes particularly challenging when analytic objective functions or constraints are unavailable. While numerous studies have addressed the issue of unknown objectives, limited research has focused on scenarios where feasibility constraints are not given explicitly. Overlooking these constraints can lead to spurious solutions that are unrealistic in practice. To deal with such unknown constraints, we propose to perform optimization within the data manifold using diffusion models. To constrain the optimization process to the data manifold, we reformulate the original optimization problem as a sampling problem from the product of the Boltzmann distribution defined by the objective function and the data distribution learned by the diffusion model. Depending on the differentiability of the objective function, we propose two different sampling methods. For differentiable objectives, we propose a two-stage framework that begins with a guided diffusion process for warm-up, followed by a Langevin dynamics stage for further correction. For non-differentiable objectives, we propose an iterative importance sampling strategy using the diffusion model as the proposal distribution. Comprehensive experiments on a synthetic dataset, six real-world black-box optimization datasets, and a multi-objective molecule optimization dataset show that our method achieves better or comparable performance with previous state-of-the-art baselines.
Poster
Juha Harviainen · Pekka Parviainen
[ Hall A-E ]
Abstract
Expert knowledge can greatly reduce the complexity of Bayesian network structure learning by constraining the search space. These constraints can come in the form of ancestral constraints that relate to the existence of paths between nodes. When the constraints are compiled into a directed acyclic graph, the complexity of learning with ancestral constraints is connected to the number of ideals of the constraint graph. First, we consider precedence constraints which define a partial order that the structure must obey. Taking the path cover number of the constraint graph as a parameter, we extend earlier results to the problems of sampling and weighted counting of network structures. We also consider the problems with related ancestral constraints which state that a node must or cannot be an ancestor of another. With positive ancestral constraints, we show that the problems are tractable under the additional assumption that the constraint graph has only a small number of incomparable edges. On the other hand, the optimization problem is NP-hard with negative ancestral constraints when the path cover number is at least two. Finally, we show that these problems become fixed-parameter tractable if the constraints are compatible with a subclass of partial orders called bucket orders.
Poster
Yi Fu · Anthony Tompkins · Yang Song · Maurice Pagnucco
[ Hall A-E ]
Abstract
Satisfiability (SAT) solvers based on techniques such as conflict driven clause learning (CDCL) have produced excellent performance on both synthetic and real world industrial problems. While these CDCL solvers only operate on a per-problem basis, graph neural network (GNN) based solvers bring new benefits to the field by allowing practitioners to exploit knowledge gained from previously solved problems to expedite solving of new SAT problems. However, one specific area that is often studied in the context of CDCL solvers, but largely overlooked in GNN solvers, is the relationship between graph theoretic measure of structure in SAT problems and the generalisation ability of GNN solvers. To bridge the gap between structural graph properties (e.g., modularity, self-similarity) and the generalisability (or lack thereof) of GNN based SAT solvers, we present StructureSAT: a curated dataset, along with code to further generate novel examples, containing a diverse set of SAT problems from well known problem domains. Furthermore, we utilise a novel splitting method that focuses on deconstructing the families into more detailed hierarchies based on their structural properties. With the new dataset, we aim to help explain problematic generalisation in existing GNN SAT solvers by exploiting knowledge of structural graph properties. We conclude with …
Poster
Parjanya Prashant · Seyedeh Baharan Khatami · Bruno Ribeiro · Babak Salimi
[ Hall A-E ]
Abstract
We consider the task of out-of-distribution (OOD) generalization, where the distribution shift is due to an unobserved confounder ($Z$) affecting both the covariates ($X$) and the labels ($Y$). This confounding introduces heterogeneity in the predictor, i.e., $P(Y \mid X) = E_{P(Z \mid X)}[P(Y \mid X,Z)]$, making traditional covariate and label shift assumptions unsuitable. OOD generalization differs from traditional domain adaptation in that it does not assume access to the covariate distribution ($X^\text{te}$) of the test samples during training. These conditions create a challenging scenario for OOD robustness: (a) $Z^\text{tr}$ is an unobserved confounder during training, (b) $P^\text{te}(Z) \neq P^\text{tr}(Z)$, (c) $X^\text{te}$ is unavailable during training, and (d) the predictive distribution depends on $P^\text{te}(Z)$. While prior work has developed complex predictors requiring multiple additional variables for identifiability of the latent distribution, we explore a set of identifiability assumptions that yield a surprisingly simple predictor using only a single additional variable. Our approach demonstrates superior empirical performance on several benchmark tasks.
Poster
Mingliang Ma · Abolfazl Safikhani
[ Hall A-E ]
Abstract
The objective of transfer learning is to enhance estimation and inference in a target data by leveraging knowledge gained from additional sources. Recent studies have explored transfer learning for independent observations in complex, high-dimensional models assuming sparsity, yet research on time series models remains limited. Our focus is on transfer learning for sequences of observations with temporal dependencies and a more intricate model parameter structure. Specifically, we investigate the vector autoregressive model (VAR), a widely recognized model for time series data, where the transition matrix can be deconstructed into a combination of a sparse matrix and a low-rank one. We propose a new transfer learning algorithm tailored for estimating high-dimensional VAR models characterized by low-rank and sparse structures. Additionally, we present a novel approach for selecting informative observations from auxiliary datasets. Theoretical guarantees are established, encompassing model parameter consistency, informative set selection, and the asymptotic distribution of estimators under mild conditions. The latter facilitates the construction of entry-wise confidence intervals for model parameters. Finally, we demonstrate the empirical efficacy of our methodologies through both simulated and real-world datasets.
Poster
Lyuzhou Chen · Taiyu Ban · Derui Lyu · Yijia Sun · Kangtao Hu · Xiangyu Wang · Huanhuan Chen
[ Hall A-E ]
Abstract
Causal discovery aims to infer a Directed Acyclic Graph (DAG) from observational data to represent causal relationships among variables. Traditional combinatorial methods search DAG spaces to identify optimal structures, while recent advances in continuous optimization improve this search process. However, integrating structural constraints informed by prior knowledge into these methods remains a substantial challenge. Existing methods typically integrate prior knowledge in a hard way, demanding precise information about causal relationships and struggling with erroneous priors. Such rigidity can lead to significant inaccuracies, especially when the priors are flawed. In response to these challenges, this work introduces the Edge Constraint Adaptive (ECA) method, a novel approach that softly represents the presence of edges, allowing for a differentiable representation of prior constraint loss. This soft integration can more flexibly adjust to both accurate and erroneous priors, enhancing both robustness and adaptability. Empirical evaluations demonstrate that our approach effectively leverages prior to improve causal structure accuracy while maintaining resilience against prior errors, thus offering significant advancements in the field of causal discovery.
Poster
Lorenz Kummer · Wilfried Gansterer · Nils Kriege
[ Hall A-E ]
Abstract
We investigate the vulnerability of Graph Neural Networks (GNNs) to bit-flip attacks (BFAs) by introducing an analytical framework to study the influence of architectural features, graph properties, and their interaction. The expressivity of GNNs refers to their ability to distinguish non-isomorphic graphs and depends on the encoding of node neighborhoods. We examine the vulnerability of neural multiset functions commonly used for this purpose and establish formal criteria to characterize a GNN's susceptibility to losing expressivity due to BFAs. This enables an analysis of the impact of homophily, graph structural variety, feature encoding, and activation functions on GNN robustness. We derive theoretical bounds for the number of bit flips required to degrade GNN expressivity on a dataset, identifying ReLU-activated GNNs operating on highly homophilous graphs with low-dimensional or one-hot encoded features as particularly susceptible. Empirical results using ten real-world datasets confirm the statistical significance of our key theoretical insights and offer actionable results to mitigate BFA risks in expressivity-critical applications.
Poster
Sheng Liu · Zihan Wang · Yuxiao Chen · Qi Lei
[ Hall A-E ]
Abstract
Reconstruction attacks and defenses are essential in understanding the data leakage problem in machine learning. However, prior work has centered around empirical observations of gradient inversion attacks, lacks theoretical grounding, and cannot disentangle the usefulness of defending methods from the computational limitation of attacking methods. In this work, we propose to view the problem as an inverse problem, enabling us to theoretically and systematically evaluate the data reconstruction attack. On various defense methods, we derived the algorithmic upper bound and the matching (in feature dimension and architecture dimension) information-theoretical lower bound on the reconstruction error for two-layer neural networks. To complement the theoretical results and investigate the utility-privacy trade-off, we defined a natural evaluation metric of the defense methods with similar utility loss among the strongest attacks. We further propose a strong reconstruction attack that helps update some previous understanding of the strength of defense methods under our proposed evaluation metric.
Poster
Alexandre Perez-Lebel · Gaël Varoquaux · Sanmi Koyejo · Matthieu Doutreligne · Marine Le Morvan
[ Hall A-E ]
Abstract
Probabilistic classifiers are central for making informed decisions under uncertainty. Based on the maximum expected utility principle, optimal decision rules can be derived using the posterior class probabilities and misclassification costs. Yet, in practice only learned approximations of the oracle posterior probabilities are available. In this work, we quantify the excess risk (a.k.a. regret) incurred using approximate posterior probabilities in batch binary decision-making. We provide analytical expressions for miscalibration-induced regret ($R^{CL}$), as well as tight and informative upper and lower bounds on the regret of calibrated classifiers ($R^{GL}$). These expressions allow us to identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss, which calls for post-training beyond recalibration. Crucially, both $R^{CL}$ and $R^{Gl}$ can be estimated in practice using a calibration curve and a recent grouping loss estimator. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost. Finally, we highlight the potential of multicalibration approaches as efficient alternatives to costlier fine-tuning approaches.
Poster
Gaëtan Serré · Argyris Kalogeratos · Nicolas Vayatis
[ Hall A-E ]
Abstract
In this paper, we present a deterministic particle-based method for global optimization of continuous Sobolev functions, called *Stein Boltzmann Sampling* (SBS). SBS initializes uniformly a number of particles representing candidate solutions, then uses the *Stein Variational Gradient Descent* (SVGD) algorithm to sequentially and deterministically move those particles in order to approximate a target distribution whose mass is concentrated around promising areas of the domain of the optimized function. The target is chosen to be a properly parametrized Boltzmann distribution. For the purpose of global optimization, we adapt the generic SVGD theoretical framework allowing to address more general target distributions over a compact subset of $\mathbb{R}^d$, and we prove SBS's asymptotic convergence. In addition to the main SBS algorithm, we present two variants: the SBS-PF that includes a particle filtering strategy, and the SBS-HYBRID one that uses SBS or SBS-PF as a continuation after other particle- or distribution-based optimization methods. A detailed comparison with state-of-the-art methods on benchmark functions demonstrates that SBS and its variants are highly competitive, while the combination of the two variants provides the best trade-off between accuracy and computational cost.
Poster
firooz shahriari-mehr · Ashkan Panahi
[ Hall A-E ]
Abstract
We propose a novel decentralized convex optimization algorithm called ASY-DAGP, where each agent has its own distinct objective function and constraint set. Agents compute at different speeds, and their communication is delayed and directed. Employing local buffers, ASY-DAGP enhances asynchronous communication and is robust to challenging scenarios such as message failure. We validate these features by numerical experiments. By analyzing ASY-DAGP, we provide the first sublinear convergence rate for the above setup under mild assumptions. This rate depends on a novel characterization of delay profiles, which we term the delay factor. We calculate the delay factor for the well-known bounded delay profiles, providing new insights for these scenarios. Our analysis is conducted by introducing a novel approach tied to the celebrated PEP framework. Our approach does not require the design of Lyapunov functions and instead provides a novel insight into the optimization algorithms as linear systems.
Poster
Davin Hill · Joshua Bone · Aria Masoomi · Max Torop · Jennifer Dy
[ Hall A-E ]
Abstract
Explainability methods are often challenging to evaluate and compare. With a multitude of explainers available, practitioners must often compare and select explainers based on quantitative evaluation metrics. One particular differentiator between explainers is the diversity of explanations for a given dataset; i.e. whether all explanations are identical, unique and uniformly distributed, or somewhere between these two extremes. In this work, we define a complexity measure for explainers, globalness, which enables deeper understanding of the distribution of explanations produced by feature attribution and feature selection methods for a given dataset. We establish the axiomatic properties that any such measure should possess and prove that our proposed measure, Wasserstein Globalness, meets these criteria. We validate the utility of Wasserstein Globalness using image, tabular, and synthetic datasets, empirically showing that it both facilitates meaningful comparison between explainers and improves the selection process for explainability methods.
Poster
Spencer Hutchinson · Tianyi Chen · Mahnoosh Alizadeh
[ Hall A-E ]
Abstract
We study the problem of online convex optimization (OCO) under unknown linear constraints that are either static, or stochastically time-varying. For this problem, we introduce an algorithm that we term Optimistically Safe OCO (OSOCO) and show that it enjoys $\tilde{O}(\sqrt{T})$ regret and no constraint violation. In the case of static linear constraints, this improves on the previous best known $\tilde{O}(T^{2/3})$ regret under the same assumptions. In the case of stochastic time-varying constraints, our work supplements existing results that show $O(\sqrt{T})$ regret and $O(\sqrt{T})$ cumulative violation under more general convex constraints and a different set of assumptions. In addition to our theoretical guarantees, we also give numerical results that further validate the effectiveness of our approach.
Poster
Zijun Gao · Shu Ge · Jian Qian
[ Hall A-E ]
Abstract
Under the prevalent potential outcome model in causal inference, each unit is associated with multiple potential outcomes but at most one of which is observed, leading to many causal quantities being only partially identified. The inherent missing data issue echoes the multi-marginal optimal transport (MOT) problem, where marginal distributions are known, but how the marginals couple to form the joint distribution is unavailable. In this paper, we cast the causal partial identification problem in the framework of MOT with $K$ margins and $d$-dimensional outcomes and obtain the exact partial identified set. In order to estimate the partial identified set via MOT, statistically,we establish a convergence rate of the plug-in MOT estimator for the $\ell_2$ cost function stemming from the variance minimization problem and prove it is minimax optimal for arbitrary $K$ and $d \le 4$. We also extend the convergence result to general quadratic objective functions. Numerically, we demonstrate the efficacy of our method over synthetic datasets and several real-world datasets where our proposal consistently outperforms the baseline by a significant margin (over 70\%). In addition, we provide efficient off-the-shelf implementations of MOT with general objective functions.
Poster
Alex Chebbah · Christian L. Vestergaard · jean-baptiste masson · Etienne Boursier
[ Hall A-E ]
Abstract
Entropy maximization and free energy minimization are general physics principles for modeling dynamic systems. Notable examples include modeling decision-making within the brain using the free-energy principle, optimizing the accuracy-complexity trade-off when accessing hidden variables with the information bottleneck principle (Tishby et al. 2000), and navigation in random environments using information maximization (Vergassola et al. 2007). Building on this principle, we propose a new class of bandit algorithms that maximize an approximation to the information of a key variable within the system. To this end, we develop an approximated, analytical physics-based representation of the entropy to forecast the information gain of each action and greedily choose the one with the largest information gain. This method yields strong performances in classical bandit settings. Motivated by its empirical success, we prove its asymptotic optimality for the multi-armed bandit problem with Gaussian rewards. Since it encompasses the system's properties in a single, global functional, this approach can be efficiently adapted to more complex bandit settings. This calls for further investigation of information maximization approaches for multi-armed bandit problems.
Poster
Ziqing Xu · Hancheng Min · Lachlan MacDonald · Jinqi Luo · Salma Tarmoun · Enrique Mallada · Rene Vidal
[ Hall A-E ]
Abstract
Despite the empirical success of Low-Rank Adaptation (LoRA) in fine-tuning pre-trained models, there is little theoretical understanding of how first-order methods with carefully crafted initialization adapt models to new tasks. In this work, we take the first step towards bridging this gap by theoretically analyzing the learning dynamics of LoRA for matrix factorization (MF) under gradient flow (GF), emphasizing the crucial role of initialization. For small initialization, we theoretically show that GF converges to a neighborhood of the optimal solution, with smaller initialization leading to lower final error. Our analysis shows that the final error is affected by the misalignment between the singular spaces of the pre-trained model and the target matrix, and reducing the initialization scale improves alignment. To address this misalignment, we propose a spectral initialization for LoRA in MF and theoretically prove that GF with small spectral initialization converges to the fine-tuning task with arbitrary precision. Numerical experiments from MF and image classification validate our findings.
Poster
Feihu Huang · Chunyu Xuan · Xinrui Wang · Siqi Zhang · Songcan Chen
[ Hall A-E ]
Abstract
Minimax optimization recently is widely applied in many machine learning tasks such as generative adversarial networks, robust learning and reinforcement learning. In the paper, we study a class of nonconvex-nonconcave minimax optimization with nonsmooth regularization, where the objective function is possibly nonconvex on primal variable $x$, and it is nonconcave and satisfies the Polyak-Lojasiewicz (PL) condition on dual variable $y$. Moreover, we propose a class of enhanced momentum-based gradient descent ascent methods (i.e., MSGDA and AdaMSGDA) to solve these stochastic nonconvex-PL minimax problems. In particular, our AdaMSGDA algorithm can use various adaptive learning rates in updating the variables $x$ and $y$ without relying on any specifical types. Theoretically, we prove that our methods have the best known sample complexity of $\tilde{O}(\epsilon^{-3})$ only requiring one sample at each loop in finding an $\epsilon$-stationary solution. Some numerical experiments on PL-game and Wasserstein-GAN demonstrate the efficiency of our proposed methods.
Poster
Charles Westphal · Stephen Hailes · Mirco Musolesi
[ Hall A-E ]
Abstract
In this paper, we introduce Partial Information Decomposition of Features (PIDF), a new paradigm for simultaneous data interpretability and feature selection. Contrary to traditional methods that assign a single importance value, our approach is based on three metrics per feature: the mutual information shared with the target variable, the feature’s contribution to synergistic information, and the amount of this information that is redundant. In particular, we develop a novel procedure based on these three metrics, which reveals not only how features are correlated with the target but also the additional and overlapping information provided by considering them in combination with other features. We extensively evaluate PIDF using both synthetic and real-world data, demonstrating its potential applications and effectiveness, by considering case studies from genetics and neuroscience.
Poster
Jiajun He · Wenlin Chen · Mingtian Zhang · David Barber · Jose Miguel Hernandez-Lobato
[ Hall A-E ]
Abstract
Training generative models to sample from unnormalized density functions is an important and challenging task in machine learning. Traditional training methods often rely on the reverse Kullback-Leibler (KL) divergence due to its tractability. However, the mode-seeking behavior of reverse KL hinders effective approximation of multi-modal target distributions. To address this, we propose to minimize the reverse KL along diffusion trajectories of both model and target densities. We refer to this objective as the reverse diffusive KL divergence, which allows the model to capture multiple modes. Leveraging this objective, we train neural samplers that can efficiently generate samples from the target distribution in one step. We demonstrate that our method enhances sampling performance across various Boltzmann distributions, including both synthetic multi-modal densities and n-body particle systems.
Poster
Ruiyang Jin · Zaiwei Chen · Yiheng Lin · Jie Song · Adam Wierman
[ Hall A-E ]
Abstract
Independent learning (IL) is a popular approach for achieving scalability in large-scale multi-agent systems, yet it typically lacks global convergence guarantees. In this paper, we study two representative algorithms—independent $Q$-learning and independent natural actor-critic—within both value-based and policy-based frameworks, and provide the first finite-sample analysis for approximate global convergence. Our results show that IL can achieve global convergence up to a fixed error arising from agent interdependence, which characterizes the fundamental limit of IL in achieving true global convergence. To establish these results, we develop a novel approach by constructing a separable Markov decision process (MDP) for convergence analysis and then bounding the gap caused by the model discrepancy between this separable MDP and the original one. Finally, we present numerical experiments using a synthetic MDP and an electric vehicle charging example to demonstrate our findings and the practical applicability of IL.
Poster
Shireen Kudukkil Manchingal · Muhammad Mubashar · Kaizheng Wang · Fabio Cuzzolin
[ Hall A-E ]
Abstract
Predictions of uncertainty-aware models are diverse, ranging from single point estimates (often averaged over prediction samples) to predictive distributions, to set-valued or credal-set representations. We propose a novel unified evaluation framework for uncertainty-aware classifiers, applicable to a wide range of model classes, which allows users to tailor the trade-off between accuracy and precision of predictions via a suitably designed performance metric. This makes possible the selection of the most suitable model for a particular real-world application as a function of the desired trade-off. Our experiments, concerning Bayesian, ensemble, evidential, deterministic, credal and belief function classifiers on the CIFAR-10, MNIST and CIFAR-100 datasets, show that the metric behaves as desired.
Poster
Minsu Kim · Sanghyeok Choi · Hyeonah Kim · Jiwoo Son · Jinkyoo Park · Yoshua Bengio
[ Hall A-E ]
Abstract
We present the Generative Flow Ant Colony Sampler (GFACS), a novel meta-heuristic method that hierarchically combines amortized inference and parallel stochastic search. Our method first leverages Generative Flow Networks (GFlowNets) to amortize a multi-modal prior distribution over combinatorial solution space that encompasses both high-reward and diversified solutions. This prior is iteratively updated via parallel stochastic search in the spirit of Ant Colony Optimization (ACO), leading to the posterior distribution that generates near-optimal solutions. Extensive experiments across seven combinatorial optimization problems demonstrate GFACS's promising performances.
Poster
Disha Hegde · Mohamed Adil · Jon Cockayne
[ Hall A-E ]
Abstract
Gaussian processes are notorious for scaling cubically with the size of the training set, preventing application to very large regression problems. Computation-aware Gaussian processes (CAGPs) tackle this scaling issue by exploiting probabilistic linear solvers to reduce complexity, widening the posterior with additional *computational* uncertainty due to reduced computation. However, the most commonly used CAGP framework results in (sometimes dramatically) conservative uncertainty quantification, making the posterior difficult to use in practice. In this work, we prove that if the utilised probabilistic linear solver is *calibrated*, in a rigorous statistical sense, then so too is the induced CAGP. We thus propose a new CAGP framework, CAGP-GS, based on using Gauss-Seidel iterations for the underlying probabilistic linear solver. CAGP-GS performs favourably compared to existing approaches when the test set is low-dimensional and few iterations are performed. We test the calibratedness on a synthetic problem, and compare the performance to existing approaches on a large-scale global temperature regression problem.
Poster
Yuxiong Gao · Wentao Li · Rong Chen
[ Hall A-E ]
Abstract
State-space models have been used in many applications, including econometrics, engi- neering, medical research, etc. The maximum likelihood estimation (MLE) of the static pa- rameter of general state-space models is not straightforward because the likelihood func- tion is intractable. It is popular to use the sequential Monte Carlo(SMC) method to per- form gradient ascent optimisation in either offline or online fashion. One problem with existing online SMC methods for MLE is that the score estimators are inconsistent, i.e. the bias does not vanish with increasing particle size. In this paper, two SMC algorithms are proposed based on an importance sampling weight function to use each set of generated particles more efficiently. The first one is an offline algorithm that locally approximates the likelihood function using importance sam- pling, where the locality is adapted by the effective sample size (ESS). The second one is a semi-online algorithm that has a compu- tational cost linear in the particle size and uses score estimators that are consistent. We study its consistency and asymptotic normal- ity. Their computational superiority is illus- trated in numerical studies for long time se- ries.
Poster
Ben Aoki-Sherwood · Catherine Bregou · David Liben-Nowell · Kiran Tomlinson · Thomas Zeng
[ Hall A-E ]
Abstract
The widely used Plackett-Luce ranking model assumes that individuals rank items by making repeated choices from a universe of items. But in many cases the universe is too big for people to plausibly consider all options. In the choice literature, this issue has been addressed by supposing that individuals first sample a small consideration set and then choose among the considered items. However, inferring unobserved consideration sets (or item consideration probabilities) in this ``consider then choose'' setting poses significant challenges, because even simple models of consideration with strong independence assumptions are not identifiable, even if item utilities are known. We apply the consider-then-choose framework to top-$k$ rankings, where we assume rankings are constructed according to a Plackett-Luce model after sampling a consideration set. While item consideration probabilities remain non-identified in this setting, we prove that we can infer bounds on the relative sizes of consideration probabilities. Additionally, given a condition on the expected consideration set size and known item utilities, we derive absolute upper and lower bounds on item consideration probabilities.We also provide algorithms to tighten those bounds on consideration probabilities by propagating inferred constraints. Thus, we show that we can learn useful information about consideration probabilities despite not being …
Poster
Sunmin Oh · Seungsu Han · Gunwoong Park
[ Hall A-E ]
Abstract
Much of science involves discovering and modeling causal relationships in nature. Significant progress has been made in developing statistical methods for representing and identifying causal knowledge from data using Linear Non-Gaussian Acyclic Models (LiNGAMs). Despite successes in learning LiNGAMs across various sample settings, the optimal sample complexity for high-dimensional LiNGAMs remains unexplored. This study establishes the optimal sample complexity for learning the structure of LiNGAMs under a sub-Gaussianity assumption. Specifically, it introduces a structure recovery algorithm using distance covariance that achieves the optimal sample complexity, $n = \Theta(d_{in} \log \frac{p}{d_{in}})$, without assuming faithfulness or a known indegree. The theoretical findings and superiority of the proposed algorithm compared to existing algorithms are validated through numerical experiments and real data analysis.
Poster
Rina Dechter · Annie Raichev · Jin Tian · Alexander Ihler
[ Hall A-E ]
Abstract
This paper focuses on the computational complexity of computing empirical plug-in estimates for causal effect queries.Given a causal graph and observational data, any identifiable causal query can be estimated from an expression over the observed variables, called the estimand. The estimand can then be evaluated by plugging in probabilities computed empirically from data. In contrast to conventional wisdom which assumes that high dimensional probabilistic functions will lead to exponential evaluation time, we show that estimand evaluation can be done efficiently, potentially in time linear in the data size, depending on the estimand's hypergraph.In particular, we show that both the $\it{treewidth}$ and $\it{hypertree width}$ of the estimand's structure bound the evaluation complexity, analogous to their role in bounding the complexity of inference in probabilistic graphical models. In settings with high dimensional functions, the hypertree width often provides a more effective bound, since the empirical distributions are sparse.
Poster
James Cheshire · Stephan Clemencon
[ Hall A-E ]
Abstract
In this article, bipartite ranking, a statistical learning problem involved in many applications and widely studied in the passive context, is approached in a much more general active setting than the discrete one previously considered in the literature. While the latter assumes that the conditional distribution is piece wise constant, the framework we develop permits in contrast to deal with continuous conditional distributions, provided that they fulfill a Hölder smoothness constraint. We first show that a naive approach based on discretisation at a uniform level, fixed a priori and consisting in applying next the active strategy designed for the discrete setting generally fails. Instead, we propose a novel algorithm, referred to as smooth-rank and designed for the continuous setting, which aims to minimise the distance between the ROC curve of the estimated ranking rule and the optimal one w.r.t. the $\sup$ norm. We show that, for a fixed confidence level $\epsilon>0$ and probability $\delta\in (0,1)$, smooth-rank is PAC$(\epsilon,\delta)$. In addition, we provide a problem dependent upper bound on the expected sampling time of smooth-rank and establish a problem dependent lower bound on the expected sampling time of any PAC$(\epsilon,\delta)$ algorithm. Beyond the theoretical analysis carried out, numerical results are …
Poster
Danial Davarnia · Mohammadreza Kiaghadi
[ Hall A-E ]
Abstract
Optimization problems with norm-bounding constraints appear in various applications, from portfolio optimization to machine learning, feature selection, and beyond. A widely used variant of these problems relaxes the norm-bounding constraint through Lagrangian relaxation and moves it to the objective function as a form of penalty or regularization term. A challenging class of these models uses the zero-norm function to induce sparsity in statistical parameter estimation models. Most existing exact solution methods for these problems use additional binary variables together with artificial bounds on variables to formulate them as a mixed-integer program in a higher dimension, which is then solved by off-the-shelf solvers. Other exact methods utilize specific structural properties of the objective function to solve certain variants of these problems, making them non-generalizable to other problems with different structures. An alternative approach employs nonconvex penalties with desirable statistical properties, which are solved using heuristic or local methods due to the structural complexity of those terms. In this paper, we develop a novel graph-based method to globally solve optimization problems that contain a generalization of norm-bounding constraints. This includes standard $\ell_p$-norms for $p \in [0, \infty)$ as well as nonconvex penalty terms, such as SCAD and MCP, as special cases. Our …
Poster
Kevin Rojas · Yixin Tan · Molei Tao · Yuriy Nevmyvaka · Wei Deng
[ Hall A-E ]
Abstract
The Momentum Schrödinger Bridge (mSB) \citep{mSB} has emerged as a leading method for accelerating generative diffusion processes and reducing transport costs. However, the lack of simulation-free properties inevitably results in high training costs and affects scalability. To obtain a trade-off between transport properties and scalability, we introduce variational Schr\"odinger momentum diffusion (VSMD), which employs linearized forward score functions (variational scores) to eliminate the dependence on simulated forward trajectories. Our approach leverages a multivariate diffusion process with adaptively transport-optimized variational scores. Additionally, we apply a critical-damping transform to stabilize training by removing the need for score estimations for both velocity and samples. Theoretically, we prove the convergence of samples generated with optimal variational scores and momentum diffusion. Empirical results demonstrate that VSMD efficiently generates anisotropic shapes while maintaining transport efficacy, outperforming overdamped alternatives, and avoiding complex denoising processes. Our approach also scales effectively to real-world data, achieving competitive results in time series and image generation, both in unconditional and conditional settings.
Poster
Nong Hieu · Jeremie Houssineau · Neil Chada · Emmanuel Delande
[ Hall A-E ]
Abstract
The special role of epistemic uncertainty in Machine Learning is now well recognised, and an increasing amount of research is focused on methods for dealing specifically with such a lack of knowledge. Yet, most often, a probabilistic representation is considered for both aleatoric and epistemic uncertainties, hence creating challenges in applications where decoupling these two types of uncertainty is necessary. In this work, we show that an alternative representation of epistemic uncertainty, based on possibility theory, maintains many of the convenient features of standard Bayesian inference while displaying specific behaviours and properties that closely match the ones of an intuitive notion of information. Our main contributions are: i) a general framework for jointly representing epistemic and aleatoric uncertainties, ii) a Bernstein-von Mises theorem for the analogue of Bayes' rule in possibility theory, iii) a version of the law of large numbers and of the central limit theorem for the associated variables, and iv) an analysis of the properties of the possibilistic maximum a posteriori. These results highlight that a dedicated and principled representation of epistemic uncertainty, that is compatible with standard Bayesian inference and preserves many of its strengths, is attainable.
Poster
Patrik Okanovic · Andreas Kirsch · Jannes Kasper · Torsten Hoefler · Andreas Krause · Nezihe Merve Gürel
[ Hall A-E ]
Abstract
We introduce MODEL SELECTOR, a framework for label-efficient selection of pretrained classifiers. Given a pool of unlabeled target data, MODEL SELECTOR samples a small subset of highly informative examples for labeling, in order to efficiently identify the best pretrained model for deployment on this target dataset. Through extensive experiments, we demonstrate that MODEL SELECTOR drastically reduces the need for labeled data while consistently picking the best or near-best performing model. Across 18 model collections on 16 different datasets, comprising over 1,500 pretrained models, MODEL SELECTOR reduces the labeling cost by up to 94.15% to identify the best model compared to the cost of the strongest baseline. Our results further highlight the robustness of MODEL SELECTOR in model selection, as it reduces the labeling cost by up to 72.41% when selecting a near-best model, whose accuracy is only within 1% of the best model.
Poster
Florian Hübler · Ilyas Fatkhullin · Niao He
[ Hall A-E ]
Abstract
Recent empirical evidence indicates that many machine learning applications involve heavy-tailed gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves good performance both theoretically and practically. However, our current theoretical understanding of non-convex gradient clipping has three main shortcomings. First, the theory hinges on large, increasing clipping thresholds, which are in stark contrast to the small constant clipping thresholds employed in practice. Second, clipping thresholds require knowledge of problem-dependent parameters to guarantee convergence. Lastly, even with this knowledge, current sample complexity upper bounds for the method are sub-optimal in nearly all parameters. To address these issues and motivated by practical observations, we make the connection of gradient clipping to its close relative --- Normalized SGD (NSGD) --- and study its convergence properties.First, we establish a parameter-free sample complexity for NSGD of $\mathcal{O}\left(\varepsilon^{-\frac{2p}{p-1}}\right)$ to find an $\varepsilon$-stationary point, only assuming a finite $p$-th central moment of the noise, $p\in(1,2]$. Furthermore, we prove the tightness of this result, by providing a matching algorithm-specific lower bound. In the setting where all problem parameters are known, we show this complexity is improved to $\mathcal{O}\left(\varepsilon^{-\frac{3p-2}{p-1}}\right)$, …
Poster
Chen Xu · Xiuyuan Cheng · Yao Xie
[ Hall A-E ]
Abstract
Computing optimal transport (OT) for general high-dimensional data has been a long-standing challenge. Despite much progress, most of the efforts including neural network methods have been focused on the static formulation of the OT problem. The current work proposes to compute the dynamic OT between two arbitrary distributions $P$ and $Q$ by optimizing a flow model, where both distributions are only accessible via finite samples. Our method learns the dynamic OT by finding an invertible flow that minimizes the transport cost. The trained optimal transport flow subsequently allows for performing many downstream tasks, including infinitesimal density ratio estimation (DRE) and domain adaptation by interpolating distributions in the latent space. The effectiveness of the proposed model on high-dimensional data is demonstrated by strong empirical performance on OT baselines, image-to-image translation, and high-dimensional DRE.
Poster
Van Khoa NGUYEN · Maciej Falkiewicz · Giangiacomo Mercatali · Alexandros Kalousis
[ Hall A-E ]
Abstract
Traditional molecule generation methods often rely on sequence- or graph-based representations, which can limit their expressive power or require complex permutation-equivariant architectures. This paper introduces a novel paradigm for learning molecule generative models based on functional representations. Specifically, we propose Molecular Implicit Neural Generation (MING), a diffusion-based model that learns molecular distributions in the function space. Unlike standard diffusion processes in the data space, MING employs a novel functional denoising probabilistic process, which jointly denoises information in both the function's input and output spaces by leveraging an expectation-maximization procedure for latent implicit neural representations of data. This approach enables a simple yet effective model design that accurately captures underlying function distributions. Experimental results on molecule-related datasets demonstrate MING's superior performance and ability to generate plausible molecular samples, surpassing state-of-the-art data-space methods while offering a more streamlined architecture and significantly faster generation times. The code is available at https://github.com/v18nguye/MING.
Poster
Daniel Paulin · Peter Whalley · Neil Chada · Benedict Leimkuhler
[ Hall A-E ]
Abstract
We propose a scalable kinetic Langevin dynamics algorithm for sampling parameter spaces of big data and AI applications. Our scheme combines a symmetric forward/backward sweep over minibatches with a symmetric discretization of Langevin dynamics. For a particular Langevin splitting method (UBU), we show that the resulting Symmetric Minibatch Splitting-UBU (SMS-UBU) integrator has bias $\mathcal{O}(h^2 d^{1/2})$ in dimension $d>0$ with stepsize $h>0$, despite only using one minibatch per iteration, thus providing excellent control of the sampling bias as a function of the stepsize. We apply the algorithm to explore local modes of the posterior distribution of Bayesian neural networks (BNNs) and evaluate the calibration performance of the posterior predictive probabilities for neural networks with convolutional neural network architectures for classification problems on three different datasets (Fashion-MNIST, Celeb-A and chest X-ray). Our results indicate that BNNs sampled with SMS-UBU can offer significantly better calibration performance compared to standard methods of training and stochastic weight averaging.
Poster
Gilad Yehudai · Alon Cohen · Amit Daniely · Yoel Drori · Tomer Koren · Mariano Schain
[ Hall A-E ]
Abstract
We introduce a novel dynamic learning-rate scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in practice. Our approach is based on estimating the locally-optimal stepsize, guaranteeing maximal descent in the direction of the stochastic gradient of the current step. We first establish theoretical convergence bounds for our method within the context of smooth non-convex stochastic optimization. We then present a practical implementation of our algorithm and conduct systematic experiments across diverse datasets and optimization algorithms, comparing our scheme with existing state-of-the-art learning-rate schedulers. Our findings indicate that our method needs minimal tuning when compared to existing approaches. Thus, removing the need for auxiliary manual schedules and warm-up phases and achieving comparable performance with drastically reduced parameter tuning.
Poster
Kuldeep S. Meel · Gunjan Kumar · Yash Pote
[ Hall A-E ]
Abstract
Given two distributions $\mathcal{P}$ and $\mathcal{Q}$ over a high-dimensional domain $\{0,1\}^n$, and a parameter $\varepsilon$, the goal of distance estimation is to determine the statistical distance between $\mathcal{P}$ and $\mathcal{Q}$, up to an additive tolerance $\pm \varepsilon$. Since exponential lower bounds (in $n$) are known for the problem in the standard sampling model, research has focused on richer query models where one can draw conditional samples. This paper presents the first polynomial query distance estimator in the conditional sampling model ($\mathsf{COND}$). We base our algorithm on the relatively weaker \textit{subcube conditional} sampling ($\mathsf{SUBCOND}$) oracle, which draws samples from the distribution conditioned on some of the dimensions. $\mathsf{SUBCOND}$ is a promising model for widespread practical use because it captures the natural behavior of discrete samplers. Our algorithm makes $\tilde{\mathcal{O}}(n^3/\varepsilon^5)$ queries to $\mathsf{SUBCOND}$.
Poster
Axel Roques · Samuel Gruffaz · Kyurae Kim · Alain Durmus · Laurent Oudre
[ Hall A-E ]
Abstract
Human physiological signals tend to exhibit both global and local structures: the former are shared across a population, while the latter reflect inter-individual variability. For instance, kinetic measurements of the gait cycle during locomotion present common characteristics, although idiosyncrasies may be observed due to biomechanical disposition or pathology.To better represent datasets with local-global structure, this work extends Convolutional Dictionary Learning (CDL), a popular method for learning interpretable representations, or dictionaries, of time-series data.In particular, we propose Personalized CDL (PerCDL), in which a local dictionary models local information as a personalized spatiotemporal transformation of a global dictionary.The transformation is learnable and can combine operations such as time-warping and rotation.Formal computational and statistical guarantees for PerCDL are provided and its effectiveness on synthetic and real human locomotion data is demonstrated.
Poster
Stefan Wahl · Armand Rousselot · Felix Draxler · Ullrich Köthe
[ Hall A-E ]
Abstract
Modeling distributions that depend on external control parameters is a common scenario in diverse applications like molecular simulations, where system properties like temperature affect molecular configurations. Despite the relevance of these applications, existing solutions are unsatisfactory as they require severely restricted model architectures or rely on energy-based training, which is prone to instability. We introduce TRADE, which overcomes these limitations by formulating the learning process as a boundary value problem. By initially training the model for a specific condition using either i.i.d.~samples or backward KL training, we establish a boundary distribution. We then propagate this information across other conditions using the gradient of the unnormalized density with respect to the external parameter. This formulation, akin to the principles of physics-informed neural networks, allows us to efficiently learn parameter-dependent distributions without restrictive assumptions.Experimentally, we demonstrate that TRADE achieves excellent results in a wide range of applications, ranging from Bayesian inference and molecular simulations to physical lattice models.
Poster
Nikolaos Nakis · Chrysoula Kosma · Giannis Nikolentzos · Michail Chatzianastasis · Iakovos Evdaimon · Michalis Vazirgiannis
[ Hall A-E ]
Abstract
Autoencoders based on Graph Neural Networks (GNNs) have garnered significant attention in recent years for their ability to learn informative latent representations of complex topologies, such as graphs. Despite the prevalence of Graph Autoencoders, there has been limited focus on developing and evaluating explainable neural-based graph generative models specifically designed for signed networks. To address this gap, we propose the Signed Graph Archetypal Autoencoder (SGAAE) framework. SGAAE extracts node-level representations that express node memberships over distinct extreme profiles, referred to as archetypes, within the network. This is achieved by projecting the graph onto a learned polytope, which governs its polarization. The framework employs the Skellam distribution for analyzing signed networks combined with relational archetypal analysis and GNNs. Our experimental evaluation demonstrates the SGAAEs' capability to successfully infer node memberships over underlying latent structures while extracting competing communities. Additionally, we introduce the 2-level network polarization problem and show how SGAAE is able to characterize such a setting. The proposed model achieves high performance in different tasks of signed link prediction across four real-world datasets, outperforming several baseline models. Finally, SGAAE allows for interpretable visualizations in the polytope space, revealing the distinct aspects of the network, as well as, how nodes are …
Poster
Tim Weiland · Marvin Pförtner · Philipp Hennig
[ Hall A-E ]
Abstract
Mechanistic knowledge about the physical world is virtually always expressed via partial differential equations (PDEs). Recently, there has been a surge of interest in probabilistic PDE solvers---Bayesian statistical models mostly based on Gaussian process (GP) priors which seamlessly combine empirical measurements and mechanistic knowledge. As such, they quantify uncertainties arising from e.g. noisy or missing data, unknown PDE parameters or discretization error by design. Prior work has established connections to classical PDE solvers and provided solid theoretical guarantees. However, scaling such methods to large-scale problems remains a fundamental challenge primarily due to dense covariance matrices. Our approach addresses the scalability issues by leveraging the Markov property of many commonly used GP priors. It has been shown that such priors are solutions to stochastic PDEs (SPDEs) which when discretized allow for highly efficient GP regression through sparse linear algebra. In this work, we show how to leverage this prior class to make probabilistic PDE solvers practical, even for large-scale nonlinear PDEs, through greatly accelerated inference mechanisms. Additionally, our approach also allows for flexible and physically meaningful priors beyond what can be modeled with covariance functions. Experiments confirm substantial speedups and accelerated convergence of our physics-informed priors in nonlinear settings.
Poster
Behrooz Tahmasebi · Stefanie Jegelka
[ Hall A-E ]
Abstract
In learning with invariances (or symmetries), canonicalization is a widely used technique that projects data onto a smaller subset of the input space to reduce associated redundancies. The transformed dataset is then processed through a function from a designated function class to obtain the final invariant representation.Although canonicalization is often simple and flexible, both theoretical and empirical evidence suggests that the projection map can be discontinuous and unstable, which poses challenges for machine learning applications. However, the overall end-to-end representation can still remain continuous.Focusing on the importance of end-to-end regularity rather than the projection mapping itself, this paper explores the continuity and regularity of canonicalized models from a theoretical perspective. In a broad setting of input spaces and group actions, we establish necessary and sufficient conditions for the continuity or regularity of canonicalized models of any order, thereby characterizing the minimal conditions required for stability.To our knowledge, this represents the first comprehensive investigation into the end-to-end regularity of canonicalized models, offering critical insights into their design and application, as well as guidance for enhancing stability in practical settings.
Poster
Alex Chen · Qing Zhou
[ Hall A-E ]
Abstract
The assumption of independence between observations (units) in a dataset is prevalent across various methodologies for learning causal graphical models. However, this assumption often finds itself in conflict with real-world data, posing challenges to accurate structure learning. We propose a decorrelation-based approach for causal graph learning on dependent binary data, where the local conditional distribution is defined by a latent utility model with dependent errors across units. We develop a pairwise maximum likelihood method to estimate the covariance matrix for the dependence among the units. Then, leveraging the estimated covariance matrix, we develop an EM-like iterative algorithm to generate and de-correlate samples of the latent utility variables, which serve as de-correlated data. Any standard causal discovery method can be applied on the de-correlated data to learn the underlying causal graph. We demonstrate that the proposed de-correlation approach significantly improves the accuracy in causal graph learning, through numerical experiments on both synthetic and real-world datasets.
Poster
Michael Hellstern · Byol Kim · Zaid Harchaoui · Ali Shojaie
[ Hall A-E ]
Abstract
Spectral networks derived from multivariate time series data arise in many domains, from brain science to Earth science. Often, it is of interest to study how these networks change under different conditions. For instance, to better understand epilepsy, it would be interesting to capture the changes in the brain connectivity network as a patient experiences a seizure, using electroencephalography data. A common approach relies on estimating the networks in each condition and calculating their difference. Such estimates may behave poorly in high dimensions as the networks themselves may not be sparse in structure while their difference may be. We build upon this observation to develop an estimator of the difference in inverse spectral densities across two conditions. Using an l1 penalty on the difference, consistency is established by only requiring the difference to be sparse. We illustrate the method on synthetic data experiments and on experiments with electroencephalography data.
Poster
Andrew Stirn · David Knowles
[ Hall A-E ]
Abstract
Widely used deep latent variable models (DLVMs), in particular Variational Autoencoders (VAEs), employ overly simplistic priors on the latent space. To achieve strong clustering performance, existing methods that replace the standard normal prior with a Gaussian mixture model (GMM) require defining the number of clusters to be close to the number of expected ground truth classes a-priori and are susceptible to poor initializations. We leverage VampPrior concepts (Tomczak and Welling, 2018) to fit a Bayesian GMM prior, resulting in the VampPrior Mixture Model (VMM), a novel prior for DLVMs. In a VAE, the VMM attains highly competitive clustering performance on benchmark datasets. Integrating the VMM into scVI (Lopez et al., 2018), a popular scRNA-seq integration method, significantly improves its performance and automatically arranges cells into clusters with similar biological characteristics.
Poster
Wenhao Li · Dan Qiao · Baoxiang Wang · Xiangfeng Wang · Wei Yin · Hao Shen · Bo Jin · Hongyuan Zha
[ Hall A-E ]
Abstract
The difficulty of appropriately assigning credit is particularly heightened in cooperative MARL with sparse reward, due to the concurrent time and structural scales involved. Automatic subgoal generation (ASG) has recently emerged as a viable MARL approach inspired by utilizing subgoals in intrinsically motivated reinforcement learning. However, end-to-end learning of complex task planning from sparse rewards without prior knowledge, undoubtedly requires massive training samples. Moreover, the diversity-promoting nature of existing ASG methods can lead to the "over-representation" of subgoals, generating numerous spurious subgoals of limited relevance to the actual task reward and thus decreasing the sample efficiency of the algorithm. To address this problem and inspired by the disentangled representation learning, we propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA), that prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning. Additionally, SAMA incorporates language-grounded MARL to train each agent's subgoal-conditioned policy. SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods, as evidenced by its performance on two challenging sparse-reward tasks, Overcooked and MiniRTS. The code is available at https://anonymous.4open.science/r/SAMA/.
Poster
Małgorzata Łazęcka · Ewa Szczurek
[ Hall A-E ]
Abstract
Integrating various data modalities brings valuable insights into underlying phenomena. Multimodal factor analysis (FA) uncovers shared axes of variation underlying different simple data modalities, where each sample is represented by a vector of features. However, FA is not suited for structured data modalities, such as text or single cell sequencing data, where multiple data points are measured per each sample and exhibit a clustering structure. To overcome this challenge, we introduce FACTM, a novel, multi-view and multi-structure Bayesian model that combines FA with correlated topic modeling and is optimized using variational inference. Additionally, we introduce a method for rotating latent factors to enhance interpretability with respect to binary features. On text and video benchmarks as well as real-world music and COVID-19 datasets, we demonstrate that FACTM outperforms other methods in identifying clusters in structured data, and integrating them with simple modalities via the inference of shared, interpretable factors.
Poster
Rohan Ghosh · Mehul Motani
[ Hall A-E ]
Abstract
Mutual information (MI) is widely employed as a measure of shared information between random variables. However, MI assumes unbounded computational resources—a condition rarely met in practice, where predicting a random variable $Y$ from $X$ must rely on finite resources. $\mathcal{V}$-information addresses this limitation by employing a predictive family $\mathcal{V}$ to emulate computational constraints, yielding a directed measure of shared information. Focusing on the mixed setting (continuous $X$ and discrete $Y$), here we highlight the upward bias of empirical $\mathcal{V}$-information, $\hat I_{\mathcal{V}}(X \rightarrow Y)$, even when $\mathcal{V}$ is low-complexity (e.g., shallow neural networks). To mitigate this bias, we introduce $\mathcal{V}$-Information Growth (VI-Growth), defined as $\\hat I_{\mathcal{V}}(X \rightarrow Y) - \hat I_{\mathcal{V}}(X' \rightarrow Y')$, where $X', Y' \sim P_X P_Y$ represent independent variables. While VI-Growth effectively counters over-estimation, more complex predictive families may lead to under-estimation. To address this, we construct a sequence of predictive families $\mathcal{V}_1, \mathcal{V}_2, \ldots, \mathcal{V}$ of increasing complexity and compute the maximum of VI-Growth across these families, yielding the ordered VI-Growth (O-VIG). We provide theoretical results that justify this approach, showing that O-VIG is a provably tighter lower bound for the true $\mathcal{V}$-Information than empirical $\mathcal{V}$-Information itself, and exhibits stronger convergence properties than $\mathcal{V}$-Information. Empirically, O-VIG alleviates …
Poster
Berfin Simsek · Amire Bendjeddou · Daniel Hsu
[ Hall A-E ]
Abstract
This work focuses on the gradient flow dynamics of a neural network model that uses correlation loss to approximate a multi-index function on high-dimensional standard Gaussian data. Specifically, the multi-index function we consider is a sum of neurons $f^*(x) = \sum_{j=1}^k \sigma^*(v_j^T x)$ where $v_1, ..., v_k$ are unit vectors, and $\sigma^*$ lacks the first and second Hermite polynomials in its Hermite expansion. It is known that, for the single-index case ($k=1$), overcoming the search phase requires polynomial time complexity. We first generalize this result to multi-index functions characterized by vectors in arbitrary directions. After the search phase, it is not clear whether the network neurons converge to the index vectors, or get stuck at a sub-optimal solution.When the index vectors are orthogonal, we give a complete characterization of the fixed points and prove that neurons converge to the nearest index vectors. Therefore, using $n \asymp k \log k$ neurons ensures finding the full set of index vectors with gradient flow with high probability over random initialization.When $v_i^T v_j = \beta \geq 0$ for all $i \neq j$, we prove the existence of a sharp threshold $\beta_c = c/(c+k)$ at which the fixed point that computes the average of the …
Poster
Ritwik Vashistha · Arya Farahi
[ Hall A-E ]
Abstract
As probabilistic models continue to permeate various facets of our society and contribute to scientific advancements, it becomes a necessity to go beyond traditional metrics such as predictive accuracy and error rates and assess their trustworthiness. Grounded in the competence-based theory of trust, this work formalizes $\mathcal{I}$-trustworthy framework -- a novel framework for assessing the trustworthiness of probabilistic classifiers for inference tasks by linking conditional calibration to trustworthiness. To assess $\mathcal{I}$-trustworthiness, we use the local calibration error (LCE) and develop a method of hypothesis-testing. This method utilizes a kernel-based test statistic, Kernel Local Calibration Error (KLCE), to test local calibration of a probabilistic classifier. This study provides theoretical guarantees by offering convergence bounds for an unbiased estimator of KLCE. Additionally, we present a diagnostic tool designed to identify and measure biases in cases of miscalibration. The effectiveness of the proposed test statistic is demonstrated through its application to both simulated and real-world datasets. Finally, LCE of related recalibration methods is studied, and we provide evidence of insufficiency of existing methods to achieve I-trustworthiness.
Poster
Yeqi Gao · Zhao Song · Junze Yin
[ Hall A-E ]
Abstract
Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come.Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression.In this paper, we propose an iterative algorithm to solve a rescaled version of the slightly different formulation of the softmax regression problem that arises in attention mechanisms of large language models. Specifically, we consider minimizing the squared loss between a certain function, which can be either the exponential function, hyperbolic sine function, or hyperbolic cosine function, and its inner product with a target $n$-dimensional vector $b$, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor.The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes …
Poster
Han Bao · Nontawat Charoenphakdee
[ Hall A-E ]
Abstract
Strict proper losses are fundamental loss functions inducing classifiers capable of estimating class probabilities.While practitioners have devised many loss functions, their properness is often unverified.In this paper, we identify several losses as improper, calling into question the validity of class probability estimates derived from their simplex-projected outputs.Nevertheless, we show that these losses are strictly proper composite with appropriate link functions, allowing predictions to be mapped into true class probabilities.We invent the calmness condition, which we prove suffices to identify that a loss has a strictly proper composite representation, and provide the general form of the inverse link.To further understand proper composite losses, we explore proper composite losses through the framework of property elicitation, revealing a connection between inverse link functions and Bregman projections.Numerical simulations are provided to demonstrate the behavior of proper composite losses and the effectiveness of the inverse link function.
Poster
Siu Lun Chau · Antonin Schrab · Arthur Gretton · Dino Sejdinovic · Krikamol Muandet
[ Hall A-E ]
Abstract
We introduce credal two-sample testing, a new hypothesis testing framework for comparing credal sets---convex sets of probability measures where each element captures aleatoric uncertainty and the set itself represents epistemic uncertainty that arises from the modeller's partial ignorance. Compared to classical two-sample tests, which focus on comparing precise distributions, the proposed framework provides a broader and more versatile set of hypotheses. This approach enables the direct integration of epistemic uncertainty, effectively addressing the challenges arising from partial ignorance in hypothesis testing. By generalising two-sample test to compare credal sets, our framework enables reasoning for equality, inclusion, intersection, and mutual exclusivity, each offering unique insights into the modeller's epistemic beliefs. As the first work on nonparametric hypothesis testing for comparing credal sets, we focus on finitely generated credal sets derived from i.i.d. samples from multiple distributions—referred to as \emph{credal samples}.We formalise these tests as two-sample tests with nuisance parameters and introduce the first permutation-based solution for this class of problems, significantly improving upon existing methods. Our approach properly incorporates the modeller's epistemic uncertainty into hypothesis testing, leading to more robust and credible conclusions, with kernel-based implementations for real-world applications.
Poster
Kaan Ozkara · Bruce Huang · Ruida Zhou · Suhas Diggavi
[ Hall A-E ]
Abstract
Statistical heterogeneity of clients' local data is an important characteristic in federated learning, motivating personalized algorithms tailored to local data statistics. Though there has been a plethora of algorithms proposed for personalized supervised learning, discovering the structure of local data through personalized unsupervised learning is less explored. We initiate a systematic study of such personalized unsupervised learning by developing algorithms based on optimization criteria inspired by a hierarchical Bayesian statistical framework. We develop adaptive algorithms that discover the balance between using limited local data and collaborative information. We do this in the context of two unsupervised learning tasks: personalized dimensionality reduction (ADEPT-PCA and ADEPT-AE) and personalized diffusion models (ADEPT-DGM). We develop convergence analyses for our adaptive algorithms which illustrate the dependence on problem parameters (e.g., heterogeneity, local sample size). We also develop a theoretical framework for personalized diffusion models, which shows the benefits of collaboration even under heterogeneity. We finally evaluate our proposed algorithms using synthetic and real data, demonstrating the effective sample amplification for personalized tasks, induced through collaboration, despite data heterogeneity.
Poster
Sreejeet Maity · Aritra Mitra
[ Hall A-E ]
Abstract
One of the most basic problems in reinforcement learning (RL) is policy evaluation: estimating the long-term return, i.e., value function, corresponding to a given fixed policy. The celebrated Temporal Difference (TD) learning algorithm addresses this problem, and recent work has investigated finite-time convergence guarantees for this algorithm and variants thereof. However, these guarantees hinge on the reward observations being always generated from a well-behaved (e.g., sub-Gaussian) true reward distribution. Motivated by harsh, real-world environments where such an idealistic assumption may no longer hold, we revisit the policy evaluation problem from the perspective of *adversarial robustness*. In particular, we consider a Huber-contaminated reward model where an adversary can arbitrarily corrupt each reward sample with a small probability $\epsilon$. Under this observation model, we first show that the adversary can cause the vanilla TD algorithm to converge to any arbitrary value function. We then develop a novel algorithm called *Robust-TD* and prove that its finite-time guarantees match that of vanilla TD with linear function approximation up to a small $O(\epsilon)$ term that captures the effect of corruption. We complement this result with a minimax lower bound, revealing that such an additive corruption-induced term is unavoidable. To our knowledge, these results are the …
Poster
Son Nguyen · Lizhang Chen · Bo Liu · qiang liu
[ Hall A-E ]
Abstract
Modern deep learning heavily depends on adaptive optimizers such as Adam and its variants, which are renowned for their capacity to handle model scaling and streamline hyperparameter tuning. However, these algorithms typically experience high memory overhead caused by the accumulation of optimization states, leading to a critical challenge in training large-scale network models. In this study, we introduce a novel adaptive optimizer, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge. By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level while maintaining competitive performance across a wide range of architectures. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees. These optimization algorithms are designed to be both straightforward and adaptable, facilitating easy implementation in diverse settings.
Poster
Oren Yuval · Saharon Rosset
[ Hall A-E ]
Abstract
We present a methodology for model evaluation and selection in binary classification models with the presence of correlations in the data, where the sampling mechanism violates the i.i.d. assumption. Our methodology involves a formulation of the bias term between the standard Cross-Validation (CV) estimator and the mean generalization error, and practical data-based procedures to estimate this term. Consequently, we present the bias-corrected CV estimator, which is the standard CV estimate supplemented by an estimate of the bias term. This concept was introduced in the literature only in the context of a linear model with squared error loss as the criterion for prediction performance. Our suggested bias-corrected CV estimator can be applied to any learning model, including deep neural networks, and to standard criteria for prediction performance for classification tasks, including misclassification rate, cross-entropy and hinge loss. We demonstrate the applicability of the proposed methodology in various scenarios where the data contains complex correlation structures (such as clustered and spatial relationships) with synthetic data and real-world datasets, providing evidence that the bias-corrected CV estimator is better than the standard CV estimator.
Poster
Siyao Wang · Miles Lopes
[ Hall A-E ]
Abstract
Graph sparsification is a well-established technique for accelerating graph-based learning algorithms, which uses edge sampling to approximate dense graphs with sparse ones. Because the sparsification error is random and unknown, users must contend with uncertainty about the reliability of downstream computations. Although it is possible for users to obtain conceptual guidance from theoretical error bounds in the literature, such results are typically impractical at a numerical level. Taking an alternative approach, we propose to address these issues from a data-driven perspective by computing empirical error estimates. The proposed error estimates are highly versatile, and we demonstrate this in four use cases: Laplacian matrix approximation, graph cut queries, graph-structured regression, and spectral clustering. Moreover, we provide two theoretical guarantees for the error estimates, and explain why the cost of computing them is manageable in comparison to the overall cost of a typical graph sparsification workflow.
Poster
Mikołaj Słupiński
[ Hall A-E ]
Abstract
In this paper, we propose a novel model called Recurrent Explicit Duration Switching Linear Dynamical Systems (REDSLDS) that incorporates recurrent explicit duration variables into the rSLDS model. We also propose an inference and learning scheme that involves the use of Pólya-gamma augmentation. We demonstrate the improved segmentation capabilities of our model on three benchmark datasets, including two quantitative datasets and one qualitative dataset.
Poster
Julia Herbinger
[ Hall A-E ]
Abstract
Global feature effect methods, such as partial dependence plots, provide an intelligible visualization of the expected marginal feature effect. However, such global feature effect methods can be misleading, as they do not represent local feature effects of single observations well when feature interactions are present. We formally introduce generalized additive decomposition of global effects (GADGET), which is a new framework based on recursive partitioning to find interpretable regions in the feature space such that the interaction-related heterogeneity of local feature effects is minimized. We provide a mathematical foundation of the framework and show that it is applicable to the most popular methods to visualize marginal feature effects, namely partial dependence, accumulated local effects, and Shapley additive explanations (SHAP) dependence. Furthermore, we introduce and validate a new permutation-based interaction detection procedure that is applicable to any feature effect method that fits into our proposed framework. We empirically evaluate the theoretical characteristics of the proposed methods based on various feature effect methods in different experimental settings. Moreover, we apply our introduced methodology to three real-world examples to showcase their usefulness.
Poster
Heishiro Kanagawa
[ Hall A-E ]
Abstract
We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., Proceedings of the 33rd international conference on machine learning (pp. 276–284); Chwialkowski et al., (2016), In Proceedings of the 33rd international conference on machine learning (pp. 2606–2615); Yang et al., (2018), In Proceedings of the 35th international conference on machine learning (pp. 5561–5570)) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structures and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test, which is based on samples from the models and does not exploit the latent structure.
Poster
Cheng Li
[ Hall A-E ]
Abstract
In geostatistical problems with massive sample size, Gaussian processes can be approximated using sparse directed acyclic graphs to achieve scalable O(n) computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents that usually include a subset of the nearest neighbours. These methodologies often exhibit excellent empirical performance, but the lack of theoretical validation leads to unclear guidance in specifying the underlying graphical model and sensitivity to graph choice. We address these issues by introducing radial-neighbour Gaussian processes, a class of Gaussian processes based on directed acyclic graphs in which directed edges connect every location to all of its neighbours within a predetermined radius. We prove that any radial-neighbour Gaussian process can accurately approximate the corresponding unrestricted Gaussian process in the Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function and the spatial dispersion of samples. We offer further empirical validation of our approach via applications on simulated and real-world data, showing excellent performance in both prior and posterior approximations to the original Gaussian process.
Poster
Thomas Nagler
[ Hall A-E ]
Abstract
Thanks to their ability to capture complex dependence structures, copulas are frequently used to glue random variables into a joint model with arbitrary marginal distributions. More recently, they have been applied to solve statistical learning problems such as regression or classification. Framing such approaches as solutions of estimating equations, we generalize them in a unified framework. We can then obtain simultaneous, coherent inferences across multiple regression-like problems. We derive consistency, asymptotic normality, and validity of the bootstrap for corresponding estimators. The conditions allow for both continuous and discrete data as well as parametric, nonparametric, and semiparametric estimators of the copula and marginal distributions. The versatility of this methodology is illustrated by several theoretical examples, a simulation study, and an application to financial portfolio allocation. Supplementary materials for this article are available online.
Poster
Arindam Banerjee · Qiaobo Li · Yingxue Zhou
[ Hall A-E ]
Abstract
Generalization and optimization guarantees on the population loss often rely on uniform convergence based analysis, typically based on the Rademacher complexity of the predictors. The rich representation power of modern models has led to concerns about this approach. In this paper, we present generalization and optimization guarantees in terms of the complexity of the gradients, as measured by the Loss Gradient Gaussian Width (LGGW). First, we introduce generalization guarantees directly in terms of the LGGW under a flexible gradient domination condition. Second, we show that sample reuse in ERM does not make the empirical gradients deviate from the population gradients as long as the LGGW is small. Third, focusing on deep networks, we bound their single-sample LGGW in terms of the Gaussian width of the featurizer, i.e., the output of the last-but-one layer. To our knowledge, our generalization and optimization guarantees in terms of LGGW are the first results of its kind, and hold considerable promise towards quantitatively tight bounds for deep models.
Poster
Abdullah Alchihabi · Hanping Zhang · Yuhong Guo
[ Hall A-E ]
Abstract
Reinforcement Learning (RL) has demonstrated remarkable success in solving sequential decision-making problems. However, in real-world scenarios, RL agents often struggle to generalize when faced with unseen actions that were not encountered during training. Some previous works on zero-shot action generalization rely on large datasets of action observations to capture the behaviors of new actions, making them impractical for real-world applications. In this paper, we introduce a novel zero-shot framework, Action Generalization from Limited Observations (AGLO). Our framework has two main components: an action representation learning module and a policy learning module. The action representation learning module extracts discriminative embeddings of actions from limited observations, while the policy learning module leverages the learned action representations, along with augmented synthetic action representations, to learn a policy capable of handling tasks with unseen actions. The experimental results demonstrate that our framework significantly outperforms state-of-the-art methods for zero-shot action generalization across multiple benchmark tasks, showcasing its effectiveness in generalizing to new actions with minimal action observations.
Poster
Ankur Nath · Alan Kuhnle
[ Hall A-E ]
Abstract
Modern instances of combinatorial optimization problems often exhibit billion-scale ground sets, which have many uninformative or redundant elements. In this work, we develop light-weight pruning algorithms to quickly discard elements that are unlikely to be part of an optimal solution. Under mild assumptions on the instance, we prove theoretical guarantees on the fraction of the optimal value retained and the size of the resulting pruned ground set. Through extensive experiments on real-world datasets for various applications, we demonstrate that our algorithm, QuickPrune, efficiently prunes over 90% of the ground set and outperforms state-of-the-art classical and machine learning heuristics for pruning.
Poster
Paula Cordero Encinar · Tobias Schroeder · Peter Yatsyshin · Andrew Duncan
[ Hall A-E ]
Abstract
Selecting cost-effective optimal sensor configurations for subsequent inference of parameters in black-box stochastic systems faces significant computational barriers. We propose a novel and robust approach, modelling the joint distribution over input parameters and solution with a joint energy-based model, trained on simulation data. Unlike existing simulation-based inference approaches, which must be tied to a specific set of point evaluations, we learn a functional representation of parameters and solution. This is used as a resolution-independent plug-and-play surrogate for the joint distribution, which can be conditioned over any set of points, permitting an efficient approach to sensor placement. We demonstrate the validity of our framework on a variety of stochastic problems, showing that our method provides highly informative sensor locations at a lower computational cost compared to conventional approaches.
Poster
Junyi ZHANG · Angelos Dassios · Zhong Chong · Qiufei Yao
[ Hall A-E ]
Abstract
The beta process is a widely used nonparametric prior in Bayesian machine learning. While various inference schemes have been developed for the beta process and related models, the current state-of-the-art method relies heavily on the stick-breaking representation with decreasing atom weights, which is available only for a special hyperparameter. In this paper, we introduce the truncated inverse-Lévy measure representation (TILe-Rep) that extends the decreasing atom weights representation of the beta process to general hyperparameters. The TILe-Rep fills the gap between the two stick-breaking representations in Teh et al. (2007) and Paisley et al. (2010). Moreover, it has a lower truncation error compared to other sequential representations of the beta process and potentially leads to the posterior consistency property of the Bayesian factor models. We demonstrate the usage of the TILe-Rep in the celebrated beta process factor analysis model and beta process sparse factor model.
Poster
Satish Keshri · Nazreen Shah · Ranjitha Prasad
[ Hall A-E ]
Abstract
The holy grail of machine learning is to enable Continual Federated Learning (CFL) to enhance the efficiency, privacy, and scalability of AI systems while learning from streaming data. The primary challenge of a CFL system is to overcome global catastrophic forgetting, wherein the accuracy of the global model trained on new tasks declines on the old tasks. In this work, we propose \emph{Continual Federated Learning with Aggregated Gradients} (C-FLAG), a novel replay-memory based federated strategy consisting of edge-based gradient updates on memory and aggregated gradients on the current data. We provide convergence analysis of the C-FLAG approach which addresses forgetting and bias while converging at a rate of O(1/\sqrt(T)) over T communication rounds. We formulate an optimization sub-problem that minimizes catastrophic forgetting, translating CFL into an iterative algorithm with adaptive learning rates that ensure seamless learning across tasks. We empirically show that C-FLAG outperforms several state-of-the-art baselines on both task and class-incremental settings with respect to metrics such as accuracy and forgetting.
Poster
Deep Chakraborty · Yann LeCun · Tim G. J. Rudner · Erik Learned-Miller
[ Hall A-E ]
Abstract
A number of different architectures and loss functions have been applied to the problem of self-supervised learning (SSL), with the goal of developing embeddings that provide the best possible pre-training for as-yet-unknown, lightly supervised downstream tasks. One of these SSL criteria is to maximize the entropy of a set of embeddings in some compact space. But the goal of maximizing the embedding entropy often depends—whether explicitly or implicitly—upon high dimensional entropy estimates, which typically perform poorly in more than a few dimensions. In this paper, we motivate an effective entropy maximization criterion (E2MC), defined in terms of easy-to-estimate, low-dimensional constraints. We demonstrate that using it to continue training an already-trained SSL model for only a handful of epochs leads to a consistent and, in some cases, significant improvement in downstream performance. We perform careful ablation studies to show that the improved performance is due to the proposed add-on criterion. We also show that continued pre-training with alternative criteria does not lead to notable improvements, and in some cases, even degrades performance.
Poster
Khimya Khetarpal · Zhaohan Daniel Guo · Bernardo Avila Pires · Yunhao Tang · Clare Lyle · Mark Rowland · Nicolas Heess · Diana Borsa · Arthur Guez · Will Dabney
[ Hall A-E ]
Abstract
Learning a good representation is a crucial challenge for reinforcement learning (RL) agents. Self-predictive algorithms jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model in the case of a fixed policy (BYOL-$\Pi$); this assumption is at odds with practical implementations, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework. Interestingly, we uncover that BYOL-$\Pi$ and BYOL-AC are related through the lens of variance. We unify the study of these objectives through two complementary lenses; a model-based perspective, where each objective is related to low-rank approximation of certain dynamics, and a model-free perspective, which relates the objectives to modified value, Q-value, and Advantage functions. This mismatch with the true value functions leads to the empirical observation (in both linear and deep RL experiments) that BYOL-$\Pi$ and BYOL-AC are either very similar in performance across many tasks or task-dependent.
Poster
Suqi Liu · Morgane Austern
[ Hall A-E ]
Abstract
We study the graph matching problem in the presence of vertex feature information using shallow graph neural networks. Specifically, given two graphs that are independent perturbations of a single random geometric graph with sparse binary features, the task is to recover an unknown one-to-one mapping between the vertices of the two graphs. We show under certain conditions on the sparsity and noise level of the feature vectors, a carefully designed two-layer graph neural network can, with high probability, recover the correct mapping between the vertices with the help of the graph structure. Additionally, we prove that our condition on the noise parameter is tight up to logarithmic factors. Finally, we compare the performance of the graph neural network to directly solving an assignment problem using the noisy vertex features and demonstrate that when the noise level is at least constant, this direct matching fails to achieve perfect recovery, whereas the graph neural network can tolerate noise levels growing as fast as a power of the size of the graph. Our theoretical findings are further supported by numerical studies as well as real-world data experiments.
Poster
Devansh Gupta · Meisam Razaviyayn · Vatsal Sharan
[ Hall A-E ]
Abstract
Differentially private zeroth-order optimization methods have recently gained popularity in private fine tuning of machine learning models due to their reduced memory requirements. Current approaches for privatizing zeroth-order methods rely on adding Gaussian noise to the estimated zeroth-order gradients. However, since the search direction in the zeroth-order methods is inherently random, researchers including Tang et al. (2024) and Zhang et al. (2024a) have raised an important question: is the inherent noise in zeroth-order estimators sufficient to ensure the overall differential privacy of the algorithm? This work settles this question for a class of oracle-based optimization algorithms where the oracle returns zeroth-order gradient estimates. In particular, we show that for a fixed initialization, there exist strongly convex objective functions such that running (Projected) Zeroth-Order Gradient Descent (ZO-GD) is not differentially private. Furthermore, we show that even with random initialization and without revealing intermediate iterates, the privacy loss in ZO-GD can grow superlinearly with the number of iterations when minimizing convex objective functions.
Poster
Shivvrat Arya · Tahrima Rahman · Vibhav Gogate
[ Hall A-E ]
Abstract
Our paper builds on the recent trend of using neural networks trained with self-supervised or supervised learning to solve the Most Probable Explanation (MPE) task in discrete graphical models. At inference time, these networks take an evidence assignment as input and generate the most likely assignment for the remaining variables via a single forward pass. We address two key limitations of existing approaches: (1) the inability to fully exploit the graphical model's structure and parameters, and (2) the suboptimal discretization of continuous neural network outputs. Our approach embeds model structure and parameters into a more expressive feature representation, significantly improving performance. Existing methods rely on standard thresholding, which often yields suboptimal results due to the non-convexity of the loss function. We introduce two methods to overcome discretization challenges: (1) an external oracle-based approach that infers uncertain variables using additional evidence from confidently predicted ones, and (2) a technique that identifies and selects the highest-scoring discrete solutions near the continuous output. Experimental results on various probabilistic models demonstrate the effectiveness and scalability of our approach, highlighting its practical impact.
Poster
Felix Jimenez · Matthias Katzfuss
[ Hall A-E ]
Abstract
For regression tasks, standard Gaussian processes (GPs) provide natural uncertainty quantification (UQ), while deep neural networks (DNNs) excel at representation learning. Deterministic UQ methods for neural networks have successfully combined the two and require only a single pass through the neural network. However, current methods necessitate changes to network training to address feature collapse, where unique inputs map to identical feature vectors. We propose an alternative solution, the deep Vecchia ensemble (DVE), which allows deterministic UQ to work in the presence of feature collapse, negating the need for network retraining. DVE comprises an ensemble of GPs built on hidden-layer outputs of a DNN, achieving scalability via Vecchia approximations that leverage nearest-neighbor conditional independence. DVE is compatible with pretrained networks and incurs low computational overhead. We demonstrate DVE's utility on several datasets and carry out experiments to understand the inner workings of the proposed method.
Poster
Haoming Yang · Ali Hasan · Vahid Tarokh
[ Hall A-E ]
Abstract
Regularizing continual learning techniques is important for anticipating algorithmic behavior under new realizations of data. We introduce a new approach to continual learning by imposing the properties of a parabolic partial differential equation (PDE) to regularize the expected behavior of the loss over time.This class of parabolic PDEs has a number of favorable properties that allow us to analyze the error incurred through forgetting and the error induced through generalization.Specifically, we do this through imposing boundary conditions where the boundary is given by a memory buffer.By using the memory buffer as a boundary, we can enforce long term dependencies by bounding the expected error by the boundary loss.Finally, we illustrate the empirical performance of the method on a series of continual learning tasks.
Poster
Iosif Lytras · Panayotis Mertikopoulos
[ Hall A-E ]
Abstract
Motivated by applications to deep learning which often fail standard Lipschitz smoothness requirements, we examine the problem of sampling from distributions that are not log-concave and are only weakly dissipative, with log-gradients allowed to grow superlinearly at infinity. In terms of structure, we only assume that the target distribution satisfies either a Log-Sobolev or a Poincare inequality and a local Lipschitz smoothness assumption with modulus growing possibly polynomially at infinity. This set of assumptions greatly exceeds the operational limits of the "vanilla" ULA, making sampling from such distributions a highly involved affair. To account for this, we introduce a taming scheme which is tailored to the growth and decay properties of the target distribution, and we provide explicit non-asymptotic guarantees for the proposed sampler in terms of the KL divergence, total variation, and Wasserstein distance to the target distribution.
Poster
Wenjing Han · Yueming Wu · Xinwei Sun · Lingjing Hu · Yizhou Wang
[ Hall A-E ]
Abstract
In voxel-based neuroimaging disease prediction, it was recently found that in addition to lesion features, there exists another type of feature called "Procedural Bias", which is introduced during preprocessing and can further improve the prediction power. However, traditional sparse learning methods fail to simultaneously capture both types of features due to their heterogeneity in sparsity types. Specifically, the lesion features are spatially coherent and suffer from volumetric degeneration, while the procedural bias refers to enlarged voxels that are dispersedly distributed. In this paper, we propose a new method based on differential inclusion, which generates a sparse regularized solution path on a couple of parameters that are enforced with heterogeneous sparsity to capture lesion features and the procedural bias separately. Specifically, we employ Total Variation with a non-negative constraint for the parameter associated with degenerated and spatially coherent lesions; on the other hand, we impose $\ell_1$ sparsity with a non-positive constraint on the parameter related to enlarged and scatterly distributed procedural bias. We theoretically show that our method enjoys model selection consistency and $\ell_2$ consistency in estimation. The utility of our method is demonstrated by improved prediction power and interpretability in the early prediction of Alzheimer's Disease.
Poster
Jiaru Zhang · Rui Ding · Qiang Fu · Huang Bojun · Zizhen Deng · Yang Hua · Haibing Guan · Shi Han · Dongmei Zhang
[ Hall A-E ]
Abstract
Causal discovery is a structured prediction task that aims to predict causal relations among variables based on their data samples.Supervised Causal Learning (SCL) is an emerging paradigm in this field. Existing Deep Neural Network (DNN)-based methods commonly adopt the “Node-Edge approach”, in which the model first computes an embedding vector for each variable-node, then uses these variable-wise representations to concurrently and independently predict for each directed causal-edge. In this paper, we first show that this architecture has some systematic bias that cannot be mitigated regardless of model size and data size. We then propose SiCL, a DNN-based SCL method that predicts a skeleton matrix together with a v-tensor (a third-order tensor representing the v-structures). According to the Markov Equivalence Class (MEC) theory, both the skeleton and the v-structures are *identifiable* causal structures under the canonical MEC setting, so predictions about skeleton and v-structures do not suffer from the identifiability limit in causal discovery, thus SiCL can avoid the systematic bias in Node-Edge architecture, and enable consistent estimators for causal discovery. Moreover, SiCL is also equipped with a specially designed pairwise encoder module with a unidirectional attention layer to model both internal and external relationships of pairs of nodes. Experimental results …
Poster
Qiran Dong · Paul Grigas · Vishal Gupta
[ Hall A-E ]
Abstract
Many applications require minimizing a family of optimization problems indexed by some hyperparameter $\lambda \in \Lambda$ to obtain an entire solution path. Traditional approaches proceed by discretizing $\Lambda$ and solving a series of optimization problems. We propose an alternative approach that parameterizes the solution path with a set of basis functions and solves a \emph{single} stochastic optimization problem to learn the entire solution path. Our method offers substantial complexity improvements over discretization. When using constant-step size SGD, the uniform error of our learned solution path relative to the true path exhibits linear convergence to a constant related to the expressiveness of the basis. When the true solution path lies in the span of the basis, this constant is zero. We also prove stronger results for special cases common in machine learning: When $\lambda \in [-1, 1]$ and the solution path is $\nu$-times differentiable, constant step-size SGD learns a path with $\epsilon$ uniform error after at most $O(\epsilon^{\frac{1}{1-\nu}} \log(1/\epsilon))$ iterations, and when the solution path is analytic, it only requires $O\left(\log^2(1/\epsilon)\log\log(1/\epsilon)\right)$. By comparison, the best-known discretization schemes in these settings require at least $O(\epsilon^{-1/2})$ discretization points (and even more gradient calls). Finally, we propose an adaptive variant of our method that …
Poster
Mayleen Cortez-Rodriguez · Matthew Eichhorn · Christina Yu
[ Hall A-E ]
Abstract
Estimating causal effects under interference is pertinent to many real-world settings. Recent work with low-order potential outcomes models uses a rollout design to obtain unbiased estimators that require no interference network information. However, the required extrapolation can lead to prohibitively high variance. To address this, we propose a two-stage experiment that selects a sub-population in the first stage and restricts treatment rollout to this sub-population in the second stage. We explore the role of clustering in the first stage by analyzing the bias and variance of a polynomial interpolation-style estimator under this experimental design. Bias increases with the number of edges cut in the clustering of the interference network, but variance depends on qualities of the clustering that relate to homophily and covariate balance. There is a tension between clustering objectives that minimize the number of cut edges versus those that maximize covariate balance across clusters. Through simulations, we explore {a bias-variance} trade-off and compare the performance of the estimator under different clustering strategies.
Poster
Yusuke Tanaka · Takaharu Yaguchi · Tomoharu Iwata · Naonori Ueda
[ Hall A-E ]
Abstract
The operator learning has received significant attention in recent years, with the aim of learning a mapping between function spaces. Prior works have proposed deep neural networks (DNNs) for learning such a mapping, enabling the learning of solution operators of partial differential equations (PDEs). However, these works still struggle to learn dynamics that obeys the laws of physics. This paper proposes Energy-consistent Neural Operators (ENOs), a general framework for learning solution operators of PDEs that follows the energy conservation or dissipation law from observed solution trajectories. We introduce a novel penalty function inspired by the energy-based theory of physics for training, in which the functional derivative is calculated making full use of automatic differentiation, allowing one to bias the outputs of the DNN-based solution operators to obey appropriate energetic behavior without explicit PDEs. Experiments on multiple systems show that ENO outperforms existing DNN models in predicting solutions from data, especially in super-resolution settings.
Poster
Matthieu Carreau · Roi Naveiro · William Caballero
[ Hall A-E ]
Abstract
Research in adversarial machine learning (AML) has shown that statistical models are vulnerable to maliciously altered data. However, despite advances in Bayesian machine learning models, most AML research remains concentrated on classical techniques. Therefore, we focus on extending the white-box model poisoning paradigm to attack generic Bayesian inference, highlighting its vulnerability in adversarial contexts. A suite of attacks are developed that allow an attacker to steer the Bayesian posterior toward a target distribution through the strategic deletion and replication of true observations, even when only sampling access to the posterior is available.Analytic properties of these algorithms are proven and their performance is empirically examined in both synthetic and real-world scenarios. With relatively little effort, the attacker is able to substantively alter the Bayesian's beliefs and, by accepting more risk, they can mold these beliefs to their will. By carefully constructing the adversarial posterior, surgical poisoning is achieved such that only targeted inferences are corrupted and others are minimally disturbed.
Poster
Julie Alberge · Vincent Maladiere · Olivier Grisel · Judith Abécassis · Gaël Varoquaux
[ Hall A-E ]
Abstract
When dealing with right-censored data, where some outcomes are missing due to a limited observation period, survival analysis —known as *time-to-event analysis*— focuses on predicting the time until an event of interest occurs. Multiple classes of outcomes lead to a classification variant: predicting the most likely event, a less explored area known as *competing risks*. Classic competing risks models couple architecture and loss, limiting scalability. To address these issues, we design a strictly proper censoring-adjusted separable scoring rule, allowing optimization on a subset of the data as each observation is evaluated independently. The loss estimates outcome probabilities and enables stochastic optimization for competing risks, which we use for efficient gradient boosting trees. **SurvivalBoost** not only outperforms 12 state-of-the-art models across several metrics on 4 real-life datasets, both in competing risks and survival settings, but also provides great calibration, the ability to predict across any time horizon, and computation times faster than existing methods.
Poster
Rafał Karczewski · Samuel Kaski · Markus Heinonen · Vikas Garg
[ Hall A-E ]
Abstract
Several generative models with elaborate training and sampling procedures have been proposed to accelerate structure-based drug design (SBDD); however, their empirical performance turns out to be suboptimal. We seek to better understand this phenomenon from both theoretical and empirical perspectives. Since most of these models apply graph neural networks (GNNs), one may suspect that they inherit the representational limitations of GNNs. We analyze this aspect, establishing the first such results for protein-ligand complexes. A plausible counterview may attribute the underperformance of these models to their excessive parameterizations, inducing expressivity at the expense of generalization. We investigate this possibility with a simple metric-aware approach that learns an economical surrogate for affinity to infer an unlabelled molecular graph and optimizes for labels conditioned on this graph and molecular properties. The resulting model achieves state-of-the-art results using 100x fewer trainable parameters and affords up to 1000x speedup. Collectively, our findings underscore the need to reassess and redirect the existing paradigm and efforts for SBDD. Code is available at https://github.com/rafalkarczewski/SimpleSBDD.
Poster
Abdellah Rahmani · Pascal Frossard
[ Hall A-E ]
Abstract
Understanding causal relationships in multivariate time series is essential for predicting and controlling dynamic systems in fields like economics, neuroscience, and climate science. However, existing causal discovery methods often assume stationarity, limiting their effectiveness when time series consist of sequential regimes, consecutive temporal segments with unknown boundaries and changing causal structures. In this work, we firstly introduce a framework to describe and model such time series. Then, we present CASTOR, a novel method that concurrently learns the Directed Acyclic Graph (DAG) for each regime while determining the number of regimes and their sequential arrangement. CASTOR optimizes the data log-likelihood using an expectation-maximization algorithm, alternating between assigning regime indices (expectation step) and inferring causal relationships in each regime (maximization step). We establish the identifiability of the regimes and DAGs within our framework. Extensive experiments show that CASTOR consistently outperforms existing causal discovery models in detecting different regimes and learning their DAGs across various settings, including linear and nonlinear causal relationships, on both synthetic and real world datasets.
Poster
Virginie Loison · Guillaume Staerman · Thomas Moreau
[ Hall A-E ]
Abstract
Physiological signal analysis often involves identifying events crucial to understanding biological dynamics. Many methods have been proposed to detect them, from handcrafted and supervised approaches to unsupervised techniques.All these methods tend to produce spurious events, mainly as they detect each event independently.This work introduces UNHaP (Unmix Noise from Hawkes Processes), a novel approach addressing the joint learning of temporal structures in events and the removal of spurious detections.By treating the event detection output as a mixture of structured Hawkes and unstructured Poisson events, UNHaP efficiently unmixes these processes and estimates their parameters.This approach significantly enhances event distribution characterization while minimizing false detection rates on simulated and real data.
Poster
Abhinav Agrawal · Justin Domke
[ Hall A-E ]
Abstract
Normalizing flow-based variational inference (flow VI) is a promising approximate inference approach, but its performance remains inconsistent across studies. Numerous algorithmic choices influence flow VI's performance. We conduct a step-by-step analysis to disentangle the impact of some of the key factors: capacity, objectives, gradient estimators, number of gradient estimates (batchsize), and step-sizes. Each step examines one factor while neutralizing others using insights from the previous steps and/or using extensive parallel computation. To facilitate high-fidelity evaluation, we curate a benchmark of synthetic targets that represent common posterior pathologies and allow for exact sampling. We provide specific recommendations for different factors and propose a flow VI recipe that matches or surpasses leading turnkey Hamiltonian Monte Carlo (HMC) methods.
Poster
Danial Dervovic · Michael Cashmore
[ Hall A-E ]
Abstract
Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution's location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.
Poster
Kaiqi Jiang · Wenzhe Fan · Mao Li · Xinhua Zhang
[ Hall A-E ]
Abstract
Fairness-aware classification models have gained increasing attention in recent years as concerns grow on discrimination against some demographic groups. Most existing models require full knowledge of the sensitive features, which can be impractical due to privacy, legal issues, and an individual’s fear of discrimination. The key challenge we will address is the group dependency of the unavailability, e.g., people of some age range may be more reluctant to reveal their age. Our solution augments general fairness risks with probabilistic imputations of the sensitive features, while jointly learning the group-conditionally missing probabilities in a variational auto-encoder. Our model is demonstrated effective on both image and tabular datasets, achieving an improved balance between accuracy and fairness.
Poster
Han Cui · Zhiyuan Yu · Jingbo Liu
[ Hall A-E ]
Abstract
Recently, Approximate Message Passing (AMP) has been integrated with stochastic localization (diffusion model) by providing a computationally efficient estimator of the posterior mean. Existing (rigorous) analysis typically proves the success of sampling for sufficiently small noise, but determining the exact threshold involves several challenges. In this paper, we focus on sampling from the posterior in the linear inverse problem, with an i.i.d. random design matrix, and show that the threshold for sampling coincide with that of posterior mean estimation. We give a proof for the convergence in smoothed KL divergence whenever the noise variance $\Delta$ is below $\Delta_{\rm AMP}$, which is the computation threshold for mean estimation introduced in (Barbier et al., 2020). We also show convergence in the Wasserstein distance under the same threshold assuming a dimension-free bound on the operator norm of the posterior covariance matrix, a condition strongly suggested by recent breakthroughs on operator norm bounds in similar replica symmetric systems. A key observation in our analysis is that phase transition does not occur along the sampling and interpolation paths assuming $\Delta<\Delta_{\rm AMP}$.
Poster
Yihan Zhou · Eric Price · Trung Nguyen
[ Hall A-E ]
Abstract
We address the problem of active logistic regression in the realizable setting. It is well known that active learning can require exponentially fewer label queries compared to passive learning, in some cases using $\log \frac{1}{\varepsilon}$ rather than $\mathrm{poly}(1/\varepsilon)$ samples to get error $\varepsilon$ larger than the optimum.We present the first algorithm that is polynomially competitive with the optimal algorithm on every input instance, up to factors polylogarithmic in the error and domain size. In particular, if any algorithm achieves sample complexity polylogarithmic in $\varepsilon$, so does ours. Our algorithm is based on efficient sampling and can be extended to learn more general class of functions. We further support our theoretical results with experiments demonstrating performance gains for logistic regression compared to existing active learning algorithms.
Poster
Bailey Andrew · David Westhead · Luisa Cutillo
[ Hall A-E ]
Abstract
Multi-axis graphical modelling techniques allow us to perform network inference without making independence assumptions. This is done by replacing the independence assumption with a weaker assumption about the interaction between the axes; there are several choices for which assumption to use. In single-cell RNA sequencing data, genes may interact differently depending on whether they are expressed in the same cell, or in different cells. Unfortunately, current methods are not able to make this distinction. In this paper, we address this problem by introducing the strong product model for Gaussian graphical modelling.
Poster
Koshi Watanabe · Keisuke Maeda · Takahiro Ogawa · Miki Haseyama
[ Hall A-E ]
Abstract
Dimensionality reduction (DR) offers interpretable representations of complex high-dimensional data, and recent DR methods have leveraged hyperbolic geometry to obtain faithful low-dimensional embeddings of high-dimensional hierarchical relationships. However, existing methods are dependent on neighbor embedding, which frequently ruins the continuous nature of the hierarchical structures. This paper proposes hyperboloid Gaussian process latent variable models (hGP-LVMs) to embed high-dimensional hierarchical data while preserving the implicit continuity via nonparametric estimation. We adopt generative modeling using the GP, which provides effective hierarchical embedding and executes ill-posed hyperparameter tuning. This paper presents three variants of the proposed models that employ original point, sparse point, and Bayesian estimations, and we establish their learning algorithms by incorporating the Riemannian optimization and active approximation scheme of the GP-LVM. In addition, we employ the reparameterization trick for scalable learning of the latent variables in the Bayesian estimation method. The proposed hGP-LVMs were applied to several datasets, and the results demonstrate their ability to represent high-dimensional hierarchies in low-dimensional spaces.
Poster
Hao Liu · Junze Ye · Jose Blanchet · NIAN SI
[ Hall A-E ]
Abstract
We introduce ScoreFusion, a theoretically grounded method for fusing multiple pre-trained diffusion models that are assumed to generate from auxiliary populations. ScoreFusion is particularly useful for enhancing the generative modeling of a target population with limited observed data. Our starting point considers the family of KL barycenters of the auxiliary populations, which is proven to be an optimal parametric class in the KL sense, but difficult to learn. Nevertheless, by recasting the learning problem as score matching in denoising diffusion, we obtain a tractable way of computing the optimal KL barycenter weights. We prove a dimension-free sample complexity bound in total variation distance, provided that the auxiliary models are well-fitted for their own task and the auxiliary tasks combined capture the target well. The sample efficiency of ScoreFusion is demonstrated by learning handwritten digits. We also provide a simple adaptation of a Stable Diffusion denoising pipeline that enables sampling from the KL barycenter of two auxiliary checkpoints; on a portrait generation task, our method produces faces that enhance population heterogeneity relative to the auxiliary distributions.
Poster
Tatsuya Matsukawa · Tomohiro Shiraishi · Shuichi Nishino · Teruyuki Katsuoka · Ichiro Takeuchi
[ Hall A-E ]
Abstract
Auto Feature Engineering (AFE) plays a crucial role in developing practical machine learning pipelines by automating the transformation of raw data into meaningful features that enhance model performance. By generating features in a data-driven manner, AFE enables the discovery of important features that may not be apparent through human experience or intuition. On the other hand, since AFE generates features based on data, there is a risk that these features may be overly adapted to the data, making it essential to assess their reliability appropriately. Unfortunately, because most AFE problems are formulated as combinatorial search problems and solved by heuristic algorithms, it has been challenging to theoretically quantify the reliability of generated features. To address this issue, we propose a new statistical test for generated features by AFE algorithms based on a framework called selective inference. As a proof of concept, we consider a simple class of tree search-based heuristic AFE algorithms, and consider the problem of testing the generated features when they are used in a linear model. The proposed test can quantify the statistical significance of the generated features in the form of $p$-values, enabling theoretically guaranteed control of the risk of false findings.
Poster
Gunnar König · Eric Günther · Ulrike von Luxburg
[ Hall A-E ]
Abstract
In explainable machine learning, global feature importance methods try to determine how much each individual feature contributes to predicting the target variable, resulting in one importance score for each feature. But often, predicting the target variable requires interactions between several features (such as in the XOR function), and features might have complex statistical dependencies that allow to partially replace one feature with another one. In commonly used feature importance scores, these cooperative effects are conflated with the features' individual contributions, making them prone to misinterpretations. In this work, we derive DIP, a new mathematical decomposition of individual feature importance scores that disentangles three components: the standalone contribution and the contributions stemming from interactions and dependencies. We prove that the DIP decomposition is unique and show how it can be estimated in practice. Based on these results, we propose a new visualization of feature importance scores that clearly illustrates the different contributions.
Poster
Henry Yuchi · Shixiang Zhu · Li Dong · Yigit Arisoy · Matthew Spencer
[ Hall A-E ]
Abstract
Modeling and analysis for event series generated by users of heterogeneous behavioral patterns are closely involved in our daily lives, including credit card fraud detection, online platform user recommendation, and social network analysis. The most commonly adopted approach to this task is to assign users to behavior-based categories and analyze each of them separately.However, this requires extensive data to fully understand the user behavior, presenting challenges in modeling newcomers without significant historical knowledge.In this work, we propose a novel discrete event prediction framework for new users with limited history, without needing to know the user's category. We treat the user event history as the "treatment" for future events and the user category as the key confounder. Thus, the prediction problem can be framed as counterfactual outcome estimation, where each event is re-weighted by its inverse propensity score.We demonstrate the improved performance of the proposed framework with a numerical simulation study and two real-world applications, including Netflix rating prediction and seller contact prediction for customer support at Amazon.
Poster
Maximilian Fleissner · Gautham Anil · Debarghya Ghoshdastidar
[ Hall A-E ]
Abstract
The NTK is a widely used tool in the theoretical analysis of deep learning, allowing us to look at supervised deep neural networks through the lenses of kernel regression. Recently, several works have investigated kernel models for self-supervised learning, hypothesizing that these also shed light on the behavior of wide neural networks by virtue of the NTK. However, it remains an open question to what extent this connection is mathematically sound --- it is a commonly encountered misbelief that the kernel behavior of wide neural networks emerges irrespective of the loss function it is trained on. In this paper, we bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity. Our analysis technique is a bit different from previous works on the NTK and may be of independent interest. Overall, our work provides a first justification for the use of classic kernel theory to understand self-supervised learning of wide neural networks. Building on this result, we derive generalization error bounds for kernelized Barlow Twins and connect them to neural networks of …
Poster
N. Benjamin Erichson · Soon Hoe Lim · Michael Mahoney
[ Hall A-E ]
Abstract
In this paper, we present a novel approach to modeling long-term dependencies in sequential data by introducing a gated recurrent unit (GRU) with a weighted time-delay feedback mechanism. Our proposed model, named $\tau$-GRU, is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs). We prove the existence and uniqueness of solutions for the continuous-time model and show that the proposed feedback mechanism can significantly improve the modeling of long-term dependencies. Our empirical results indicate that $\tau$-GRU outperforms state-of-the-art recurrent units and gated recurrent architectures on a range of tasks, achieving faster convergence and better generalization.
Poster
Ali Azizpour · Nicolas Zilberstein · Santiago Segarra
[ Hall A-E ]
Abstract
Graphons are continuous models that represent the structure of graphs and allow the generation of graphs of varying sizes. We propose Scalable Implicit Graphon Learning (SIGL), a scalable method that combines implicit neural representations (INRs) and graph neural networks (GNNs) to estimate a graphon from observed graphs. Unlike existing methods, which face important limitations like fixed resolution and scalability issues, SIGL learns a continuous graphon at arbitrary resolutions. GNNs are used to determine the correct node ordering, improving graph alignment. Furthermore, we characterize the asymptotic consistency of our estimator, showing that more expressive INRs and GNNs lead to consistent estimators. We evaluate SIGL in synthetic and real-world graphs, showing that it outperforms existing methods and scales effectively to larger graphs, making it ideal for tasks like graph data augmentation.
Poster
Son Luu · Zuheng Xu · Nikola Surjanovic · Miguel Biron-Lattes · Trevor Campbell · Alexandre Bouchard-Côté
[ Hall A-E ]
Abstract
The Hamiltonian Monte Carlo (HMC) algorithm is often lauded for its ability to effectively sample from high-dimensional distributions. In this paper we challenge the presumed domination of HMC for the Bayesian analysis of GLMs. By utilizing the structure of the compute graph rather than the graphical model, we reduce the time per sweep of a full-scan Gibbs sampler from $O(d^2)$ to $O(d)$, where $d$ is the number of GLM parameters. Our simple changes to the implementation of the Gibbs sampler allow us to perform Bayesian inference on high-dimensional GLMs that are practically infeasible with traditional Gibbs sampler implementations. We empirically demonstrate substantial increase in effective sample size per time when comparing our Gibbs algorithms to state-of-the-art HMC algorithms. While Gibbs is superior in terms of dimension scaling, neither Gibbs nor HMC dominate the other: we provide numerical and theoretical evidence that HMC retains an edge in certain circumstances thanks to its advantageous condition number scaling. Interestingly, for GLMs of fixed data size, we observe that increasing dimensionality can stabilize or even decrease condition number, shedding light on the empirical advantage of our efficient Gibbs sampler.
Poster
Ruichen Luo · Sebastian Stich · Samuel Horvath · Martin Takac
[ Hall A-E ]
Abstract
LocalSGD and SCAFFOLD are widely usedmethods in distributed stochastic optimization, with numerous applications in machinelearning, large-scale data processing, and federated learning. However, rigorously establishing their theoretical advantages over simplermethods, such as minibatch SGD (MbSGD),has proven challenging, as existing analysesoften rely on strong assumptions, unrealisticpremises, or overly restrictive scenarios.In this work, we revisit the convergence properties of LocalSGD and SCAFFOLD under avariety of existing or weaker conditions, including gradient similarity, Hessian similarity, weak convexity, andLipschitz continuity of the Hessian. Our analysis shows that (i) LocalSGD achieves fasterconvergence compared to MbSGD for weaklyconvex functions without requiring strongergradient similarity assumptions; (ii) LocalSGDbenefits significantly from higher-order similarity and smoothness; and (iii) SCAFFOLDdemonstrates faster convergence than MbSGDfor a broader class of non-quadratic functions.These theoretical insights provide a clearerunderstanding of the conditions under whichLocalSGD and SCAFFOLD outperform MbSGD.
Poster
Yiming Zhang · Parthe Pandit
[ Hall A-E ]
Abstract
In this paper we derive a novel iterative algorithm for learning kernel machines. Our algorithm, $\textsf{AxlePro}$, extends the $\textsf{EigenPro}$ family of algorithms via momentum-based acceleration. $\textsf{AxlePro}$ can be applied to train kernel machines with arbitrary positive semidefinite kernels. We provide a convergence guarantee for the algorithm and demonstrate the speed-up of $\textsf{AxlePro}$ over competing algorithms via numerical experiments. Furthermore, we also derive a version of $\textsf{AxlePro}$ to train large kernel models over arbitrarily large datasets.
Poster
Albert Saiapin · Kim Batselier
[ Hall A-E ]
Abstract
Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal formulation of the learning problem by mapping the data to a higher-dimensional space using tensor-product structured polynomial and Fourier features. The curse of dimensionality due to these tensor-product features was effectively solved by a tensor network reparameterization of the model parameters. However, another important aspect of model training — identifying optimal feature hyperparameters — has not been addressed and is typically handled using the standard cross-validation approach.In this paper, we introduce the Feature Learning (FL) model, which addresses this issue by representing tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). By leveraging this CPD structure, we efficiently learn the hyperparameters associated with different features alongside the model parameters using an Alternating Least Squares (ALS) optimization method.We prove the effectiveness of the FL model through experiments on real data of various dimensionality and scale. The results show that the FL model can be consistently trained 3-5 times faster than and have the prediction quality on par with a standard cross-validated model.
Poster
Sahil Sidheekh · Pranuthi Tenali · Saurabh Mathur · Erik Blasch · Kristian Kersting · Sriraam Natarajan
[ Hall A-E ]
Abstract
We consider the problem of late multimodal fusion for discriminative learning. Motivated by noisy, multi-source domains that require understanding the reliability of each data source, we explore the notion of credibility in the context of multimodal fusion. We propose a combination function that uses probabilistic circuits (PCs) to combine predictive distributions over individual modalities. We also define a probabilistic measure to evaluate the credibility of each modality via inference queries over the PC. Our experimental evaluation demonstrates that our fusion method can reliably infer credibility while being competitive with the state-of-the-art.
Poster
Takahiro Kawashima · Hideitsu Hino
[ Hall A-E ]
Abstract
Positive and negative dependence are fundamental concepts that characterize the attractive and repulsive behavior of random subsets. Although some probabilistic models are known to exhibit positive or negative dependence, it is challenging to seamlessly bridge them with a practicable probabilistic model. In this study, we introduce a new family of distributions, named the discrete kernel point process (DKPP), which includes determinantal point processes and parts of Boltzmann machines. We also develop some computational methods for probabilistic operations and inference with DKPPs, such as calculating marginal and conditional probabilities and learning the parameters. Our numerical experiments demonstrate the controllability of positive and negative dependence and the effectiveness of the computational methods for DKPPs.
Poster
Marcus Häggbom · Morten Karlsmark · Joakim Andén
[ Hall A-E ]
Abstract
Microcanonical gradient descent is a sampling procedure for energy-based models allowing for efficient sampling of distributions in high dimension. It works by transporting samples from a high-entropy distribution, such as Gaussian white noise, to a low-energy region using gradient descent. We put this model in the framework of normalizing flows, showing how it can often overfit by losing an unnecessary amount of entropy in the descent. As a remedy, we propose a mean-field microcanonical gradient descent that samples several weakly coupled data points simultaneously, allowing for better control of the entropy loss while paying little in terms of likelihood fit. We study these models in the context of stationary time series and 2D textures.
Poster
Manan Saxena · Tinghua Chen · Justin Silverman
[ Hall A-E ]
Abstract
Many scientific fields collect longitudinal count compositional data. Each observation is a multivariate count vector, where the total counts are arbitrary, and the information lies in the relative frequency of the counts. Multiple authors have proposed Bayesian Multinomial Logistic-Normal Dynamic Linear Models (MLN-DLMs) as a flexible approach to modeling these data. However, adoption of these methods has been limited by computational challenges. This article develops an efficient and accurate approach to posterior state estimation, called Fenrir. Our approach relies on a novel algorithm for MAP estimation and an accurate approximation to a key posterior marginal of the model. As there are no equivalent methods against which we can compare, we also develop an optimized Stan implementation of MLN-DLMs. Our experiments suggest that Fenrir can be three orders of magnitude more efficient than Stan and can even be incorporated into larger sampling schemes for joint inference of model hyperparameters. Our methods are made available to the community as a user-friendly software library written in C++ with an R interface.
Poster
Vahid Shahverdi · Giovanni Luca Marchetti · Kathlén Kohn
[ Hall A-E ]
Abstract
We study convolutional neural networks with monomial activation functions. Specifically, we prove that their parameterization map is regular and is an isomorphism almost everywhere, up to rescaling the filters. By leveraging on tools from algebraic geometry, we explore the geometric properties of the image in function space of this map -- typically referred to as neuromanifold. In particular, we compute the dimension and the degree of the neuromanifold, which measure the expressivity of the model, and describe its singularities. Moreover, for a generic large dataset, we derive an explicit formula that quantifies the number of critical points arising in the optimization of a regression loss.
Poster
Leonel Rozo · Miguel González-Duque · Noémie Jaquier · Soren Hauberg
[ Hall A-E ]
Abstract
Latent variable models are powerful tools for learning low-dimensional manifolds from high-dimensional data. However, when dealing with constrained data such as unit-norm vectors or symmetric positive-definite matrices, existing approaches ignore the underlying geometric constraints or fail to provide meaningful metrics in the latent space. To address these limitations, we propose to learn Riemannian latent representations of such geometric data. To do so, we estimate the pullback metric induced by a Wrapped Gaussian Process Latent Variable Model, which explicitly accounts for the data geometry. This enables us to define geometry-aware notions of distance and shortest paths in the latent space, while ensuring that our model only assigns probability mass to the data manifold. This generalizes previous work and allows us to handle complex tasks in various domains, including robot motion synthesis and analysis of brain connectomes.
Poster
Peiyuan Zhang · Jiaye Teng · Jingzhao Zhang
[ Hall A-E ]
Abstract
This work studies the generalization error of gradient methods. More specifically, we focus on how training steps $T$ and step-size $\eta$ might affect generalization in smooth stochastic convex optimization (SCO) problems. Recent works show that in some cases longer training can hurt generalization. Our work reexamines this for smooth SCO and find that the conclusion can be case-dependent. In particular, we first study SCO problems when the loss is \emph{realizable}, i.e. an optimal solution minimizes all the data points. Our work provides excess risk lower bounds for Gradient Descent (GD) and Stochastic Gradient Descent (SGD) and finds that longer training may not hurt generalization. In the short training scenario $\eta T = O(n)$ ($n$ is sample size), our lower bounds tightly match and certify the respective upper bounds. However, for the long training scenario where $\eta T =O(n)$, our analysis reveals a gap between the lower and upper bounds, indicating that longer training does hurt generalization for realizable objectives. A conjecture is proposed that the gap can be closed by improving upper bounds, supported by analyses in two special instances. Moreover, besides the realizable setup, we also provide first tight excess risk lower bounds for GD and SGD under the …
Poster
Sebastian Salazar · Michal Kucer · Yixin Wang · Emily Casleton · David Blei
[ Hall A-E ]
Abstract
This paper introduces posterior mean matching (PMM), a new method for generative modeling that is grounded in Bayesian inference. PMM uses conjugate pairs of distributions to model complex data of various modalities like images and text, offering a flexible alternative to existing methods like diffusion models. PMM models iteratively refine noisy approximations of the target distribution using updates from online Bayesian inference. PMM is flexible because its mechanics are based on general Bayesian models. We demonstrate this flexibility by developing specialized examples: a generative PMM model of real-valued data using the Normal-Normal model, a generative PMM model of count data using a Gamma-Poisson model, and a generative PMM model of discrete data using a Dirichlet-Categorical model. For the Normal-Normal PMM model, we establish a direct connection to diffusion models by showing that its continuous-time formulation converges to a stochastic differential equation (SDE). Additionally, for the Gamma-Poisson PMM, we derive a novel SDE driven by a Cox process, which is a significant departure from traditional Brownian motion-based generative models. PMMs achieve performance that is competitive with generative models for language modeling and image generation.
Poster
Pierre Marion · Anna Korba · Peter Bartlett · Mathieu Blondel · Valentin De Bortoli · Arnaud Doucet · Felipe Llinares-López · Courtney Paquette · Quentin Berthet
[ Hall A-E ]
Abstract
Sampling and automatic differentiation are both ubiquitous in modern machine learning. At its intersection, differentiating through a sampling operation, with respect to the parameters of the sampling process, is a problem that is both challenging and broadly applicable. We introduce a general framework and a new algorithm for first-order optimization of parameterized stochastic diffusions, performing jointly, in a single loop, optimization and sampling steps. This approach is inspired by recent advances in bilevel optimization and automatic implicit differentiation, leveraging the point of view of sampling as optimization over the space of probability distributions. We provide theoretical and experimental results showcasing the performance of our method.
Poster
Yiming Wang · Yuxuan Song · Yiqun Wang · Minkai Xu · Rui Wang · Hao Zhou · Wei-Ying Ma
[ Hall A-E ]
Abstract
Retrosynthesis poses a key challenge in biopharmaceuticals, aiding chemists in finding appropriate reactant molecules for given product molecules. With reactants and products represented as 2D graphs, retrosynthesis constitutes a conditional graph-to-graph (G2G) generative task. Inspired by advancements in discrete diffusion models for graph generation, we aim to design a diffusion-based method to address this problem. However, integrating a diffusion-based G2G framework while retaining essential chemical reaction template information presents a notable challenge. Our key innovation involves a multi-stage diffusion process. We decompose the retrosynthesis procedure to first sample external groups from the dummy distribution given products, then generate external bonds to connect products and generated groups. Interestingly, this generation process mirrors the reverse of the widely adapted semi-template retrosynthesis workflow, i.e., from reaction center identification to synthon completion. Based on these designs, we introduce Retrosynthesis Diffusion (RetroDiff), a novel diffusion-based method for the retrosynthesis task. Experimental results demonstrate that RetroDiff surpasses all semi-template methods in accuracy, and outperforms template-based and template-free methods in large-scale scenarios and molecular validity, respectively.
Poster
Didier Chételat · Joseph Cotnareanu · Rylee Thompson · Yingxue Zhang · Mark Coates
[ Hall A-E ]
Abstract
Large language models (LLMs) contain substantial factual knowledge which is commonly elicited by multiple-choice question-answering prompts. Internally, such models process the prompt through multiple transformer layers, building varying representations of the problem within its hidden states. Ultimately, however, only the hidden state corresponding to the final layer and token position is used to predict the answer label. In this work, we propose instead to learn a small separate neural network predictor module on a collection of training questions, that take the hidden states from all the layers at the last temporal position as input and outputs predictions. In effect, such a framework disentangles the representational abilities of LLMs from their predictive abilities. On a collection of hard benchmarks, our method achieves considerable improvements in performance, sometimes comparable to supervised fine-tuning procedures, but at a fraction of the computational cost.
Poster
Shinsaku Sakaue · Han Bao · Taira Tsuchiya
[ Hall A-E ]
Abstract
This paper revisits the online learning approach to inverse linear optimization studied by Bärmann et al. (2017), where the goal is to infer an unknown linear objective function of an agent from sequential observations of the agent's input-output pairs. First, we provide a simple understanding of the online learning approach through its connection to online convex optimization of *Fenchel–Young losses*. As a byproduct, we present an offline guarantee on the *suboptimality loss*, which measures how well predicted objective vectors explain the agent's choices, without assuming the optimality of the agent's choices. Second, assuming that there is a gap between optimal and suboptimal objective values in the agent's decision problems, we obtain an upper bound independent of the time horizon $T$ on the sum of suboptimality and *estimate losses*, where the latter measures the quality of solutions recommended by predicted objective vectors. Interestingly, our gap-dependent analysis achieves a faster rate than the standard $O(\sqrt{T})$ regret bound by exploiting structures specific to inverse linear optimization, even though neither the loss functions nor their domains possess desirable properties, such as strong convexity.
Poster
Jong-Ik Park · Srinivasa Pranav · José Moura · Carlee Joe-Wong
[ Hall A-E ]
Abstract
Foundation models are now a major focus of leading technology organizations due to their ability to generalize across diverse tasks. Existing approaches for adapting foundation models to new applications often rely on Federated Learning (FL) and disclose the foundation model weights to clients when using it to initialize the global model. While these methods ensure client data privacy, they compromise model and information security. In this paper, we introduce Federated Learning Aggregation Biased by a Foundation Model (FedBaF), a novel method for dynamically integrating pre-trained foundation model weights during the FL aggregation phase. Unlike conventional methods, FedBaF preserves the confidentiality of the foundation model while still leveraging its power to train more accurate models, especially in non-IID and adversarial scenarios. Our comprehensive experiments use Pre-ResNet and foundation models like Vision Transformer to demonstrate that FedBaF not only matches, but often surpasses the test accuracy of traditional weight initialization methods by up to 11.4\% in IID and up to 15.8\% in non-IID settings. Additionally, FedBaF applied to a Transformer-based language model significantly reduced perplexity by up to 39.2\%.
Poster
Qingshi Sun · Nathan Justin · Andres Gomez · Phebe Vayanos
[ Hall A-E ]
Abstract
Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem that seeks the model that will perform best against adversarial realizations of the data distribution drawn from a suitably constructed Wasserstein ambiguity set. Our model and solution approach differ from prior work in that we can capture settings where the likelihood of distribution shifts can vary across features, significantly broadening the applicability of our model relative to the state-of-the-art. We propose a graph-based solution approach that can be integrated into off-the-shelf optimization solvers. We evaluate the performance of our model and algorithms on numerous publicly available datasets. Our solution achieves a 408x speed-up relative to the state-of-the-art. Additionally, compared to the state-of-the-art, our model reduces average calibration error by up to 36.19% and worst-case calibration error by up to 41.70%, while increasing the average area under the ROC curve (AUC) by up to 18.02% and worst-case AUC by up to 48.37%.
Poster
Gergely Neu · Nneka Okolo
[ Hall A-E ]
Abstract
We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs) when the reward and transition models are linearly realizable under a known feature map. Starting from the classic linear-program formulation of the optimal control problem in MDPs, we develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies, defined as the expected feature vectors that can potentially be generated by executing policies in the environment. We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees, achieved under the least restrictive data coverage assumptions known in the literature. In particular, we show that the sample complexity of our method scales optimally with the desired accuracy level and depends on a weak notion of coverage that only requires the empirical feature covariance matrix to cover a single direction in the feature space (as opposed to covering a full subspace). Additionally, our method can be implemented efficiently without requiring any computational oracles, and requires no prior knowledge of the coverage ratio (or even an upper bound on it), which altogether make it the strongest known algorithm for this setting to date.
Poster
Daniel Guzmán Olivares · Philipp Schmidt · Jacek Golebiowski · Artur Bekasov
[ Hall A-E ]
Abstract
Off-policy evaluation can leverage logged data to estimate the effectiveness of new policies in e-commerce, search engines, media streaming services, or automatic diagnostic tools in healthcare. However, the performance of baseline off-policy estimators like IPS deteriorates when the logging policy significantly differs from the evaluation policy. Recent work proposes sharing information across similar actions to mitigate this problem. In this work, we propose an alternative estimator that shares information across similar contexts using clustering. We study the theoretical properties of the proposed estimator, characterizing its bias and variance under different conditions. We also compare the performance of the proposed estimator and existing approaches in various synthetic problems, as well as a real-world recommendation dataset. Our experimental results confirm that clustering contexts improves estimation accuracy, especially in deficient information settings.
Poster
Bijan Mazaheri · Chandler Squires · Caroline Uhler
[ Hall A-E ]
Abstract
Heterogeneous data from multiple populations, sub-groups, or sources can be represented as a "mixture model" with a single latent class influencing all of the observed covariates. Heterogeneity can be resolved at different levels by grouping populations according to different notions of similarity. This paper proposes grouping with respect to the causal response of an intervention or perturbation on the system. This is distinct from previous notions, such as grouping by similar covariate values (e.g., clustering) or similar correlations between covariates (e.g., Gaussian mixture models). To solve the problem, we "synthetically sample" from a counterfactual distribution using higher-order multi-linear moments of the observable data. To understand how these ``causal mixtures'' fit in with more classical notions, we develop a hierarchy of mixture identifiability.
Poster
Yang Luo · Michael O'Neill
[ Hall A-E ]
Abstract
We develop a new class of self-tuning algorithms to solve a root-finding problem involving a Lipschitz continuous operator, with applications in convex optimization, minimax saddle point problems and variational inequalities. Our methods are adaptive to the unknown, problem specific parameters, such as the Lipschitz constant and the variance of the stochastic operator. Unlike prior work, our approach does not rely on restrictive assumptions, such as a bounded domain, boundedness of the operator or a light-tailed distribution. We prove a $\tilde{\mathcal{O}}(N^{-1/2})$ average-iterate convergence rate of the restricted merit function under an affine noise assumption, matching the optimal rate up to log factors. In addition, we improve the convergence rate to $\mathcal{O}(N^{-1})$ under a strong growth condition, characterizing the field of cutting-edge machine learning models and matching the optimal rate for the \textit{deterministic regime}. Finally, we illustrate the effectiveness of the proposed algorithms through numerical experiments on saddle point problems. Our results suggest that the adaptive step sizes automatically take advantage of the structure of the noise and observe improved convergence in certain settings, such as when the strong growth condition holds. To the best of our knowledge, this is the first method for root-finding problems under mild assumptions that adapts to …
Poster
Sanjeeb Dash · Joao Goncalves · Tian Gao
[ Hall A-E ]
Abstract
Acyclic directed mixed graphs (ADMG) –graphs that contain both directed and bidi-rected edges but no directed cycles – are usedto model causal and conditional independencerelationships between a set of random vari-ables in the presence of latent or unmeasuredvariables. Bow-free ADMGs, Arid ADMGs,and Ancestral ADMGs (AADMG) are threewidely studied classes of ADMGs where eachclass is contained in the previously mentionedclass. There are a number of published meth-ods – primarily heuristic ones – to find score-maximizing AADMGs from data. Bow-freeand Arid ADMGs can model certain equal-ity restrictions – such as Verma constraints– between observed variables that maximalAADMGs cannot. In this work, we developthe first exact methods – based on integerprogramming – to find score-maximizing Bow-free and Arid ADMGs. Our methods work fordata that follows a continuous Gaussian distri-bution and for scores that linearly decomposeinto the sum of scores of c-components ofan ADMG. To improve scaling, we developan effective linear-programming based heuris-tic that yields solutions with high parent setsizes and/or large districts. We show thatour proposed algorithms obtain better scoresthan other state-of-the-art methods and re-turn graphs that have excellent fits to data.
Poster
Alix Lhéritier · Maurizio Filippone
[ Hall A-E ]
Abstract
Mixture Density Networks (MDNs) allow to model arbitrarily complex mappings between inputs and mixture densities, enabling flexible conditional density estimation, at the risk of severe overfitting. A Bayesian approach can alleviate this problem by specifying a prior over the parameters of the neural network. However, these priors can be difficult to specify due to the lack of interpretability. We propose a novel neural network construction for conditional mixture densities that allows one to specify the prior in the predictive distribution domain. The construction is based on mapping the targets to the unit hypercube via a diffeomorphism, enabling the use of mixtures of Beta distributions. We prove that the prior predictive distributions are calibrated in the sense that they are equal to the unconditional density function defined by the diffeomorphism. Contrary to Bayesian Gaussian MDNs, which exhibit tied functional and distributional complexity, we show that our construction allows to decouple them. We propose an extension allowing to model correlations in the covariates via Gaussian copulas, potentially reducing the necessary number of mixture components. Our experiments show competitive performance on standard benchmarks with respect to the state of the art.
Poster
Eduard Tulchinskii · Daria Voronkova · Ilya Trofimov · Evgeny Burnaev · Serguei Barannikov
[ Hall A-E ]
Abstract
Topological methods for comparing weighted graphs are valuable in various learning tasks but often suffer from computational inefficiency on large datasets. We introduce RTD-Lite, a scalable algorithm that efficiently compares topological features, specifically connectivity or cluster structures at arbitrary scales, of two weighted graphs with one-to-one correspondence between vertices. By leveraging minimal spanning trees in auxiliary graphs, RTD-Lite captures topological discrepancies with O(n^2) time and memory complexity. This efficiency enables its application in tasks like dimensionality reduction and neural network training. Experiments on synthetic and real-world datasets demonstrate that RTD-Lite effectively identifies topological differences while significantly reducing computation time compared to existing methods. Moreover, integrating RTD-Lite into neural network training as a loss function component enhances the preservation of topological structures in learned representations. Our code is publicly available at https://github.com/ArGintum/RTD-Lite.
Poster
Fabio Feser · Marina Evangelou
[ Hall A-E ]
Abstract
Tuning the regularization parameter in penalized regression models is an expensive task, requiring multiple models to be fit along a path of parameters. Strong screening rules drastically reduce computational costs by lowering the dimensionality of the input prior to fitting. We develop strong screening rules for group-based Sorted L-One Penalized Estimation (SLOPE) models: Group SLOPE and Sparse-group SLOPE. The developed rules are applicable to the wider family of group-based OWL models, including OSCAR. Our experiments on both synthetic and real data show that the screening rules significantly accelerate the fitting process. The screening rules make it accessible for group SLOPE and sparse-group SLOPE to be applied to high-dimensional datasets, particularly those encountered in genetics.
Poster
Adrien Corenflos · Zheng Zhao · Thomas Schön · Simo Särkkä · Jens Sjölund
[ Hall A-E ]
Abstract
Given an unconditional diffusion model targeting a joint model $\pi(x, y)$, using it to perform conditional simulation $\pi(x \mid y)$ is still largely an open question and is typically achieved by learning conditional drifts to the denoising SDE after the fact. In this work, we express \emph{exact} conditional simulation within the \emph{approximate} diffusion model as an inference problem on an augmented space corresponding to a partial SDE bridge. This perspective allows us to implement efficient and principled particle Gibbs and pseudo-marginal samplers marginally targeting the conditional distribution $\pi(x \mid y)$. Contrary to existing methodology, our methods do not introduce any additional approximation to the unconditional diffusion model aside from the Monte Carlo error. We showcase the benefits and drawbacks of our approach on a series of synthetic and real data examples.
Poster
Jonas Wahl · Jakob Runge
[ Hall A-E ]
Abstract
Assessing the accuracy of the output of causal discovery algorithms is crucial in developing and comparing novel methods. Common evaluation metrics such as the structural Hamming distance are useful for assessing individual links of causal graphs. However, many state-of-the-art causal discovery methods do not output single causal graphs, but rather their Markov equivalence classes (MECs) which encode all of the graph's separation and connection statements. In this work, we propose additional measures of distance that capture the difference in separations of two causal graphs which link-based distances are not fit to assess. The proposed distances have low polynomial time complexity and are applicable to directed acyclic graphs (DAGs) as well as to maximal ancestral graph (MAGs) that may contain bidirected edges. We complement our theoretical analysis with toy examples and empirical experiments that highlight the differences to existing comparison metrics.
Poster
Tuan Nguyen · Jay Barrett · Kwang-Sung Jun
[ Hall A-E ]
Abstract
We study the problem of estimating the \emph{value} of the largest mean among $K$ distributions via samples from them (rather than estimating \emph{which} distribution has the largest mean), which arises from various machine learning tasks including Q-learning and Monte Carlo Tree Search (MCTS). While there have been a few proposed algorithms, their performance analyses have been limited to their biases rather than a precise error metric. In this paper, we propose a novel algorithm called HAVER (Head AVERaging) and analyze its mean squared error. Our analysis reveals that HAVER has a compelling performance in two respects. First, HAVER estimates the maximum mean as well as the oracle who knows the identity of the best distribution and reports its sample mean. Second, perhaps surprisingly, HAVER exhibits even better rates than this oracle when there are many distributions near the best one. Both of these improvements are the first of their kind in the literature, and we also prove that the naive algorithm that reports the largest empirical mean does not achieve these bounds. Finally, we confirm our theoretical findings via numerical experiments where we implement HAVER in bandit, Q-learning, and MCTS algorithms. In these experiments, HAVER consistently outperforms the baseline methods, …
Poster
Teodor Rotaru · Panagiotis Patrinos · François Glineur
[ Hall A-E ]
Abstract
We investigate a difference-of-convex (DC) formulation where the second term is allowed to be weakly convex. We examine the precise behavior of a single iteration of the difference-of-convex algorithm (DCA), providing a tight characterization of the objective function decrease, distinguishing between six distinct parameter regimes. Our proofs, inspired by the performance estimation framework, are notably simplified compared to related prior research. We subsequently derive sublinear convergence rates for the DCA towards critical points, assuming at least one of the functions is smooth. Additionally, we explore the underexamined equivalence between proximal gradient descent (PGD) and DCA iterations, demonstrating how DCA, a parameter-free algorithm, without the need for a stepsize, serves as a tool for studying the exact convergence rates of PGD. Finally, we propose a method to optimize the DC decomposition to achieve optimal convergence rates, potentially transforming the subtracted function to become weakly convex.
Poster
ziqi Liu
[ Hall A-E ]
Abstract
Long-term time series forecasting is essential in areas like finance and weather prediction. Besides traditional methods that operate in the time domain, many recent models transform time series data into the frequency domain to better capture complex patterns. However, these methods often use filtering techniques to remove certain frequency signals as noise, which may unintentionally discard important information and reduce prediction accuracy. To address this, we propose the Frequency Decomposition Mixture-of-Experts (FreqMoE) model, which dynamically decomposes time series data into frequency bands, each processed by a specialized expert. A gating mechanism adjusts the importance of each output of expert based on frequency characteristics, and the aggregated results are fed into a prediction module that iteratively refines the forecast using residual connections. Our experiments demonstrate that FreqMoE outperforms state-of-the-art models, achieving the best performance on 51 out of 70 metrics across all tested datasets, while significantly reducing the number of required parameters to under 50k, providing notable efficiency advantages. Code is available at: https://github.com/sunbus100/FreqMoE-main
Poster
Fares Fourati · Salma Kharrat · Vaneet Aggarwal · Mohamed-Slim Alouini
[ Hall A-E ]
Abstract
Optimizing expensive, non-convex, black-box Lipschitz continuous functions presents significant challenges, particularly when the Lipschitz constant of the underlying function is unknown. Such problems often demand numerous function evaluations to approximate the global optimum, which can be prohibitive in terms of time, energy, or resources. In this work, we introduce Every Call is Precious (ECP), a novel global optimization algorithm that minimizes unpromising evaluations by strategically focusing on potentially optimal regions. Unlike previous approaches, ECP eliminates the need to estimate the Lipschitz constant, thereby avoiding additional function evaluations. ECP guarantees no-regret performance for infinite evaluation budgets and achieves minimax-optimal regret bounds within finite budgets. Extensive ablation studies validate the algorithm's robustness, while empirical evaluations show that ECP outperforms 10 benchmark algorithms—including Lipschitz, Bayesian, bandits, and evolutionary methods—across 30 multi-dimensional non-convex synthetic and real-world optimization problems, which positions ECP as a competitive approach for global optimization.
Poster
yuyang deng · Fuli Qiao · Mehrdad Mahdavi
[ Hall A-E ]
Abstract
Stochastic compositional minimax problems are prevalent in machine learning, yet there exist only limited established findings on the convergence of this class of problems. In this paper, we propose a formal definition of the stochastic compositional minimax problem, which involves optimizing a minimax loss with a compositional structure either in primal, dual, or both primal and dual variables. We introduce a simple yet effective algorithm, stochastically Corrected stOchastic gradient Descent Ascent (CODA), which is a primal-dual type algorithm with compositional correction steps, and establish its convergence rate in the aforementioned three settings. We also propose a variance reduced variant, CODA+, which achieves the best-known rate on nonconvex-strongly-concave and nonconvex-concave compositional minimax problems. This work initiates the theoretical study of the stochastic compositional minimax problem in various settings and may inform modern machine learning scenarios such as domain adaptation or robust model-agnostic meta-learning.
Poster
Sagnik Chatterjee · MANUJ MUKHERJEE · Alhad Sethi
[ Hall A-E ]
Abstract
In this work, we give generalization bounds of statistical learning algorithms trained on samples drawn from a dependent data source both in expectation and with high probability, using the Online-to-Batch conversion paradigm. We show that the generalization error of statistical learners in the dependent data setting is equivalent to the generalization error of statistical learners in the i.i.d. setting up to a term that depends on the decay rate of the underlying mixing stochastic process, and is independent of the complexity of the statistical learner. Our proof techniques involve defining a new notion of stability of online learning algorithms based on Wasserstein distances, and employing ”near-martingale” concentration bounds for dependent random variables to arrive at appropriate upper bounds for the generalization error of statistical learners trained on dependent data. Finally, we prove that the Exponential Weighted Averages (EWA) algorithm satisfies our new notion of stability, and instantiate our bounds using the EWA algorithm.
Poster
Tomoharu Iwata · Atsutoshi Kumagai
[ Hall A-E ]
Abstract
We propose neural network-based models for tensor completion in few observation settings. The proposed model can meta-learn inductive bias from multiple heterogeneous tensors without shared modes. Although many tensor completion methods have been proposed, the existing methods cannot leverage knowledge across heterogeneous tensors, and their performance is low when only a small number of elements are observed. The proposed model encodes each element of a given tensor by considering information about other elements while reflecting the tensor structure via a self-attention mechanism. The missing values are predicted by tensor-specific linear projection from the encoded vectors. The proposed model is shared across different tensors, and it is meta-learned such that the expected tensor completion performance is improved using multiple tensors. By experiments using synthetic and real-world tensors, we demonstrate that the proposed method achieves better performance than the existing meta-learning and tensor completion methods.
Poster
Krzysztof Choromanski · Isaac Reid · Arijit Sehanobish · Kumar Avinava Dubey
[ Hall A-E ]
Abstract
We present the first linear time complexity randomized algorithms for unbiased approximation of the celebrated family of general random walk kernels (RWKs) for sparse graphs. This includes both labelled and unlabelled instances. The previous fastest methods for general RWKs were of cubic time complexity and not applicable to labelled graphs. Our method samples dependent random walks to compute novel graph embeddings in $R^{d}$ whose dot product is equal to the true RWK in expectation. It does so without instantiating the direct product graph in memory, meaning we can scale to massive datasets that cannot be stored on a single machine. We derive exponential concentration bounds to prove that our estimator is sharp, and show that the ability to approximate general RWKs (rather than just special cases) unlocks efficient implicit graph kernel learning. Our method is up to **27×** faster than its counterparts for efficient computation on large graphs and scales to graphs **128×** bigger than largest examples amenable to brute-force computation.
Poster
Elena Grigorescu · Young-San Lin · Maoyuan Song
[ Hall A-E ]
Abstract
*Learning-augmented algorithms* have been extensively studied in the computer science community recently, particularly in the context of online problems, in which machine-learning predictors can help provide additional information about the future, in order to overcome classical impossibility results. Such algorithms use *advice* prudently to improve the performance of classical algorithms, while ensuring robustness against inaccurate advice. In this paper, we present learning-augmented algorithmic frameworks for two fundamental optimizations settings, extending and generalizing prior works. For *online packing with concave objectives*, we present a simple but overarching strategy that *switches* between the advice and the state-of-the-art online algorithm. For *online covering with convex objectives*, we greatly extend primal-dual methods for online convex covering programs and previous learning-augmented framework for online covering linear programs from the literature, to many new applications. We show that our algorithms break impossibility results when the advice is accurate, while maintaining comparable performance with state-of-the-art classical online algorithms even when the advice is erroneous.
Poster
Shpresim Sadiku · Moritz Wagner · Sai Ganesh Nagarajan · Sebastian Pokutta
[ Hall A-E ]
Abstract
We study the problem of finding optimal sparse, manifold-aligned counterfactual explanations for classifiers. Canonically, this can be formulated as an optimization problem with multiple non-convex components, including classifier loss functions and manifold alignment (or _plausibility_) metrics. The added complexity of enforcing _sparsity_, or shorter explanations, complicates the problem further. Existing methods often focus on specific models and plausibility measures, relying on convex $\ell_1$ regularizers to enforce sparsity. In this paper, we tackle the canonical formulation using the accelerated proximal gradient (APG) method, a simple yet efficient first-order procedure capable of handling smooth non-convex objectives and non-smooth $\ell_p$ (where $0 \leq p < 1$) regularizers. This enables our approach to seamlessly incorporate various classifiers and plausibility measures while producing sparser solutions. Our algorithm only requires differentiable data-manifold regularizers and supports box constraints for bounded feature ranges, ensuring the generated counterfactuals remain \emph{actionable}. Finally, experiments on real-world datasets demonstrate that our approach effectively produces sparse, manifold-aligned counterfactual explanations while maintaining proximity to the factual data and computational efficiency.
Poster
Ananda Theertha Suresh · Andrew Thangaraj · Aditya Khandavally
[ Hall A-E ]
Abstract
Given the ease of creating synthetic data from machine learning models, new models can be potentially trained on synthetic data generated by previous models. This recursive training process raises concerns about the long-term impact on model quality. As models are recursively trained on generated data from previous rounds, their ability to capture the nuances of the original human-generated data may degrade. This is often referred to as model collapse. In this work, we ask how fast model collapse occurs for some well-studied distribution families under maximum likelihood (ML or near ML) estimation during recursive training. Surprisingly, even for fundamental distributions such as discrete and Gaussian distributions, the exact rate of model collapse is unknown. In this work, we theoretically characterize the rate of collapse in these fundamental settings and complement it with experimental evaluations. Our results show that for discrete distributions, the time to forget a symbol is approximately linearly dependent on the number of times it occurred in the original corpus, and for Gaussian models, the standard deviation reduces to zero roughly at $n$ iterations, where $n$ is the number of samples at each iteration. Both of these findings imply that model forgetting, at least in these simple distributions …
Poster
Markus Heinonen · Ba-Hien Tran · Michael Kampffmeyer · Maurizio Filippone
[ Hall A-E ]
Abstract
Introducing training-time augmentations is a key technique to enhance generalization and prepare deep neural networks against test-time corruptions. Inspired by the success of generative diffusion models, we propose a novel approach of coupling data mollification, in the form of image noising and blurring, with label smoothing to align predicted label confidences with image degradation. The method is simple to implement, introduces negligible overheads, and can be combined with existing augmentations. We demonstrate improved robustness and uncertainty quantification on the corrupted image benchmarks of CIFAR, TinyImageNet and ImageNet datasets.
Poster
Csaba Tóth · Masaki Adachi · Michael A. Osborne · Harald Oberhauser
[ Hall A-E ]
Abstract
The signature kernel is a kernel between time series of arbitrary length and comes with strong theoretical guarantees from stochastic analysis. It has found applications in machine learning such as covariance functions for Gaussian processes. A strength of the underlying signature features is that they provide a structured global description of a time series. However, this property can quickly become a curse when local information is essential and forgetting is required; so far this has only been addressed with ad-hoc methods such as slicing the time series into smaller segments. To overcome this, we propose a principled and data-driven approach by introducing a novel forgetting mechanism for signature features. This allows the model to dynamically adapt its observed context length to focus on more recent information. To achieve this, we revisit the recently introduced Random Fourier Signature Features, and develop Random Fourier Decayed Signature Features (RFDSF) with Gaussian processes (GPs). This results in a Bayesian time series forecasting algorithm with variational inference, that offers a scalable probabilistic algorithm that processes and transforms a time series into a joint predictive distribution over the time steps in one pass using recurrence. For example, processing a sequence of length $10^4$ steps in less …
Poster
El Mahdi Chayti · Nikita Doikov · Martin Jaggi
[ Hall A-E ]
Abstract
We study stochastic second-order methods for solving general non-convex optimization problems. We propose using a special version of momentum to stabilize the stochastic gradient and Hessian estimates in Newton's method. We show that momentum provably improves the variance of stochastic estimates and allows the method to converge for any noise level. Using the cubic regularization technique, we prove a global convergence rate for our method on general non-convex problems to a second-order stationary point, even when using only a single stochastic data sample per iteration. This starkly contrasts with all existing stochastic second-order methods for non-convex problems, which typically require large batches. Therefore, we are the first to demonstrate global convergence for batches of arbitrary size in the non-convex case for the Stochastic Cubic Newton. Additionally, we show improved speed on convex stochastic problems for our regularized Newton methods with momentum.
Poster
Boning Zhang · Dongzhu Liu · Osvaldo Simeone · Guanchu Wang · Dimitrios Pezaros · Guangxu Zhu
[ Hall A-E ]
Abstract
To support real-world decision-making, it is crucial for models to be well-calibrated, i.e., to assign reliable confidence estimates to their predictions. Uncertainty quantification is particularly important in personalized federated learning (PFL), as participating clients typically have small local datasets, making it difficult to unambiguously determine optimal model parameters. Bayesian PFL (BPFL) methods can potentially enhance calibration, but they often come with considerable computational and memory requirements due to the need to track the variances of all the individual model parameters. Furthermore, different clients may exhibit heterogeneous uncertainty levels owing to varying local dataset sizes and distributions. To address these challenges, we propose LR-BPFL, a novel BPFL method that learns a global deterministic model along with personalized low-rank Bayesian corrections. To tailor the local model to each client's inherent uncertainty level, LR-BPFL incorporates an adaptive rank selection mechanism. We evaluate LR-BPFL across a variety of datasets, demonstrating its advantages in terms of calibration, accuracy, as well as computational and memory requirements. The code is available at https://github.com/Bernie0115/LR-BPFL.