Track: Oral Session 1: Bayesian & Probabilistic Modeling

Oral

Rethinking Probabilistic Circuit Parameter Learning

Anji Liu ⋅ Zilei Shao ⋅ Guy Van den Broeck

Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. Although empirical extensions to the mini-batch setting, as well as gradient-based mini-batch algorithms, converge faster than full-batch EM, they generally underperform in terms of final likelihood. We investigate this gap by establishing a novel theoretical connection between these practical algorithms and the general EM objective. Our analysis reveals a fundamental issue that existing mini-batch EM and gradient-based methods fail to properly regularize distribution changes, causing each update to effectively "overfit" the current mini-batch. Motivated by this insight, we introduce anemone, a new mini-batch EM algorithm for PCs. An\textbf{em}one applies an implicit adaptive learning rate to each parameter, scaled by how much it contributes to the likelihood of the current batch. Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance. Code is available at https://github.com/liuanji/pc-arena.

Spotlight

Standard Acquisition Is Sufficient for Asynchronous Bayesian Optimization

Ben Riegler ⋅ James Odgers ⋅ Vincent Fortuin

Asynchronous Bayesian optimization is widely used for gradient-free optimization in domains with independent parallel experiments and varying evaluation times. Existing methods posit that standard acquisitions lead to redundant and repeated queries, proposing complex solutions to enforce diversity in queries. Challenging this fundamental premise, we show that methods, like the Upper Confidence Bound, can in fact achieve theoretical guarantees essentially equivalent to those of sequential Thompson sampling. A conceptual analysis of asynchronous Bayesian optimization reveals that existing works neglect intermediate posterior updates, which we find to be generally sufficient to avoid redundant queries. Further investigation shows that by penalizing busy locations, diversity-enforcing methods can over-explore in asynchronous settings, reducing their performance. Our extensive experiments demonstrate that simple standard acquisition functions match or outperform purpose-built asynchronous methods across synthetic and real-world tasks.

Spotlight

Learning Hyperparameters via a Data-Emphasized Variational Objective

Ethan Harvey ⋅ Mikhail Petrov ⋅ Michael Hughes

When training large models on limited data, avoiding overfitting is paramount. Common grid search or smarter search methods rely on expensive separate runs for each candidate hyperparameter, while carving out a validation set that reduces available training data. In this paper, we study gradient-based learning of hyperparameters via the evidence lower bound (ELBO) objective from Bayesian variational methods. This avoids the need for any validation set. We focus on scenarios where the model is over-parameterized for flexibility and the approximate posterior is chosen to be Gaussian with isotropic covariance for tractability, even though it cannot match the true posterior. In such scenarios, we find the ELBO prioritizes posteriors that match the prior, leading to severe underfitting. Instead, we recommend a data-emphasized ELBO that upweights the likelihood but not the prior. In Bayesian transfer learning of image and text classifiers, our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable lengthscale kernels.

Spotlight

On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors

Julius Kobialka ⋅ Emanuel Sommer ⋅ Chris Kolb ⋅ Juntae Kwon ⋅ Daniel Dold ⋅ David Rügamer

Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: layer balancedness, weight distribution on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.

Spotlight

Local Inconsistency Resolution: The Interplay between Attention and Control in Probabilistic Models

Oliver Richardson ⋅ Mandana Samiei ⋅ Mehran Shakerinava ⋅ Joseph Viviano ⋅ Abdessamad Kabid ⋅ Ali Parviz ⋅ Yoshua Bengio

We present a generic algorithm for learning and approximate inference with an intuitive epistemic interpretation: iteratively focus on a subset of the model and resolve inconsistencies using the parameters under control. This framework, which we call Local Inconsistency Resolution (LIR) is built upon Probabilistic Dependency Graphs (PDGs), which provide a flexible representational foundation capable of capturing inconsistent beliefs. We show how LIR unifies and generalizes a wide variety of important algorithms in the literature, including the Expectation-Maximization (EM) algorithm, belief propagation, adversarial training, GANs, and GFlowNets. In the last case, LIR actually suggests a more natural loss, which we demonstrate improves GFlowNet convergence. Each of these methods can be recovered as a specific instance of LIR by choosing a procedure to direct focus (attention and control). We implement this algorithm for discrete PDGs and study its properties on synthetically generated PDGs, comparing its behavior to the global optimization semantics of the full PDG.

Spotlight

Hellinger Multimodal Variational Autoencoders

Huyen Vo ⋅ Isabel Valera

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha$-divergence family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

Spotlight

Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps

TienHai Do ⋅ Trung Nguyen ⋅ TrungTin Nguyen ⋅ Nhat Ho ⋅ Binh Nguyen Thanh ⋅ Chris Drovandi

We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $\epsilon$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.

Spotlight

Laplace approximation for Bayesian variable selection via Le Cam's one-step procedure

Tianrui Hou ⋅ Aguemon Atchade

Relevant feature selection in high-dimensional settings is a central challenge in modern scientific research and decision-making. While many existing methods offer strong statistical guarantees, they are often computationally intractable in high-dimensional problems. To address this issue, we introduce a novel Laplace approximation method based on Le Cam’s one-step procedure, termed \textsf{OLAP}. This approach is specifically designed to alleviate computational burdens while maintaining statistical rigor. Under standard high-dimensional assumptions, we establish that \textsf{OLAP} achieves consistent variable selection. Moreover, the method yields a posterior distribution that can be efficiently explored in polynomial time via a simple Gibbs sampling algorithm. We demonstrate the effectiveness of OLAP through applications to logistic and Poisson regression models, using both simulated and real data.

Main Navigation

Oral

Oral Session 1: Bayesian & Probabilistic Modeling

Main Ballroom

Rethinking Probabilistic Circuit Parameter Learning

Standard Acquisition Is Sufficient for Asynchronous Bayesian Optimization

Learning Hyperparameters via a Data-Emphasized Variational Objective

On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors

Local Inconsistency Resolution: The Interplay between Attention and Control in Probabilistic Models

Hellinger Multimodal Variational Autoencoders

Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps

Laplace approximation for Bayesian variable selection via Le Cam's one-step procedure