Moderators: Arno Solin · Matthew Hoffman
Tin Nguyen · Brian Trippe · Tamara Broderick
Markov chain Monte Carlo (MCMC) methods are often used in clustering since they guarantee asymptotically exact expectations in the infinite-time limit. In finite time, though, slow mixing often leads to poor performance. Modern computing environments offer massive parallelism, but naive implementations of parallel MCMC can exhibit substantial bias. In MCMC samplers of continuous random variables, Markov chain couplings can overcome bias. But these approaches depend crucially on paired chains meetings after a small number of transitions. We show that straightforward applications of existing coupling ideas to discrete clustering variables fail to meet quickly. This failure arises from the "label-switching problem": semantically equivalent cluster relabelings impede fast meeting of coupled chains. We instead consider chains as exploring the space of partitions rather than partitions' (arbitrary) labelings. Using a metric on the partition space, we formulate a practical algorithm using optimal transport couplings. Our theory confirms our method is accurate and efficient. In experiments ranging from clustering of genes or seeds to graph colorings, we show the benefits of our coupling in the highly parallel, time-limited regime.
Quan Zhou · Aaron Smith
Informed Markov chain Monte Carlo (MCMC) methods have been proposed as scalable solutions to Bayesian posterior computation on high-dimensional discrete state spaces, but theoretical results about their convergence behavior in general settings are lacking. In this article, we propose a class of MCMC schemes called informed importance tempering (IIT), which combine importance sampling and informed local proposals, and derive generally applicable spectral gap bounds for IIT estimators. Our theory shows that IIT samplers have remarkable scalability when the target posterior distribution concentrates on a small set. Further, both our theory and numerical experiments demonstrate that the informed proposal should be chosen with caution: the performance may be very sensitive to the shape of the target distribution. We find that the ``square-root proposal weighting'' scheme tends to perform well in most settings.
Alejandro Catalina · Paul Bürkner · Aki Vehtari
Projection predictive inference is a decision theoretic Bayesian approach that decouples model estimation from decision making.Given a reference model previously built including all variables present in the data, projection predictive inference projects its posterior onto a constrained space of a subset of variables. Variable selection is then performed by sequentially adding relevant variables until predictive performance is satisfactory. Previously, projection predictive inference has been demonstrated only for generalized linear models (GLMs) and Gaussian processes (GPs) where it showed superior performance to competing variable selection procedures. In this work, we extend projection predictive inference to support variable and structure selection for generalized linear multilevel models (GLMMs) and generalized additive multilevel models (GAMMs). Our simulative and real-world experiments demonstrate that our method can drastically reduce the model complexity required to reach reference predictive performance and achieve good frequency properties.
Kristy Choi · Chenlin Meng · Yang Song · Stefano Ermon
Density ratio estimation (DRE) is a fundamental machine learning technique for comparing two probability distributions.However, existing methods struggle in high-dimensional settings, as it is difficult to accurately compare probability distributions based on finite samples.In this work we propose DRE-$\infty$, a divide-and-conquer approach to reduce DRE to a series of easier subproblems. Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions. We then estimate the instantaneous rate of change of the bridge distributions indexed by time (the ``time score'')---a quantity defined analogously to data (Stein) scores---with a novel time score matching objective. Crucially, the learned time scores can then be integrated to compute the desired density ratio. In addition, we show that traditional (Stein) scores can be used to obtain integration paths that connect regions of high density in both distributions, improving performance in practice. Empirically, we demonstrate that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets.