Oral
Oral Session 6: Generative Modeling (Diffusion / Flow / Temporal / Graph)
Main Ballroom
Moderator: Yingzhen Li
Beyond Real Data: Synthetic Data through the Lens of Regularization
Amitis Shidani ⋅ Tyler Farghly ⋅ Yang SUN ⋅ Habib Ganjgahi ⋅ George Deligiannidis
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.
Archetypal Graph Generative Models: Explainable and Identifiable Communities via Anchor-Dominant Convex Hulls
Nikolaos Nakis ⋅ Chrysoula Kosma ⋅ Panagiotis Promponas ⋅ Michail Chatzianastasis ⋅ Giannis Nikolentzos
Representation learning has been essential for graph machine learning tasks such as link prediction, community detection, and network visualization. Despite recent advances achieving high performance on these downstream tasks, little progress has been made toward self-explainable models. Understanding the patterns behind predictions is equally important, motivating recent interest in explainable machine learning. In this paper, we present GraphHull, an explainable generative model that represents networks using two levels of convex hulls. At the global level, the vertices of a convex hull are treated as archetypes, each corresponding to a pure community in the network. At the local level, each community is refined by a prototypical hull whose vertices act as representative profiles, capturing community-specific variation. This two-level construction yields clear multi-scale explanations: a node’s position relative to global archetypes and its local prototypes directly accounts for its edges. The geometry is well-behaved by design, while local hulls are kept disjoint by construction. To further encourage diversity and stability, we place principled priors, including determinantal point processes, and fit the model under MAP estimation with scalable subsampling. Experiments on real networks demonstrate the ability of GraphHull to recover multi-level community structure and to achieve competitive or superior performance on link prediction and community detection, while naturally providing interpretable predictions.
Longitudinal Flow Matching for Trajectory Modeling
Mohammad Mohaiminul Islam ⋅ Thijs Kuipers ⋅ Sharvaree Vadgama ⋅ Coen de Vente ⋅ Afsana Khan ⋅ Clara Sánchez ⋅ Erik Bekkers
Generative models for sequential data often struggle with sparsely sampled and high-dimensional trajectories, typically reducing the learning of dynamics to pairwise transitions. We propose \textit{Interpolative Multi-Marginal Flow Matching} (IMMFM), a framework that learns continuous stochastic dynamics jointly consistent with multiple observed time points. IMMFM employs a piecewise-quadratic interpolation path as a smooth target for flow matching and jointly optimizes drift and a data-driven diffusion coefficient, supported by a theoretical condition for stable learning. This design captures intrinsic stochasticity, handles irregular sparse sampling, and yields subject-specific trajectories. Experiments on synthetic benchmarks and real-world longitudinal neuroimaging datasets show that IMMFM outperforms existing methods in both forecasting accuracy and further downstream tasks.
Denoising Score Matching with Random Features: Insights on Diffusion Models From Precise Learning Curves
Anand Jerry George ⋅ Rodrigo Veiga ⋅ Nicolas Macris
We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample ($m$) used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being $d$-dimensional Gaussian. We operate in a regime where the dimension $d$, number of data samples $n$, and number of features $p$ tend to infinity while keeping the ratios $\psi_n=\frac{n}{d}$ and $\psi_p=\frac{p}{d}$ fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of $\psi_n,\psi_p$, and $m$. Our theoretical findings are consistent with the empirical observations.
Simplex-to-Euclidean Bijections for Categorical Flow Matching
Bernardo Williams ⋅ Victor Yeom-Song ⋅ Marcelo Hartmann ⋅ Arto Klami
We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.
A Continuous Time Markov Chain Framework for Insertion Language Models
Dhruvesh Patel ⋅ Benjamin Rozonoyer ⋅ Soumitra Das ⋅ Tahira Naseem ⋅ Tim G. J. Rudner ⋅ Andrew McCallum
Insertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.
Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence
José Manuel de Frutos ⋅ Pablo Martínez Olmos ⋅ Manuel Vázquez ⋅ Joaquín Míguez
Rank-based objectives such as the invariant statistical loss (ISL) are robust, likelihood-free tools for training implicit generative models. We propose \emph{dual-ISL}, obtained by interchanging the roles of the target $p$ and model density $\tilde p$ within ISL, which induces a \emph{convex} optimization problem over model densities. We show that the associated rank-based discrepancy $d_K$ is \emph{continuous} under weak and $L^1$ convergence and \emph{convex} in its first argument, properties not shared by classical divergences such as KL or Wasserstein distances. Additionally, we prove that $d_K$ admits an $L^2$ interpretation: it is the projection of the density ratio $q=p/\tilde p$ onto a Bernstein polynomial basis. This yields explicit truncation-error bounds, sharp convergence rates, and a closed-form expression for the truncated density approximation. To handle multivariate data, we further introduce a sliced dual-ISL via random one-dimensional projections that preserves both continuity and convexity. Empirically, across several benchmarks, dual-ISL delivers faster and smoother convergence than standard ISL and offers competitive, often superior, mode coverage relative to state-of-the-art implicit models (modern GAN baselines, including multi-critic setups), while providing an explicit density approximation.
High-Performance Self-Supervised Learning by Joint Training of Flow Matching
Kosuke Ukita ⋅ Tsuyoshi Okita
Diffusion models can learn rich representations during data generation, showing potential for Self-Supervised Learning (SSL), but they face a trade-off between generative quality and discriminative performance. Their iterative sampling also incurs substantial computational and energy costs, hindering industrial and edge AI applications. To address these issues, we propose the Flow Matching-based Sensor Foundation Model (SenFlow), which jointly trains a representation encoder and a conditional flow matching generator. This decoupled design achieves both high-fidelity generation and effective recognition. By using flow matching to learn a simpler velocity field, SenFlow accelerates and stabilizes training, improving its efficiency for representation learning. Experiments on wearable sensor data show SenFlow reduces training time by 50.4% compared to a diffusion-based approach. On downstream tasks, SenFlow surpassed the state-of-the-art SSL method on all five datasets while achieving up to a 51.0x inference speedup and maintaining high generative quality. The implementation code is available at https://github.com/Okita-Laboratory/SenFlow.