### Poster Session

## Poster Session 1

##### Auditorium 1 Foyer

**Learning to Solve the Constrained Most Probable Explanation Task in Probabilistic Graphical Models**

Shivvrat Arya · Tahrima Rahman · Vibhav Gogate

We propose a self-supervised learning approach for solving the following constrained optimization task in log-linear models or Markov networks. Let $f$ and $g$ be two log-linear models defined over the sets $X$ and $Y$ of random variables. Given an assignment $x$ to all variables in $X$ (evidence or observations) and a real number $q$, the constrained most-probable explanation (CMPE) task seeks to find an assignment $y$ to all variables in $Y$ such that $f(x, y)$ is maximized and $g(x, y) \leq q$. In our proposed self-supervised approach, given assignments $x$ to $X$ (data), we train a deep neural network that learns to output near-optimal solutions to the CMPE problem without requiring access to any pre-computed solutions. The key idea in our approach is to use first principles and approximate inference methods for CMPE to derive novel loss functions that seek to push infeasible solutions towards feasible ones and feasible solutions towards optimal ones. We analyze the properties of our proposed method and experimentally demonstrate its efficacy on several benchmark problems.

In the current era of vast data and transparent machine learning, it is essential for techniques to operate at a large scale while providing a clear mathematical comprehension of the internal workings of the method. Although there already exist interpretable semi-parametric regression methods for large-scale applications that take into account non-linearity in the data, the complexity of the models is still often limited. One of the main challenges is the absence of interactions in these models, which are left out for the sake of better interpretability but also due to impractical computational costs. To overcome this limitation, we propose a new approach using a factorization method to derive a highly scalable higher-order tensor product spline model. Our method allows for the incorporation of all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We further develop a meaningful penalization scheme and examine the induced optimization problem. We conclude by evaluating the predictive and estimation performance of our method.

**Fast Dynamic Sampling for Determinantal Point Processes**

Zhao Song · Junze Yin · Lichen Zhang · Ruizhe Zhang

n this work, we provide fast dynamic algorithms for repeatedly sampling from distributions characterized by Determinantal Point Processes (DPPs) and Nonsymmetric Determinantal Point Processes (NDPPs). DPPs are a very well-studied class of distributions on subsets of items drawn from a ground set of cardinality $n$ characterized by a symmetric $n \times n$ kernel matrix $L$ such that the probability of any subset is proportional to the determinant of its corresponding principal submatrix. Recent work has shown that the kernel symmetry constraint can be relaxed, leading to NDPPs, which can better model data in several machine learning applications. Given a low-rank kernel matrix ${\cal L}=L+L^\top\in \mathbb{R}^{n\times n}$ and its corresponding eigendecomposition specified by $\{\lambda_i, u_i \}_{i=1}^d$ where $d\leq n$ is the rank, we design a data structure that uses $O(nd)$ space and preprocesses data in $O(nd^{\omega-1})$ time where $\omega\approx 2.37$ is the exponent of matrix multiplication. The data structure can generate a sample according to DPP distribution in time $O(|E|^3\log n+|E|^{\omega-1}d^2)$ or according to NDPP distribution in time $O((|E|^3 \log n+ |E|^{\omega-1}d^2)(1+w)^d)$ for $E$ being the sampled indices and $w$ is a data-dependent parameter. This improves upon the space and preprocessing time over prior works, and achieves a state-of-the-art sampling time when the sampling set is relatively dense. At the heart of our data structure is an efficient sampling tree that can leverage batch initialization and fast inner product query simultaneously.

**Optimal Zero-Shot Detector for Multi-Armed Attacks**

Federica Granese · Marco Romanelli · Pablo Piantanida

This research delves into a scenario where a malicious actor can manipulate data samples using a multi-armed attack strategy, providing them with multiple ways to introduce noise into the data sample. Our central objective is to protect the data by detecting any alterations to the input. We approach this defensive strategy with utmost caution, operating in an environment where the defender possesses significantly less information compared to the attacker. Specifically, the defender is unable to utilize any data samples for training a defense model or verifying the integrity of the channel. Instead, the defender relies exclusively on a set of pre-existing detectors readily available "off the shelf." To tackle this challenge, we derive an innovative information-theoretic defense approach that optimally aggregates the decisions made by these detectors, eliminating the need for any training data. We further explore a practical use-case scenario for empirical evaluation, where the attacker possesses a pre-trained classifier and launches well-known adversarial attacks against it. Our experiments highlight the effectiveness of our proposed solution, even in scenarios that deviate from the optimal setup.

Data-driven approaches to predict-then-optimize decision-making problems seek to mitigate the risk of uncertainty region misspecification in safety-critical settings. Current approaches, however, suffer from considering overly conservative uncertainty regions, often resulting in suboptimal decision-making. To this end, we propose Conformal-Predict-Then-Optimize (CPO), a framework for leveraging highly informative, nonconvex conformal prediction regions over high-dimensional spaces based on conditional generative models, which have the desired distribution-free coverage guarantees. Despite guaranteeing robustness, such black-box optimization procedures alone inspire little confidence owing to the lack of explanation of why a particular decision was found to be optimal. We, therefore, augment CPO to additionally provide semantically meaningful visual summaries of the uncertainty regions to give qualitative intuition for the optimal decision. We highlight the CPO framework by demonstrating results on a suite of simulation-based inference benchmark tasks and a vehicle routing task based on probabilistic weather prediction.

**From Data Imputation to Data Cleaning --- Automated Cleaning of Tabular Data Improves Downstream Predictive Performance**

Sebastian Jäger · Felix Biessmann

The translation of Machine Learning (ML) research innovations to real-world applications and the maintenance of ML components are hindered by reoccurring challenges, such as reaching high predictive performance, robustness, complying with regulatory constraints, or meeting ethical standards. Many of these challenges are related to data quality and, in particular, to the lack of automation in data pipelines upstream of ML components. Automated data cleaning remains challenging since many approaches neglect the dependency structure of the data errors and require task-specific heuristics or human input for calibration. In this study, we develop and evaluate an application-agnostic ML-based data cleaning approach using well-established imputation techniques for automated detection and cleaning of erroneous values. To improve the degree of automation, we combine imputation techniques with conformal prediction (CP), a model-agnostic and distribution-free method to quantify and calibrate the uncertainty of ML models. Extensive empirical evaluations demonstrate that Conformal Data Cleaning (CDC) improves predictive performance in downstream ML tasks in the majority of cases. Our code is available on GitHub: \url{https://github.com/se-jaeger/conformal-data-cleaning}.

**On-Demand Federated Learning for Arbitrary Target Class Distributions**

Isu Jeong · Seulki Lee

We introduce On-Demand Federated Learning (On-Demand FL), which enables on-demand federated learning of a deep model for an arbitrary target data distribution of interest by making the best use of the heterogeneity (non-IID-ness) of local client data, unlike existing approaches trying to circumvent the non-IID nature of federated learning. On-Demand FL composes a dataset of the target distribution, which we call the composite dataset, from a selected subset of local clients whose aggregate distribution is expected to emulate the target distribution as a whole. As the composite dataset consists of a precise yet diverse subset of clients reflecting the target distribution, the on-demand model trained with exactly enough selected clients becomes able to improve the model performance on the target distribution compared when trained with off-target and/or unknown distributions while reducing the number of participating clients and federating rounds. We model the target data distribution in terms of class and estimate the class distribution of each local client from the weight gradient of its local model. Our experiment results show that On-Demand FL achieves up to 5\% higher classification accuracy on various target distributions just involving 9${\times}$ fewer clients with FashionMNIST, CIFAR-10, and CIFAR-100.

**Manifold-Aligned Counterfactual Explanations for Neural Networks**

Asterios Tsiourvas · Wei Sun · Georgia Perakis

We study the problem of finding optimal manifold-aligned counterfactual explanations for neural networks. Existing approaches that involve solving a complex mixed-integer optimization (MIP) problem frequently suffer from scalability issues, limiting their practical usefulness. Furthermore, the solutions are not guaranteed to follow the data manifold, resulting in unrealistic counterfactual explanations. To address these challenges, we first present a MIP formulation where we explicitly enforce manifold alignment by reformulating the highly nonlinear Local Outlier Factor (LOF) metric as mixed-integer constraints. To address the computational challenge, we leverage the geometry of a trained neural network and propose an efficient decomposition scheme that reduces the initial large, hard-to-solve optimization problem into a series of significantly smaller, easier-to-solve problems by constraining the search space to “live” polytopes, i.e., regions that contain at least one actual data point. Experiments on real-world datasets demonstrate the efficacy of our approach in producing both optimal and realistic counterfactual explanations, and computational traceability.

**On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem**

Georg Pichler · Marco Romanelli · Divya Prakash Manivannan · Prashanth Krishnamurthy · Farshad khorrami · Siddharth Garg

We introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it to analyze the feasibility of such problems, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability result for backdoor detection. We show a no-free-lunch theorem, proving that universal (adversary-unaware) backdoor detection is impossible, except for very small alphabet sizes. Thus, we argue, that backdoor detection methods need to be either explicitly, or implicitly adversary-aware. However, our work does not imply that backdoor detection cannot work in specific scenarios, as evidenced by successful backdoor detection methods in the scientific literature. Furthermore, we connect our definition to the probably approximately correct (PAC) learnability of the out-of-distribution detection problem.

**Simple and scalable algorithms for cluster-aware precision medicine**

Amanda M. Buch · Conor Liston · Logan Grosenick

AI-enabled precision medicine promises a transformational improvement in healthcare outcomes. However, training on biomedical data presents significant challenges as they are often high dimensional, clustered, and of limited sample size. To overcome these challenges, we propose a simple and scalable approach for cluster-aware embedding that combines latent factor methods with a convex clustering penalty in a modular way. Our novel approach overcomes the complexity and limitations of current joint embedding and clustering methods and enables hierarchically clustered principal component analysis (PCA), locally linear embedding (LLE), and canonical correlation analysis (CCA). Through numerical experiments and real-world examples, we demonstrate that our approach outperforms fourteen clustering methods on highly underdetermined problems (e.g., with limited sample size) as well as on large sample datasets. Importantly, our approach does not require the user to choose the desired number of clusters, yields improved model selection if they do, and yields interpretable hierarchically clustered embedding dendrograms. Thus, our approach improves significantly on existing methods for identifying patient subgroups in multiomics and neuroimaging data and enables scalable and interpretable biomarkers for precision medicine.

**HintMiner: Automatic Question Hints Mining From Q\&A Web Posts with Language Model via Self-Supervised Learning**

Zhenyu Zhang · yang jiudong

Users often need ask questions and seek answers online.The Question - Answering (QA) forums such as Stack Overflow cannot always respond to the questions timely and properly. In this paper, we propose HintMiner, a novel automatic question hints mining tool for users to help them find answers. HintMiner leverages the machine comprehension and sequence generation techniques to automatically generate hints for users' questions.It firstly retrieve many web Q\&A posts and then extract some hints from the posts using MiningNet that is built via a language model.Using the huge amount of online Q\&A posts, we design a self-supervised objective to train the MiningNet that is a neural encoder-decoder model based on the transformer and copying mechanisms.We have evaluated HintMiner on 60,000 Stack Overflow questions. The experiment results show that the proposed approach is effective. For example, HintMiner achieves an average BLEU score of 36.17\% and an average ROUGE-2 score of 36.29\%. Our tool and experimental data are publicly available at \url{https://github.com/zhangzhenyu13/HintMiner}.

**Enhancing In-context Learning via Linear Probe Calibration**

Momin Abbas · Yi Zhou · Parikshit Ram · Nathalie Baracaldo · Horst Samulowitz · Theodoros Salonidis · Tianyi Chen

In-context learning (ICL) is a new paradigm for natural language processing that utilizes Generative Pre-trained Transformer (GPT)-like models. This approach uses prompts that include in-context demonstrations to generate the corresponding output for a new query input. However, applying ICL in real cases does not scale with the number of samples, and lacks robustness to different prompt templates and demonstration permutations. In this paper, we first show that GPT-like models using ICL result in unreliable predictions based on a new metric based on Shannon entropy. Then, to solve this problem, we propose a new technique called the Linear Probe Calibration (LinC), a method that calibrates the model's output probabilities, resulting in reliable predictions and improved performance, while requiring only minimal additional samples (as few as five labeled data samples). LinC significantly enhances the ICL test performance of GPT models on various benchmark datasets, with an average improvement of up to 21\%, and up to a 50\% improvement in some cases, and significantly boosts the performance of PEFT methods, especially in the low resource regime. Moreover, LinC achieves lower expected calibration error, and is highly robust to varying label proportions, prompt templates, and demonstration permutations.

**TenGAN: Pure Transformer Encoders Make an Efficient Discrete GAN for De Novo Molecular Generation**

Chen Li · Yoshihiro Yamanishi

Deep generative models for de novo molecular generation using discrete data, such as the simplified molecular-input line-entry system (SMILES) strings, have attracted widespread attention in drug design. However, training instability often plagues generative adversarial networks (GANs), leading to problems such as mode collapse and low diversity. This study proposes a pure transformer encoder-based GAN (TenGAN) to solve these issues. The generator and discriminator of TenGAN are variants of the transformer encoders and are combined with reinforcement learning (RL) to generate molecules with the desired chemical properties. Besides, data augmentation of the variant SMILES is leveraged for the TenGAN training to learn the semantics and syntax of SMILES strings. Additionally, we introduce an enhanced variant of TenGAN, named Ten(W)GAN, which incorporates mini-batch discrimination and Wasserstein GAN to improve the ability to generate molecules. The experimental results and ablation studies on the QM9 and ZINC datasets showed that the proposed models generated highly valid and novel molecules with the desired chemical properties in a computationally efficient manner.

Motivated by the cost heterogeneity in experimentation across different alternatives, we study the Best Arm Identification with Resource Constraints (BAIwRC) problem. The agent aims to identify the best arm under resource constraints, where resources are consumed for each arm pull. We make two novel contributions. We design and analyze the Successive Halving with Resource Rationing algorithm (SH-RR). The SH-RR achieves a near-optimal non-asymptotic rate of convergence in terms of the probability of successively identifying an optimal arm. Interestingly, we identify a difference in convergence rates between the cases of deterministic and stochastic resource consumption.

**LEDetection: A Simple Framework for Semi-Supervised Few-Shot Object Detection**

Phi Vu Tran

Few-shot object detection (FSOD) is a challenging problem aimed at detecting novel concepts from few exemplars. Existing approaches to FSOD all assume abundant base labels to adapt to novel objects. This paper studies the new task of semi-supervised FSOD by considering a realistic scenario in which both base and novel labels are simultaneously scarce. We explore the utility of unlabeled data within our proposed label-efficient detection framework and discover its remarkable ability to boost semi-supervised FSOD by way of region proposals. Motivated by this finding, we introduce SoftER Teacher, a robust detector combining pseudo-labeling with consistency learning on region proposals, to harness unlabeled data for improved FSOD without relying on abundant labels. Rigorous experiments show that SoftER Teacher surpasses the novel performance of a strong supervised detector using only 10\% of required base labels, without catastrophic forgetting observed in prior approaches. Our work also sheds light on a potential relationship between semi-supervised and few-shot detection suggesting that a stronger semi-supervised detector leads to a more effective few-shot detector.

**Fusing Individualized Treatment Rules Using Secondary Outcomes**

Daiqi Gao · Yuanjia Wang · Donglin Zeng

An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.

**A White-Box False Positive Adversarial Attack Method on Contrastive Loss Based Offline Handwritten Signature Verification Models**

Zhongliang Guo · Weiye Li · Yifei Qian · Ognjen Arandjelovic · Lei Fang

In this paper, we tackle the challenge of white-box false positive adversarial attacks on contrastive loss based offline handwritten signature verification models. We propose a novel attack method that treats the attack as a style transfer between closely related but distinct writing styles. To guide the generation of deceptive images, we introduce two new loss functions that enhance the attack success rate by perturbing the Euclidean distance between the embedding vectors of the original and synthesized samples, while ensuring minimal perturbations by reducing the difference between the generated image and the original image. Our method demonstrates state-of-the-art performance in white-box attacks on contrastive loss based offline handwritten signature verification models, as evidenced by our experiments. The key contributions of this paper include a novel false positive attack method, two new loss functions, effective style transfer in handwriting styles, and superior performance in white-box false positive attacks compared to other white-box attack methods.

**When No-Rejection Learning is Consistent for Regression with Rejection**

Xiaocheng Li · Shang Liu · Chunlin Sun · Hanzhao Wang

Learning with rejection has been a prototypical model for studying the human-AI interaction on prediction tasks. Upon the arrival of a sample instance, the model first uses a rejector to decide whether to accept and use the AI predictor to make a prediction or reject and defer the sample to humans. Learning such a model changes the structure of the original loss function and often results in undesirable non-convexity and inconsistency issues. For the classification with rejection problem, several works develop consistent surrogate losses for the joint learning of the predictor and the rejector, while there have been fewer works for the regression counterpart. This paper studies the regression with rejection (RwR) problem and investigates a no-rejection learning strategy that uses all the data to learn the predictor. We first establish the consistency for such a strategy under the weak realizability condition. Then for the case without the weak realizability, we show that the excessive risk can also be upper bounded with the sum of two parts: prediction error and calibration error. Lastly, we demonstrate the advantage of such a proposed learning strategy with empirical evidence.

**Efficient Graph Laplacian Estimation by Proximal Newton**

Yakov Medvedovsky · Eran Treister · Tirza Routtenberg

The Laplacian-constrained Gaussian Markov Random Field (LGMRF) is a common multivariate statistical model for learning a weighted sparse dependency graph from given data. This graph learning problem can be formulated as a maximum likelihood estimation (MLE) of the precision matrix, subject to Laplacian structural constraints, with a sparsity-inducing penalty term. This paper aims to solve this learning problem accurately and efficiently.First, since the commonly used $\ell_1$-norm penalty is inappropriate in this setting and may lead to a complete graph, we employ the nonconvex minimax concave penalty (MCP), which promotes sparse solutions with lower estimation bias. Second, as opposed to existing first-order methods for this problem, we develop a second-order proximal Newton approach to obtain an efficient solver, utilizing several algorithmic features, such as using conjugate gradients, preconditioning, and splitting to active/free sets. Numerical experiments demonstrate the advantages of the proposed method in terms of both computational complexity and graph learning accuracy compared to existing methods.

**Simulation-Free Schrödinger Bridges via Score and Flow Matching**

Alexander Tong · Nikolay Malkin · Kilian Fatras · Lazar Atanackovic · Yanlei Zhang · Guillaume Huguet · Guy Wolf · Yoshua Bengio

We present simulation-free score and flow matching ([SF]$^2$M), a simulation-free objective for inferring stochastic dynamics given unpaired samples drawn from arbitrary source and target distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic generative modeling as a Schrödinger bridge problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]$^2$M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, [SF]$^2$M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data. Our code is available in the TorchCFM package at \url{https://github.com/atong01/conditional-flow-matching}.

**Federated Learning For Heterogeneous Electronic Health Records Utilising Augmented Temporal Graph Attention Networks**

Soheila Molaei · Anshul Thakur · Ghazaleh Niknam · Andrew Soltan · Hadi Zare · David Clifton

The proliferation of decentralised electronic healthcare records (EHRs) across medical institutions requires innovative federated learning strategies for collaborative data analysis and global model training, prioritising data privacy. A prevalent issue during decentralised model training is the data-view discrepancies across medical institutions that arises from differences or availability of healthcare services, such as blood test panels. The prevailing way to handle this issue is to select a common subset of features across institutions to make data-views consistent. This approach, however, constrains some institutions to shed some critical features that may play a significant role in improving the model performance. This paper introduces a federated learning framework that relies on augmented graph attention networks to address data-view heterogeneity. The proposed framework utilises an alignment augmentation layer over self-attention mechanisms to weigh the importance of neighbouring nodes when updating a node's embedding irrespective of the data-views. Furthermore, our framework adeptly addresses both the temporal nuances and structural intricacies of EHR datasets. This dual capability not only offers deeper insights but also effectively encapsulates EHR graphs' time-evolving nature. Using diverse real-world datasets, we show that the proposed framework significantly outperforms conventional FL methodology for dealing with heterogeneous data-views.

**Cross-model Mutual Learning for Exemplar-based Medical Image Segmentation**

Qing En · Yuhong Guo

Medical image segmentation typically demands extensive dense annotations for model training, which is both time-consuming and skill-intensive. To mitigate this burden, exemplar-based medical image segmentation methods have been introduced to achieve effective training with only one annotated image. In this paper, we introduce a novel Cross-model Mutual learning framework for Exemplar-based Medical image Segmentation (CMEMS), which leverages two models to mutually excavate implicit information from unlabeled data at multiple granularities. CMEMS can eliminate confirmation bias and enable collaborative training to learn complementary information by enforcing consistency at different granularities across models. Concretely, cross-model image perturbation based mutual learning is devised by using weakly perturbed images to generate high-confidence pseudo-labels, supervising predictions of strongly perturbed images across models. This approach enables joint pursuit of prediction consistency at the image granularity. Moreover, cross-model multi-level feature perturbation based mutual learning is designed by letting pseudo-labels supervise predictions from perturbed multi-level features with different resolutions, which can broaden the perturbation space and enhance the robustness of our framework. CMEMS is jointly trained using exemplar data, synthetic data, and unlabeled data in an end-to-end manner. Experimental results on two medical image datasets indicate that the proposed CMEMS outperforms the state-of-the-art segmentation methods with extremely limited supervision.

**Provable local learning rule by expert aggregation for a Hawkes network**

Sophie Jaffard · Samuel Vaiter · Alexandre Muzy · Patricia Reynaud-Bouret

We propose a simple network of Hawkes processes as a cognitive model capable of learning to classify objects. Our learning algorithm, named HAN for Hawkes Aggregation of Neurons, is based on a local synaptic learning rule based on spiking probabilities at each output node. We were able to use local regret bounds to prove mathematically that the network is able to learn on average and even asymptotically under more restrictive assumptions.

**Ethics in Action: Training Reinforcement Learning Agents for Moral Decision-making In Text-based Adventure Games**

Weichen Li · Rati Devidze · Waleed Mustafa · Sophie Fellenz

Reinforcement Learning (RL) has demonstrated its potential in solving goal-oriented sequential tasks. However, with the increasing capabilities of RL agents, ensuring morally responsible agent behavior is becoming a pressing concern. Previous approaches have included moral considerations by statically assigning a moral score to each action at runtime. However, these methods do not account for the potential moral value of future states when evaluating immoral actions. This limits the ability to find trade-offs between different aspects of moral behavior and the utility of the action. In this paper, we aim to factor in moral scores by adding a constraint to the RL objective that is incorporated during training, thereby dynamically adapting the policy function. By combining Lagrangian optimization and meta-gradient learning, we develop an RL method that is able to find a trade-off between immoral behavior and performance in the decision-making process.

**A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning**

Kihyuk Hong · Yuhang Li · Ambuj Tewari

Offline constrained reinforcement learning (RL) aims to learn a policy that maximizes the expected cumulative reward subject to constraints on expected cumulative cost using an existing dataset. In this paper, we propose Primal-Dual-Critic Algorithm (PDCA), a novel algorithm for offline constrained RL with general function approximation. PDCA runs a primal-dual algorithm on the Lagrangian function estimated by critics. The primal player employs a no-regret policy optimization oracle to maximize the Lagrangian estimate and the dual player acts greedily to minimize the Lagrangian estimate. We show that PDCA finds a near saddle point of the Lagrangian, which is nearly optimal for the constrained RL problem. Unlike previous work that requires concentrability and a strong Bellman completeness assumption, PDCA only requires concentrability and realizability assumptions for sample-efficient learning.

**Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint**

Haoyue Tang · Tian Xie · Aosong Feng · Hanyu Wang · Chenyang Zhang · Yang Bai

Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image).By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning.To precisely estimate the guidance score function of the input image, we propose Diffusion Policy Gradient (DPG), a tractable computation method by viewing the intermediate noisy images as policies and the target image as the states selected by the policy.Experiments show that our method is robust to both Gaussian and Poisson noise degradation on multiple linear and non-linear inverse tasks, resulting into a higher image restoration quality on FFHQ, ImageNet and LSUN datasets.

**Safe and Interpretable Estimation of Optimal Treatment Regimes**

Harsh Parikh · Quinn Lanners · Zade Akras · Sahar Zafar · M Brandon Westover · Cynthia Rudin · Alexander Volfovsky

Recent advancements in statistical and reinforcement learning methods have contributed to superior patient care strategies.However, these methods face substantial challenges in high-stakes contexts, including missing data, stochasticity, and the need for interpretability and patient safety.Our work operationalizes a safe and interpretable approach for optimizing treatment regimes by matching patients with similar medical and pharmacological profiles.This allows us to construct optimal policies via interpolation.Our comprehensive simulation study demonstrates our method's effectiveness in complex scenarios.We use this approach to study seizure treatment in critically ill patients, advocating for personalized strategies based on medical history and pharmacological features.Our findings recommend reducing medication doses for mild, brief seizure episodes and adopting aggressive treatment strategies for severe cases, leading to improved outcomes.

**MIM-Reasoner: Learning with Theoretical Guarantees for Multiplex Influence Maximization**

Nguyen Do · Tanmoy Chowdhury · Chen Ling · Liang Zhao · My T. Thai

Multiplex influence maximization (MIM) asks us to identify a set of seed users such as to maximize the expected number of influenced users in a multiplex network. MIM has been one of central research topics, especially in nowadays social networking landscape where users participate in multiple online social networks (OSNs) and their influences can propagate among several OSNs simultaneously. Although there exist a couple combinatorial algorithms to MIM, learning-based solutions have been desired due to its generalization ability to heterogeneous networks and their diversified propagation characteristics. In this paper, we introduce MIM-Reasoner, coupling reinforcement learning with probabilistic graphical model, which effectively captures the complex propagation process within and between layers of a given multiplex network, thereby tackling the most challenging problem in MIM. We establish a theoretical guarantee for MIM-Reasoner as well as conduct extensive analyses on both synthetic and real-world datasets to validate our MIM-Reasoner's performance.

**Bayesian Online Learning for Consensus Prediction**

Sam Showalter · Alex Boyd · Padhraic Smyth · Mark Steyvers

Given a pre-trained classifier and multiple human experts, we investigate the task of online classification where model predictions are provided for free but querying humans incurs a cost. In this practical but under-explored setting, oracle ground truth is not available. Instead, the prediction target is defined as the consensus vote of all experts. Given that querying full consensus can be costly, we propose a general framework for online Bayesian consensus estimation, leveraging properties of the multivariate hypergeometric distribution. Based on this framework, we propose a family of methods that dynamically estimate expert consensus from partial feedback by producing a posterior over expert and model beliefs. Analyzing this posterior induces an interpretable trade-off between querying cost and classification performance. We demonstrate the efficacy of our framework against a variety of baselines on CIFAR-10H and ImageNet-16H, two large-scale crowdsourced datasets.

**BlockBoost: Scalable and Efficient Blocking through Boosting**

Thiago Ramos · Rodrigo Schuller · Alex Akira Okuno · Lucas Nissenbaum · Roberto Oliveira · Paulo Orenstein

As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.

**Estimating treatment effects from single-arm trials via latent-variable modeling**

Manuel Haußmann · Tran Le · Viivi Halla-aho · Samu Kurki · Jussi Leinonen · Miika Koskinen · Samuel Kaski · Harri Lähdesmäki

Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for {\em (i)} patient matching if treatment outcomes are not available for the treatment group, or for {\em (ii)} direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.

**ALAS: Active Learning for Autoconversion Rates Prediction from Satellite Data**

Maria Carolina Novitasari · Johannes Quaas · Miguel Rodrigues

High-resolution simulations, such as the ICOsahedral Non-hydrostatic Large-Eddy Model (ICON-LEM), provide valuable insights into the complex interactions among aerosols, clouds, and precipitation, which are the major contributors to climate change uncertainty. However, due to their exorbitant computational costs, they can only be employed for a limited period and geographical area. To address this, we propose a more cost-effective method powered by an emerging machine learning approach to better understand the intricate dynamics of the climate system. Our approach involves active learning techniques by leveraging high-resolution climate simulation as an oracle that is queried based on an abundant amount of unlabeled data drawn from satellite observations. In particular, we aim to predict autoconversion rates, a crucial step in precipitation formation, while significantly reducing the need for a large number of labeled instances. In this study, we present novel methods: custom fusion query strategies for labeling instances -- weight fusion (WiFi) and merge fusion (MeFi) -- along with active feature selection based on SHAP. These methods are designed to tackle real-world challenges -- in this case, climate change, with a specific focus on the prediction of autoconversion rates -- due to their simplicity and practicality in application.

**Dynamic Inter-treatment Information Sharing for Individualized Treatment Effects Estimation**

Vinod Kumar Chauhan · Jiandong Zhou · Ghadeer Ghosheh · Soheila Molaei · David Clifton

Estimation of individualized treatment effects (ITE) from observational studies is a fundamental problem in causal inference and holds significant importance across domains, including healthcare. However, limited observational datasets pose challenges in reliable ITE estimation as data have to be split among treatment groups to train an ITE learner. While information sharing among treatment groups can partially alleviate the problem, there is currently no general framework for end-to-end information sharing in ITE estimation. To tackle this problem, we propose a deep learning framework based on `\textit{soft weight sharing}' to train ITE learners, enabling \textit{dynamic end-to-end} information sharing among treatment groups. The proposed framework complements existing ITE learners, and introduces a new class of ITE learners, referred to as \textit{HyperITE}. We extend state-of-the-art ITE learners with \textit{HyperITE} versions and evaluate them on IHDP, ACIC-2016, and Twins benchmarks. Our experimental results show that the proposed framework improves ITE estimation error, with increasing effectiveness for smaller datasets.

**Shape Arithmetic Expressions: Advancing Scientific Discovery Beyond Closed-Form Equations**

Krzysztof Kacprzyk · Mihaela van der Schaar

Symbolic regression has excelled in uncovering equations from physics, chemistry, biology, and related disciplines. However, its effectiveness becomes less certain when applied to experimental data lacking inherent closed-form expressions. Empirically derived relationships, such as entire stress-strain curves, may defy concise closed-form representation, compelling us to explore more adaptive modeling approaches that balance flexibility with interpretability. In our pursuit, we turn to Generalized Additive Models (GAMs), a widely used class of models known for their versatility across various domains. Although GAMs can capture non-linear relationships between variables and targets, they cannot capture intricate feature interactions. In this work, we investigate both of these challenges and propose a novel class of models, Shape Arithmetic Expressions (SHAREs), that fuses GAM's flexible shape functions with the complex feature interactions found in mathematical expressions. SHAREs also provide a unifying framework for both of these approaches. We also design a set of rules for constructing SHAREs that guarantee transparency of the found expressions beyond the standard constraints based on the model's size.

**Attention-based Multi-instance Mixed Models**

Jan Engelmann · Alessandro Palma · Jakub Tomczak · Fabian Theis · Francesco Paolo Casale

Predicting patient features from single-cell data can unveil cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce GMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell-state heterogeneity. By leveraging predefined cell embeddings, GMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that GMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.

**On the Statistical Efficiency of Mean-Field Reinforcement Learning with General Function Approximation**

Jiawei Huang · Batuhan Yardim · Niao He

In this paper, we study the fundamental statistical efficiency of Reinforcement Learning in Mean-Field Control (MFC) and Mean-Field Game (MFG) with general model-based function approximation. We introduce a new concept called Mean-Field Model-Based Eluder Dimension (MF-MBED), which characterizes the inherent complexity of mean-field model classes. We show that a rich family of Mean-Field RL problems exhibits low MF-MBED. Additionally, we propose algorithms based on maximal likelihood estimation, which can return an $\epsilon$-optimal policy for MFC or an $\epsilon$-Nash Equilibrium policy for MFG. The overall sample complexity depends only polynomially on MF-MBED, which is potentially much lower than the size of state-action space. Compared with previous works, our results only require the minimal assumptions including realizability and Lipschitz continuity.

**Learning to Rank for Optimal Treatment Allocation Under Resource Constraints**

Fahad Kamran · Maggie Makar · Jenna Wiens

Current causal inference approaches for estimating conditional average treatment effects (CATEs) often prioritize accuracy. However, in resource constrained settings, decision makers may only need a ranking of individuals based on their estimated CATE. In these scenarios, exact CATE estimation may be an unnecessarily challenging task, particularly when the underlying function is difficult to learn. In this work, we study the relationship between CATE estimation and optimizing for CATE ranking, demonstrating that optimizing for ranking may be more appropriate than optimizing for accuracy in certain settings. Guided by our analysis, we propose an approach to directly optimize for rankings of individuals to inform treatment assignment that aims to maximize benefit. Our tree-based approach maximizes the expected benefit of the treatment assignment using a novel splitting criteria. In an empirical case-study across synthetic datasets, our approach leads to better treatment assignments compared to CATE estimation methods as measured by expected total benefit. By providing a practical and efficient approach to learning a CATE ranking, this work offers an important step towards bridging the gap between CATE estimation techniques and their downstream applications.

**Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning**

Amey Pasarkar · Adji Bousso Dieng

Measuring diversity accurately is important for many scientific fields, including machine learning (ML), ecology, and chemistry. The Vendi Score was introduced as a generic similarity-based diversity metric that extends the Hill number of order $q=1$ by leveraging ideas from quantum statistical mechanics. Contrary to many diversity metrics in ecology, the Vendi Score accounts for similarity and does not require knowledge of the prevalence of the categories in the collection to be evaluated for diversity. However, the Vendi Score treats each item in a given collection with a level of sensitivity proportional to the item's prevalence. This is undesirable in settings where there is a significant imbalance in item prevalence. In this paper, we extend the other Hill numbers using similarity to provide flexibility in allocating sensitivity to rare or common items. This leads to a family of diversity metrics--Vendi scores with different levels of sensitivity controlled by the order $q$--that can be used in a variety of applications. We study the properties of the scores in a synthetic controlled setting where the ground truth diversity is known. We then test the utility of the Vendi scores in improving molecular simulations via Vendi Sampling. Finally, we use the scores to better understand the behavior of image generative models in terms of memorization, duplication, diversity, and sample quality.

**Contextual Bandits with Budgeted Information Reveal**

Kyra Gan · Esmaeil Keyvanshokooh · Xueqing Liu · Susan Murphy

Contextual bandit algorithms are commonly used in digital health to recommend personalized treatments. However, to ensure the effectiveness of the treatments, patients are often requested to take actions that have no immediate benefit to them, which we refer to as pro-treatment actions. In practice, clinicians have a limited budget to encourage patients to take these actions and collect additional information. We introduce a novel optimization and learning algorithm to address this problem. This algorithm effectively combines the strengths of two algorithmic approaches in a seamless manner, including 1) an online primal-dual algorithm for deciding the optimal timing to reach out to patients, and 2) a contextual bandit learning algorithm to deliver personalized treatment to the patient. We prove that this algorithm admits a sub-linear regret bound. We illustrate the usefulness of this algorithm on both synthetic and real-world data.

**Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection**

Mohammad Mahmudul Alam · Edward Raff · Stella Biderman · Tim Oates · James Holt

Malware detection is an interesting and valuable domain to work in because it has significant real-world impact and unique machine-learning challenges. We investigate existing long-range techniques and benchmarks and find that they're not very suitable in this problem area. In this paper, we introduce Holographic Global Convolutional Networks (HGConv) that utilize the properties of Holographic Reduced Representations (HRR) to encode and decode features from sequence elements. Unlike other global convolutional methods, our method does not require any intricate kernel computation or crafted kernel design.HGConv kernels are defined as simple parameters learned through backpropagation.The proposed method has achieved new SOTA results on Microsoft Malware Classification Challenge, Drebin, and EMBER malware benchmarks. With log-linear complexity in sequence length, the empirical results demonstrate substantially faster run-time by HGConv compared to other methods achieving far more efficient scaling even with sequence length $\geq 100,000$.

**Deep Learning-Based Alternative Route Computation**

Alex Zhai · Dee Guo · Sreenivas Gollapudi · Kostas Kollias · Daniel Delling

Algorithms for the computation of alternative routes in road networks power many geographic navigation systems. A good set of alternative routes offers meaningful options to the user of the system and can support applications such as routing that is robust to failures (e.g., road closures, extreme traffic congestion, etc.) and routing with diverse preferences and objective functions. Algorithmic techniques for alternative route computation include the penalty method, via-node type algorithms (which deploy bidirectional search and finding plateaus), and, more recently, electrical-circuit based algorithms. In this work we focus on the practically important family of via-node type algorithms and aim to produce high quality alternative routes for road networks using a novel deep learning-based approach that learns a representation of the underlying road network.We show that this approach can support natural objectives, such as the uniformly bounded stretch, that are difficult and computationally expensive to support through traditional algorithmic techniques. Moreover, we achieve this in a practical system based on the Customizable Route Planning (CRP) hierarchical routing architecture. Our training methodology uses the hierarchical partition of the graph and trains a model to predict which boundary nodes in the partition should be crossed by the alternative routes. We describe our methods in detail and evaluate them against previously studied baselines, showing quality improvements in the road networks of Seattle, Paris, and Bangalore.

**Uncertainty-aware Continuous Implicit Neural Representations for Remote Sensing Object Counting**

Siyuan Xu · Yucheng Wang · Mingzhou Fan · Byung-Jun Yoon · Xiaoning Qian

Many existing object counting methods rely on density map estimation~(DME) of the discrete grid representation by decoding extracted image semantic features from designed convolutional neural networks~(CNNs). Relying on discrete density maps not only leads to information loss dependent on the original image resolution, but also has a scalability issue when analyzing high-resolution images with cubically increasing memory complexity. Furthermore, none of the existing methods can offer reliable uncertainty quantification~(UQ) for the derived count estimates. To overcome these limitations, we design UNcertainty-aware, hypernetwork-based Implicit neural representations for Counting~(UNIC) to assign probabilities and the corresponding counting confidence over continuous spatial coordinates. We derive a sampling-based Bayesian counting loss function and develop the corresponding model training algorithm. UNIC outperforms existing methods on the Remote Sensing Object Counting~(RSOC) dataset with reliable UQ and improved interpretability of the derived count estimates. Our code is available at https://github.com/SiyuanXu-tamu/UNIC.

**Towards a Complete Benchmark on Video Moment Localization**

Jinyeong Chae · Donghwa Kim · Kwanseok Kim · Doyeon Lee · Sangho Lee · Seongsu Ha · Jonghwan Mun · Wooyoung Kang · Byungseok Roh · Joonseok Lee

In this paper, we propose and conduct a comprehensive benchmark on moment localization task, which aims to retrieve a segment that corresponds to a text query from a single untrimmed video. Our study starts from an observation that most moment localization papers report experimental results only on a few datasets in spite of availability of far more benchmarks. Thus, we conduct an extensive benchmark study to measure the performance of representative methods on widely used 7 datasets. Looking further into the details, we pose additional research questions and empirically verify them, including if they rely on unintended biases introduced by specific training data, if advanced visual features trained on classification task transfer well to this task, and if computational cost of each model pays off. With a series of these experiments, we provide multi-faceted evaluation of state-of-the-art moment localization models. Codes are available at \url{https://github.com/snuviplab/MoLEF}.

**SADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR Data**

Zongyu Dai · Emily Getzen · Qi Long

Missing values are prevalent in temporal electronic health records (EHR) data and are known to complicate data analysis and lead to biased results. The current state-of-the-art (SOTA) models for imputing missing values in EHR primarily leverage correlations across time points and across features, which perform well when data have strong correlation across time points, such as in ICU data where high-frequency time series data are collected. However, this is often insufficient for temporal EHR data from non-ICU settings (e.g., outpatient visits for primary care or specialty care), where data are collected at substantially sparser time points, resulting in much weaker correlation across time points. To address this methodological gap, we propose the Similarity-Aware Diffusion Model-Based Imputation (SADI), a novel imputation method that leverages the diffusion model and utilizes information across dependent variables. We apply SADI to impute incomplete temporal EHR data and propose a similarity-aware denoising function, which includes a self-attention mechanism to model the correlations between time points, features, and similar patients. To the best of our knowledge, this is the first time that the information of similar patients is directly used to construct imputation for incomplete temporal EHR data. Our extensive experiments on two datasets, the Critical Path For Alzheimer’s Disease (CPAD) data and the PhysioNet Challenge 2012 data, show that SADI outperforms the current SOTA under various missing data mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

**The Effective Number of Shared Dimensions Between Paired Datasets**

Hamza Giaffar · Camille Rullán Buxó · Mikio Aoi

A number of recent studies have sought to understand the behavior of both artificial and biological neural networks by comparing representations across layers, networks and brain areas. Increasingly prevalent, too, are comparisons across modalities of data, such as neural network activations and training data or behavioral data and neurophysiological recordings. One approach to such comparisons involves measuring the dimensionality of the space shared between the paired data matrices, where dimensionality serves as a proxy for computational or representational complexity. Established approaches, including CCA, can be used to measure the number of shared embedding dimensions, however they do not account for potentially unequal variance along shared dimensions and so cannot measure effective shared dimensionality. We present a candidate measure for shared dimensionality that we call the effective number of shared dimensions (ENSD). The ENSD is an interpretable and computationally efficient model-free measure of shared dimensionality that can be used to probe shared structure in a wide variety of data types. We demonstrate the relative robustness of the ENSD in cases where data is sparse or low rank and illustrate how the ENSD can be applied in a variety of analyses of representational similarities across layers in convolutional neural networks and between brain regions.

**Auditing Fairness under Unobserved Confounding**

Yewon Byun · Dylan Sam · Michael Oberst · Zachary Lipton · Bryan Wilder

A fundamental problem in decision-making systems is the presence of inequity along demographic lines. However, inequity can be difficult to quantify, particularly if our notion of equity relies on hard-to-measure notions like risk (e.g., equal access to treatment for those who would die without it). Auditing such inequity requires accurate measurements of individual risk, which is difficult to estimate in the realistic setting of unobserved confounding. In the case that these unobservables “explain” an apparent disparity, we may understate or overstate inequity. In this paper, we show that one can still give informative bounds on allocation rates among high-risk individuals, even while relaxing or (surprisingly) even when eliminating the assumption that all relevant risk factors are observed. We utilize the fact that in many real-world settings (e.g., the introduction of a novel treatment) we have data from a period prior to any allocation, to derive unbiased estimates of risk. We apply our framework to a real-world setting of Paxlovid allocation to COVID-19 patients, finding that observed racial inequity cannot be explained by unobserved confounders of the same strength as important observed covariates.

**Sample Efficient Learning of Factored Embeddings of Tensor Fields**

Taemin Heo · Chandrajit Bajaj

Data tensors of orders 2 and greater are now routinely being generated. These data collections are increasingly huge and growing. Many scientific and medical data tensors are tensor fields (e.g., images, videos, geographic data) in which the spatial neighborhood contains important information. Directly accessing such large data tensor collections for information has become increasingly prohibitive. We learn approximate full-rank and compact tensor sketches with decompositive representations providing compact space, time and spectral embeddings of tensor fields. All information querying and post-processing on the original tensor field can now be achieved more efficiently and with customizable accuracy as they are performed on these compact factored sketches in latent generative space. We produce optimal rank-r sketchy Tucker decomposition of arbitrary order data tensors by building compact factor matrices from a sample-efficient sub-sampling of tensor slices. Our sample efficient policy is learned via an adaptable stochastic Thompson sampling using Dirichlet distributions with conjugate priors.

**Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach**

Juanwu Lu · Wei Zhan · Masayoshi TOMIZUKA · Yeping Hu

Estimating the potential behavior of the surrounding human-driven vehicles is crucial for the safety of autonomous vehicles in a mixed traffic flow. Recent state-of-the-art achieved accurate prediction using deep neural networks. However, these end-to-end models are usually black boxes with weak interpretability and generalizability. This paper proposes the Goal-based Neural Variational Agent (GNeVA), an interpretable generative model for motion prediction with robust generalizability to out-of-distribution cases. For interpretability, the model achieves target-driven motion prediction by estimating the spatial distribution of long-term destinations with a variational mixture of Gaussians. We identify a causal structure among maps and agents' histories and derive a variational posterior to enhance generalizability. Experiments on motion prediction datasets validate that the fitted model can be interpretable and generalizable and can achieve comparable performance to state-of-the-art results.

**Density Uncertainty Layers for Reliable Uncertainty Estimation**

Yookoon Park · David Blei

Assessing the predictive uncertainty of deep neural networks is crucial for safety-related applications of deep learning. Although Bayesian deep learning offers a principled framework for estimating model uncertainty, the common approaches that approximate the parameter posterior often fail to deliver reliable estimates of predictive uncertainty. In this paper, we propose a novel criterion for reliable predictive uncertainty: a model's predictive variance should be grounded in the empirical density of the input. That is, the model should produce higher uncertainty for inputs that are improbable in the training data and lower uncertainty for inputs that are more probable. To operationalize this criterion, we develop the density uncertainty layer, a stochastic neural network architecture that satisfies the density uncertain criterion by design. We study density uncertainty layers on the UCI and CIFAR-10/100 uncertainty benchmarks. Compared to existing approaches, density uncertainty layers provide more reliable uncertainty estimates and robust out-of-distribution detection performance.

**Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?**

Kyurae Kim · Yian Ma · Jacob Gardner

We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called “linear”) rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompassesmisspecified variational families. Combined with previous works on the quadratic variance condition, this directly implies convergence of BBVI with the use of projectedstochastic gradient descent. For the projection operator, we consider a domain with triangular scale matrices, which the projection onto is computable in $\theta(d)$ time, where $d$ is the dimensionality of the target posterior. We also improve existing analysis on the regular closed-form entropy gradient estimators, which enables comparison against the STLestimator, providing explicit non-asymptotic complexity guarantees for both.

**Neural McKean Vlasov Processes: Distributional Dependence in Diffusion Models**

Haoming Yang · Ali Hasan · Yuting Ng · Vahid Tarokh

McKean-Vlasov stochastic differential equations (MV-SDEs) provide a mathematical description of the behavior of an infinite number of interacting particles by imposing a dependence on the particle density.These processes differ from standard It\^o-SDEs to the extent that MV-SDEs include distributional information in their individual particle parameterization.As such, we study the influence of explicitly including distributional information in the parameterization of the SDE.We first propose a series of semi-parametric methods for representing MV-SDEs, and then propose corresponding estimators for inferring parameters from data based on the underlying properties of the MV-SDE.By analyzing the properties of the different architectures and estimators, we consider their relationship to standard It\^o-SDEs and consider their applicability in relevant machine learning problems.We empirically compare the performance of the different architectures on a series of real and synthetic datasets for time series and probabilistic modeling.The results suggest that including the distributional dependence in MV-SDEs is an effective modeling framework for temporal data under an exchangeability assumption while maintaining strong performance for standard It\^o-SDE problems due to the richer class of probability flows associated with MV-SDEs.

We consider the problem of jointly learning row-wise and column-wise dependencies of matrix-variate observations, which are modelled separately by two precision matrices. Due to the complicated structure of Kronecker-product precision matrices in the commonly used matrix-variate Gaussian graphical models, a sparser Kronecker-sum structure was proposed recently based on the Cartesian product of graphs. However, existing methods for estimating Kronecker-sum structured precision matrices do not scale well to large scale datasets. In this paper, we introduce DNNLasso, a diagonally non-negative graphical lasso model for estimating the Kronecker-sum structured precision matrix, which outperforms the state-of-the-art methods by a large margin in both accuracy and computational time.

**Error bounds for any regression model using Gaussian processes with gradient information**

Rafael Savvides · Hoang Phuc Hau Luu · Kai Puolamäki

We provide an upper bound for the expected quadratic loss on new data for any regression model. We derive the bound by modelling the underlying function by a Gaussian process (GP). Instead of a single kernel or family of kernels of the same form, we consider all GPs with translation-invariant and continuously twice differentiable kernels having a bounded signal variance and prior covariance of the gradient. To obtain a bound for the expected posterior loss, we present bounds for the posterior variance and squared bias. The squared bias bound depends on the regression model used, which can be arbitrary and not based on GPs. The bounds scale well with data size, in contrast to computing the GP posterior by a Cholesky factorisation of a large matrix. More importantly, our bounds do not require strong prior knowledge as we do not specify the exact kernel form. We validate our theoretical findings by numerical experiments and show that the bounds have applications in uncertainty estimation and concept drift detection.

**Learning Granger Causality from Instance-wise Self-attentive Hawkes Processes**

Dongxia Wu · Tsuyoshi Ide · Georgios Kollias · Jiri Navratil · Aurelie Lozano · Naoki Abe · Yian Ma · Rose Yu

We address the problem of learning Granger causality from asynchronous, interdependent, multi-type event sequences. In particular, we are interested in discovering instance-level causal structures in an unsupervised manner. Instance-level causality identifies causal relationships among individual events, providing more fine-grained information for decision-making. Existing work in the literature either requires strong assumptions, such as linearity in the intensity function, or heuristically defined model parameters that do not necessarily meet the requirements of Granger causality. We propose Instance-wise Self-Attentive Hawkes Processes (ISAHP), a novel deep learning framework that can directly infer the Granger causality at the event instance level. ISAHP is the first neural point process model that meets the requirements of Granger causality. It leverages the self-attention mechanism of the transformer to align with the principles of Granger causality. We empirically demonstrate that ISAHP is capable of discovering complex instance-level causal structures that cannot be handled by classical models. We also show that ISAHP achieves state-of-the-art performance in proxy tasks involving type-level causal discovery and instance-level event type prediction.

**Looping in the Human: Collaborative and Explainable Bayesian Optimization**

Masaki Adachi · Brady Planden · David Howey · Michael A. Osborne · Sebastian Orbell · Natalia Ares · Krikamol Muandet · Siu Lun Chau

Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods. Code is available https://github.com/ma921/CoExBO.

**VEC-SBM: Optimal Community Detection with Vectorial Edges Covariates**

Guillaume Braun · Masashi Sugiyama

Social networks are often associated with rich side information, such as texts and images. While numerous methods have been developed to identify communities from pairwise interactions, they usually ignore such side information. In this work, we study an extension of the Stochastic Block Model (SBM), a widely used statistical framework for community detection, that integrates vectorial edges covariates: the Vectorial Edges Covariates Stochastic Block Model (VEC-SBM). We propose a novel algorithm based on iterative refinement techniques and show that it optimally recovers the latent communities under the VEC-SBM. Furthermore, we rigorously assess the added value of leveraging edge's side information in the community detection process. We complement our theoretical results with numerical experiments on synthetic and semi-synthetic data.

**Theory-guided Message Passing Neural Network for Probabilistic Inference**

Zijun Cui · Hanjing Wang · Tian Gao · Kartik Talamadupula · Qiang Ji

Probabilistic inference can be tackled by minimizing a variational free energy through message passing. To improve performance, neural networks are adopted for message computation. Neural message learning is heuristic and requires strong guidance to perform well. In this work, we propose a {\em theory-guided message passing neural network} (TMPNN) for probabilistic inference. Inspired by existing work, we consider a generalized Bethe free energy which allows for a learnable variational assumption. Instead of using a black-box neural network for message computation, we utilize a general message equation and introduce a symbolic message function with semantically meaningful parameters. The analytically derived symbolic message function is seamlessly integrated into the MPNN framework, giving rise to the proposed TMPNN. TMPNN is trained using algorithmic supervision without requiring exact inference results. Leveraging the theory-guided symbolic function, TMPNN offers strengthened theoretical guarantees compared to conventional heuristic neural models. It presents a novel contribution by demonstrating its applicability to both MAP and marginal inference tasks, outperforming SOTAs in both cases. Furthermore, TMPNN provides improved generalizability across various graph structures and exhibits enhanced data efficiency.

Traditional approaches to variational inference rely on parametric families of variational distributions, with the choice of family playing a critical role in determining the accuracy of the resulting posterior approximation. Simple mean-field families often lead to poor approximations, while rich families of distributions like normalizing flows can be difficult to optimize and usually do not incorporate the known structure of the target distribution due to their black-box nature. To expand the space of flexible variational families, we revisit Variational Rejection Sampling (VRS) [Grover et al., 2018], which combines a parametric proposal distribution with rejection sampling to define a rich non-parametric family of distributions that explicitly utilizes the known target distribution. By introducing a low-variance reparameterized gradient estimator for the parameters of the proposal distribution, we make VRS an attractive inference strategy for models with continuous latent variables. We argue theoretically and demonstrate empirically that the resulting method--Reparameterized Variational Rejection Sampling (RVRS)--offers an attractive trade-off between computational cost and inference fidelity. In experiments we show that our method performs well in practice and that it is well suited for black-box inference, especially for models with local latent variables.

**Riemannian Laplace Approximation with the Fisher Metric**

Hanlin Yu · Marcelo Hartmann · Bernardo Williams Moreno Sanchez · Mark Girolami · Arto Klami

Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties depend heavily on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.

**Beyond Bayesian Model Averaging over Paths in Probabilistic Programs with Stochastic Support**

Tim Reichelt · Luke Ong · Tom Rainforth

The posterior in probabilistic programs with stochastic support decomposes as a weighted sum of the local posterior distributions associated with each possible program path.We show that making predictions with this full posterior implicitly performs a Bayesian model averaging (BMA) over paths.This is potentially problematic, as BMA weights can be unstable due to model misspecification or inference approximations, leading to sub-optimal predictions in turn.To remedy this issue, we propose alternative mechanisms for path weighting: one based on stacking and one based on ideas from PAC-Bayes.We show how both can be implemented as a cheap post-processing step on top of existing inference engines.In our experiments, we find them to be more robust and lead to better predictions compared to the default BMA weights.

**Categorical Generative Model Evaluation via Synthetic Distribution Coarsening**

Florence Regol · Mark Coates

As we expect to see a rapid integration of generative models in our day to day lives, the development of rigorous methods of evaluation and analysis for generative models has never been more pressing.Multiple works have highlighted the shortcomings of widely used metrics and exposed how they fail to behave as expected in some settings. So far, the response has been to use a variety of metrics that target different desirable and interpretable properties such as fidelity, diversity, and authenticity, to obtain a clearer picture of a generative model’s capabilities. These methods mainly focus on ordinal data and they all suffer from the same unavoidable issues stemming from estimating quantities of high-dimensional data from a limited number of samples.We propose to take an alternative approach and to return to the synthetic data setting where the ground truth is explicit and known. We focus on nominal categorical data and introduce an evaluation method that can scale to the high-dimensional settings often encountered in practice. Our method involves successively binning the large space to obtain smaller probability spaces and coarser distributions where meaningful statistical estimates can be obtained. This allows us to provide probabilistic guarantees and sample complexities and we illustrate how our method can be applied to distinguish between the capabilities of several state-of-the-art categorical models.

**Data-Adaptive Probabilistic Likelihood Approximation for Ordinary Differential Equations**

Mohan Wu · Martin Lysy

Estimating the parameters of ordinary differential equations (ODEs) is of fundamental importance in many scientific applications. While ODEs are typically approximated with deterministic algorithms, new research on probabilistic solvers indicates that they produce more reliable parameter estimates by better accounting for numerical errors. However, many ODE systems are highly sensitive to their parameter values. This produces deep local maxima in the likelihood function -- a problem which existing probabilistic solvers have yet to resolve. Here we present a novel probabilistic ODE likelihood approximation, DALTON, which can dramatically reduce parameter sensitivity by learning from noisy ODE measurements in a data-adaptive manner. Our approximation scales linearly in both ODE variables and time discretization points, and is applicable to ODEs with both partially-unobserved components and non-Gaussian measurement models. Several examples demonstrate that DALTON produces more accurate parameter estimates via numerical optimization than existing probabilistic ODE solvers, and even in some cases than the exact ODE likelihood itself.

**Intrinsic Gaussian Vector Fields on Manifolds**

Daniel Robert-Nicoud · Andreas Krause · Slava Borovitskiy

Various applications ranging from robotics to climate science require modeling signals on non-Euclidean domains, such as the sphere. Gaussian process models on manifolds have recently been proposed for such tasks, in particular when uncertainty quantification is needed. In the manifold setting, vector-valued signals can behave very differently from scalar-valued ones, with much of the progress so far focused on modeling the latter. The former, however, are crucial for many applications, such as modeling wind speeds or force fields of unknown dynamical systems. In this paper, we propose novel Gaussian process models for vector-valued signals on manifolds that are intrinsically defined and account for the geometry of the space in consideration. We provide computational primitives needed to deploy the resulting Hodge-Matérn Gaussian vector fields on the two-dimensional sphere and the hypertori. Further, we highlight two generalization directions: discrete two-dimensional meshes and "ideal" manifolds like hyperspheres, Lie groups, and homogeneous spaces. Finally, we show that our Gaussian vector fields constitute considerably more refined inductive biases than the extrinsic fields proposed before.

**A Unifying Variational Framework for Gaussian Process Motion Planning**

Lucas Cosier · Rares Iordan · Sicelukwanda Zwane · Giovanni Franzese · James Wilson · Marc Deisenroth · Alexander Terenin · Yasemin Bekiroglu

To control how a robot moves, motion planning algorithms must compute paths in high-dimensional state spaces while accounting for physical constraints related to motors and joints, generating smooth and stable motions, avoiding obstacles, and preventing collisions. A motion planning algorithm must therefore balance competing demands, and should ideally incorporate uncertainty to handle noise, model errors, and facilitate deployment in complex environments. To address these issues, we introduce a framework for robot motion planning based on variational Gaussian processes, which unifies and generalizes various probabilistic-inference-based motion planning algorithms, and connects them with optimization-based planners. Our framework provides a principled and flexible way to incorporate equality-based, inequality-based, and soft motion-planning constraints during end-to-end training, is straightforward to implement, and provides both interval-based and Monte-Carlo-based uncertainty estimates. We conduct experiments using different environments and robots, comparing against baseline approaches based on the feasibility of the planned paths, and obstacle avoidance quality. Results show that our proposed approach yields a good balance between success rates and path quality.

**Efficiently Computable Safety Bounds for Gaussian Processes in Active Learning**

Jörn Tebbe · Christoph Zimmer · Ansgar Steland · Markus Lange-Hegermann · Fabian Mies

Active learning of physical systems must commonly respect practical safety constraints, which restricts the exploration of the design space. Gaussian Processes (GPs) and their calibrated uncertainty estimations are widely used for this purpose. In many technical applications the design space is explored via continuous trajectories, along which the safety needs to be assessed. This is particularly challenging for strict safety requirements in GP methods, as it employs computationally expensive Monte Carlo sampling of high quantiles.We address these challenges by providing provable safety bounds based on the adaptively sampled median of the supremum of the posterior GP. Our method significantly reduces the number of samples required for estimating high safety probabilities, resulting in faster evaluation without sacrificing accuracy and exploration speed. The effectiveness of our safe active learning approach is demonstrated through extensive simulations and validated using a real-world engine example.

**Subsampling Error in Stochastic Gradient Langevin Diffusions**

Kexin Jin · Chenguang Liu · Jonas Latz

The Stochastic Gradient Langevin Dynamics (SGLD) are popularly used to approximate Bayesian posterior distributions in statistical learning procedures with large-scale data. As opposed to many usual Markov chain Monte Carlo (MCMC) algorithms, SGLD is not stationary with respect to the posterior distribution; two sources of error appear: The first error is introduced by an Euler--Maruyama discretisation of a Langevin diffusion process, the second error comes from the data subsampling that enables its use in large-scale data settings. In this work, we consider an idealised version of SGLD to analyse the method's pure subsampling error that we then see as a best-case error for diffusion-based subsampling MCMC methods. Indeed, we introduce and study the Stochastic Gradient Langevin Diffusion (SGLDiff), a continuous-time Markov process that follows the Langevin diffusion corresponding to a data subset and switches this data subset after exponential waiting times. There, we show the exponential ergodicity of SLGDiff and that the Wasserstein distance between the posterior and the limiting distribution of SGLDiff is bounded above by a fractional power of the mean waiting time. We bring our results into context with other analyses of SGLD.

**Robust Offline Reinforcement Learning with Heavy-Tailed Rewards**

Jin Zhu · Runzhe Wan · Zhengling Qi · Shikai Luo · Chengchun Shi

This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions. The implementation of the proposal is available at \url{https://github.com/Mamba413/ROOM}.

Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. However, existing works applying KD to QAT require tedious hyper-parameter tuning to balance the weights of different loss terms, assume the availability of labeled training data, and require complex, computationally intensive training procedures for good performance. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD first unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the KL-Loss between the full-precision and low-bit models for KD and the discretization error for quantization, without supervision from labels. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures. Our code is at: https://github.com/kaiqi123/SQAKD.git.

**Joint control variate for faster black-box variational inference**

Xi Wang · · Justin Domke

Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.

Empirical studies have demonstrated the effectiveness of (score-based) diffusion models in generating high-dimensional data, such as texts and images, which typically exhibit a low-dimensional manifold nature. These empirical successes raise the theoretical question of whether score-based diffusion models can optimally adapt to low-dimensional manifold structures. While recent work has validated the minimax optimality of diffusion models when the target distribution admits a smooth density with respect to the Lebesgue measure of the ambient data space, these findings do not fully account for the ability of diffusion models in avoiding the the curse of dimensionality when estimating high-dimensional distributions. This work considers two common classes of diffusion models: Langevin diffusion and forward-backward diffusion. We show that both models can adapt to the intrinsic manifold structure by showing that the convergence rate of the inducing distribution estimator depends only on the intrinsic dimension of the data. Moreover, our considered estimator does not require knowing or explicitly estimating the manifold. We also demonstrate that the forward-backward diffusion can achieve the minimax optimal rate under the Wasserstein metric when the target distribution possesses a smooth density with respect to the volume measure of the low-dimensional manifold.

**Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean**

Anton Thielmann · René-Marcel Kruse · Thomas Kneib · Benjamin Säfken

Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs.However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data.We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models. Thecode is available at the following link: \url{https://github.com/AnFreTh/NAMpy}

**Diagonalisation SGD: Fast \& Convergent SGD for Non-Differentiable Models via Reparameterisation and Smoothing**

Dominik Wagner · Basim Khajwal · Luke Ong

It is well-known that the reparameterisation gradient estimator, which exhibits low variance in practice, is biased for non-differentiable models. This may compromise correctness of gradient-based optimisation methods such as stochastic gradient descent (SGD). We introduce a simple syntactic framework to define non-differentiable functions piecewisely and present a systematic approach to obtain smoothings for which the reparameterisation gradient estimator is unbiased. Our main contribution is a novel variant of SGD, Diagonalisation Stochastic Gradient Descent, which progressively enhances the accuracy of the smoothed approximation during optimisation, and we prove convergence to stationary points of the unsmoothed (original) objective. Our empirical evaluation reveals benefits over the state of the art: our approach is simple, fast, stable and attains orders of magnitude reduction in work-normalised variance.

**Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting**

Louis Sharrock · Daniel Dodd · Christopher Nemeth

We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning.

**Bayesian Semi-structured Subspace Inference**

Daniel Dold · David Rügamer · Beate Sick · Oliver Dürr

Semi-structured regression models enable the joint modeling of interpretable structured and complex unstructured feature effects. The structured model part is inspired by statistical models and can be used to infer the input-output relationship for features of particular importance. The complex unstructured part defines an arbitrary deep neural network and thereby provides enough flexibility to achieve competitive prediction performance. While these models can also account for aleatoric uncertainty, there is still a lack of work on accounting for epistemic uncertainty. In this paper, we address this problem by presenting a Bayesian approximation for semi-structured regression models using subspace inference. To this end, we extend subspace inference for joint posterior sampling from a full parameter space for structured effects and a subspace for unstructured effects. Apart from this hybrid sampling scheme, our method allows for tunable complexity of the subspace and can capture multiple minima in the loss landscape. Numerical experiments validate our approach's efficacy in recovering structured effect parameter posteriors in semi-structured models and approaching the full-space posterior distribution of MCMC for increasing subspace dimension. Further, our approach exhibits competitive predictive performance across simulated and real-world datasets.

**Structural perspective on constraint-based learning of Markov networks**

Tuukka Korhonen · Fedor Fomin · Pekka Parviainen

Markov networks are probabilistic graphical models that employ undirected graphs to depict conditional independence relationships among variables. Our focus lies in constraint-based structure learning, which entails learning the undirected graph from data through the execution of conditional independence tests. We establish theoretical limits concerning two critical aspects of constraint-based learning of Markov networks: the number of tests and the sizes of the conditioning sets. These bounds uncover an exciting interplay between the structural properties of the graph and the amount of tests required to learn a Markov network. The starting point of our work is that the graph parameter maximum pairwise connectivity, $\kappa$, that is, the maximum number of vertex-disjoint paths connecting a pair of vertices in the graph, is responsible for the sizes of independence tests required to learn the graph. On one hand, we show that at least one test with the size of the conditioning set at least $\kappa$ is always necessary. On the other hand, we prove that any graph can be learned by performing tests of size at most $\kappa$. This completely resolves the question of the minimum size of conditioning sets required to learn the graph. When it comes to the number of tests, our upper bound on the sizes of conditioning sets implies that every $n$-vertex graph can be learned by at most $n^{\kappa}$ tests with conditioning sets of sizes at most $\kappa$. We show that for any upper bound q on the sizes of the conditioning sets, there exist graphs with $O(nq)$ vertices that require at least $n^{\Omega(\kappa)}$ tests to learn. This lower bound holds even when the treewidth and the maximum degree of the graph are at most $\kappa+2$. On the positive side, we prove that every graph of bounded treewidth can be learned by a polynomial number of tests with conditioning sets of sizes at most $2*\kappa$.

**Variational Gaussian Process Diffusion Processes**

Prakhar Verma · Vincent Adam · Arno Solin

Diffusion processes are a class of stochastic differential equations (SDEs) providing a rich family of expressive models that arise naturally in dynamic modelling tasks. Probabilistic inference and learning under generative models with latent processes endowed with a non-linear diffusion process prior are intractable problems. We build upon work within variational inference, approximating the posterior process as a linear diffusion process, and point out pathologies in the approach. We propose an alternative parameterization of the Gaussian variational process using a site-based exponential family description. This allows us to trade a slow inference algorithm with fixed-point iterations for a fast algorithm for convex optimization akin to natural gradient descent, which also provides a better objective for learning model parameters.

**Scalable Meta-Learning with Gaussian Processes**

Petru Tighineanu · Lukas Grossberger · Paul Baireuther · Kathrin Skubch · Stefan Falkner · Julia Vinogradska · Felix Berkenkamp

Meta-learning is a powerful approach that exploits historical data to quickly solve new tasks from the same distribution. In the low-data regime, methods based on the closed-form posterior of Gaussian processes (GP) together with Bayesian optimization have achieved high performance. However, these methods are either computationally expensive or introduce assumptions that hinder a principled propagation of uncertainty between task models. This may disrupt the balance between exploration and exploitation during optimization. In this paper, we develop ScaML-GP, a modular GP model for meta-learning that is scalable in the number of tasks. Our core contribution is carefully designed multi-task kernel that enables hierarchical training and task scalability. Conditioning ScaML-GP on the meta-data exposes its modular nature yielding a test-task prior that combines the posteriors of meta-task GPs. In synthetic and real-world meta-learning experiments, we demonstrate that ScaML-GP can learn efficiently both with few and many meta-tasks.

**Prior-dependent analysis of posterior sampling reinforcement learning with function approximation**

Yingru Li · Zhiquan Luo

This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of $\tilde{\mathcal{O}}(d\sqrt{H^3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.

**Training Implicit Generative Models via an Invariant Statistical Loss**

José Manuel de Frutos · Pablo Olmos · Manuel Alberto Vazquez Lopez · Joaquín Míguez

Implicit generative models have the capability to learn arbitrary complex data distributions. On the downside, training requires telling apart real data from artificially-generated ones using adversarial discriminators, leading to unstable training and mode-dropping issues. As reported by Zahee et al. (2017), even in the one-dimensional (1D) case, training a generative adversarial network (GAN) is challenging and often suboptimal. In this work, we develop a discriminator-free method for training one-dimensional (1D) generative implicit models and subsequently expand this method to accommodate multivariate cases. Our loss function is a discrepancy measure between a suitably chosen transformation of the model samples and a uniform distribution; hence, it is invariant with respect to the true distribution of the data. We first formulate our method for 1D random variables, providing an effective solution for approximate reparameterization of arbitrary complex distributions. Then, we consider the temporal setting (both univariate and multivariate), in which we model the conditional distribution of each sample given the history of the process. We demonstrate through numerical simulations that this new method yields promising results, successfully learning true distributions in a variety of scenarios and mitigating some of the well-known problems that state-of-the-art implicit methods present.

**GmGM: a fast multi-axis Gaussian graphical model**

Bailey Andrew · David Westhead · Luisa Cutillo

This paper introduces the Gaussian multi-Graphical Model, a model to construct sparse graph representations of matrix- and tensor-variate data. We generalize prior work in this area by simultaneously learning this representation across several tensors that share axes, which is necessary to allow the analysis of multimodal datasets such as those encountered in multi-omics. Our algorithm uses only a single eigendecomposition per axis, achieving an order of magnitude speedup over prior work in the ungeneralized case. This allows the use of our methodology on large multi-modal datasets such as single-cell multi-omics data, which was challenging with previous approaches. We validate our model on synthetic data and five real-world datasets.

**Learning Sparse Codes with Entropy-Based ELBOs**

Dmytro Velychko · Simon Damm · Asja Fischer · Jörg Lücke

Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions.We here derive a solely entropy-based learning objective for the parameters of standard sparse coding.The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) the novel objective is fully analytic; and (C) the objective allows for a novel principled form of annealing.The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors.The conditions under which the ELBO becomes equal to entropies are then shown to have analytic solutions, which leads to the fully analytic objective.Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs.We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations.Furthermore, we numerically investigate entropy-based annealing which results in improved learning.Our main contributions are theoretical, however, and they are twofold: (1) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning; and (2) using the entropy objective, we derive a fully analytic ELBO objective for the standard sparse coding generative model.

**Robust Approximate Sampling via Stochastic Gradient Barker Dynamics**

Lorenzo Mauri · Giacomo Zanella

Stochastic Gradient (SG) Markov Chain Monte Carlo algorithms (MCMC) are popular algorithms for Bayesian sampling in the presence of large datasets. However, they come with little theoretical guarantees and assessing their empirical performances is non-trivial. In such context, it is crucial to develop algorithms that are robust to the choice of hyperparameters and to gradients heterogeneity since, in practice, both the choice of step-size and behaviour of target gradients induce hard-to-control biases in the invariant distribution. In this work we introduce the stochastic gradient Barker dynamics algorithm (SGBD), extending the recently developed Barker MCMC scheme, a robust alternative to Langevin-based sampling algorithms, to the stochastic gradient framework. We characterize the impact of stochastic gradients on the Barker transition mechanism and develop a bias-corrected version that, under suitable assumptions, eliminates the error due to the gradient noise in the proposal. We illustrate the performance on a number of high-dimensional examples, showing that SGBD is more robust to hyperparameter tuning and to irregular behavior of the target gradients compared to the popular stochastic gradient Langevin dynamics algorithm.

**Probabilistic Integral Circuits**

Gennaro Gala · Cassio de Campos · Robert Peharz · Antonio Vergari · Erik Quaeghebeur

Continuous latent variables (LVs) are a key ingredient of many generative models, as they allow modelling expressive mixtures with an uncountable number of components.In contrast, probabilistic circuits (PCs) are hierarchical discrete mixtures represented as computational graphs composed of input, sum and product units.Unlike continuous LV models, PCs provide tractable inference but are limited to discrete LVs with categorical (i.e. unordered) states.We bridge these model classes by introducing probabilistic integral circuits (PICs), a new language of computational graphs that extends PCs with integral units representing continuous LVs.In the first place, PICs are symbolic computational graphs and are fully tractable in simple cases where analytical integration is possible.In practice, we parameterise PICs with light-weight neural nets delivering an intractable hierarchical continuous mixture that can be approximated arbitrarily well with large PCs using numerical quadrature.On several distribution estimation benchmarks, we show that such PIC-approximating PCs systematically outperform PCs commonly learned via expectation-maximization or SGD.

**Backward Filtering Forward Deciding in Linear Non-Gaussian State Space Models**

Yun-Peng Li · Hans-Andrea Loeliger

The paper considers linear state space models with non-Gaussian inputs and/or constraints. As shown previously, NUP representations (normal with unknown parameters) allow to compute MAP estimates in such models by iterating Kalman smoothing recursions. In this paper, we propose to compute such MAP estimates by iterating backward-forward recursions where the forward recursion amounts to coordinatewise input estimation. The advantages of the proposed approach include faster convergence, no “zero-variance stucking”, and easier control of constraint satisfaction. The approach is demonstrated with simulation results of exemplary applications including (i) regression with non-Gaussian priors or constraints on k-th order differences and (ii) control with linearly constrained inputs.

**Stochastic Approximation with Biased MCMC for Expectation Maximization**

Samuel Gruffaz · Kyurae Kim · Alain Durmus · Jacob Gardner

The expectation maximization (EM) algorithm is a widespread method for empirical Bayesian inference, but its expectation step (E-step) is often intractable. Employing a stochastic approximation scheme with Markov chain Monte Carlo (MCMC) can circumvent this issue, resulting in an algorithm known as MCMC-SAEM. While theoretical guarantees for MCMC-SAEM have previously been established, these results are restricted to the case where asymptotically unbiased MCMC algorithms are used. In practice, MCMC-SAEM is often run with asymptotically biased MCMC, for which the consequences are theoretically less understood. In this work, we fill this gap by analyzing the asymptotics and non-asymptotics of SAEM with biased MCMC steps, particularly the effect of bias. We also provide numerical experiments comparing the Metropolis-adjusted Langevin algorithm (MALA), which is asymptotically unbiased, and the unadjusted Langevin algorithm (ULA), which is asymptotically biased, on synthetic and real datasets. Experimental results show that ULA is more stable with respect to the choice of Langevin stepsize and can sometimes result in faster convergence.

**Mixed variational flows for discrete variables**

Gian C. Diluvi · Benjamin Bloem-Reddy · Trevor Campbell

Variational flows allow practitioners to learn complex continuous distributions, but approximating discrete distributions remains a challenge. Current methodologies typically embed the discrete target in a continuous space---usually via continuous relaxation or dequantization---and then apply a continuous flow. These approaches involve a surrogate target that may not capture the original discrete target, might have biased or unstable gradients, and can create a difficult optimization problem. In this work, we develop a variational flow family for discrete distributions without any continuous embedding. First, we develop a measure-preserving and discrete (MAD) invertible map that leaves the discrete target invariant, and then create a mixed variational flow (MAD Mix) based on that map. Our family provides access to i.i.d. sampling and density evaluation with virtually no tuning effort. We also develop an extension to MAD Mix that handles joint discrete and continuous models. Our experiments suggest that MAD Mix produces more reliable approximations than continuous-embedding flows while requiring orders of magnitude less compute.

**Minimizing Convex Functionals over Space of Probability Measures via KL Divergence Gradient Flow**

Rentian Yao · Linjun Huang · Yun Yang

Motivated by the computation of the non-parametric maximum likelihood estimator (NPMLE) and the Bayesian posterior in statistics, this paper explores the problem of convex optimization over the space of all probability distributions. We introduce an implicit scheme, called the implicit KL proximal descent (IKLPD) algorithm, for discretizing a continuous-time gradient flow relative to the Kullback--Leibler (KL) divergence for minimizing a convex target functional. We show that IKLPD converges to a global optimum at a polynomial rate from any initialization; moreover, if the objective functional is strongly convex relative to the KL divergence, for example, when the target functional itself is a KL divergence as in the context of Bayesian posterior computation, IKLPD exhibits globally exponential convergence. Computationally, we propose a numerical method based on normalizing flow to realize IKLPD. Conversely, our numerical method can also be viewed as a new approach that sequentially trains a normalizing flow for minimizing a convex functional with a strong theoretical guarantee.

**Surrogate Bayesian Networks for Approximating Evolutionary Games**

Vincent Hsiao · Dana Nau · Bobak Pezeshki · Rina Dechter

Spatial evolutionary games are used to model large systems of interacting agents. In earlier work, a method was developed using Bayesian Networks to approximate the population dynamics in these games. One of the advantages of the Bayesian Network modeling approach is that it is possible to smoothly adjust the size of the network to get more accurate approximations. However, scaling the method up can be intractable if the number of strategies in the evolutionary game increases. In this paper, we propose a new method for computing more accurate approximations by using surrogate Bayesian Networks. Instead of computing inference on larger networks directly, we perform inference on a much smaller surrogate network extended with parameters that exploit the symmetry inherent to the domain. We learn the parameters on the surrogate network using KL-divergence as the loss function. We illustrate the value of this method empirically through a comparison on several evolutionary games.

**Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias**

Ambroise Odonnat · Vasilii Feofanov · Ievgen Redko

Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, \texttt{softmax} prediction probabilities are often used as a confidence measure, although they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraints. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities. The code is available at https://github.com/ambroiseodt/tsim.

**Large-Scale Gaussian Processes via Alternating Projection**

Kaiwen Wu · Jonathan Wenger · Haydn Jones · Geoff Pleiss · Jacob Gardner

Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG.

**Clustering Items From Adaptively Collected Inconsistent Feedback**

Shubham Gupta · Peter Staar · Christian de Sainte Marie

We study clustering in a query-based model where the learner can repeatedly query an oracle to determine if two items belong to the same cluster. However, these queries are costly and the oracle's responses are marred by inconsistency and noise. The learner's goal is to adaptively make a small number of queries and return the correct clusters for \emph{all} $n$ items with high confidence. We develop efficient algorithms for this problem using the sequential hypothesis testing framework. We derive high probability upper bounds on their sample complexity (the number of queries they make) and complement this analysis with an information-theoretic lower bound. In particular, we show that our algorithm for two clusters is nearly optimal when the oracle's error probability is a constant. Our experiments verify these findings and highlight a few shortcomings of our algorithms. Namely, we show that their sample complexity deviates from the lower bound when the error probability of the oracle depends on $n$. We suggest an improvement based on a more efficient sequential hypothesis test and demonstrate it empirically.

The importance of dealing with big data is further increasing, as machine learning (ML) systems obtain useful knowledge from big datasets. However, using all data is practically prohibitive because of the massive sizes of the datasets, so summarizing them by centers obtained from k-center clustering is a promising approach. We have two concerns here. One is fairness, because if the summary does not have some specific groups, subsequent applications may provide unfair results for the groups. The other is the presence of outliers, and if outliers dominate the summary, it cannot be useful. To overcome these concerns, we address the problem of fair k-center clustering with outliers. Although prior works studied the fair k-center clustering problem, they do not consider outliers. This paper yields a linear time algorithm that satisfies the fairness constraint of our problem and probabilistically guarantees the almost 3-approximation bound. Its empirical efficiency and effectiveness are also reported.

**Deep anytime-valid hypothesis testing**

Teodora Pandeva · Patrick Forré · Aaditya Ramdas · Shubhanshu Shekhar

We propose a general framework for constructing powerful, sequential hypothesis tests for a large class of nonparametric testing problems. The null hypothesis for these problems is defined in an abstract form using the action of two known operators on the data distribution. This abstraction allows for a unified treatment of several classical tasks, such as two-sample testing, independence testing, and conditional-independence testing, as well as modern problems, such as testing for adversarial robustness of machine learning (ML) models. Our proposed framework has the following advantages over classical batch tests: 1) it continuously monitors online data streams and efficiently aggregates evidence against the null, 2) it provides tight control over the type I error without the need for multiple testing correction, 3) it adapts the sample size requirement to the unknown hardness of the problem. We develop a principled approach of leveraging the representation capability of ML models within the testing-by-betting framework, a game-theoretic approach for designing sequential tests. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines on several tasks.

**Understanding Generalization of Federated Learning via Stability: Heterogeneity Matters**

Zhenyu Sun · Xiaochun Niu · Ermin Wei

Generalization performance is a key metric in evaluating machine learning models when applied to real-world applications. Good generalization indicates the model can predict unseen data correctly when trained under a limited number of data. Federated learning (FL), which has emerged as a popular distributed learning framework, allows multiple devices or clients to train a shared model without violating privacy requirements. While the existing literature has studied extensively the generalization performances of centralized machine learning algorithms, similar analysis in the federated settings is either absent or with very restrictive assumptions on the loss functions. In this paper, we aim to analyze the generalization performances of federated learning by means of algorithmic stability, which measures the change of the output model of an algorithm when perturbing one data point. Three widely-used algorithms are studied, including FedAvg, SCAFFOLD, and FedProx, under convex and non-convex loss functions. Our analysis shows that the generalization performances of models trained by these three algorithms are closely related to the heterogeneity of clients' datasets as well as the convergence behaviors of the algorithms. Particularly, in the i.i.d. setting, our results recover the classical results of stochastic gradient descent (SGD).

**A General Algorithm for Solving Rank-one Matrix Sensing**

Lianke Qin · Zhao Song · Ruizhe Zhang

Matrix sensing has many real-world applications in science and engineering, such as system control, distance embedding, and computer vision. The goal of matrix sensing is to recover a matrix $A_\star \in \mathbb{R}^{n \times n}$, based on a sequence of measurements $(u_i,b_i) \in \mathbb{R}^{n} \times \mathbb{R}$ such that $u_i^\top A_\star u_i = b_i$. Previous work (Zhong et al., 2015) focused on the scenario where matrix $A_{\star}$ has a small rank, e.g. rank-$k$. Their analysis heavily relies on the RIP assumption, making it unclear how to generalize to high-rank matrices. In this paper, we relax that rank-$k$ assumption and solve a much more general matrix sensing problem. Given an accuracy parameter $\delta \in (0,1)$, we can compute $A \in \mathbb{R}^{n \times n}$ in $\widetilde{O}(m^{3/2} n^2 \delta^{-1} )$, such that $ |u_i^\top A u_i - b_i| \leq \delta$ for all $i \in [m]$. We design an efficient algorithm with provable convergence guarantees using stochastic gradient descent for this problem.

We study distributed schemes for high-dimensional sparse linear regression, based on orthogonal matching pursuit (OMP). Such schemes are particularly suited for settings where a central fusion center is connected to end machines, that have both computation and communication limitations. We prove that under suitable assumptions, distributed-OMP schemes recover the support of the regression vector with communication per machine linear in its sparsity and logarithmic in the dimension. Remarkably, this holds even at low signal-to-noise-ratios, where individual machines are unable to detect the support. Our simulations show that distributed-OMP schemes are competitive with more computationally intensive methods, and in some cases even outperform them.

**Making Better Use of Unlabelled Data in Bayesian Active Learning**

Freddie Bickford Smith · Adam Foster · Tom Rainforth

Fully supervised models are predominant in Bayesian active learning. We argue that their neglect of the information present in unlabelled data harms not just predictive performance but also decisions about what data to acquire. Our proposed solution is a simple framework for semi-supervised Bayesian active learning. We find it produces better-performing models than either conventional Bayesian active learning or semi-supervised learning with randomly acquired data. It is also easier to scale up than the conventional approach. As well as supporting a shift towards semi-supervised models, our findings highlight the importance of studying models and acquisition methods in conjunction.

**BOBA: Byzantine-Robust Federated Learning with Label Skewness**

Wenxuan Bao · Jun Wu · Jingrui He

In federated learning, most existing robust aggregation rules (AGRs) combat Byzantine attacks in the IID setting, where client data is assumed to be independent and identically distributed. In this paper, we address label skewness, a more realistic and challenging non-IID setting, where each client only has access to a few classes of data. In this setting, state-of-the-art AGRs suffer from selection bias, leading to significant performance drop for particular classes; they are also more vulnerable to Byzantine attacks due to the increased variation among gradients of honest clients. To address these limitations, we propose an efficient two-stage method named BOBA. Theoretically, we prove the convergence of BOBA with an error of the optimal order. Our empirical evaluations demonstrate BOBA's superior unbiasedness and robustness across diverse models and datasets when compared to various baselines.

**Autoregressive Bandits**

Francesco Bacchiocchi · Gianmarco Genalti · Davide Maran · Marco Mussi · Marcello Restelli · Nicola Gatti · Alberto Maria Metelli

Autoregressive processes naturally arise in a large variety of real-world scenarios, including stock markets, sales forecasting, weather prediction, advertising, and pricing. When facing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for guaranteeing convergence to the optimal policy. In this work, we propose a novel online learning setting, namely, Autoregressive Bandits (ARBs), in which the observed reward is governed by an autoregressive process of order $k$, whose parameters depend on the chosen action. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. Then, we devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $\tilde{O} ( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2} )$, where $T$ is the optimization horizon, $n$ is the number of actions, and $\Gamma < 1$ is a stability index of the process. Finally, we empirically validate our algorithm, illustrating its advantages w.r.t. bandit baselines and its robustness to misspecification of key parameters.

**MINTY: Rule-based models that minimize the need for imputing features with missing values**

Lena Stempfle · Fredrik Johansson

Rule models are often preferred in prediction tasks with tabular inputs as they can be easily interpreted using natural language and provide predictive performance on par with more complex models. However, most rule models’ predictions are undefined or ambiguous when some inputs are missing, forcing users to rely on statistical imputation models or heuristics like zero imputation, undermining the interpretability of the models. In this work, we propose fitting concise yet precise rule models that learn to avoid relying on features with missing values and, therefore, limit their reliance on imputation at test time. We develop MINTY, a method that learns rules in the form of disjunctions between variables that act as replacements for each other when one or more is missing. This results in a sparse linear rule model, regularized to have small dependence on features with missing values, that allows a trade-off between goodness of fit, interpretability, and robustness to missing values at test time. We demonstrate the value of MINTY in experiments using synthetic and real-world data sets and find its predictive performance comparable or favorable to baselines, with smaller reliance on features with missing values.

**Graph Machine Learning through the Lens of Bilevel Optimization**

Amber Yijia Zheng · Tong He · Yixuan Qiu · Minjie Wang · David Wipf

Bilevel optimization refers to scenarios whereby the optimal solution of a lower-level energy function serves as input features to an upper-level objective of interest. These optimal features typically depend on tunable parameters of the lower-level energy in such a way that the entire bilevel pipeline can be trained end-to-end. Although not generally presented as such, this paper demonstrates how a variety of graph learning techniques can be recast as special cases of bilevel optimization or simplifications thereof. In brief, building on prior work we first derive a more flexible class of energy functions that, when paired with various descent steps (e.g., gradient descent, proximal methods, momentum, etc.), form graph neural network (GNN) message-passing layers; critically, we also carefully unpack where any residual approximation error lies with respect to the underlying constituent message-passing functions. We then probe several simplifications of this framework to derive close connections with non-GNN-based graph learning approaches, including knowledge graph embeddings, various forms of label propagation, and efficient graph-regularized MLP models. And finally, we present supporting empirical results that demonstrate the versatility of the proposed bilevel lens, which we refer to as BloomGML, referencing that BiLevel Optimization Offers More Graph Machine Learning. Our code is available at \url{https://github.com/amberyzheng/BloomGML}. Let graph ML bloom.

**Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL**

Arnob Ghosh · Xingyu Zhou · Ness Shroff

We study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. Existing approaches have primarily focused on \emph{soft} constraint violation, which allows compensation across episodes, making it easier to satisfy the constraints. In contrast, we consider a stronger \emph{hard} constraint violation metric, where only positive constraint violations are accumulated. Our main result is the development of the \emph{first model-free}, \emph{simulator-free} algorithm that achieves a sub-linear regret and a sub-linear hard constraint violation simultaneously, even in \emph{large-scale} systems. In particular, we show that $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^4K})$ hard constraint violation bounds can be achieved, where $K$ is the number of episodes, $d$ is the dimension of the feature mapping, $H$ is the length of the episode. Our results are achieved via novel adaptations of the primal-dual LSVI-UCB algorithm, i.e., it searches for the dual variable that balances between regret and constraint violation within every episode, rather than updating it at the end of each episode. This turns out to be crucial for our theoretical guarantees when dealing with hard constraint violations.

**Asynchronous SGD on Graphs: a Unified Framework for Asynchronous Decentralized and Federated Optimization**

Mathieu Even · Anastasia Koloskova · Laurent Massoulié

Decentralized and asynchronous communications are two popular techniques to speedup communication complexity of distributed machine learning, by respectively removing the dependency over a central orchestrator and the need for synchronization. Yet, combining these two techniques together still remains a challenge. In this paper, we take a step in this direction and introduce Asynchronous SGD on Graphs (AGRAF SGD) --- a general algorithmic framework that covers asynchronous versions of many popular algorithms including SGD, Decentralized SGD, Local SGD, FedBuff, thanks to its relaxed communication and computation assumptions. We provide rates of convergence under much milder assumptions than previous decentralized asynchronous works, while still recovering or even improving over the best know results for all the algorithms covered.

**Scalable Algorithms for Individual Preference Stable Clustering**

Ron Mosenzon · Ali Vakilian

In this paper, we study the individual preference (IP) stability, which is an notion capturing individual fairness and stability in clustering. Within this setting, a clustering is $\alpha$-IP stable when each data point's average distance to its cluster is no more than $\alpha$ times its average distance to any other cluster. In this paper, we study the natural local search algorithm for IP stable clustering. Our analysis confirms a $O(\log n)$-IP stability guarantee for this algorithm, where $n$ denotes the number of points in the input. Furthermore, by refining the local search approach, we show it runs in an almost linear time, $\tilde{O}(nk)$.

**Orthogonal Gradient Boosting for Simpler Additive Rule Ensembles**

Fan Yang · Pierre Le Bodic · Michael Kamp · Mario Boley

Gradient boosting of prediction rules is an efficient approach to learn potentially interpretable yet accurate probabilistic models. However, actual interpretability requires to limit the number and size of the generated rules, and existing boosting variants are not designed for this purpose. Though corrective boosting refits all rule weights in each iteration to minimise prediction risk, the included rule conditions tend to be sub-optimal, because commonly used objective functions fail to anticipate this refitting. Here, we address this issue by a new objective function that measures the angle between the risk gradient vector and the projection of the condition output vector onto the orthogonal complement of the already selected conditions. This approach correctly approximates the ideal update of adding the risk gradient itself to the model and favours the inclusion of more general and thus shorter rules. As we demonstrate using a wide range of prediction tasks, this significantly improves the comprehensibility/accuracy trade-off of the fitted ensemble. Additionally, we show how objective values for related rule conditions can be computed incrementally to avoid any substantial computational overhead of the new method.

**Adaptive Experiment Design with Synthetic Controls**

Alihan Hüyük · Zhaozhi Qian · Mihaela van der Schaar

Clinical trials are typically run in order to understand the effects of a new treatment on a given population of patients. However, patients in large populations rarely respond the same way to the same treatment. This heterogeneity in patient responses necessitates trials that investigate effects on multiple subpopulations—especially when a treatment has marginal or no benefit for the overall population but might have significant benefit for a particular subpopulation. Motivated by this need, we propose Syntax, an exploratory trial design that identifies subpopulations with positive treatment effect among many subpopulations. Syntax is sample efficient as it (i) recruits and allocates patients adaptively and (ii) estimates treatment effects by forming synthetic controls for each subpopulation that combines control samples from other subpopulations. We validate the performance of Syntax and provide insights into when it might have an advantage over conventional trial designs through experiments.

**Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization**

Alejandro de la Concha Duarte · Nicolas Vayatis · Argyris Kalogeratos

Quantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -to our best knowledge- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x'_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated at every iteration. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.

**Scalable Learning of Item Response Theory Models**

Susanne Frick · Amer Krivošija · Alexander Munteanu

Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.

**Corruption-Robust Offline Two-Player Zero-Sum Markov Games**

Andi Nika · Debmalya Mandal · Adish Singla · Goran Radanovic

We study data corruption robustness in offline two-player zero-sum Markov games. Given a dataset of realized trajectories of two players, an adversary is allowed to modify an $\epsilon$-fraction of it. The learner's goal is to identify an approximate Nash Equilibrium policy pair from the corrupted data. We consider this problem in linear Markov games under different degrees of data coverage and corruption. We start by providing an information-theoretic lower bound on the suboptimality gap of any learner. Next, we propose robust versions of the Pessimistic Minimax Value Iteration algorithm (Zhong et al., 2022), both under coverage on the corrupted data and under coverage only on the clean data, and show that they achieve (near)-optimal suboptimality gap bounds with respect to $\epsilon$. We note that we are the first to provide such a characterization of the problem of learning approximate Nash Equilibrium policies in offline two-player zero-sum Markov games under data corruption.

**Quantized Fourier and Polynomial Features for more Expressive Tensor Network Models**

Frederiek Wesel · Kim Batselier

In the context of kernel machines, polynomial and Fourier features are commonly used to provide a nonlinear extension to linear models by mapping the data to a higher-dimensional space. Unless one considers the dual formulation of the learning problem, which renders exact large-scale learning unfeasible, the exponential increase of model parameters in the dimensionality of the data caused by their tensor-product structure prohibits to tackle high-dimensional problems. One of the possible approaches to circumvent this exponential scaling is to exploit the tensor structure present in the features by constraining the model weights to be an underparametrized tensor network.In this paper we quantize, i.e. further tensorize, polynomial and Fourier features. Based on this feature quantization we propose to quantize the associated model weights, yielding quantized models. We show that, for the same number of model parameters, the resulting quantized models have a higher bound on the VC-dimension as opposed to their non-quantized counterparts, at no additional computational cost while learning from identical features.We verify experimentally how this additional tensorization regularizes the learning problem by prioritizing the most salient features in the data and how it provides models with increased generalization capabilities. We finally benchmark our approach on large regression task, achieving state-of-the-art results on a laptop computer.

**Fair Soft Clustering**

Rune D. Kjærsgaard · Pekka Parviainen · Saket Saurabh · MADHUMITA KUNDU · Line Clemmensen

Scholars in the machine learning community have recently focused on analyzing the fairness of learning models, including clustering algorithms. In this work we study fair clustering in a probabilistic (soft) setting, where observations may belong to several clusters determined by probabilities. We introduce new probabilistic fairness metrics, which generalize and extend existing non-probabilistic fairness frameworks and propose an algorithm for obtaining a fair probabilistic cluster solution from a data representation known as a fairlet decomposition. Finally, we demonstrate our proposed fairness metrics and algorithm by constructing a fair Gaussian mixture model on three real-world datasets. We achieve this by identifying balanced micro-clusters which minimize the distances induced by the model, and on which traditional clustering can be performed while ensuring the fairness of the solution.

**Gaussian process regression with Sliced Wasserstein Weisfeiler-Lehman graph kernels**

Raphaël Carpintero Perez · Sébastien Da Veiga · Josselin Garnier · Brian Staber

Supervised learning has recently garnered significant attention in the field of computational physics due to its ability to effectively extract complex patterns for tasks like solving partial differential equations, or predicting material properties. Traditionally, such datasets consist of inputs given as meshes with a large number of nodes representing the problem geometry (seen as graphs), and corresponding outputs obtained with a numerical solver. This means the supervised learning model must be able to handle large and sparse graphs with continuous node attributes. In this work, we focus on Gaussian process regression, for which we introduce the Sliced Wasserstein Weisfeiler-Lehman (SWWL) graph kernel. In contrast to existing graph kernels, the proposed SWWL kernel enjoys positive definiteness and a drastic complexity reduction, which makes it possible to process datasets that were previously impossible to handle. The new kernel is first validated on graph classification for molecular datasets, where the input graphs have a few tens of nodes. The efficiency of the SWWL kernel is then illustrated on graph regression in computational fluid dynamics and solid mechanics, where the input graphs are made up of tens of thousands of nodes.

**MMD-based Variable Importance for Distributional Random Forest**

Clément Bénard · Jeffrey Näf · julie Josse

Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence on the output mean, our algorithm detects variables impacting the output distribution more generally. We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors. In particular, our algorithm is highly efficient to select variables through recursive feature elimination, and can therefore provide small sets of variables to build accurate estimates of conditional output distributions.

**Better Batch for Deep Probabilistic Time Series Forecasting**

Zhihao Zheng · Seongjin Choi · Lijun Sun

Deep probabilistic time series forecasting has gained attention for its ability to provide nonlinear approximation and valuable uncertainty quantification for decision-making. However, existing models often oversimplify the problem by assuming a time-independent error process and overlooking serial correlation. To overcome this limitation, we propose an innovative training method that incorporates error autocorrelation to enhance probabilistic forecasting accuracy. Our method constructs a mini-batch as a collection of D consecutive time series segments for model training. It explicitly learns a time-varying covariance matrix over each mini-batch, encoding error correlation among adjacent time steps. The learned covariance matrix can be used to improve prediction accuracy and enhance uncertainty quantification. We evaluate our method on two different neural forecasting models and multiple public datasets. Experimental results confirm the effectiveness of the proposed approach in improving the performance of both models across a range of datasets, resulting in notable improvements in predictive accuracy.

**Adaptive Parametric Prototype Learning for Cross-Domain Few-Shot Classification**

Marzi Heidari · Abdullah Alchihabi · Qing En · Yuhong Guo

Cross-domain few-shot classification induces a much more challenging problem than its in-domain counterpart due to the existence of domain shifts between the training and test tasks. In this paper, we develop a novel Adaptive Parametric Prototype Learning (APPL) method under the meta-learning convention for cross-domain few-shot classification. Different from existing prototypical few-shot methods that use the averages of support instances to calculate the class prototypes, we propose to learn class prototypes from the concatenated features of the support set in a parametric fashion and meta-learn the model by enforcing prototype-based regularization on the query set. In addition, we fine-tune the model in the target domain in a transductive manner using a weighted-moving-average self-training approach on the query instances. We conduct experiments on multiple cross-domain few-shot benchmark datasets. The empirical results demonstrate that APPL yields superior performance to many state-of-the-art cross-domain few-shot learning methods.

**Online Calibrated and Conformal Prediction Improves Bayesian Optimization**

Shachi Shailesh Deshpande · Charles Marx · Volodymyr Kuleshov

Accurate uncertainty estimates are important in sequential model-based decision-making tasks such as Bayesian optimization. However, these estimates can be imperfect if the data violates assumptions made by the model (e.g., Gaussianity). This paper studies which uncertainties are needed in model-based decision-making and in Bayesian optimization, and argues that uncertainties can benefit from calibration—i.e., an 80\% predictive interval should contain the true outcome 80\% of the time. Maintaining calibration, however, can be challenging when the data is non-stationary and depends on our actions. We propose using simple algorithms based on online learning to provably maintain calibration onnon-i.i.d. data, and we show how to integrate these algorithms in Bayesian optimization with minimal overhead. Empirically, we find that calibrated Bayesian optimization converges to better optima in fewer steps, and we demonstrate improved performance on standard benchmark functions and hyperparameter optimization tasks.

**Offline Policy Evaluation and Optimization Under Confounding**

Chinmaya Kausik · Yangyi Lu · Kevin Tan · Maggie Makar · Yixin Wang · Ambuj Tewari

Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning.Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education.We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies.We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value.When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees.We also present new algorithms for offline policy improvement and prove local convergence guarantees.Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients.In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, we demonstrate the effectiveness of our method and investigate the importance of a clustering sub-routine.

**Asynchronous Randomized Trace Estimation**

Vasileios Kalantzis · Shashanka Ubaru · Chai Wah Wu · Georgios Kollias · Lior Horesh

Randomized trace estimation is a popular technique to approximate the trace of an implicitly-defined matrix $A$ by averaging the quadratic form $x'Ax$ across several samples of a random vector $x$. This paper focuses on the application of randomized trace estimators on asynchronous computing environments where the quadratic form $x'Ax$ is computed partially by observing only a random row subset of $A$ for each sample of the random vector $x$. Our asynchronous framework treats the number of rows, as well as the row subset observed for each sample, as random variables, and our theoretical analysis establishes the variance of the randomized estimator for Rademacher and Gaussian samples. We also present error analysis and sampling complexity bounds for the proposed asynchronous randomized trace estimator. Our numerical experiments illustrate that the asynchronous variant can be competitive even when a small number of rows is updated per each sample.

**FedFisher: Leveraging Fisher Information for One-Shot Federated Learning**

Divyansh Jhunjhunwala · Shiqiang Wang · Gauri Joshi

Standard federated learning (FL) algorithms typically require multiple rounds of communication between the server and the clients, which has several drawbacks, including requiring constant network connectivity, repeated investment of computational resources, and susceptibility to privacy attacks. One-Shot FL is a new paradigm that aims to address this challenge by enabling the server to train a global model in a single round of communication. In this work, we present FedFisher, a novel algorithm for one-shot FL that makes use of Fisher information matrices computed on local client models, motivated by a Bayesian perspective of FL. First, we theoretically analyze FedFisher for two-layer over-parameterized ReLU neural networks and show that the error of our one-shot FedFisher global model becomes vanishingly small as the width of the neural networks and amount of local training at clients increases. Next, we propose practical variants of FedFisher using the diagonal Fisher and K-FAC approximation for the full Fisher and highlight their communication and compute efficiency for FL. Finally, we conduct extensive experiments on various datasets, which show that these variants of FedFisher consistently improve over competing baselines.

**Feasible $Q$-Learning for Average Reward Reinforcement Learning**

Ying Jin · Ramki Gummadi · Zhengyuan Zhou · Jose Blanchet

Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e.\ long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor.However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $Q$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$-close to optimal, (ii) estimate optimal average reward with $\epsilon$-accuracy, and (iii) estimate the bias function (similar to $Q$-function in discounted case) with $\epsilon$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$-learning.That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community's best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.

**Smoothness-Adaptive Dynamic Pricing with Nonparametric Demand Learning**

Zeqi Ye · Hansheng Jiang

We study the dynamic pricing problem where the demand function is nonparametric and H\"older smooth, and we focus on adaptivity to the unknown H\"older smoothness parameter $\beta$ of the demand function. Traditionally the optimal dynamic pricing algorithm heavily relies on the knowledge of $\beta$ to achieve a minimax optimal regret of $\widetilde{O}(T^{\frac{\beta+1}{2\beta+1}})$. However, we highlight the challenge of adaptivity in this dynamic pricing problem by proving that no pricing policy can adaptively achieve this minimax optimal regret without knowledge of $\beta$. Motivated by the impossibility result, we propose a self-similarity condition to enable adaptivity. Importantly, we show that the self-similarity condition does not compromise the problem's inherent complexity since it preserves the regret lower bound $\Omega(T^{\frac{\beta+1}{2\beta+1}})$. Furthermore, we develop a smoothness-adaptive dynamic pricing algorithm and theoretically prove that the algorithm achieves this minimax optimal regret bound without the prior knowledge $\beta$.

**Sample Complexity Characterization for Linear Contextual MDPs**

JUNZE DENG · Yuan Cheng · Shaofeng Zou · Yingbin Liang

Contextual Markov decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. While CMDPs serve as an important framework to model many real-world applications with time-varying environments, they are largely unexplored from theoretical perspective. In this paper, we study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights. For both models, we propose novel model-based algorithms and show that they enjoy guaranteed $\epsilon$-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.

**Queuing dynamics of asynchronous Federated Learning**

Louis Leconte · Matthieu Jonckheere · Sergey Samsonov · Eric Moulines

We study asynchronous federated learning mechanisms with nodes having potentially different computational speeds. In such an environment, each node is allowed to work on models with potential delays and contribute to updates to the central server at its own pace. Existing analyses of such algorithms typically depend on intractable quantities such as the maximum node delay and do not consider the underlying queuing dynamics of the system. In this paper, we propose a non-uniform sampling scheme for the central server that allows for lower delays with better complexity, taking into account the closed Jackson network structure of the associated computational graph. Our experiments clearly show a significant improvement of our method over current state-of-the-art asynchronous algorithms on image classification problems.

**Learning Populations of Preferences via Pairwise Comparison Queries**

Gokcan Tatli · Yi Chen · Ramya Korlakai Vinayak

Ideal point based preference learning using pairwise comparisons of type "Do you prefer a or b?" has emerged as a powerful tool for understanding how we make preferences. Existing preference learning approaches assume homogeneity and focus on learning preference on average over the population or require a large number of queries per individual to localize individual preferences. However, in practical scenarios with heterogeneous preferences and limited availability of responses, these approaches are impractical. Therefore, we introduce the problem of learning the distribution of preferences over a population via pairwise comparisons using only one response per individual. Due to binary answers from comparison queries, we focus on learning the mass of the underlying distribution in the regions created by the intersection of bisecting hyperplanes between queried item pairs. We investigate this fundamental question in both 1-D and higher dimensional settings with noiseless response to comparison queries. We show that the problem is identifiable in 1-D setting and provide recovery guarantees. We show that the problem is not identifiable for higher dimensional settings in general and establish sufficient condition for identifiability. We propose using a regularized recovery, and provide guarantees on the total variation distance between the true mass and the learned distribution. We validate our findings through simulations and experiments on real datasets. We also introduce a new dataset for this task collected on a real crowdsourcing platform.

**Distributionally Robust Model-based Reinforcement Learning with Large State Spaces**

Shyam Sundhar Ramesh · Pier Giuseppe Sessa · Yifan Hu · Andreas Krause · Ilija Bogunovic

Three major challenges in reinforcement learning are the complex dynamical systems with large state spaces, the costly data acquisition processes, and the deviation of real-world dynamics from the training environment deployment. To overcome these issues, we study distributionally robust Markov decision processes with continuous state spaces under the widely used Kullback–Leibler, chi-square, and total variation uncertainty sets. We propose a model-based approach that utilizes Gaussian Processes and the maximum variance reduction algorithm to efficiently learn multi-output nominal transition dynamics, leveraging access to a generative model (i.e., simulator). We further demonstrate the statistical sample complexity of the proposed method for different uncertainty sets. These complexity bounds are independent of the number of states and extend beyond linear dynamics, ensuring the effectiveness of our approach in identifying near-optimal distributionally-robust policies. The proposed method can be further combined with other model-free distributionally robust reinforcement learning methods to obtain a near-optimal robust policy. Experimental results demonstrate the robustness of our algorithm to distributional shifts and its superior performance in terms of the number of samples needed.

**A Neural Architecture Predictor based on GNN-Enhanced Transformer**

Xunzhi Xiang · Kun Jing · Jungang Xu

Neural architecture performance predictor is an efﬁcient approach for architecture estimation in Neural Architecture Search (NAS). However, existing predictors based on Graph Neural Networks (GNNs) are deﬁcient in modeling long-range interactions between operation nodes and prone to the problem of over-smoothing, which limits their ability to learn neural architecture representation. Furthermore, some Transformer-based predictors use simple position encodings to improve performance via self-attention mechanism, but they fail to fully exploit the subgraph structure information of the graph. To solve this problem, we propose a novel method to enhance the graph representation of neural architectures by combining GNNs and Transformer blocks. We evaluate the effectiveness of our predictor on NAS-Bench-101 and NAS-bench-201 benchmarks, the discovered architecture on DARTS search space achieves an accuracy of 97.61\% on CIFAR-10 dataset, which outperforms traditional position encoding methods such as adjacency and Laplacian matrices. The code of our work is available at \url{https://github.com/GNET}.

**Robust Data Clustering with Outliers via Transformed Tensor Low-Rank Representation**

Tong Wu

Recently, tensor low-rank representation (TLRR) has become a popular tool for tensor data recovery and clustering, due to its empirical success and theoretical guarantees. However, existing TLRR methods consider Gaussian or gross sparse noise, inevitably leading to performance degradation when the tensor data are contaminated by outliers or sample-specific corruptions. This paper develops an outlier-robust tensor low-rank representation (OR-TLRR) method that provides outlier detection and tensor data clustering simultaneously based on the t-SVD framework. For tensor observations with arbitrary outlier corruptions, OR-TLRR has provable performance guarantee for exactly recovering the row space of clean data and detecting outliers under mild conditions. Moreover, an extension of OR-TLRR is proposed to handle the case when parts of the data are missing. Finally, extensive experimental results on synthetic and real data demonstrate the effectiveness of the proposed algorithms. We release our code at \url{https://github.com/twugithub/2024-AISTATS-ORTLRR}.

**Regret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures**

Hao Liang · Zhiquan Luo

We study finite episodic Markov decision processes incorporating dynamic risk measures to capture risk sensitivity. To this end, we present two model-based algorithms applied to \emph{Lipschitz} dynamic risk measures, a wide range of risk measures that subsumes spectral risk measure, optimized certainty equivalent, and distortion risk measures, among others. We establish both regret upper bounds and lower bounds. Notably, our upper bounds demonstrate optimal dependencies on the number of actions and episodes while reflecting the inherent trade-off between risk sensitivity and sample complexity. Our approach offers a unified framework that not only encompasses multiple existing formulations in the literature but also broadens the application spectrum.

**CAD-DA: Controllable Anomaly Detection after Domain Adaptation by Statistical Inference**

Vo Nguyen Le Duy · Hsuan-Tien Lin · Ichiro Takeuchi

We propose a novel statistical method for testing the results of anomaly detection (AD) under domain adaptation (DA), which we call CAD-DA---controllable AD under DA. The distinct advantage of the CAD-DA lies in its ability to control the probability of misidentifying anomalies under a pre-specified level $\alpha$ (e.g., 0.05). The challenge within this DA setting is the necessity to account for the influence of DA to ensure the validity of the inference results. We overcome the challenge by leveraging the concept of Selective Inference to handle the impact of DA. To our knowledge, this is the first work capable of conducting a valid statistical inference within the context of DA. We evaluate the performance of the CAD-DA method on both synthetic and real-world datasets.

**Multitask Online Learning: Listen to the Neighborhood Buzz**

Juliette Achddou · Nicolò Cesa-Bianchi · Pierre Laforgue

We study multitask online learning in a setting where agents can only exchange information with their neighbors on an arbitrary communication network. We introduce MT-CO\textsubscript{2}OL, a decentralized algorithm for this setting whose regret depends on the interplay between the task similarities and the network structure. Our analysis shows that the regret of MT-CO\textsubscript{2}OL is never worse (up to constants) than the bound obtained when agents do not share information. On the other hand, our bounds significantly improve when neighboring agents operate on similar tasks. In addition, we prove that our algorithm can be made differentially private with a negligible impact on the regret. Finally, we provide experimental support for our theory.

**Time to Cite: Modeling Citation Networks using the Dynamic Impact Single-Event Embedding Model**

Nikolaos Nakis · Abdulkadir Celikkanat · Louis Boucherie · Sune Lehmann · Morten Mørup

Understanding the structure and dynamics of scientific research, i.e., the science of science (SciSci), has become an important area of research in order to address imminent questions including how scholars interact to advance science, how disciplines are related and evolve, and how research impact can be quantified and predicted. Central to the study of SciSci has been the analysis of citation networks. Here, two prominent modeling methodologies have been employed: one is to assess the citation impact dynamics of papers using parametric distributions, and the other is to embed the citation networks in a latent space optimal for characterizing the static relations between papers in terms of their citations. Interestingly, citation networks are a prominent example of single-event dynamic networks, i.e., networks for which each dyad only has a single event (i.e., the point in time of citation). We presently propose a novel likelihood function for the characterization of such single-event networks. Using this likelihood, we propose the Dynamic Impact Single-Event Embedding model (DISEE). The DISEE model characterizes the scientific interactions in terms of a latent distance model in which random effects account for citation heterogeneity while the time-varying impact is characterized using existing parametric representations for assessment of dynamic impact. We highlight the proposed approach on several real citation networks finding that DISEE well reconciles static latent distance network embedding approaches with classical dynamic impact assessments.

**Extended Deep Adaptive Input Normalization for Preprocessing Time Series Data for Neural Networks**

Marcus September · Francesco Sanna Passino · Leonie Goldmann · Anton Hinel

Data preprocessing is a crucial part of any machine learning pipeline, and it can have a significant impact on both performance and training efficiency. This is especially evident when using deep neural networks for time series prediction and classification: real-world time series data often exhibit irregularities such as multi-modality, skewness and outliers, and the model performance can degrade rapidly if these characteristics are not adequately addressed. In this work, we propose the EDAIN (Extended Deep Adaptive Input Normalization) layer, a novel adaptive neural layer that learns how to appropriately normalize irregular time series data for a given task in an end-to-end fashion, instead of using a fixed normalization scheme. This is achieved by optimizing its unknown parameters simultaneously with the deep neural network using back-propagation. Our experiments, conducted using synthetic data, a credit default prediction dataset, and a large-scale limit order book benchmark dataset, demonstrate the superior performance of the EDAIN layer when compared to conventional normalization methods and existing adaptive time series preprocessing layers.

**Classifier Calibration with ROC-Regularized Isotonic Regression**

Eugène Berta · Francis Bach · Michael Jordan

Calibration of machine learning classifiers is necessary to obtain reliable and interpretable predictions, bridging the gap between model outputs and actual probabilities. One prominent technique, isotonic regression (IR), aims at calibrating binary classifiers by minimizing the cross entropy with respect to monotone transformations. IR acts as an adaptive binning procedure that is able to achieve a calibration error of zero but leaves open the issue of the effect on performance. We first prove that IR preserves the convex hull of the ROC curve---an essential performance metric for binary classifiers. This ensures that a classifier is calibrated while controlling for over-fitting of the calibration set. We then present a novel generalization of isotonic regression to accommodate classifiers with $K$-classes. Our method constructs a multidimensional adaptive binning scheme on the probability simplex, again achieving a multi-class calibration error equal to zero. We regularize this algorithm by imposing a form of monotony that preserves the $K$-dimensional ROC surface of the classifier. We show empirically that this general monotony criterion is effective in striking a balance between reducing cross entropy loss and avoiding over-fitting of the calibration set.

**RL in Markov Games with Independent Function Approximation: Improved Sample Complexity Bound under the Local Access Model**

Junyi FAN · Yuxuan Han · JIALIN ZENG · Jian-Feng Cai · Yang Wang · Yang Xiang · Jiheng Zhang

Efficiently learning equilibria with large state and action spaces in general-sum Markov games while overcoming the curse of multi-agency is a challenging problem.Recent works have attempted to solve this problem by employing independent linear function classes to approximate the marginal $Q$-value for each agent.However, existing sample complexity bounds under such a framework have a suboptimal dependency on the desired accuracy $\varepsilon$ or the action space. In this work, we introduce a new algorithm, Lin-Confident-FTRL, for learning coarse correlated equilibria (CCE) with local access to the simulator, i.e., one can interact with the underlying environment on the visited states. Up to a logarithmic dependence on the size of the state space, Lin-Confident-FTRL learns $\epsilon$-CCE with a provable optimal accuracy bound $O(\epsilon^{-2})$and gets rids of the linear dependency on the action space, while scaling polynomially with relevant problem parameters (such as the number of agents and time horizon). Moreover, our analysis of Linear-Confident-FTRL generalizes the virtual policy iteration technique in the single-agent local planning literature, which yields a new computationally efficient algorithm with a tighter sample complexity bound when assuming random access to the simulator.

**Multiclass Learning from Noisy Labels for Non-decomposable Performance Measures**

Mingyuan Zhang · Shivani Agarwal

There has been much interest in recent years in learning good classifiers from data with noisy labels. Most work on learning from noisy labels has focused on standard loss-based performance measures. However, many machine learning problems require using non-decomposable performance measures which cannot be expressed as the expectation or sum of a loss on individual examples; these include for example the H-mean, Q-mean and G-mean in class imbalance settings, and the Micro F1 in information retrieval. In this paper, we design algorithms to learn from noisy labels for two broad classes of multiclass non-decomposable performance measures, namely, monotonic convex and ratio-of-linear, which encompass all the above examples. Our work builds on the Frank-Wolfe and Bisection based methods of Narasimhan et al. (2015). In both cases, we develop noise-corrected versions of the algorithms under the widely studied family of class-conditional noise models. We provide regret (excess risk) bounds for our algorithms, establishing that even though they are trained on noisy data, they are Bayes consistent in the sense that their performance converges to the optimal performance w.r.t. the clean (non-noisy) distribution. Our experiments demonstrate the effectiveness of our algorithms in handling label noise.

**Sketch In, Sketch Out: Accelerating both Learning and Inference for Structured Prediction with Kernels**

Tamim El Ahmad · Luc Brogat-Motte · Pierre Laforgue · Florence d'Alché-Buc

Leveraging the kernel trick in both the input and output spaces, surrogate kernel methods are a flexible and theoretically grounded solution to structured output prediction. If they provide state-of-the-art performance on complex data sets of moderate size (e.g., in chemoinformatics), these approaches however fail to scale. We propose to equip surrogate kernel methods with sketching-based approximations, applied to both the input and output feature maps. We prove excess risk bounds on the original structured prediction problem, showing how to attain close-to-optimal rates with a reduced sketch size that depends on the eigendecay of the input/output covariance operators. From a computational perspective, we show that the two approximations have distinct but complementary impacts: sketching the input kernel mostly reduces training time, while sketching the output kernel decreases the inference time. Empirically, our approach is shown to scale, achieving state-of-the-art performance on benchmark data sets where non-sketched methods are intractable.

**Efficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror Descent**

Bianca Marin Moreno · Margaux Bregere · Pierre Gaillard · Nadia Oudjane

Many machine learning tasks can be solved by minimizing a convex function of an occupancy measure over the policies that generate them. These include reinforcement learning, imitation learning, among others. This more general paradigm is called the Concave Utility Reinforcement Learning problem (CURL). Since CURL invalidates classical Bellman equations, it requires new algorithms. We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process. MD-CURL is inspired by mirror descent and uses a non-standard regularization to achieve convergence guarantees and a simple closed-form solution, eliminating the need for computationally expensive projection steps typically found in mirror descent approaches. We then extend CURL to an online learning scenario and present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting with partially unknown dynamics. Like MD-CURL, the online version Greedy MD-CURL benefits from low computational complexity, while guaranteeing sub-linear or even logarithmic regret, depending on the level of information available on the underlying dynamics.

**GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models**

Tolga Dimlioglu · Anna Choromanska

We study distributed training of deep learning models in time-constrained environments. We propose a new algorithm that periodically pulls workers towards the center variable computed as a weighted average of workers, where the weights are inversely proportional to the gradient norms of the workers such that recovering the flat regions in the optimization landscape is prioritized. We develop two asynchronous variants of the proposed algorithm that we call Model-level and Layer-level Gradient-based Weighted Averaging (resp. MGRAWA and LGRAWA), which differ in terms of the weighting scheme that is either done with respect to the entire model or is applied layer-wise. On the theoretical front, we prove the convergence guarantee for the proposed approach in both convex and non-convex settings. We then experimentally demonstrate that our algorithms outperform the competitor methods by achieving faster convergence and recovering better quality and flatter local optima. We also carry out an ablation study to analyze the scalability of the proposed algorithms in more crowded distributed training environments. Finally, we report that our approach requires less frequent communication and fewer distributed updates compared to the state-of-the-art baselines.

**A Doubly Robust Approach to Sparse Reinforcement Learning**

Wonyoung Kim · Garud Iyengar · Assaf Zeevi

We propose a new regret minimization algorithm for episodic sparse linear Markov decision process (SMDP) where the state-transition distribution is a linear function of observed features.The only previously known algorithm for SMDP requires the knowledge of the sparsity parameter and oracle access to an unknown policy.We overcome these limitations by combining the doubly robust method that allows one to use feature vectors of \emph{all} actions with a novel analysis technique that enables the algorithm to use data from all periods in all episodes.The regret of the proposed algorithm is $\tilde{O}(\sigma^{-1}_{\min}s_{\star} H \sqrt{N})$, where $\sigma_{\min}$ denotes the restrictive the minimum eigenvalue of the average Gram matrix of feature vectors, $s_\star$ is the sparsity parameter, $H$ is the length of an episode, and $N$ is the number of rounds.We provide a lower regret bound that matches the upper bound to logarithmic factors on a newly identified subclass of SMDPs.Our numerical experiments support our theoretical results and demonstrate the superior performance of our algorithm.

**Analysis of Kernel Mirror Prox for Measure Optimization**

Pavel Dvurechenskii · Jia-Jie Zhu

By choosing a suitable function space as the dual to the non-negative measure cone, we study in a unified framework a class of functional saddle-point optimization problems, which we term the Mixed Functional Nash Equilibrium (MFNE), that underlies several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters. We model the saddle-point optimization dynamics as an interacting Fisher-Rao-RKHS gradient flow when the function space is chosen as a reproducing kernel Hilbert space (RKHS). As a discrete time counterpart, we propose a primal-dual kernel mirror prox (KMP) algorithm, which uses a dual step in the RKHS, and a primal entropic mirror prox step. We then provide a unified convergence analysis of KMP in an infinite-dimensional setting for this class of MFNE problems, which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case, where $N$ is the iteration counter. As a case study, we apply our analysis to DRO, providing algorithmic guarantees for DRO robustness and convergence.

**Online Learning in Contextual Second-Price Pay-Per-Click Auctions**

Mengxiao Zhang · Haipeng Luo

We study online learning in contextual pay-per-click auctions where at each of the $T$ rounds, the learner receives some context along with a set of ads and needs to make an estimate on their click-through rate (CTR) in order to run a second-price pay-per-click auction. The learner's goal is to minimize her regret, defined as the gap between her total revenue and that of an oracle strategy that always makes perfect CTR predictions. We first show that $\sqrt{T}$-regret is obtainable via a computationally inefficient algorithm and that it is unavoidable since our algorithm is no easier than the classical multi-armed bandit problem. A by-product of our results is a $\sqrt{T}$-regret bound for the simpler non-contextual setting, improving upon a recent work of [Feng et al., 2023] by removing the inverse CTR dependency that could be arbitrarily large. Then, borrowing ideas from recent advances on efficient contextual bandit algorithms, we develop two practically efficient contextual auction algorithms: the first one uses the exponential weight scheme with optimistic square errors and maintains the same $\sqrt{T}$-regret bound, while the second one reduces the problem to online regression via a simple epsilon-greedy strategy, albeit with a worse regret bound. Finally, we conduct experiments on a synthetic dataset to showcase the effectiveness and superior performance of our algorithms.

**Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients**

Chris Cundy · Rishi Desai · Stefano Ermon

As reinforcement learning techniques are increasingly applied to real-world decision problems, attention has turned to how these algorithms use potentially sensitive information.We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.We give examples of how this setting covers real-world problems in privacy for sequential decision-making.We solve this problem in the policy gradients framework by introducing a regularizer based on the mutual information (MI) between the sensitive state and the actions.We develop a model-based stochastic gradient estimator for optimization of privacy-constrained policies. We also discuss an alternative MI regularizer that serves as an upper bound to our main MI regularizer and can be optimized in a model-free setting, and a powerful direct estimator that can be used in an environment with differentiable dynamics.We contrast previous work in differentially-private RL to our mutual-information formulation of information disclosure.Experimental results show that our training method results in policies that hide the sensitive state, even in challenging high-dimensional tasks.

**Information Theoretically Optimal Sample Complexity of Learning Dynamical Directed Acyclic Graphs**

Mishfad Shaikh Veedu · Deepjyoti Deka · Murti Salapaka

In this article, the optimal sample complexity of learning the underlying interactions or dependencies of a Linear Dynamical System (LDS) over a Directed Acyclic Graph (DAG) is studied. We call such a DAG underlying an LDS as dynamical DAG (DDAG). In particular, we consider a DDAG where the nodal dynamics are driven by unobserved exogenous noise sources that are wide-sense stationary (WSS) in time but are mutually uncorrelated, and have the same power spectral density (PSD). Inspired by the static DAG setting, a metric and an algorithm based on the PSD matrix of the observed time series are proposed to reconstruct the DDAG. It is shown that the optimal sample complexity (or length of state trajectory) needed to learn the DDAG is $n=\Theta(q\log(p/q))$, where $p$ is the number of nodes and $q$ is the maximum number of parents per node. To prove the sample complexity upper bound, a concentration bound for the PSD estimation is derived, under two different sampling strategies. A matching min-max lower bound using generalized Fano’s inequality also is provided, thus showing the order optimality of the proposed algorithm. The codes used in the paper are available at \url{https://github.com/Mishfad/Learning-Dynamical-DAGs}

**Local Causal Discovery with Linear non-Gaussian Cyclic Models**

Haoyue Dai · Ignavier Ng · Yujia Zheng · Zhengqing Gao · Kun Zhang

Local causal discovery is of great practical significance, as there are often situations where the discovery of the global causal structure is unnecessary, and the interest lies solely on a single target variable. Most existing local methods utilize conditional independence relations, providing only a partially directed graph, and assume acyclicity for the ground-truth structure, even though real-world scenarios often involve cycles like feedback mechanisms. In this work, we present a general, unified local causal discovery method with linear non-Gaussian models, whether they are cyclic or acyclic. We extend the application of independent component analysis from the global context to independent subspace analysis, enabling the exact identification of the equivalent local directed structures and causal strengths from the Markov blanket of the target variable. We also propose an alternative regression-based method in the particular acyclic scenarios. Our identifiability results are empirically validated using both synthetic and real-world datasets.

We present a new adaptive algorithm for learning discrete distributions under distribution drift. In this setting, we observe a sequence of independent samples from a discrete distribution that is changing over time, and the goal is to estimate the current distribution. Since we have access to only a single sample for each time step, a good estimation requires a careful choice of the number of past samples to use. To use more samples, we must resort to samples further in the past, and we incur a drift error due to the bias introduced by the change in distribution. On the other hand, if we use a small number of past samples, we incur a large statistical error as the estimation has a high variance. We present a novel adaptive algorithm that can solve this trade-off without any prior knowledge of the drift. Unlike previous adaptive results, our algorithm characterizes the statistical error using data-dependent bounds. This technicality enables us to overcome the limitations of the previous work that require a fixed finite support whose size is known in advance and that cannot change over time. Additionally, we can obtain tighter bounds depending on the complexity of the drifting distribution, and also consider distributions with infinite support.

Resampling methods such as the bootstrap have proven invaluable in the field of machine learning. However, the applicability of traditional bootstrap methods is limited when dealing with large streams of dependent data, such as time series or spatially correlated observations. In this paper, we propose a novel bootstrap method that is designed to account for data dependencies and can be executed online, making it particularly suitable for real-time applications.This method is based on an autoregressive sequence of increasingly dependent resampling weights.We prove the theoretical validity of the proposed bootstrap scheme under general conditions. We demonstrate the effectiveness of our approach through extensive simulations and show that it provides reliable uncertainty quantification even in the presence of complex data dependencies. Our work bridges the gap between classical resampling techniques and the demands of modern data analysis, providing a valuable tool for researchers and practitioners in dynamic, data-rich environments.

**Achieving Fairness through Separability: A Unified Framework for Fair Representation Learning**

Taeuk Jang · Hongchang Gao · Pengyi Shi · Xiaoqian Wang

Fairness is a growing concern in machine learning as state-of-the-art models may amplify social prejudice by making biased predictions against specific demographics such as race and gender. Such discrimination raises issues in various fields such as employment, criminal justice, and trust score evaluation. To address the concerns, we propose learning fair representation through a straightforward yet effective approach to project intrinsic information while filtering sensitive information for downstream tasks. Our model consists of two goals: one is to ensure that the latent data from different demographic groups is non-separable (i.e., make the latent data distribution independent of the sensitive feature to improve fairness); the other is to maximize the separability of latent data from different classes (i.e., maintain the discriminative power of data for the sake of the downstream tasks like classification). Our method adopts a non-zero-sum adversarial game to minimize the distance between data from different demographic groups while maximizing the margin between data from different classes. Moreover, the proposed objective function can be easily generalized to multiple sensitive attributes and multi-class scenarios as it upper bounds popular fairness metrics in these cases. We provide theoretical analysis of the fairness of our model and validate w.r.t.\ both fairness and predictive performance on benchmark datasets.

**Is this model reliable for everyone? Testing for strong calibration**

Jean Feng · Alexej Gossmann · Romain Pirracchio · Nicholas Petrick · Gene Pennello · Berkman Sahiner

In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model.

**Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations**

Hanjiang Hu · Zuxin Liu · Linyi Li · Jiacheng Zhu · DING ZHAO

Deep learning-based visual perception models lack robustness when faced with camera motion perturbations in practice. The current certification process for assessing robustness is costly and time-consuming due to the extensive number of image projections required for Monte Carlo sampling in the 3D camera motion space. To address these challenges, we present a novel, efficient, and practical framework for certifying the robustness of 3D-2D projective transformations against camera motion perturbations. Our approach leverages a smoothing distribution over the 2D-pixel space instead of in the 3D physical space, eliminating the need for costly camera motion sampling and significantly enhancing the efficiency of robustness certifications. With the pixel-wise smoothed classifier, we are able to fully upper bound the projection errors using a technique of uniform partitioning in camera motion space. Additionally, we extend our certification framework to a more general scenario where only a single-frame point cloud is required in the projection oracle. Through extensive experimentation, we validate the trade-off between effectiveness and efficiency enabled by our proposed method. Remarkably, our approach achieves approximately 80\% certified accuracy while utilizing only 30\% of the projected image frames.

**Emergent specialization from participation dynamics and multi-learner retraining**

Sarah Dean · Mihaela Curmei · Lillian Ratliff · Jamie Morgenstern · Maryam Fazel

Numerous online services are data-driven: the behavior of users affects the system's parameters, and the system's parameters affect the users' experience of the service, which in turn affects the way users may interact with the system.For example, people may choose to use a service only for tasks that already works well, or they may choose to switch to a different service.These adaptations influence the ability of a system to learn about a population of users and tasks in order to improve its performance broadly.In this work, we analyze a class of such dynamics---where users allocate their participation amongst services to reduce the individual risk they experience, and services update their model parameters to reduce the service's risk on their current user population.We refer to these dynamics as \emph{risk-reducing}, which cover a broad class of common model updates including gradient descent and multiplicative weights.For this general class of dynamics, we show that asymptotically stable equilibria are always segmented, with sub-populations allocated to a single learner.Under mild assumptions, the utilitarian social optimum is a stable equilibrium.In contrast to previous work, which shows that repeated risk minimization can result in representation disparity and high overall loss with a single learner (Hashimoto et al., 2018; Miller et al., 2021), we find that repeated myopic updates with multiple learners lead to better outcomes.We illustrate the phenomena via a simulated example initialized from real data.

Interpretability is crucial for doctors, hospitals, pharmaceutical companies and biotechnology corporations to analyze and make decisions for high stakes problemsthat involve human health. Tree-based methods have been widely adopted for survival analysis due to their appealing interpretablility and their ability to capture complex relationships. However, most existing methods to produce survival trees rely on heuristic (or greedy) algorithms, which risk producing sub-optimal models. We present a dynamic-programming-with-bounds approach that finds provably-optimal sparse survival tree models, frequently in only a few seconds.

**Explanation-based Training with Differentiable Insertion/Deletion Metric-aware Regularizers**

Yuya Yoshikawa · Tomoharu Iwata

The quality of explanations for the predictions made by complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how accurately the explanations reflect the predictor's behavior.To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both the insertion and deletion scores of the explanations while maintaining their predictive accuracy.Because the original insertion and deletion metrics are non-differentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics so that they are differentiable and use them to formalize insertion and deletion metric-based regularizers.Our experimental results on image and tabular datasets show that the deep neural network-based predictors that are fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easier-to-interpret explanations while maintaining high predictive accuracy.The code is available at https://github.com/yuyay/idexpo.

**Online Distribution Learning with Local Privacy Constraints**

Jin Sima · Changlong Wu · Olgica Milenkovic · Wojciech Szpankowski

We study the problem of online conditional distribution estimation with \emph{unbounded} label sets under local differential privacy. The problem may be succinctly stated as follows. Let $\mathcal{F}$ be a distribution-valued function class with an unbounded label set. Our aim is to estimate an \emph{unknown} function $f\in \mathcal{F}$ in an online fashion. More precisely, at time $t$, given a sample ${\mathbf{x}}_t$, we generate an estimate of $f({\mathbf{x}}_t)$ using only a \emph{privatized} version of the true \emph{labels} sampled from $f({\mathbf{x}}_t)$. The objective is to minimize the cumulative KL-risk of a finite horizon $T$. We show that under $(\epsilon,0)$-local differential privacy for the labels, the KL-risk equals $\tilde{\Theta}(\frac{1}{\epsilon}\sqrt{KT}),$ up to poly-logarithmic factors, where $K=|\mathcal{F}|$. This result significantly differs from the $\tilde{\Theta}(\sqrt{T\log K})$ bound derived in Wu et al., (2023a) for \emph{bounded} label sets. As a side-result, our approach recovers a nearly tight upper bound for the hypothesis selection problem of Gopi et al., (2020), which has only been established for the \emph{batch} setting.

**The Risks of Recourse in Binary Classification**

Hidde Fokkema · Damien Garreau · Tim van Erven

Algorithmic recourse provides explanations that help users overturn an unfavorable decision by a machine learning system. But so far very little attention has been paid to whether providing recourse is beneficial or not. We introduce an abstract learning-theoretic framework that compares the risks (i.e., expected losses) for classification with and without algorithmic recourse. This allows us to answer the question of when providing recourse is beneficial or harmful at the population level. Surprisingly, we find that there are many plausible scenarios in which providing recourse turns out to be harmful, because it pushes users to regions of higher class uncertainty and therefore leads to more mistakes. We further study whether the party deploying the classifier has an incentive to strategize in anticipation of having to provide recourse, and we find that sometimes they do, to the detriment of their users. Providing algorithmic recourse may therefore also be harmful at the systemic level. We confirm our theoretical findings in experiments on simulated and real-world data. All in all, we conclude that the current concept of algorithmic recourse is not reliably beneficial, and therefore requires rethinking.

**Quantifying Uncertainty in Natural Language Explanations of Large Language Models**

Sree Harsha Tanneru · Chirag Agarwal · Himabindu Lakkaraju

Large Language Models (LLMs) are increasingly used as powerful tools for several high-stakes natural language processing (NLP) applications. Recent prompting works claim to elicit intermediate reasoning steps and key tokens that serve as proxy explanations for LLM predictions. However, there is no certainty whether these explanations are reliable and reflect the LLM’s behavior. In this work, we make one of the first attempts at quantifying the uncertainty in explanations of LLMs. To this end, we propose two novel metrics --- Verbalized Uncertainty and Probing Uncertainty --- to quantify the uncertainty of generated explanations. While verbalized uncertainty involves prompting the LLM to express its confidence in its explanations, probing uncertainty leverages sample and model perturbations as a means to quantify the uncertainty. Our empirical analysis of benchmark datasets reveals that verbalized uncertainty is not a reliable estimate of explanation confidence. Further, we show that the probing uncertainty estimates are correlated with the faithfulness of an explanation, with lower uncertainty corresponding to explanations with higher faithfulness. Our study provides insights into the challenges and opportunities of quantifying uncertainty in LLM explanations, contributing to the broader discussion of the trustworthiness of foundation models.

**Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective**

Yue Xing · Xiaofeng Lin · Qifan Song · Yi Xu · Belinda Zeng · Guang Cheng

Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., Kim et al. (2020), empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.

**Effect of Ambient-Intrinsic Dimension Gap on Adversarial Vulnerability**

Rajdeep Haldar · Yue Xing · Qifan Song

The existence of adversarial attacks on machine learning models imperceptible to a human is still quite a mystery from a theoretical perspective. In this work, we introduce two notions of adversarial attacks: natural or on-manifold attacks, which are perceptible by a human/oracle, and unnatural or off-manifold attacks, which are not. We argue that the existence of the off-manifold attacks is a natural consequence of the dimension gap between the intrinsic and ambient dimensions of the data. For 2-layer ReLU networks, we prove that even though the dimension gap does not affect generalization performance on samples drawn from the observed data space, it makes the clean-trained model more vulnerable to adversarial perturbations in the off-manifold direction of the data space. Our main results provide an explicit relationship between the $\ell_2,\ell_{\infty}$ attack strength of the on/off-manifold attack and the dimension gap.

**Fast and Adversarial Robust Kernelized SDU Learning**

Yajing Fan · wanli shi · Yi Chang · Bin Gu

SDU learning, a weakly supervised learning problem with only pairwise similarities, dissimilarities data points and unlabeled data available, has many practical applications. However, it is still lacking in defense against adversarial samples, and its learning process can be expensive. To address this gap, we propose a novel adversarial training framework for SDU learning.Our approach reformulates the conventional minimax problem as an equivalent minimization problem based on the kernel perspective, departing from traditional confrontational training methods. Additionally, we employ the random gradient method and random features to accelerate the training process. Theoretical analysis shows that our method can converge to a stationary point at a rate of $\mathcal{O}(1/T^{1/4})$.Our experimental results show that our algorithm is superior to other adversarial training methods in terms of generalization, efficiency and scalability against various adversarial attacks.

**Uncertainty Matters: Stable Conclusions under Unstable Assessment of Fairness Results**

Ainhize Barrainkua · Paula Gordaliza · Jose A Lozano · Novi Quadrianto

Recent studies highlight the effectiveness of Bayesian methods in assessing algorithm performance, particularly in fairness and bias evaluation. We present Uncertainty Matters, a multi-objective uncertainty-aware algorithmic comparison framework. In fairness focused scenarios, it models sensitive group confusion matrices using Bayesian updates and facilitates joint comparison of performance (e.g., accuracy) and fairness metrics (e.g., true positive rate parity). Our approach works seamlessly with common evaluation methods like K-fold cross-validation, effectively addressing dependencies among the K posterior metric distributions. The integration of correlated information is carried out through a procedure tailored to the classifier's complexity. Experiments demonstrate that the insights derived from algorithmic comparisons employing the Uncertainty Matters approach are more informative, reliable, and less influenced by particular data partitions. Code for the paper is publicly available at \url{https://github.com/abarrainkua/UncertaintyMatters}.

**Analyzing Explainer Robustness via Probabilistic Lipschitzness of Prediction Functions**

Zulqarnain Khan · Davin Hill · Aria Masoomi · Joshua Bone · Jennifer Dy

Machine learning methods have significantly improved in their predictive capabilities, but at the same time they are becoming more complex and less transparent. As a result, explainers are often relied on to provide interpretability to these black-box prediction models. As crucial diagnostics tools, it is important that these explainers themselves are robust. In this paper we focus on one particular aspect of robustness, namely that an explainer should give similar explanations for similar data inputs. We formalize this notion by introducing and defining explainer astuteness, analogous to astuteness of prediction functions. Our formalism allows us to connect explainer robustness to the predictor's probabilistic Lipschitzness, which captures the probability of local smoothness of a function. We provide lower bound guarantees on the astuteness of a variety of explainers (e.g., SHAP, RISE, CXPlain) given the Lipschitzness of the prediction function. These theoretical results imply that locally smooth prediction functions lend themselves to locally robust explanations. We evaluate these results empirically on simulated as well as real datasets.

**Certified private data release for sparse Lipschitz functions**

Konstantin Donhauser · Johan Lokna · Amartya Sanyal · March Boedihardjo · Robert Hönig · Fanny Yang

As machine learning has become more relevant for everyday applications, a natural requirement is the protection of the privacy of the training data. When the relevant learning questions are unknown in advance, or hyper-parameter tuning plays a central role, one solution is to release a differentially private synthetic data set that leads to similar conclusions as the original training data. In this work, we introduce an algorithm that enjoys fast rates for the utility loss for sparse Lipschitz queries. Furthermore, we show how to obtain a certificate for the utility loss for a large class of algorithms.

**Analysis of Privacy Leakage in Federated Large Language Models**

Minh Vu · Truc Nguyen · Tre' Jeter · My T. Thai

With the rapid adoption of Federated Learning (FL) as the training and tuning protocol for applications utilizing Large Language Models (LLMs), recent research highlights the need for significant modifications to FL to accommodate the large-scale of LLMs. While substantial adjustments to the protocol have been introduced as a response, comprehensive privacy analysis for the adapted FL protocol is currently lacking.To address this gap, our work delves into an extensive examination of the privacy analysis of FL when used for training LLMs, both from theoretical and practical perspectives. In particular, we design two active membership inference attacks with guaranteed theoretical success rates to assess the privacy leakages of various adapted FL configurations. Our theoretical findings are translated into practical attacks, revealing substantial privacy vulnerabilities in popular LLMs, including BERT, RoBERTa, DistilBERT, and OpenAI's GPTs, across multiple real-world language datasets. Additionally, we conduct thorough experiments to evaluate the privacy leakage of these models when data is protected by state-of-the-art differential privacy (DP) mechanisms.

```
Machine learning models are widely applied in various fields. Stakeholders often use post-hoc feature importance methods to better understand the input features' contribution to the models' predictions. The interpretation of the importance values provided by these methods is frequently based on the relative order of the features (their ranking) rather than the importance values themselves. Since the order may be unstable, we present a framework for quantifying the uncertainty in global importance values. We propose a novel method for the post-hoc interpretation of feature importance values that is based on the framework and pairwise comparisons of the feature importance values. This method produces simultaneous confidence intervals for the features' ranks, which include the ``true'' (infinite sample) ranks with high probability, and enables the selection of the set of top-k important features.
```

**Taming False Positives in Out-of-Distribution Detection with Human Feedback**

Harit Vishwakarma · Heguang Lin · Ramya Korlakai Vinayak

Robustness to out-of-distribution (OOD) samples is crucial for the safe deployment of machine learning models in the open world. Recent works have focused on designing scoring functions to quantify OOD uncertainty. Setting appropriate thresholds for these scoring functions for OOD detection is challenging as OOD samples are often unavailable up front. Typically, thresholds are set to achieve a desired true positive rate (TPR), e.g., $95\%$ TPR. However, this can lead to very high false positive rates (FPR), ranging from 60 to 96\%, as observed in the Open-OOD benchmark. In safety critical real-life applications, e.g., medical diagnosis, controlling the FPR is essential when dealing with various OOD samples dynamically. To address these challenges, we propose a mathematically grounded OOD detection framework that leverages expert feedback to \emph{safely} update the threshold on the fly. We provide theoretical results showing that it is guaranteed to meet the FPR constraint at all times while minimizing the use of human feedback. Another key feature of our framework is that it can work with any scoring function for OOD uncertainty quantification. Empirical evaluation of our system on synthetic and benchmark OOD datasets shows that our method can maintain FPR at most $5\%$ while maximizing TPR.

**On the Privacy of Selection Mechanisms with Gaussian Noise**

Jonathan Lebensold · Doina Precup · Borja Balle

Report Noisy Max and Above Threshold are two classical differentially private (DP) selection mechanisms. Their output is obtained by adding noise to a sequence of low-sensitivity queries and reporting the identity of the query whose (noisy) answer satisfies a certain condition. Pure DP guarantees for these mechanisms are easy to obtain when Laplace noise is added to the queries. On the other hand, when instantiated using Gaussian noise, standard analyses only yield approximate DP guarantees despite the fact that the outputs of these mechanisms lie in a discrete space. In this work, we revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise and show that, under the additional assumption that the underlying queries are bounded, it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold. The resulting bounds are tight and depend on closed-form expressions that can be numerically evaluated using standard methods. Empirically we find these lead to tighter privacy accounting in the high privacy, low data regime. Further, we propose a simple privacy filter for composing pure ex-post DP guarantees, and use it to derive a fully adaptive Gaussian Sparse Vector Technique mechanism. Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our Sparse Vector Technique is practically competitive with previous approaches and requires less hyper-parameter tuning.

**Operationalizing Counterfactual Metrics: Incentives, Ranking, and Information Asymmetry**

Serena Wang · Stephen Bates · P Aronow · Michael Jordan

From the social sciences to machine learning, it is well documented that metrics do not always align with social welfare. In healthcare, Dranove et al. (2003) showed that publishing surgery mortality metrics actually harmed sicker patients by increasing provider selection behavior. Using a principal-agent model, we analyze the incentive misalignments that arise from such average treated outcome metrics, and show that the incentives driving treatment decisions would align with maximizing total patient welfare if the metrics (i) accounted for counterfactual untreated outcomes and (ii) considered total welfare instead of averaging over treated patients. Operationalizing this, we show how counterfactual metrics can be modified to behave reasonably in patient-facing ranking systems. Extending to realistic settings when providers observe more about patients than the regulatory agencies do, we bound the decay in performance by the degree of information asymmetry between principal and agent. In doing so, our model connects principal-agent information asymmetry with unobserved heterogeneity in causal inference.

**Solving Attention Kernel Regression Problem via Pre-conditioner**

Zhao Song · Junze Yin · Lichen Zhang

Attention mechanism is the key to large language models, and attention matrix serves as an algorithmic and computational bottleneck for such a scheme. In this paper, we define two problems, motivated by designing fast algorithms for \emph{proxy} of attention matrix and solving regressions against them. Given an input matrix $A\in \mathbb{R}^{n\times d}$ with $n\gg d$ and a response vector $b$, we first consider the matrix exponential of the matrix $A^\top A$ as a proxy, and we in turn design algorithms for two types of regression problems: $\min_{x\in \mathbb{R}^d}\|(A^\top A)^jx-b\|_2$ and $\min_{x\in \mathbb{R}^d}\|A(A^\top A)^jx-b\|_2$ for any positive integer $j$. Studying algorithms for these regressions is essential, as matrix exponential can be approximated term-by-term via these smaller problems. The second proxy is applying exponential entrywise to the Gram matrix, denoted by $\exp(AA^\top)$ and solving the regression $\min_{x\in \mathbb{R}^n}\|\exp(AA^\top)x-b \|_2$. We call this problem the \emph{attention kernel regression} problem, as the matrix $\exp(AA^\top)$ could be viewed as a kernel function with respect to $A$. We design fast algorithms for these regression problems, based on sketching and preconditioning. We hope these efforts will provide an alternative perspective of studying efficient approximation of attention matrices.

**Thompson Sampling Itself is Differentially Private**

Tingting Ou · Rachel Cummings · Marco Avella

In this work we first show that the classical Thompson sampling algorithm for multi-arm bandits is differentially private as-is, without any modification. We provide per-round privacy guarantees as a function of problem parameters and show composition over $T$ rounds; since the algorithm is unchanged, existing $O(\sqrt{NT\log N})$ regret bounds still hold and there is no loss in performance due to privacy. We then show that simple modifications -- such as pre-pulling all arms a fixed number of times, increasing the sampling variance -- can provide tighter privacy guarantees. We again provide privacy guarantees that now depend on the new parameters introduced in the modification, which allows the analyst to tune the privacy guarantee as desired. We also provide a novel regret analysis for this new algorithm, and show how the new parameters also impact expected regret. Finally, we empirically validate and illustrate our theoretical findings in two parameter regimes and demonstrate that tuning the new parameters substantially improve the privacy-regret tradeoff.

**Fair Supervised Learning with A Simple Random Sampler of Sensitive Attributes**

Jinwon Sohn · Qifan Song · Guang Lin

As the data-driven decision process becomes dominating for industrial applications, fairness-aware machine learning arouses great attention in various areas. This work proposes fairness penalties learned by neural networks with a simple random sampler of sensitive attributes for non-discriminatory supervised learning. In contrast to many existing works that critically rely on the discreteness of sensitive attributes and response variables, the proposed penalty is able to handle versatile formats of the sensitive attributes, so it is more extensively applicable in practice than many existing algorithms. This penalty enables us to build a computationally efficient group-level in-processing fairness-aware training framework. Empirical evidence shows that our framework enjoys better utility and fairness measures on popular benchmark data sets than competing methods. We also theoretically characterize estimation errors and loss of utility of the proposed neural-penalized risk minimization problem.

**DAGnosis: Localized Identification of Data Inconsistencies using Structures**

Nicolas HUYNH · Jeroen Berrevoets · Nabeel Seedat · Jonathan Crabbé · Zhaozhi Qian · Mihaela van der Schaar

Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.

**Interpretability Guarantees with Merlin-Arthur Classifiers**

Stephan Wäldchen · Kartikey Sharma · Berkant Turan · Max Zimmer · Sebastian Pokutta

We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks.These guarantees consist of lower bounds on the mutual information between selected features and the classification decision.Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness.Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult.We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.

**Tackling the XAI Disagreement Problem with Regional Explanations**

Gabriel Laberge · Yann Batiste Pequignot · Mario Marchand · Foutse Khomh

The XAI Disagreement Problem concerns the fact that various explainability methods yield different local/global insights on model behavior. Thus, given the lack of ground truth in explainability, practitioners are left wondering ``Which explanation should I believe?''. In this work, we approach the Disagreement Problem from the point of view of Functional Decomposition (FD). First, we demonstrate that many XAI techniques disagree because they handle feature interactions differently. Secondly, we reduce interactions locally by fitting a so-called FD-Tree, which partitions the input space into regions where the model is approximately additive. Thus instead of providing global explanations aggregated over the whole dataset, we advocate reporting the FD-Tree structure as well as the regional explanations extracted from its leaves. The beneficial effects of FD-Trees on the Disagreement Problem are demonstrated on toy and real datasets.

**Sparse and Faithful Explanations Without Sparse Models**

Yiyang Sun · Zhi Chen · Vittorio Orlandi · Tong Wang · Cynthia Rudin

Even if a model is not globally sparse, it is possible for decisions made from that model to be accurately and faithfully described by a small number of features. For instance, an application for a large loan might be denied to someone because they have no credit history, which overwhelms any evidence towards their creditworthiness. In this work, we introduce the Sparse Explanation Value (SEV), a new way of measuring sparsity in machine learning models. In the loan denial example above, the SEV is 1 because only one factor is needed to explain why the loan was denied. SEV is a measure of decision sparsity rather than overall model sparsity, and we are able to show that many machine learning models -- even if they are not sparse -- actually have low decision sparsity, as measured by SEV. SEV is defined using movements over a hypercube, allowing SEV to be defined consistently over various model classes, with movement restrictions reflecting real-world constraints. Our algorithms reduce SEV without sacrificing accuracy, providing sparse and completely faithful explanations, even without globally sparse models.

**Enhancing Distributional Stability among Sub-populations**

Jiashuo Liu · Jiayun Wu · Jie Peng · xiaoyu wu · Yang Zheng · Bo Li · Peng Cui

Enhancing the stability of machine learning algorithms under distributional shifts is at the heart of the Out-of-Distribution (OOD) Generalization problem. Derived from causal learning, recent works of invariant learning pursue strict invariance with multiple training environments. Although intuitively reasonable, strong assumptions on the availability and quality of environments are made to learn the strict invariance property. In this work, we come up with the ``distributional stability" notion to mitigate such limitations. It quantifies the stability of prediction mechanisms among sub-populations down to a prescribed scale. Based on this, we propose the learnability assumption and derive the generalization error bound under distribution shifts. Inspired by theoretical analyses, we propose our novel stable risk minimization (SRM) algorithm to enhance the model's stability w.r.t. shifts in prediction mechanisms (Y|X-shifts). Experimental results are consistent with our intuition and validate the effectiveness of our algorithm. The code can be found at https://github.com/LJSthu/SRM.

**Agnostic Multi-Robust Learning using ERM**

Saba Ahmadi · Avrim Blum · Omar Montasser · Kevin Stangl

A fundamental problem in robust learning is asymmetry: a learner needs to correctly classify every one of exponentially-many perturbations that an adversary might make to a test-time natural example. In contrast, the attacker only needs to find one successful perturbation. Xiang et al.[2022] proposed an algorithm that in the context of patch attacks for image classification, reduces the effective number of perturbations from an exponential to a polynomial number of perturbations and learns using an ERM oracle. However, to achieve its guarantee, their algorithm requires the natural examples to be robustly realizable. This prompts the natural question; can we extend their approach to the non-robustly-realizable case where there is no classifier with zero robust error?Our first contribution is to answer this question affirmatively by reducing this problem to a setting in which an algorithm proposed by Feige et al. [2015] can be applied, and in the process extend their guarantees. Next, we extend our results to a multi-group setting and introduce a novel agnostic multi-robust learning problem where the goal is to learn a predictor that achieves low robust loss on a (potentially) rich collection of subgroups.

**Imposing Fairness Constraints in Synthetic Data Generation**

Mahed Abroshan · Andrew Elliott · Mohammad Mahdi Khalili

In several real-world applications (e.g., online advertising, item recommendations, etc.) it may not be possible to release and share the real dataset due to privacy concerns. As a result, synthetic data generation (SDG) has emerged as a promising solution for data sharing. While the main goal of private SDG is to create a dataset that preserves the privacy of individuals contributing to the dataset, the use of synthetic data also creates an opportunity to improve fairness. Since there often exist historical biases in the datasets, using the original real data for training can lead to an unfair model. Using synthetic data, we can attempt to remove such biases from the dataset before releasing the data. In this work, we formalize the definition of fairness in synthetic data generation and provide a general framework to achieve fairness. Then we consider two notions of counterfactual fairness and information filtering fairness and show how our framework can be used for these definitions.

**Joint Selection: Adaptively Incorporating Public Information for Private Synthetic Data**

Miguel Fuentes · Brett Mullins · Ryan McKenna · Gerome Miklau · Daniel Sheldon

Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism JAM-PGM, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that JAM-PGM is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.

**DHMConv: Directed Hypergraph Momentum Convolution Framework**

WENBO ZHAO · ZITONG MA · Zhe Yang

Due to its capability to capture high-order information, the hypergraph model has shown greater potential than the graph model in various scenarios.Real-world entity relations frequently involve directionality, in order to express high-order information while capturing directional information in relationships, we present a directed hypergraph spatial convolution framework that is designed to acquire vertex embeddings of directed hypergraphs.The framework characterizes the information propagation of directed hypergraphs through two stages: hyperedge information aggregation and hyperedge information broadcasting.During the hyperedge information aggregation stage, we optimize the acquisition of hyperedge information using attention mechanisms.In the hyperedge information broadcasting stage, we leverage a directed hypergraph momentum encoder to capture the directional information of directed hyperedges.Experimental results on five publicly available directed graph datasets of three different categories demonstrate that our proposed DHMConv outperforms various commonly used graph and hypergraph models.

**Estimation of partially known Gaussian graphical models with score-based structural priors**

Martín Sevilla · Antonio Marques · Santiago Segarra

We propose a novel algorithm for the support estimation of partially known Gaussian graphical models that incorporates prior information about the underlying graph. In contrast to classical approaches that provide a point estimate based on a maximum likelihood or maximum a posteriori approach using (simple) priors on the precision matrix, we consider a prior on the graph and rely on annealed Langevin diffusion to generate samples from the posterior distribution. Since the Langevin sampler requires access to the score function of the underlying graph prior, we use graph neural networks to effectively estimate the score from a graph dataset (either available beforehand or generated from a known distribution). Numerical experiments in different setups demonstrate the benefits of our approach.

**Data-Driven Online Model Selection With Regret Guarantees**

Chris Dann · Claudio Gentile · Aldo Pacchiano

We consider model selection for sequential decision making in stochastic environments with bandit feedback, where a meta-learner has at its disposal a pool of base learners, and decides on the fly which action to take based on the policies recommended by each base learner. Model selection is performed by regret balancing but, unlike the recent literature on this subject, we do not assume any prior knowledge about the base learners like candidate regret guarantees; instead, we uncover these quantities in a data-driven manner. The meta-learner is therefore able to leverage the *realized* regret incurred by each base learner for the learning environment at hand (as opposed to the *expected* regret), and single out the best such regret. We design two model selection algorithms operating with this more ambitious notion of regret and, besides proving model selection guarantees via regret balancing, we experimentally demonstrate the compelling practical benefits of dealing with actual regrets instead of candidate regret bounds.

**Monitoring machine learning-based risk prediction algorithms in the presence of performativity**

Jean Feng · Alexej Gossmann · Gene Pennello · Nicholas Petrick · Berkman Sahiner · Romain Pirracchio

Performance monitoring of machine learning (ML)-based risk prediction models in healthcare is complicated by the issue of performativity: when an algorithm predicts a patient to be at high risk for an adverse event, clinicians are more likely to administer prophylactic treatment and alter the very target that the algorithm aims to predict. A simple approach is to ignore performativity and monitor only the untreated patients, whose outcomes remain unaltered. In general, ignoring performativity may inflate Type I error because (i) untreated patients disproportionally represent those with low predicted risk, and (ii) changes in the clinician's trust in the ML algorithm and the algorithm itself can induce complex dependencies that violate standard assumptions. Nevertheless, we show that valid inference is still possible when monitoring \textit{conditional} rather than marginal performance measures under either the assumption of conditional exchangeability or time-constant selection bias. Finally, performativity can vary over time and induce nonstationarity in the data, which presents challenges for monitoring. To this end, we introduce a new score-based cumulative sum (CUSUM) monitoring procedure with dynamic control limits. Through extensive simulation studies, we study applications of the score-based CUSUM and how it is affected by various factors, including the efficiency of model updating procedures and the level of clinician trust. Finally, we apply the procedure to detect calibration decay of a risk model during the COVID-19 pandemic.

**Directed Hypergraph Representation Learning for Link Prediction**

ZITONG MA · WENBO ZHAO · Zhe Yang

Link prediction is a critical problem in network structure processing. With the prevalence of deep learning, graph-based learning pattern in link prediction has been well-proven to successfully apply. However, existing representation-based computing paradigms retain some lack in processing complex networks: most methods only consider low-order pairwise information or eliminate the direction message, which tends to obtain a sub-optimal representation. To tackle the above challenges, we propose using directed hypergraph to model the real world and design a directed hypergraph neural network framework for data representation learning. Specifically, our work can be concluded into two sophisticated aspects: (1) We define the approximate Laplacian of the directed hypergraph, and further formulate the convolution operation on the directed hypergraph structure, solving the issue of the directed hypergraph structure representation learning. (2) By efficiently learning complex information from directed hypergraphs to obtain high-quality representations, we develop a framework DHGNN for link prediction on directed hypergraph structures. We empirically show that the merit of DHGNN lies in its ability to model complex correlations and encode information effectively of directed hypergraphs. Extensive experiments conducted on multi-field datasets demonstrate the superiority of the proposed DHGNN over various state-of-the-art approaches.