A Bayesian Information-Theoretic Approach to Data Attribution
Abstract
Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce—the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. For even larger-scale retrieval, we relax the information-gain objective and add a variance correction for scalable attribution in vector databases. Our method aligns with classical influence scores for single-example attribution, while promoting diversity for subsets. Experiments show competitive performance on counterfactual sensitivity and ground-truth retrieval, showing that our method scales to modern architectures bridging principled measures with practice.