Influence Attributions can be Systematically Altered by Model Manipulation
Abstract
Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influence-based attributions and investigate whether these attributions can be \textit{systematically} altered by an adversary. We show that small systemic perturbations to models can indeed alter influence-based attributions as desired. We work on logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions in adversarial circumstances. Code is available at \url{https://post-acceptance}.