Skip to yearly menu bar Skip to main content


Online Defense Strategies for Reinforcement Learning Against Adaptive Reward Poisoning

Andi Nika · Adish Singla · Goran Radanovic

Auditorium 1 Foyer 162

Abstract: We consider the problem of defense against reward-poisoning attacks in reinforcement learning and formulate it as a game in $T$ rounds between a defender and an adaptive attacker in an adversarial environment. To address this problem, we design two novel defense algorithms. First, we propose \textit{Exp3-DARP}, a defense algorithm that uses Exp3 as a hyperparameter learning subroutine, and show that it achieves order-optimal $\tilde{\Theta}(T^{1/2})$ bounds on our notion of regret with respect to a defense that always picks the optimal parameter in hindsight. We show that the order of $T$ in the bounds cannot be improved when the reward arrival process is adversarial, even if the feedback model of the defense is stronger. However, assuming that the environment is stochastic, we propose \textit{OMDUCB-DARP} that uses estimates of costs as proxies to update the randomized strategy of the learner and are able to substantially improve the bounds proportional to how smoothly the attacker's strategy changes. Furthermore, we show that weaker types of defense, that do not take into account the attack structure and the poisoned rewards, suffer linear regret with respect to a defender that always selects the optimal parameter in hindsight when faced with an adaptive attacker that uses a no-regret algorithm to learn the behavior of the defense. Finally, we support our theoretical results with experimental evaluations on three different environments, showcasing the efficiency of our methods.

Live content is unavailable. Log in and register to view live content