Poster
Exploration via linearly perturbed loss minimisation
David Janz · Shuai Liu · Alex Ayoub · Csaba Szepesvari
MR1 & MR2 - Number 86
We introduce `exploration via linearly perturbed loss minimisation' (ELPeLM), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, ELPeLM reduces to perturbed history exploration (PHE), where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. Our analysis suggests the use of data-dependent reward perturbations, not present in previous PHE-type methods, with which we are able to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside of generalised linear bandits where PHE leads to inconsistent estimates and thus linear regret, while ELPeLM remains a capable algorithm. While more principled, just like PHE, ELPeLM can be implemented in just a few lines of code.