Parameter-Free Dynamic Regret for Unconstrained Linear Bandits
Alberto Rumi · Andrew Jacobsen · Nicolò Cesa-Bianchi · Fabio Vitale
Abstract
We study dynamic regret minimization in online learning with an oblivious adversary and bandit feedback. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}\_1 \ldots \boldsymbol{u}\_{T}$ in $\mathbb{R}\^{d}$, but receives only *point-evaluation feedback* on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to design algorithms which optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}\\{\boldsymbol{u}\_{t} \neq \boldsymbol{u}\_{t-1}\\}$ of an arbitrary comparator sequence. In particular, we provide the *first* algorithm for linear bandits which obtains the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{(1+S_T) T}\big)$ up to poly-logarithmic terms *without prior knowledge of $S_{T}$*, resolving a long-standing open problem.
Successful Page Load