Keywords: [ Learning Theory and Statistics ] [ Decision Processes and Bandits ]

Abstract:
Combinatorial bandit models and algorithms are used in many sequential decision-making tasks ranging from item list recommendation to influence maximization. Typical algorithms proposed for combinatorial bandits, including combinatorial UCB (CUCB) and combinatorial Thompson sampling (CTS) do not exploit correlations between base arms during the learning process. Moreover, their regret is usually analyzed under independent base arm outcomes. In this paper, we use Gaussian Processes (GPs) to model correlations between base arms. In particular, we consider a combinatorial bandit model with probabilistically triggered arms, and assume that the expected base arm outcome function is a sample from a GP. We assume that the learner has access to an exact computation oracle, which returns an optimal solution given expected base arm outcomes, and analyze the regret of Combinatorial Gaussian Process Upper Confidence Bound (ComGP-UCB) algorithm for this setting. Under (triggering probability modulated) Lipschitz continuity assumption on the expected reward function, we derive ($O( \sqrt{m T \log T \gamma_{T, \boldsymbol{\mu}}^{PTA}})$) $O(m \sqrt{\frac{T \log T}{p^*}})$ upper bounds for the regret of ComGP-UCB that hold with high probability, where $m$ denotes the number of base arms, $p^*$ denotes the minimum non-zero triggering probability, and $\gamma_{T, \boldsymbol{\mu}}^{PTA}$ denotes the pseudo-information gain. Finally, we show via simulations that when the correlations between base arm outcomes are strong, ComGP-UCB significantly outperforms CUCB and CTS.