From Restless to Contextual: A Thresholding Bandit Reformulation for Finite-horizon Improvement
Abstract
This paper addresses the poor finite-horizon performance of existing online restless bandit (RB) algorithms, which stems from the prohibitive sample complexity of learning a full Markov decision process (MDP) for each agent. We argue that superior finite-horizon performance requires rapid convergence to a high-quality policy. Thus motivated, we introduce a reformulation of online RBs as a budgeted thresholding contextual bandit, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving faster convergence than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.