Multi-Agent Lipschitz Bandits
Sourav Chakraborty ⋅ Amit Rege ⋅ Claire Monteleoni ⋅ Lijun Chen
Abstract
We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximizes collective reward, while separating coordination costs from learning costs. We propose a modular protocol that first solves the multi-agent coordination problem by identifying and seating players on distinct, high-value regions via a novel maxima-directed search and then decouples the problem into $N$ independent single-player Lipschitz bandits. In the consensus regime, we obtain an end-to-end regret bound whose dominant learning term is \(\tilde{O}(T^{(d+1)/(d+2)})\), matching the single-player Lipschitz rate; the upfront coordination cost is horizon-independent at fixed confidence and only polylogarithmic in \(T\) in the expected-regret form. Under an additional public coverage/scheduling assumption for the epochic extension, we also obtain a gap-free \(\tilde{O}(T^{(d+1)/(d+2)})\) guarantee. We further derive a matching lower bound for the dominant learning term and extend the framework to general distance-threshold collision models.
Successful Page Load