Skip to yearly menu bar Skip to main content


Stochastic Multi-Armed Bandits with Strongly Reward-Dependent Delays

Yifu Tang · Yingfei Wang · Zeyu Zheng

MR1 & MR2 - Number 25
[ ]
Sat 4 May 6 a.m. PDT — 8:30 a.m. PDT


There has been increasing interest in applying multi-armed bandits to adaptive designs in clinical trials. However, most literature assumes that a previous patient’s survival response of a treatment is known before the next patient is treated, which is unrealistic. The inability to account for response delays is cited frequently as one of the problems in using adaptive designs in clinical trials. More critically, the “delays” in observing the survival response are the same as the rewards rather than being external stochastic noise. We formalize this problem as a novel stochastic multi-armed bandit (MAB) problem with reward-dependent delays, where the delay at each round depends on the reward generated on the same round. For general reward/delay distributions with finite expectation, our proposed censored-UCB algorithm achieves near-optimal regret in terms of both problem-dependent and problem-independent bounds. With bounded or sub-Gaussian reward distributions, the upper bounds are optimal with a matching lower bound. Our theoretical results and the algorithms' effectiveness are validated by empirical experiments.

Live content is unavailable. Log in and register to view live content