Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning

Zhengqing Zhou · Qinxun Bai · Zhengyuan Zhou · Linhai Qiu · Jose Blanchet · Peter Glynn

Keywords: [ Algorithms ] [ Reinforcement Learning ] [ Information Theory ] [ Representation Learning ] [ Algorithms -> Clustering; Theory ]

Abstract: While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness--or the lack thereof--remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $O_P\left(1/\sqrt{n}\right)$ regret, meaning that with high probability, the policy learned from using $n$ training data points will be $O\left(1/\sqrt{n}\right)$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.