Online Sparse Reinforcement Learning

Botao Hao · Tor Lattimore · Csaba Szepesvari · Mengdi Wang

Keywords: [ Applications ] [ Fairness, Accountability, and Transparency ] [ Reinforcement Learning ]

Abstract: We investigate the hardness of online reinforcement learning in sparse linear Markov decision process (MDP), with a special focus on the high-dimensional regime where the ambient dimension is larger than the number of episodes. Our contribution is two-fold. First, we provide a lower bound showing that linear regret is generally unavoidable, even if there exists a policy that collects well-conditioned data. Second, we show that if the learner has oracle access to a policy that collects well-conditioned data, then a variant of Lasso fitted Q-iteration enjoys a regret of $O(N^{2/3})$ where $N$ is the number of episodes.

Chat is not available.