Skip to yearly menu bar Skip to main content


Provably Efficient Reinforcement Learning via Surprise Bound

Hanlin Zhu · Ruosong Wang · Jason Lee

Auditorium 1 Foyer 56

Abstract: Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the huge empirical success, the theoretical understandings of function approximation have mainly been focused on the tabular setting and the linear setting, and the theory of general value function approximation is still not well established. In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximation. We show that if the value functions can be approximated by a function class $\mathcal{F}$ which satisfies a certain closedness guarantee, our algorithm achieves an $\widetilde{O}(\text{poly}(\iota H)\sqrt{T})$ regret bound where $\iota$ is the product of the surprise bound and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes and $T = HK$ is the total number of steps the agent interacts with the environment. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting. Moreover, our algorithm only needs to solve $O(H\log K)$ empirical risk minimization (ERM) problems, which is far more efficient than previous algorithms that need to solve ERM problems for $\Omega(HK)$ times.

Live content is unavailable. Log in and register to view live content