Skip to yearly menu bar Skip to main content


Don't Be Pessimistic Too Early: Look K Steps Ahead!

Chaoqi Wang · Ziyu Ye · Kevin Murphy · Yuxin Chen

MR1 & MR2 - Number 29
[ ]
Fri 3 May 8 a.m. PDT — 8:30 a.m. PDT


Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.

Live content is unavailable. Log in and register to view live content