AISTATS Poster Dueling RL: Reinforcement Learning with Trajectory Preferences

Poster

Dueling RL: Reinforcement Learning with Trajectory Preferences

Aadirupa Saha · Aldo Pacchiano · Jonathan Lee

Auditorium 1 Foyer 50

[ Abstract ]

Abstract: We consider the problem of preference-based reinforcement learning (PbRL), where, unlike traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit (0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the traditional reward-based RL framework crucially depends on how accurately a system designer can express an appropriate reward function, which is often a non-trivial task. The main novelty of the our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension

d

$d$ . Assuming the transition model is known, we propose an algorithm with a regret guarantee of

~ O (S H d log (T / δ) \sqrt{T})

$\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$ . We further extend the above algorithm to the case of unknown transition dynamics and provide an algorithm with regret

˜ O ((\sqrt{d} + H^{2} + | S |) \sqrt{d T} + \sqrt{| S | | A | T H})

$\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$ . To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference-based RL problem with trajectory preferences.

Live content is unavailable. Log in and register to view live content