AISTATS Poster Infinite-Horizon Reinforcement Learning with Multinomial Logit Function Approximation

Poster

Infinite-Horizon Reinforcement Learning with Multinomial Logit Function Approximation

Mingyu Kim · Dabeen Lee

[ Abstract ]

Abstract: We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logit (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of

$\tilde{\mathcal{O}}(dD\sqrt{T})$ where

$d$ is the dimension of feature mapping,

$D$ is the diameter of the underlying MDP, and

$T$ is the horizon. For discounted-reward MDPs, our algorithm achieves

$\tilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ regret where

$\gamma$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of

$\Omega(d\sqrt{DT})$ for learning communicating MDPs of diameter

$D$ and a lower bound of

$\Omega(d(1-\gamma)^{-3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor

$\gamma$ . Lastly, we show a regret lower bound of

$\Omega(dH^{3/2}\sqrt{K})$ for learning

$H$ -horizon episodic MDPs with MNL function approximation where

$K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

Live content is unavailable. Log in and register to view live content