AISTATS Poster Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Poster

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Long-Fei Li · Peng Zhao · Zhi-Hua Zhou

MR1 & MR2 - Number 26

[ Abstract ]

Abstract: We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an

$\tilde{\mathcal{O}}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where

$d$ is the dimension of feature mappings,

$S$ is the size of state space,

$A$ is the size of action space,

$H$ is the episode length and

$K$ is the number of episodes. Our result strictly improves the previous best-known

$\tilde{\mathcal{O}}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since

$H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (\romannumeral1) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (\romannumeral2) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

Chat is not available.