AISTATS Poster Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Poster

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Swetha Ganesh · Washim Mondal · Vaneet Aggarwal

[ Abstract ]

Abstract: We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order

\tilde{O} (T^{2 / 3})

$\tilde{\mathcal{O}}(T^{2/3})$ . The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order

\tilde{O} (\sqrt{T})

$\tilde{\mathcal{O}}(\sqrt{T})$ . These results significantly improve the state-of-the-art

\tilde{O} (T^{3 / 4})

$\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately

L

$L$ -smooth, a result that was previously assumed in earlier works.

Live content is unavailable. Log in and register to view live content