Skip to yearly menu bar Skip to main content


Poster

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Swetha Ganesh · Washim Mondal · Vaneet Aggarwal


Abstract: We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order O~(T2/3). The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order O~(T). These results significantly improve the state-of-the-art O~(T3/4) regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately L-smooth, a result that was previously assumed in earlier works.

Live content is unavailable. Log in and register to view live content