On the Global Optimum Convergence of Momentum-based Policy Gradient

Yuhao Ding · Junzi Zhang · Javad Lavaei

[ Abstract ]
Mon 28 Mar 10:15 a.m. PDT — 11:45 a.m. PDT

Abstract: Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning due to their relative stability and incremental nature. In recent years, the empirical success of PG methods has led to the development of a theoretical foundation for these methods. In this work, we generalize this line of research by establishing the first set of global convergence results of stochastic PG methods with momentum terms, which have been demonstrated to be efficient recipes for improving PG methods. We study both the soft-max and the Fisher-non-degenerate policy parametrizations, and show that adding a momentum term improves the global optimality sample complexities of vanilla PG methods by $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ and $\tilde{\mathcal{O}}(\epsilon^{-1})$, respectively, where $\epsilon>0$ is the target tolerance. Our results for the generic Fisher-non-degenerate policy parametrizations also provide the first single-loop and finite-batch PG algorithm achieving an $\tilde{O}(\epsilon^{-3})$ global optimality sample complexity. Finally, as a by-product, our analyses provide general tools for deriving the global convergence rates of stochastic PG methods, which can be readily applied and extended to other PG estimators under the two parametrizations.

Chat is not available.