Faster Rates, Adaptive Algorithms, and Finite-Time Bounds for Linear Composition Optimization and Gradient TD Learning

Anant Raj · Pooria Joulani · Andras Gyorgy · Csaba Szepesvari



Gradient temporal difference (GTD) algorithms are provably convergent policy evaluation methods for off-policy reinforcement learning. Despite much progress, proper tuning of the stochastic approximation methods used to solve the resulting saddle point optimization problem requires the knowledge of several (unknown) problem-dependent parameters. In this paper we apply adaptive step-size tuning strategies to greatly reduce this dependence on prior knowledge, and provide algorithms with adaptive convergence guarantees. In addition, we use the underlying refined analysis technique to obtain new O(1/T) rates that do not depend on the strong-convexity parameter of the problem, and also apply to the Markov noise setting, as well as the unbounded i.i.d. noise setting.

Chat is not available.