Skip to yearly menu bar Skip to main content


Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation

Zaiwei Chen · Siva Theja Maguluri

Abstract: In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules (including natural policy gradient) are considered for policy improvement. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy improvement, we provide a universal analysis that establishes geometric convergence of various policy update rules, which leads to an overall $\Tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity.

Chat is not available.