Monotone and Conservative Policy Iteration Beyond the Tabular Case
Abstract
We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI replaces the Bellman-error–based policy evaluation with a Bellman-constrained optimization. We prove that RPI restores textbook monotonicity of value estimates and that these estimates provably lower-bound the true return. Their limit partially satisfies the unprojected Bellman equation, underscoring RPI’s alignment with RL foundations. For \CRPI, we prove a performance-difference lower bound that accounts for function-approximation errors and approximate advantages, and we update policies by maximizing this bound. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI’s guarantees often fail—leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI’s guarantees for arbitrary function classes, RPI provides a principled basis for robust, next-generation RL.