Poster
Decision-Point Guided Safe Policy Improvement
Lingkai Kong · Sarah Filippi · Alihan Hüyük
Within batch reinforcement learning, safe policy improvement seeks to ensure that the learned policy performs at least as well as the behavior policy that generated the dataset. The core challenge is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (called `decision points') while still utilizing data from sparsely visited states by using them for trajectory-based value estimates. By selectively limiting the state-actions where the policy deviates from the behavior, we achieve tighter theoretical guarantees that depend only on the counts of frequently observed state-action pairs rather than on state-action space size. Our empirical results confirm DPRL provides both safety and performance improvements across synthetic and real-world applications.
Live content is unavailable. Log in and register to view live content