RL-finetuning LLMs from on- and off-policy data with a single algorithm
Yunhao Tang · Taco Cohen · David Zhang · Gabriel Synnaeve · RĂ©mi Munos
Abstract
We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for finetuning Large Language Models. AGRO leverages the concept of response consistency, which states that the optimal policy satisfies a notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the MATH dataset over baseline methods.
Successful Page Load