Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Seamus Seamus · Vinod Raman · Unique Subedi · Yuekai Sun
Abstract
Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as *supervised fine-tuning*, involves training a new next token predictor on good generations. The second method, *Best-of-N*, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either $n$ or a rate of convergence with better dependence on the response length.
Successful Page Load