Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees
Abstract
Large Language Models (LLMs) excel at generative language tasks but remain unreliable for structured prediction, particularly in extractive question answering (EQA), where success depends on precise span selection. These challenges are amplified in resource-constrained environments, such as mobile or embedded systems, where deploying high-capacity models is often infeasible. We propose a Learning-to-Defer framework that routes EQA queries across a pool of models with varying capabilities and costs to balance accuracy and efficiency. Our approach is grounded in statistical decision theory: we define a differentiable surrogate loss whose minimizer provably converges to the Bayes-optimal allocation policy. Experiments on SQuADv1, SQuADv2, and TriviaQA show that our method consistently improves the accuracy-efficiency trade-off relative to static baselines and prior routing heuristics. Overall, our framework provides a principled and scalable solution for EQA in both high-performance and on-device deployment settings.