Efficient Model Performance Evaluation Using a Combination of Expert and Crowd-sourced Labels
Abstract
As models, particularly large language models (LLMs), are deployed on increasingly challenging tasks, correctly evaluating their performance is growing in importance and difficulty. Expert human labelers are high-quality but scarce and resource-intensive to obtain, while crowd-sourced labels are more readily accessible at scale but lower in quality. We propose Maven (Model And Voter EvaluatioN), a hierarchical Bayesian model that combines these two label sources to produce model performance estimates on binary tasks that are less biased than using crowd-sourced labels alone and have lower variance than using expert labels alone. By modeling the ranking of model scores, Maven is robust to a range of prediction distributions and achieves constant inference time regardless of dataset size. We validate our approach on both simulated and real-world data, and deploy it to measure production models at Meta.