Efficient model performance evaluation using a combination of expert and crowd-sourced labels
Abstract
As models, particularly large language models (LLMs), are deployed on increasingly challenging tasks, correctly evaluating their performance is growing in importance and difficulty. Expert human labelers are high-quality but scarce and expensive, while crowd-sourced labels are cheaper at scale but lower in quality. This paper proposes Maven (Model and Voter EvaluatioN), a hierarchical Bayesian model that combines these two label sources to produce estimates of model performance on binary tasks that are less biased than using crowd-sourced labels alone and have lower variance than using high-quality labels alone. By modeling the ranking induced by model predictions instead of their raw values, our approach is robust to a range of prediction distributions and achieves constant inference time regardless of dataset size. The Maven model enables the imputation of missing high-quality labels, allowing the estimation of a comprehensive suite of performance metrics. We validate our approach on both simulated data and production models at a major technology company. In one production model, Maven's estimate of model performance achieved equal variance and indistinguishable point estimates compared to the expert-only estimate, while reducing labeling cost by 42\%. Our results show that Maven is a practical solution for cost-effective, high-quality model evaluation at scale.