Poster 101

Efficient Model Performance Evaluation Using a Combination of Expert and Crowd-sourced Labels

Sam Corbett-Davies ⋅ Viet-An Nguyen ⋅ Udi Weinsberg

[ OpenReview]

Abstract

As models, particularly large language models (LLMs), are deployed on increasingly challenging tasks, correctly evaluating their performance is growing in importance and difficulty. Expert human labelers are high-quality but scarce and resource-intensive to obtain, while crowd-sourced labels are more readily accessible at scale but lower in quality. We propose Maven (Model And Voter EvaluatioN), a hierarchical Bayesian model that combines these two label sources to produce model performance estimates on binary tasks that are less biased than using crowd-sourced labels alone and have lower variance than using expert labels alone. By modeling the ranking of model scores, Maven is robust to a range of prediction distributions and achieves constant inference time regardless of dataset size. We validate our approach on both simulated and real-world data, and deploy it to measure production models at Meta.

Chat is not available.