A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools
Abstract
Machine learning-supported decisions, such as ordering diagnostic tests or determining preventive custody, often rely on binary classification from probabilistic forecasts. A consequentialist perspective, long emphasized in decision theory, favors evaluation methods that reflect the quality of such forecasts under threshold uncertainty and varying prevalence, notably Brier scores and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on accuracy and AUC-ROC. To address this disconnect, we introduce a decision-theoretic framework mapping evaluation metrics to their appropriate use cases, along with a practical Python package, \texttt{briertools}, designed to make proper scoring rules more usable in real-world settings. Specifically, we implement a bounded-threshold variant of the Brier score and log loss that restricts evaluation to a practitioner-specified range of plausible cost ratios, rather than averaging over the full unit interval. We further contribute a theoretical reconciliation between the Brier score and decision curve analysis, directly addressing a longstanding critique by Assel et al (2017) regarding the clinical utility of proper scoring rules.