AISTATS Poster A Computation-Efficient Method of Measuring Dataset Quality based on the Coverage of the Dataset

Poster

A Computation-Efficient Method of Measuring Dataset Quality based on the Coverage of the Dataset

Michael Brennan · Jaehwan Kim · Danqi Liao · Mingyu Kim

[ Abstract ]

Abstract: Evaluating dataset quality is an essential task, as the performance of artificial intelligence (AI) systems heavily depends on it. A traditional method for evaluating dataset quality involves training an AI model on the dataset and testing it on a separate test set. However, this approach requires significant computational time. In this paper, we propose a computationally efficient method for quantifying dataset quality. Specifically, our method measures how well the dataset covers the input probability distribution, ensuring that a high-quality dataset minimizes out-of-distribution inputs.We present a GPU-accelerated algorithm for approximately implementing the proposed method. We highlight three applications of our approach. First, it can evaluate the impact of data management practices, such as data cleaning and core set selection. We experimentally demonstrate that the quality assessment provided by our method strongly correlates with the traditional approach, achieving an

$R^2 \geq 0.985$ in most cases while being 60-1200 times faster. Second, it can monitor the quality of continuously growing datasets with computation time proportional to the added data size. Finally, our method can estimate the performance of traditional methods for large datasets.

Live content is unavailable. Log in and register to view live content