Skip to yearly menu bar Skip to main content


Poster

From Data Imputation to Data Cleaning --- Automated Cleaning of Tabular Data Improves Downstream Predictive Performance

Sebastian Jäger · Felix Biessmann

Multipurpose Room 2 - Number 102

Abstract:

The translation of Machine Learning (ML) research innovations to real-world applications and the maintenance of ML components are hindered by reoccurring challenges, such as reaching high predictive performance, robustness, complying with regulatory constraints, or meeting ethical standards. Many of these challenges are related to data quality and, in particular, to the lack of automation in data pipelines upstream of ML components. Automated data cleaning remains challenging since many approaches neglect the dependency structure of the data errors and require task-specific heuristics or human input for calibration. In this study, we develop and evaluate an application-agnostic ML-based data cleaning approach using well-established imputation techniques for automated detection and cleaning of erroneous values. To improve the degree of automation, we combine imputation techniques with conformal prediction (CP), a model-agnostic and distribution-free method to quantify and calibrate the uncertainty of ML models. Extensive empirical evaluations demonstrate that Conformal Data Cleaning (CDC) improves predictive performance in downstream ML tasks in the majority of cases. Our code is available on GitHub: \url{https://github.com/se-jaeger/conformal-data-cleaning}.

Live content is unavailable. Log in and register to view live content