Poster

BlockBoost: Scalable and Efficient Blocking through Boosting

Thiago Ramos ⋅ Rodrigo Schuller ⋅ Alex Akira Okuno ⋅ Lucas Nissenbaum ⋅ Roberto Oliveira ⋅ Paulo Orenstein

2024 Poster

Project Page [ Poster]

Abstract

As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.

Chat is not available.