Skip to yearly menu bar Skip to main content


Poster

BlockBoost: Scalable and Efficient Blocking through Boosting

Thiago Ramos · Rodrigo Schuller · Alex Akira Okuno · Lucas Nissenbaum · Roberto Oliveira · Paulo Orenstein

MR1 & MR2 - Number 124

Abstract:

As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.

Chat is not available.