Efficient entity resolution for large heterogeneous information spaces
We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledg e about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient. Copyright 2011 ACM.