• No results found

Techniques for Real-Time Entity Resolution

The authors improved the efficiency of the blocking graph by giving weights to the edges between linked nodes, in a weight blocking graph, and then removing record pairs with low weights at a small cost in recall. The results presented by the authors showed that meta-blocking can improve the efficiency of the overlapping blocks generated in the indexing step of the ER process. To improve the efficiency of meta-blocking further, Efthymiou et al. [56] have proposed a parallelized variation of the meta-blocking approach based on MapReduce [46]. A supervised meta-blocking approach was proposed in [119] to enhance the performance of the unsupervised meta-blocking from [118]. This approach replaces the edge weights with feature vectors to generate a generalized blocking graph. The authors have proposed a set of generic features that combine low extraction cost with high discriminatory (the ability to distinguish between records).

Meta-blocking approaches are designed to reduce the number of record pairs (number of comparisons) produced from static overlapping blocks (which are used in batch processing ER). To be able to use meta-blocking with dynamic indexes in real-time, the blocking graph has to be dynamic to facilitate the dynamic nature of the blocks generated by indexing techniques that are used with real-time ER.

3.5

Techniques for Real-Time Entity Resolution

Existing ER techniques focus mainly on improving the accuracy and efficiency of the ER process. However, the majority of these techniques are aimed at off-line entity matching (using batched algorithms) of static databases.

The first query-time ER approach was based on a collective classification ap- proach [16]. The idea behind this approach is to only use a subset of records in a database for resolving queries, by extracting records related to a query and then resolving this query using only these records. Although this approach can improve matching quality, experiments showed an average time of 31.28 sec was needed on a database with 831,991 records. Thus, this approach is not suitable for real-time ER, nor it is scalable to large databases since it is computationally expensive.

Christen et al. [35] proposed a similarity-aware inverted indexing technique that is suitable for real-time ER. The main idea behind this technique is to pre-calculate similarities between attribute values that are in the same block. These pre-calculated similarities are stored in main memory to be used later in the query matching pro- cess. Avoiding similarity calculations at query time significantly reduces the time required for matching a query record. This technique was shown to be two orders of magnitude faster than standard blocking [62, 87], which makes it suitable for real- time ER. However, this technique is static and once the index data structures are created, new records and attribute values cannot be added to the index.

Another real-time ER approach that is also designed to work with static databases was proposed by Dey et al. [48]. It is based on using a matching tree to limit the amount of communication required for matching records between disparate databases held at different locations, where a matching decision can be made without the need

40 Related work

of comparing all attribute values between records. This approach was shown to reduce the communication overhead, without affecting the matching quality.

Ioannou et al. [85] on the other hand proposed an approach that provides ER in real-time for RDF dynamic databases. Their method is based on using links between the entities in a database combined with a probabilistic database for resolving enti- ties. The approach uses existing ER techniques to find possible matches of a query, and instead of using these possible matches to make an off-line resolution decision, it stores the possible matches alongside with a probability weight in a dynamic index data structure. This stored information is then used at query time to perform ER in real-time. The approach is reported to have an average time of 70 ms for a query record on a database of 51,222 records. This query time is almost constant and does not increase when the database get larger.

Another dynamic ER approach is proposed by Whang and Garcia-Molina [158] that allows matching rules to evolve over time when new records become available. This approach aims at using materialized ER results (which are a set of records that are classified as matches) to save redundant work, and does not require running the ER process from scratch. The authors report that this rule evolution approach can be faster than the naive approach by up to several orders of magnitude [158].

Whang et al. [159] proposed a pay-as-you-go ER technique that can be used with real-time ER. The authors build their technique on top of the indexing step (before record comparison and classification steps). This approach propose the use ofHints to give information on records that are likely to refer to the same real-world entity. The aim is to order candidate record pairs (that are generated in the indexing step) by the likelihood of a match. Then, in the comparison step, the records that are more likely to be a match will be compared first. The results in [159] showed that using Hints before the comparison step improves the ER process by finding the majority of matching records within a fraction of the total runtime at the cost of an overhead in time and space. However, the authors also proposed a trade-off between time overhead and the benefit of using Hints.

Lately, Rezig et al. [133] proposed a general ER framework for on-line query- matching which is based on iterative caching. The idea of their approach is to de- duplicate and cache a set of frequently requested records that are obtained from dif- ferent Web databases using sampling. These de-duplicated record pairs are used for future reference when new queries arrive. When a new query arrives, it is matched jointly with the cached record pairs and then it is added to the cache. The result presented by the authors showed that their approach provides a fast and effective ER framework that can be employed with on-line settings.

The techniques reviewed above provide general ER frameworks that can be used in real-time. Nevertheless, none of those techniques (except the indexing technique proposed in [35]) focuses on the indexing step of the ER process. However, as dis- cussed in Sections 2.2.2 and 2.3.2, we believe that indexing is a vital step for the real-time ER process, and creating new indexing techniques that are particularly de- signed to work with real-time ER allows conducting the real-time ER process using