Limitations and Future Work - Learning Expressive Linkage Rules for Entity Matching using Genet

In what follows, we discuss current limitations of the presented approaches and promising directions for future work.

7.2.1 Collective Matching

The learning algorithms that have been proposed in this thesis generate linkage rules that match pairs of entities independently of each other. Thus, matching data sets using the generated linkage rules does not exploit relationships between different entities. For instance, if a pair of citations in a bibliographic data set is identified as match, this does not influence the matching decision concerning other citations. The promise of collective approaches is to exploit relationships between different types of entities in order to improve the overall matching result. For example, if the first author of two citations is found to be matching together with the citations themselves, this information can be used to infer that other pairs of citations that refer to the same author are more likely to match as well. A number of previously proposed collective approaches have been discussed in Section 3.4.4.

On the Cora citation data set, collective approaches outperformed previ- ous approaches that are based on learning linkage rules, but achieved worse results that GenLink. On this data set, the superior performance of GenLink, compared to other supervised approaches, mostly depends on the inclusion of transformations into the linkage rule, as discussed in Section 3.5.4. However, the discussed previously proposed collective approaches do not learn data transformations. As the Cora data set is rather noisy and contains many typographical errors1_{, combining the learning of linkage rules that include}

transformations with collective matching is promising.

In the following, we discuss how GenLink could be extended for collective matching. Currently when matching citations, a linkage rule is learned that includes comparisons for a set of properties. For instance, a linkage rule for matching citations may include a comparison of the author names that includes data transformations for normalizing author names, a distance measure that is used for comparing two author names and a distance threshold. When multiple comparisons are aggregated using a weighted average aggregation, each comparison may also specify a weight that specifies the influence of a specific comparison to the overall similarity score.

Collective approaches typically assume that the data sets that are to be matched provide multiple types of entities [Christen, 2012]. For instance, a citation data set includes entity types such as citations, authors, and venues. On data sets that provide multiple types of entities, GenLink could be used to learn multiple linkage rules concurrently, one linkage rule for each type of entity. In order to support collective approaches in GenLink, the linkage rule representation could be extended to include relationships to linkage rules that match related types of entities. Instead of including a comparison of author names into a linkage rule for matching citations, the linkage rule could include a pointer to the corresponding linkage rule for matching authors, at the same place. The pointer would fulfil the role of specifying that matching decisions of two types of entities are interdependent, e.g., matching two authors influences the matching of citations that refer to these authors. The weight would fulfill the task of specifying the influence of a given relation to the overall score. After linkage rules have been learned for each type of entity that is present in the data set, a collective entity matching approach could be used to execute all rules concurrently.

As collective approaches do not conduct matching decisions independently, but take a global view of the data set, performance can be an issue, although recent efforts are targeted at improving the matching performance on large data sets [Rastogi et al., 2011].

7.2.2 Distance Measures and Transformations

In our experiments, we have only used a limited number of distance measures and data transformations. As the evaluation on the Cora data set in Section 3.5.4 showed, the performance could be further improved by imple- menting more distance measures and data transformations. In particular, transformations that remove frequent stop words or normalize person names could be improve the matching performance. In addition, the use of hybrid distance measures that combine the advantages of character-based measures and token-based measures as described in Section 2.3.3 could further improve the accuracy.

7.2.3 Learning of Functions Parameters

GenLink generates a linkage rule by building an operator tree from a set of predefined operators. For each operator it also learns a number of parameters. For instance, for each comparison operator it learns a suitable distance measure, an appropriate distance threshold as well as the operator weight. In the same way, it finds suitable aggregation functions for aggregation operators and transformation functions for transformation operators. Currently, distance measures, aggregation functions and transformation functions are selected from a user-defined list of functions. If a function needs to be configured (e.g., a transformation function may be configured to filter specific stop-words), the respective configuration needs to be supplied by the user.

In order to avoid this, GenLink could be extended to learn function parameters. As GenLink already achieves a high accuracy on all employed evaluation data sets, this idea has not been explored further so far.

7.2.4 Population Seeding

Instead of starting with a population of random linkage rules, GenLink generates an initial population that only includes linkage rules that compare properties with similar values. For our experiments, two values have been considered similar if the Levenshtein distance between both values after they have been normalized was one or less. The complete heuristic that we used to seed the population has been described in Section 3.3.1.

While the employed seeding heuristic worked well on all evaluation data sets, it could fail if matching entities in different data sets use values that are not detected as similar. For instance, if values in different data sets use different units of measurements the current strategy is likely to fail on de- tecting both values as similar. In order to cover cases that are now missed,

the heuristic for identifying properties with similar values could be extended in two ways: The normalization of the values could be improved by applying more advanced transformations. In addition, instead of using solely the Lev- enshtein distance for comparing values for similarity, all distance measures that have been configured to be used by GenLink for learning linkage rules could be used for this purpose.

7.2.5 Query Strategy

ActiveGenLink uses a query strategy that reduces user involvement by selecting link candidates which yield a high information gain. We proposed a query strategy that outperforms the query-by-vote-entropy strategy by selecting a link that is as different as possible from any already labeled reference link. By this, it aims to distribute the labeled links uniformly across the similarity space. Future work may focus on further improving the distribution of the link candidates, which are chosen by the query strategy for manual labeling. For this purpose, clustering algorithms could be employed in order to select link candidates from the unlabeled pool that are in the center of diverse cluster of similar unlabeled link candidates. The query strategy could favor cluster with many link candidates over smaller clusters in order to increase the number of links that are covered by the selected link candidate.

In document Learning Expressive Linkage Rules for Entity Matching using Genetic Programming (Page 196-200)