4.3 Previous Work on Active Learning
4.4.4 Comparison of Different Query Strategies
In this section, we compare the performance of the proposed query strategy with two other strategies:
• Random: Selects a random link from the unlabeled pool for labeling (baseline).
• Entropy: Selects a link according to the query-by-vote-entropy strat- egy.
Random Entropy Our Approach Cora 0.604 (0.222) 0.841 (0.041) 0.917 (0.055) Restaurant 0.568 (0.195) 0.888 (0.029) 0.993 (0.002) SiderDrug. 0.309 (0.189) 0.666 (0.007) 0.795 (0.044) NewYorkTimes 0.467 (0.174) 0.756 (0.080) 0.809 (0.039) LinkedMDB 0.774 (0.235) 0.948 (0.035) 0.988 (0.005) DBpediaDrug. 0.654 (0.146) 0.902 (0.076) 0.953 (0.011)
Table 4.10: Query Strategy: F-measure after 10 iterations
cases, the query-by-vote-entropy strategy as well as our proposed query strat- egy outperformed the random baseline. Our approach outperforms the query- by-vote-entropy strategy on all data sets. For the restaurant data set, our approach already achieved the maximum F-measure that can be achieved with GenLink on this data set as shown in Table 4.5.
Table 4.11 compares the test F-measure after labeling 20 links. For the
Random Entropy Our Approach Cora 0.762 (0.176) 0.938 (0.026) 0.945 (0.024) Restaurant 0.707 (0.185) 0.994 (0.001) 0.993 (0.001) SiderDrug. 0.615 (0.191) 0.926 (0.035) 0.954 (0.043) NewYorkTimes 0.543 (0.182) 0.741 (0.102) 0.859 (0.084) LinkedMDB 0.885 (0.209) 0.973 (0.125) 0.998 (0.003) DBpediaDrug. 0.788 (0.156) 0.973 (0.007) 0.989 (0.003)
Table 4.11: Query Strategy: F-measure after 20 iterations
restaurant data set, the query-by-vote-entropy also reaches the maximum F-measure. For the remaining data sets, our approach outperforms query- by-vote-entropy strategy.
4.5
Summary
In this chapter, we presented the third main contribution of this thesis: the ActiveGenLink learning algorithm. ActiveGenLink is an algorithm for learn- ing linkage rules interactively using active learning and genetic programming. ActiveGenLink learns a linkage rule by asking the user to confirm or reject a number of link candidates, which are actively selected by the algorithm. ActiveGenLink lowers the required level of expertise as the task of generating the linkage rule is automated by the genetic programming algorithm while the
user only has to verify a set of link candidates. The proposed query strategy reduces user involvement by selecting the link candidates to be verified by the user that are the most informative. ActiveGenLink employs the GenLink algorithm for learning linkage rules and thus is capable of learning linkage rules with the same expressivity, i.e., it chooses which properties to compare, it chooses appropriate distance measures, aggregation functions, and thresh- olds, as well as data transformations, which are applied to normalize data prior to comparison.
Within our experiments, ActiveGenLink outperformed state-of-the-art unsupervised approaches after manually labeling a few link candidates. In addition, ActiveGenLink usually required the user to label less than 50 link candidates in order to generate linkage rules with the same performance per- formance as the supervised GenLink algorithm on the entire set of reference links. The proposed query strategy required the user to label fewer links than the query-by-vote-entropy strategy.
Chapter 5
Execution of Linkage Rules
The growing number and size of available data sets demands efficient methods for entity matching. A number of indexing methods have been proposed to improve the efficiency of entity matching by reducing the number of required entity comparisons by dismissing definitive non-matches prior to compari- son [Elmagarmid et al., 2007]. Unfortunately, many indexing methods may lead to a decrease of recall due to false dismissals [Draisbach and Naumann, 2009]. Therefore, increasing the efficiency usually represents a trade-off be- tween reducing the execution time of the entity matching task on the one hand and retaining the effectiveness of the entity matching task by avoiding a significant decrease of recall on the other hand.
While the previous chapters have been concerned with algorithms for learning linkage rules that are represented using the model that has been introduced in Section 2.5, the practical value of the introduced linkage rule representation also depends on the availability of efficient methods for ex- ecuting learned linkage rules. In this chapter, we propose a data flow to efficiently execute linkage rules using a multidimensional indexing approach that guarantees that no false dismissals, and thus no decrease of recall, can occur. The proposed indexing approach, is called MultiBlock and constitutes the fourth main contribution of this thesis. The basic idea of MultiBlock is to map entities to a multidimensional index that preserves the distances of the entities, i.e., similar entities will be located near to each other in the index space. While standard blocking techniques block in one dimension, Multi- Block concurrently blocks by multiple properties using multiple dimensions. Thereby it increases its efficiency significantly.
We further propose a distributed data flow for executing linkage rule effi- ciently on a cluster of machines. For this purpose, we employ the MultiBlock approach to segment the data sets into partitions, which can be executed on remote machines. MultiBlock enables the parallel indexing of entities on
different machines as it generates indices for each entity independently and does not require any global preprocessing. In order to scale to large data sets, we apply the MapReduce paradigm for distributing and executing the partitions on multiple machines.
5.1
Scalability Challenges
When executing linkage rules on local or distributed system, three challenges hinder scalability:
Quadratic Execution Time: Evaluating linkage rules for all pairs of en- tities scales quadratically.
Parallel Execution: The increasing hardware parallelism demands data flows that can be parallelized in order to utilize multiple computation cores at the same time.
Memory Constraints: Data sets may exceed the size of the available mem- ory and thus cannot be held in memory at once.
In the following paragraphs, we describe each of these challenges and state how the proposed workflow accounts for each of them.