This section introduces the basic data flow that we propose to execute link- age rules efficiently. Figure 5.1 illustrates the overall data flow of executing a linkage rule, which will be detailed in the following paragraphs. The ex- ecution of a linkage rule begins by retrieving all entities from the two data sources. Each retrieved entity is indexed and written to the cache. From the cache, pairs of entities that are potential matches are generated from the index. For each pair, the linkage rule is evaluated and a link is generated between each pair of entities that is found matching.
5.2.1
Indexing
Since the evaluation of the linkage rule is computationally expensive, the goal of the indexing phase is to dismiss definitive non-matches prior to comparison. This is achieved by assigning a set of indices to each entity by which later phases can identify definitive non matches. We represent an index as a vector of natural numbers, i.e., as an element from the set Nn. The overall data
flow is independent from a concrete indexing function. Given a linkage rule l, a suitable indexing function assigns a set of indices to each entity from the
1The working set of a process is the data which must be held in memory at once for
Figure 5.1: Data Flow for executing linkage rules.
input data sets A and B2:
indexl : A ∪ B → P(Nn)
A suitable indexing function further adheres to the property that entities which are matches according to the linkage rule share at least one index:
l(e1, e2) ≥ 0.5 ⇐⇒ indexl(e1) ∩ indexl(e2) 6= ∅
given two entities e1 and e2.
2P(Nn
) denotes the power set of Nn
In the context of this work, we propose the MultiBlock approach for indexing. The basic idea of MultiBlock is to generate an index for each entity with the goal of assigning the same index to entities that are potential matches and a different index to entities that are definitive non-matches. Section 5.3 will describe MultiBlock in detail.
5.2.2
Caching
After an entity has been retrieved and indexed it is written to a cache. Based on their index, entities are distributed into blocks. The idea is that entities that share the same index are written to the same block. The number of blocks can be configured and is usually smaller than the number of possible indices. Thus, entities with different indices might end up in the same block. Entities are assigned a number of blocks based on the following function3:
blockl(e1, e2) = {f latten(b) mod numBlocks|b ∈ indexl(e1, e2)}
Blocks that are bigger than a configured maximum size are further split into partitions. In our experiments, we set the maximum number of blocks to 100 and split blocks into multiple partitions if they exceeded a size of 10,000 entities. These parameters worked well for data sets of different size.
Each pair of partitions from the same block is now sent to next phase in order to generate the comparison pairs. As pair of partitions can be held in memory, the comparison pairs can be generated efficiently as explained in the next section. On a distributed system a pair of partitions can also be send to another machine for matching.
Figure 5.2 shows an example cache.
Figure 5.2: Example cache.
3
f latten : Nn
→ N flattens an index vector into a scalar value. Every injective function is suitable, that is, every function which preserves the distinctness of the indices.
5.2.3
Generating Comparison Pairs
In order to generate all comparison pairs for which the linkage rule is evalu- ated, we select all pairs of partitions from the same block. For each of these partition pairs, a comparison pair is generated for each pair of entities which share the same index. More formally, for a pair of partitions Pa and Pb, the
comparison pairs are generated according to:
{(ea, eb)|ia = ib, ia ∈ indexl(ea), ib ∈ indexl(eb), ea∈ Pa, eb ∈ Pb}
These pairs are then evaluated using the linkage rule to compute the exact similarity and determine the actual links.
Figure 5.3 illustrates how two caches are compared.
Figure 5.3: Example cache
5.2.4
Matching
The matching phase evaluates the linkage rule for each comparison pair. For each pair of entities for which the similarity according to the linkage rule is above a certain threshold, a link is generated.
5.2.5
Filtering
In many data sources the assumption can be made that there are no dupli- cates inside a single data source itself, i.e., for each real-world object the data source contains no more than one entity. In that case, generating more than
one links between two data sources that all share the same source entity but target different entities in the other data source means that at least one link is incorrect as the transitive closure of the links would imply that both target entities are referring to the same real-world entity.
In order to handle case like this, a link limit can be supplied. The link limit defines the number of links originating from a single entity. Only the n highest-rated links per source data item will remain after the filtering. If no limit is provided, all links per entity will be returned.