Baseline Approaches - Indexing techniques for real-time entity resolution

As described in Chapter 1, in this thesis we focus on the indexing step of the ER process. In the next four chapters we propose four different approaches that are categorized into indexing and learning:

1. Indexing: In Chapters 5, 6 and 7 we propose three dynamic indexing techniques that works with real-time ER. As discussed in the Chapter 3, most of the available indexing techniques that are currently used to perform the ER process are static and solely designed for batched matching algorithms which work offline. This was a problem for selecting a proper baseline indexing technique that supports query-based matching which is required for real-time ER. Therefore, we use a q-gram indexing (QGI) technique which we generated by modifying the static indexing techniques from [11, 107] to produce a dynamic index that can be use with real-time ER, as described below, and we used this technique to compare with our proposed dynamic indexes.

The Q-GramIndexing (QGI) technique is a q-gram based inverted index [11, 34, 107] that converts the attribute values of each record in a data set into a list of q-grams (sub-strings of length q). Each unique q-gram becomes a key

§4.3 Baseline Approaches 49

RecID Firstname _Q-grams_{(q = 2)}

r1 peter [‘pe’, ‘et’, ‘te’, ‘er’]

r2 smith [‘sm’, ‘mi’, ‘it’, ‘th’]

r3 pedro [‘pe’, ‘ed’, ‘dr’, ‘ro’]

r4 pete [‘pe’, ‘et’, ‘te’]

r1 QGI keys pe et te er sm mi it th ed dr ro r3 r4 r1 r4 r1 r4 r1 r4 r2 r2 r2 r2 r3 r3 r3 r4 A query record Candidate records = {r1, r3}

Figure4.2: The Q-gram index (QGI) that is used as a baseline in Chapter 9.

in the inverted index where its corresponding value is the list of all records in the data set that have this q-gram in their attribute values. To match a query record with the q-gram inverted index, its attribute values are converted into a q-gram list, then it is compared only with records that have a certain number of common q-grams that achieve a minimum similarity threshold. The approach returns a list of all records that have a Jaccard-based similarity [33] with the query record that is greater than the minimum similarity threshold.

Figure 4.2 illustrate an example of the QGI approach. Assume thatr4 is a query record and it is required to be matched with the existing index from the figure using an overall similarity threshold oft =0.75. First, the`Firstname'attribute value ofr4 (`pete') will be converted into the following q-grams [`pe',`et',`te']. Then, the record identifier r4 of the query record is inserted into the inverted index by adding it to the value of all keys that are included in its q-gram list (i.e.`pe',`et', and`te'). The list of candidate records(C), that will be compared in detail with the query record, is generated by taking all the records that have at least one common q-gram with the query record. From the example in Figure 4.2 only r1, and r3 share common q-grams with the query record r4. These records are then compared in detail with the query record r4 by using the Jaccard-based similarity measure:

simjaccard(qj,ri) =

|Qgram(qj)∩Qgram(ri)|

|Qgram(qj)∪Qgram(ri)|

(4.4) The jaccard similarity between the query record r4 and the candidate records in C = {r1, r3} are calculated as follows: simjaccard(r4,r1) = 3/4 = 0.75, and

sim_jaccard(r4,r3) =1/6=0.16. A record is considered a match ifsim_jaccard(qj,ri)

50 Evaluation Framework

2. Learning: In Chapter 8 we propose an unsupervised learning algorithm that automatically selects optimal blocking keys for building indexes that can be used in real-time ER. We compare our proposed blocking key learning algorithm with the following baseline:

TheFisherDisJunctive (FDJ) technique is an unsupervised algorithm for learning blocking keys to be used with indexing techniques [89]. We selected the recently proposed FDJapproach as our baseline since it was shown in [89] to outperform two of the major supervised blocking key learning approaches proposed by Bilenko et al. [17] and Michelson and Knoblock [110]. This baseline algorithm consists of two phases. In the first phase, the algorithm generates a weakly labeled training data set using a TF-IDF weighting scheme to calculate the similarity between record pairs (rx,ry)∈ Ras follows. A lower and upper

thresholds 0 < l < u < 1 are used to generate the training data sets. Record pairs(rx,ry)that have a TF-IDF similarity valuesim(rx,ry)below lare labeled

as negative matches, and all pairs that have a TF-IDF value aboveuare labeled as positive matches (how TF-IDF values are calculated is described in more details in Chapter 8).

In the second phase the FDJ algorithm uses the generated labeled training data sets to learn the optimal blocking keys using a Fisher discrimination cri- terion [55]. This Fisher score is used to rank the candidate blocking keys, then selects the optimal blocking key (the key with the highest Fisher score). The

FDJalgorithm only considers key coverage (which is defined as the number of record pairs that evaluate to the same key value) when calculating Fisher scores and selecting optimal blocking keys. More details about generating the training data sets and the calculation of key coverage can be found in Chapter 8).

In document Indexing techniques for real-time entity resolution (Page 70-72)