Scalability - Indexing Techniques - Indexing techniques for real-time entity resolution

9.3 Indexing Techniques

9.3.1 Scalability

We comparatively evaluate the scalability of the different indexing techniques on the NC data set (The CCA data sets are not used for scalability testing since the QGI technique is shown to be slow to run on very large data sets). We measure the average time required to insert a single record into an index data structure, and the average query time required to resolve a single query record across the growing size of the index structure. 10% of the NC data set is used to build the indexes, and the rest of the records are considered as query records (note that query records are inserted into the index upon arrival).

• DySimII:We use the Double-Metaphone [33] encoding function to encode the attribute values when building the index as described in Section 5.3.2.

• DySNI:We use a concatenation of the`Surname'and`Firstname'attributes as a sorting key (SK) to build the index data structure, and we use the similarity- based adaptive window approach to generate the set of candidate records using a similarity threshold of θ = 0.8, where 0 ≤ θ ≤ 1. Building the index and generating the set of candidate records are described in detail in Sections 6.3.2 and 6.4.3 respectively. Note that we did not use single attribute values as SKs because they are not suitable for real-time ER (based on the results presented in Chapter 7).

• F-DySNI(2): We build two trees in the index data structure. For the first tree we use the`Firstname+Surname'concatenation as a SK (the same SK used for DySNI), and for the second tree we use the`Surname+Firstname'concatenation. We use the same window approach and similarity threshold as in the DySNI to generate the set of candidate records.

140 Comparative Evaluation

0 1M 2M 3M 4M 5M 6M 7M 8M

Record insertion number

10-2

10-1

100

101

Insertion time (ms)

(a) Average insertion time (NC) DySimII QGI

DySimII

QGI

0 1M 2M 3M 4M 5M 6M 7M 8M

Record insertion number

10-2

10-1

100

101

Insertion time (ms)

(b) Average insertion time (NC) F-DySNI(3) F-DySNI(2) DySNI

F-DySNI(3)

F-DySNI(2) DySNI

0 1M 2M 3M 4M 5M 6M 7M 8M

Record insertion number

10-1 100 101 102 103 104 105 106 Query time (ms)

DySimII F-DySNI(3)F-DySNI(2) DySNI

QGI

DySimII F-DySNI(3) F-DySNI(2) DySNI

Figure9.1:Plots (a) and (b) show the average time required to insert a single record into the index using the compared indexing techniques (the results are split over two plots to improve readability). Plot (c) illustrates the average time required to query the growing index. The results for the QGI technique are not complete because the technique was slow and it was not feasible to finish running the experiment. The Full NC data set (described in Section 4.5) is used to build the indexes (M = million).

• F-DySNI(3): We build three trees in the index data structure. For the first and second trees we use the same SKs as in F-DySNI(2), and for the third tree we use the`Firstname+City'concatenation as a SK. We use the same window approach and similarity threshold as in the DySNI to generate the set of candidate records.

• QGI: We use q-grams of length q = 2 to convert the attribute values of each record in the data set into a list of q-grams and each unique q-gram becomes a key in the inverted index. How the index operates is explained in detail in Section 4.3

Figure 9.1, illustrates the scalability results for all compared techniques. Note that the results for the average query times for the QGI technique (in plot (c)) are not complete as it was not feasible to run the ER process for the full 8 million records. This is because the technique was slow and required high query times (an average of around 1.5 seconds for the first 1.5 million records).

Plots (a) and (b) in Figure 9.1 present the average insertion time required by all compared techniques (we split the results over two plots to improve readability). The results show that the average insertion times for all compared techniques are

§9.3 Indexing Techniques 141

not affected by the growing size of the index data structure (almost constant). The average insertion times for the compared techniques ranges between 0.05 to 0.4 mil- liseconds (ms). The DySimII achieves an average insertion time of around 0.4 ms, the F-DySNI(3) around 0.1 ms, the F-DySNI(2) around 0.08 ms, and both DySNI and QGI techniques achieve around 0.05 ms. The results presented in Figure 9.1 (a) and (b) confirm that the process of inserting a record into the index data structure is scalable to large data sets for the compared techniques.

The results for the average query times achieved by all compared techniques are presented in Figure 9.1 (c). From the plot it is clear that the average query times for all techniques increases sub-linearly as the index becomes larger. However, the QGI technique has a high average query time (around 1.5 second) that makes it not suitable for real-time ER and not scalable with large data sets.

The fastest technique was the DySNI, which achieved an average query time of 1.15 ms, followed by the F-DySNI(2) and F-DySNI(3) techniques that achieved an average query time of 1.9 ms and 15.7 ms respectively. The reason behind the increase in query times when we use more trees in the index data structure is that having more trees leads to an increase in the number of candidate records which then leads to an increase in the average query time. The slowest among all proposed solutions is the DySimII that achieved an average query time of 225 ms (although it is the slowest, it is still fast enough to be used with real-time ER).

The presented results confirm that the proposed indexing techniques are suitable for real-time ER (where query records need to be matched in sub-second times) and are scalable to large data sets. The effectiveness and efficiency of the compared techniques are evaluated next.

In document Indexing techniques for real-time entity resolution (Page 161-163)