• No results found

Duplicate-Based Adaptive Window (DySNI-d)

6.4 Generating the Window of Neighboring Nodes

6.4.4 Duplicate-Based Adaptive Window (DySNI-d)

This third adaptive approach is based on [54]. The authors used an adaptive window size that grows or shrinks based on the number of classified matches that are found within the window. The window slides over the static array starting from the first to

84 Dynamic Sorted Neighborhood Index for Real-Time Entity Resolution DySNI-d Initial w= 2 σ= 0.6 r7 r10 percysmith N1 paulsmith N2 abbybond N4 pedrosmith N5 robinstevens N3 petersmith N6 sallytaylor N7 next prev r1 r2 r5 r4 r9 r3 r6 r8 r1 r3 r4 Candidate Records r6 r7 r8 r9 r2 A query record r10

Figure6.6: The set of candidate records generated using the duplicate-based adaptive win- dow approach (DySNI-d) described in Section 6.4.4 using an expansion threshold ofσ=0.6.

the last record in the index to match records in the whole data set. The more matches are found, the larger the window size becomes. However, if no or only a small number of matches for a query record are found in the window, then this approach assumes that there are no more matches further away (based on the fact that all records are sorted alphabetically according to a sorting key and similar records are likely closer to each other), and therefore there is no need to increase the size of the window. In our approach (shown in Algorithm 6.4), we adaptively expand a window on each side of the query tree node based on the following steps.

When a query record arrives, and after it is inserted into the index data structure, a window of initial fixed-sizew≥ 1 is generated. Note this is different from the initial window size w = 0 for the similarity-based adaptive approach described before. A window size w ≥ 1 is required because the duplicate-based approach needs to be able to compare candidate records to get a set of matching and non-matching record pairs. The query record is compared with all candidate records that are in the initial window. Records that have a similarity above a certain threshold with the query record are classified as matches, all others as non-matches. Assume that the number of classified matches is m out of a total of c candidate records compared with the query record, and assume that the expansion threshold (expansion ratio) isσ[54]. A window is expanded to the next tree node if the following holds:

m

c ≥σ (6.1)

In the same way as the similarity-based adaptive approach expands the window in each direction independently, the duplicate-based approach also calculates Equa- tion 6.1 independently in the forward (next) and backward (prev) direction (as shown in Algorithm 6.4).

§6.4 Generating the Window of Neighboring Nodes 85

Algorithm 6.4:DySNI-d - generateWin(qj, Nqj,w,S,σ)

Input:

- Query record:qj - Query node: Nqj

- Initial window size:w≥1 - Similarity functions:S

- Expansion threshold:σ

Output:

- Candidate record set:C

// Window expansion in the (next) direction

1: C:=Nqj.I // Add record id’sIwithinNqjtoC

2: c_n :=getNextCandidates(Nqj,w) // Get candidates from the nextwnodes toNqj 3: C:=C ∪ c_n // Add candidates from the nextwnodes toC

4: c_next :=|c_n| // The number of candidates in the next direction 5: m_next :=getNumMatches(c_n,qj,S) // The number of matches in the next direction 6: next_nd :=Nq(j+w).next // Get the next node afterwneighboring nodes

// fromNqj(in the next direction) 7: while m_nextc_nextσdo: // Expand the window in the

// (next) direction while condition is true 8: C:=C∪ next_nd.I // Add record idsIof next_nd toC

9: next_nd :=next_nd.next // Get the node in the next direction of next_nd 10: m_next :=getNumMatches(next_nd,qj,S) // Get the number of matches in the new next_nd 11: c_next :=getNumCandidates(next_nd) // Get the number of candidates in the new next_nd // Window expansion in the (previous) direction

12: c_p :=getPrevCandidates(Nqj,w) // Get candidates from the previouswnodes toNqj 13: C:=C ∪ c_p // Add candidates from the previouswnodes toC

14: c_prev :=|c_p| // The number of candidates in the prev direction 15: m_prev :=getNumMatches(c_p,qj,S) // The number of matches in the previous direction 16: prev_nd :=Nq(j−w).prev // Get the previous node beforewneighboring

// nodes fromNqj(in the previous direction) 17: while m_prevc_prevσdo: // Expand the window from the

// (previous) direction while condition is true 18: C:=C∪ prev_nd.I // Add record idsIof prev_n toC

19: prev_nd :=prev_nd.prev // Get the node in the previous direction of prev_nd 20: m_prev :=getNumMatches(prev_nd,qj,S) // Get the number of matches in the new prev_nd 21: c_prev :=getNumCandidates(prev_nd) // Get the number of candidates in the new prev_nd 22: ReturnC

Let us use the example in Figure 6.6, and assume that the initial window size is w = 2, the expansion threshold σ = 0.6, and r10 is the query record. With w = 2, the previous window will initially include two tree nodes (`percysmith'and `pedrosmith') and the next window will also include two nodes (`robinstevens'and `sallytaylor'). The query node has one candidate record r7 (which is not included in the expansion ratio calculation), the window into the previous direction has three candidate records{r1,r4,r9}, and the window into the next direction also has three candidate records{r3,r6,r8}. Therefore,c=3 in both directions.

As for the window expansion in the forward (next) direction, the window cannot expand since the last node in the tree, N7, is already included in the initial window size. However, the decision on whether the previous window needs to be expanded or not depends on the number of matches found in the window based on Equa- tion 6.1. Based on the full example records in Figure 6.1, assume that both r1 and r4 are matching records (so number of matching records in the previous window is

86 Dynamic Sorted Neighborhood Index for Real-Time Entity Resolution percysmith N1 r1 paulsmith N2 r2 abbybond N4 r5 r4 N5pedrosmith robinstevens N3 r3 petersmith N6 r7 sallytaylor N7 r6 r8 next prev r9 N5 1.7 N2 1.5 N6 1.7 N1 1.7 N3 0.6 N2 1.5 N4 0.0 N1 1.7 N5 0.0 N2 0.5 N3 0.4 N6 0.0 N6 0.6 N1 1.0 N7 0.4 N4 0.5 N5 1.5 SimDySNI w = 2 N3 1.0 N1 1.5 N7 0.0 N5 1.7 N6 1.7 A query record r10 r10

Figure6.7:Example of the similarity-based dynamic sorted neighborhood index (SimDySNI) for the ten records from the table in Figure 6.1. The sorting key values are the concatenation of`Firstname'and`Surname'values, and the window size for pre-calculation of similarities is set asw=2. The pre-calculated similarities are generated using the Jaro-Winkler similarity function [33].

m = 2). Because 2/3 ≥ σ, this means that the window will expand in the back- ward (previous) direction to include N2. The expansion process will continue until m/c < σ which is reached after including node N2 in the window. Therefore, the final set of candidate records isC={r1,r2,r3,r4,r6,r7,r8,r9}.