Data Sets - Indexing techniques for real-time entity resolution

To evaluate different aspects of our proposed approaches we used both real as well as synthetic data sets. Table 4.1 summarizes these data sets.

• NC data set1: is a large real voter registration data set from the US state of North Carolina (NC) that contains the names, addresses, and ages of around 8 1_{Available from:}_{ftp://alt.ncsbe.gov/data/}

§4.5 Data Sets 51

Table 4.1: Data sets summary. `No. Records' is the number of the records in a data set.

`No. Duplicates'is the number of records that are represented more than once in a data set.

`No. Entities'is the number of real-world entities that exist in the data set. `Ave Duplicates' represents the average number of duplicates per entity in a data set. Note that each entity in a data set can be represented by one or more records.

Data sets Provenance No. No. No. Ave Records Duplicates Entities Duplicates

NC Real 7,997,234 150,089 7,847,145 3.0 CCA-1 Real 689,928 ?CNF CNF CNF CCA-3 Real 2,064,823 CNF CNF CNF CCA-10 Real 6,900,163 CNF CNF CNF CCA-30 Real 20,708,303 CNF CNF CNF Cora Real 1,295 1,183 112 23.3 DBLP/ACM Real 4,910 2,224 2,686 1.5

OZ-(1,2,3,4) Real (modified) 345,876 172,938 172,938 3.5

Febrl-5 Synthetic 100,000 80,000 20,000 5.0

Febrl-10 Synthetic 100,000 90,000 10,000 10.0

Febrl-20 Synthetic 100,000 95,000 5,000 20.0

? _{CNF: Confidential} _{Unless specified otherwise}

million voters, as well as their voter registration numbers (the used attributes are`Firstname',`Surname',`City', and`Zipcode'). Each record has a time-stamp attached which corresponds to the date a voter originally registered, or when any of their details have changed. This data set therefore contains realistic temporal information about a large number of people. We identified 142,673 individuals with two records, 3,566 with three, and 92 with four records in this data set. This data set is used for scalability evaluation.

• CCA data set: is a confidential commercial data set which contains names and addresses of tens of millions of individuals, as well as a log file of query records against this data set. To evaluate the scalability of our proposed approaches, we generated four subsets of different sizes by randomly selecting records from the full CCA data set. The first subset (CCA-1) contains 689,928 data set records and 50,190 query records, the second sebset (CCA-3) contains 2,064,823 data set records and 151,343 query records, the third subset (CCA-10) contains 6,900,163 data set records and 504,226 query records, and the last subset (CCA-30) contains 20,708,303 records and 1,513,233 query records. The number of records in the larger subsets relative to CCA-1 is 3 times, 10 times, and 30 times, respec- tively.

• OZ-x data sets: We generated four data sets with various corruption ratios using the GeCo data generator and corrupter [149], for the purpose of investi- gating the effect of having different levels of data quality in attribute values on

52 Evaluation Framework

matching quality. The four data sets each contains 345,876 records of personal details (`Firstname', `Surname', `Suburb', and `Postcode') selected randomly from a clean Australian telephone directory, and modified by adding duplicate records that had randomly corrupted attribute values based on typing, scan- ning, and OCR errors, or phonetic variations. `x'refers to the number of corrupted attributes in the data set that we used ranging from OZ-1 to OZ-4. For example, in OZ-1 the added duplicates have been corrupted by modifying only one attribute while for OZ-4 added duplicates have been corrupted by modifying all four attributes in a record. Each entity is represented on average by 3.5 duplicates. These data sets are used to evaluate the effect of how different levels of noise (i.e. different data quality) in a data set affect the performance of the proposed approaches.

• Febrl data sets: We generated three fully synthetic data sets where we specified the average number of records per entity (person) using the Febrl data generator [37]. The three data sets each contains 100,000 records consisting of name and address attributes. In the first data set (named Febrl-5) each entity is on average represented by 5 records (with a maximum of 8 records per entity), in the second data set (named Febrl-10) each entity is on average represented by 10 records (with a maximum of 15 records per entity), and in the third data set (named Febrl-20) each entity is on average represented by 20 records, (with a maximum of 30 records). Records were generated by first creating an`original' record for an entity, followed by the application of various modifications to generate`duplicate'records such as keyboard edits, phonetic and OCR modifications, and setting values to missing. These data sets are used to evaluate the effect of having different number of duplicates in a data set on the proposed approaches.

• Cora2 and DBLP/ACM [96]: Are both real-world bibliographic gold standard data sets that are commonly used in ER research. Cora has 1,295 records and 112 entities (authors), while DBLP/ACM has 2,616/2,294 records and 2,686 entities (authors). For both data sets, we used the following attributes to conduct the ER process: `authors', `title', `venue' and`year'. These data sets are used to evaluate the proposed blocking/sorting key selection algorithm as they are commonly used in this area.

4.6 Summary

In this chapter we described the framework that is used in evaluating the proposed approaches. We provided details on the evaluation measures, baselines, record pair comparison, implementation environment, and data sets we will use. The next four chapters describe the proposed approaches. We will empirically evaluate these approaches using the evaluation framework presented in this chapter.

Chapter5

Dynamic Similarity-Aware Inverted

Index for Real-Time Entity

Resolution

As described in Chapter 3, there is a need for blocking-based indexing techniques that work with real-time ER. In this chapter we propose a dynamic blocking-based indexing technique that supports query-based matching in real-time. In Section 5.2 we summarize the notation that we use in this chapter. Then, we describe our proposed approaches in Sections 5.3, and 5.4. In Section 5.5 we provide an analysis of the proposed approach in terms of estimating the number of comparisons required to match query records, and in Section 5.6 we describe the experimental evaluation. Finally, we summarize our findings in Section 5.7.

5.1 Introduction

Blocking-based indexing techniques are commonly used in entity resolution (ER) [33] to reduce the search space by grouping similar records together using ablocking key criterion. However, as described in Chapter 3, most existing indexing techniques are static and only work with traditional ER where two or more data sets are matched off-line using batched processing algorithms. Such indexing techniques cannot be used with real-time ER where a stream of query records needs to be matched with an existing data set in real-time.

In this chapter we propose a dynamic indexing technique that works with real- time ER on dynamic data sets. Our proposed technique is based on a similarity-aware inverted index proposed in [35]. We first propose a dynamic inverted index (named DySimII) that is updated after every query record, by adding arriving query records into the index data structures, leaving the index up-to-date at all times. Because this is a memory-based solution, it is important that the full index can fit into available memory. This is challenging for large data sets; therefore, we propose a frequency- based alteration (named DySimII-F) where we reduce the size of the index by only inserting most frequent attribute values into the index data structures. The following sub-sections describe the proposed approaches in more details.

54 Dynamic Similarity-Aware Inverted Index for Real-Time Entity Resolution

Table5.1: Summary of the main notations used in this chapter

R A data set of records about known entities

A A set of attributes{a1,a2, . . . ,a|A|}for eachri∈R

Q A stream of query records

C A list of candidate records for a queryqj

D An inverted index or disk-based data set table

Mqj A set of all records inRthat belong to the same entity of a queryqj

ri A record inR

ri.id Unique identifier forri

ri.eid Entity identifier forri

qj A query record inQ

qj.id Unique identifier forqj

qj.eid Entity identifier forqj

n The size of data setR

sim(., .) A function used to calculate the similarity between two values (0≤sim(., .)≤1) BK A blocking key that is used to partition records inRinto blocks of similar records BKV The blocking key value of an attributeri.ah∈R.

b The number of the generated blocks using a certain BK.

In document Indexing techniques for real-time entity resolution (Page 72-76)