Real-Time Entity Resolution - Indexing techniques for real-time entity resolution

how well it can adapt with being used with large data sets [36]. To evaluate the scal- ability of an ER technique run time, the number of comparisons, and reduction ratio are commonly used [36, 113]. More details bout evaluation measures are provided in Chapter 4.

2.3 Real-Time Entity Resolution

The traditional ER process (described in Section 2.2) is generally based on batch processing and runs off-line to match multiple static data sets. However, most businesses and organizations are moving on-line where they have to provide their services to their customers instantly. In addition, the data sets used by organizations are dynamic and are modified constantly. Organizations increasingly have to deal with a stream of query records that include information about entities in their dynamic data sets. These query records have to be matched with existing data sets and resolved on-line using ER techniques. This process is referred to as real-time ER [33, 57, 113].

2.3.1 Real-Time Entity Resolution Overview

Similarity search approaches [50, 66, 74] aim at identifying similar entities within unstructured data (like social media, web pages, and email). Such approaches are not suitable for identifying entities in structured data sets [33]. Moreover, traditional ER techniques (as described in Section 2.2) are designed to match multiple data sets with each other (which means that for matching two data sets, each record in the first data set has potentially to be compared with all records in the second data set). Such traditional ER techniques are not suitable for real-time ER, therefore, new techniques that are tailored for real-time ER are required. Real-time ER can be defined as a query-based data matching process where a stream of query records has to be resolved against records in one or more existing data set(s) in sub-second time [33, 57, 113] (as illustrated in Figure 2.3). A formal definition of the real-time ER problem is described in Section 1.2.

Real-time ER is required in many applications. One example application that illustrates real-time ER on dynamic data sets is credit bureaus. A credit bureau is responsible for maintaining a large data set that contains credit history for individuals and businesses. If a person applies at a bank for a loan, the bank sends the customer’s details to a credit bureau and requests a credit check. The credit bureau has to match the customer’s details to their data set and if found, it will send the customer’s credit history back to the bank. The bank’s decision of approving the loan depends on the credit report received from the credit bureau. This process is done in real-time, where the bank receives the response instantly.

Other examples include government social services which need to identify individuals on the spot even if their social security number is not available, police officers who need to identify suspect individuals within seconds when they conduct an iden- tity check using the suspect’s personal details, and on-line product comparison sites

24 Background

Query record Data set R

Number of comparisons = |R| q

Figure2.3: The number of comparisons required for query-based matching in the real-time ER process

where the site has to eliminate duplicate results in real-time from the result list based on the queries entered by customers.

The real-time ER process is challenging. On top of the challenges of traditional ER discussed in Section 2.1.4, real-time ER approaches have to be very efficient since query records have to be matched within sub-second time. In addition, ER approaches have to take into consideration the fact that real-world data are dynamic and that they grow and change constantly. The steps in the real-time ER process are the same as in traditional ER, however, each step has to consider the fact that the query matching process has to be conducted very efficiently (within a sub-second time). The following section describes how the different steps of the ER process can be conducted in real-time.

2.3.2 Real-Time Entity Resolution Process

For real-time ER, the various steps in the ER process (described in Section 2.2) have to consider the fact that a query record has to be resolved in real-time. The following is a description of how each step in the ER process can be carried out to facilitate query-based matching with dynamic data sets in real-time.

1. Cleaning and standardization: Queries are usually generated using data entry forms that consumers fill and submit to an organization. Controlling the

§2.3 Real-Time Entity Resolution 25

quality of the received queries is important for completing the matching process effectively. A query with low quality data affects the matching results and leads to low quality output. Therefore, it is important for an organization to control the quality of received queries as much as possible. This involves two main procedures.

First is controlling the data entry process before submitting the query. This can be achieved by keeping the typed-in values to a minimum to reduce possible errors and variations in the query records. For example, this includes using select drop down menus, check boxes, and radio buttons instead of text fields whenever is possible. If a text field is used, a validation check on the entered values should be conducted before allowing the submission of the query. This could include checking for empty fields, auto spelling checks, and consistency checks between entered values.

Second is performing cleaning and standardization on the submitted query using the same model that an organization uses for cleaning its existing data set. The cleaned and standardized query is then ready to be matched with records within the existing data set. Cleaning and standardizing a query record before conducting the matching process with the existing data set is not generally computationally expensive.

2. Indexing: Real-time ER requires using dynamic indexing techniques that can be updated constantly and can handle resolving stream of query records in real-time. The index is built using existing data sets, which can then be used to resolve arriving query records. When a query record arrives, and after it is cleaned (in the previous step), it will be inserted into the index to be resolved. A list of ranked results are then sent to the requested user. This is different from traditional ER where multiple data sets have to be matched and an index is built only once for the matched data sets. If the data sets change, these changes will not be included in the index unless it is re-built from scratch (off-line). In real-time ER for dynamic data sets, indexing has to be fast enough to facilitate retrieving records from the index in sub-second time. Moreover, the indexing technique should have the ability of adding new records to the existing index to facilitate working with dynamic data sets. Most available indexing techniques work with batch processing (off-line) on static data sets, and are not suitable for real-time ER with dynamic data sets.

3. Comparison: The comparison step is similar in both traditional (off-line) ER, and real-time (on-line) ER. In traditional ER, a pair-wise comparison has to be conducted for all records that are within the same block, while for real-time ER, a query has to be compared with all records that are inserted into the same block as the query record. The comparison step is achieved by using approx- imate comparison functions (that are usually expensive). For real-time ER, we

26 Background

must always consider the complexity of the chosen comparison function. Since speed is vital, comparison functions with less computational complexity should be used.

4. Classification: We discussed earlier that the classification models used for classifying record pairs into matches, non-matches, and potential matches are based on the characteristics of the existing records (i.e. attribute values, fre- quencies of attribute values, dependencies between attributes, or relationships between attributes). Simple classification models (like threshold-based, rule- based, and probabilistic classification models) that do not require additional information about the matched data to be generated are more suitable to be used in the real-time ER process. On the other hand, models that require gen- erating new information about the matched data, such as re-training the model or re-building relational models (e.g. supervised and collective classification approaches) are less suitable for real-time ER.

However, to use such complex classification approaches with real-time ER, any additional information about the data can be re-generated off-line on a regular basis, while the matching process can be conducted on-line in real-time. For example, if a supervised classification model was used, and the data set has changed by adding more records with new characteristics, the classification model has to be re-trained on the new data to be able to continue with the classification process effectively. This training can be done off-line, then the new trained classification model can be used in real-time with the ER process. In summary, traditional ER techniques are designed to work with static data sets based on using batch processing algorithms. Such approaches are not suitable for conducting the ER process in real-time on dynamic data sets. Therefore, techniques that can handle query-matching in real-time using dynamic data sets are required. Indexing is the main step in the ER process that affects the efficiency of real-time ER techniques as it reduces the search space. Fast and dynamic indexing techniques are required to enable conducting the ER process in real-time.

In document Indexing techniques for real-time entity resolution (Page 45-48)