A Two-Step Classification Approach to
Unsupervised Record Linkage
Peter Christen
Department of Computer Science, The Australian National University Canberra ACT 0200, Australia
Email:[email protected]
Abstract
Linking or matching databases is becoming increas-ingly important in many data mining projects, as linked data can contain information that is not avail-able otherwise, or that would be too expensive to col-lect manually. A main challenge when linking large databases is the classification of the compared record pairs into matches and non-matches. In traditional record linkage, classification thresholds have to be set either manually or using an EM-based approach. More recently developed classification methods are mainly based on supervised machine learning tech-niques and thus require training data, which is often not available in real world situations or has to be pre-pared manually. In this paper, a novel two-step ap-proach to record pair classification is presented. In a first step, example training data of high quality is generated automatically, and then used in a second step to train a supervised classifier. Initial experi-mental results on both real and synthetic data show that this approach can outperform traditional unsu-pervised clustering, and even achieve linkage quality almost as good as fully supervised techniques.
Keywords: Data linkage, data matching, deduplica-tion, entity resoludeduplica-tion, clustering, support vector ma-chines, quality measures.
1 Introduction
With many businesses, government organisations and research projects collecting large amounts of data, techniques that allow efficient processing, analysing and mining of massive databases have in recent years attracted interest from both academia and indus-try. Increasingly, data from various sources has to be linked, matched and aggregated in order to improve data quality, or to enrich existing data with additional information. Similarly, detecting and removing dupli-cate records that relate to the same entity within one database is often required in the data pre-processing step of many data mining projects. The aim of such linkages and deduplications is to match and aggregate all records that relate to the same entity, such as a pa-tient, a customer, a business, a product description, a publication, or a genome sequence.
Record or data linkage and deduplication can be used to improve data quality and integrity (Winkler 2004), to allow re-use of existing data sources for new studies, and to reduce costs and efforts in data
acqui-Copyright c2007, Australian Computer Society, Inc. This pa-per appeared at the Sixth Australasian Data Mining Confer-ence (AusDM 2007), Gold Coast, Australia. ConferConfer-ences in Re-search and Practice in Information Technology (CRPIT), Vol. 70. Peter Christen, Paul Kennedy, Jiuyong Li, Inna Kolyshkina and Graham Williams, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included.
sition. In the health sector, for example, linked data might contain information that is needed to improve health policies (Kelman et al. 2002), and that tradi-tionally has been collected with time consuming and expensive survey methods. Statistical agencies rou-tinely link census data for further analysis (Gill 2001), while businesses often deduplicate their databases to compile mailing lists or link them for collaborative e-Commerce projects. Within taxation offices and de-partments of social security, record linkage is used to identify people who register for assistance multiple times or who work and collect unemployment bene-fits. Another application of current interest is the use of record linkage in crime and terror detection. Se-curity agencies and crime investigators increasingly rely on the ability to quickly access files for a particu-lar individual (Wang et al. 2006), which may help to prevent crimes and terror by early intervention.
The problem of finding similar entities does not only apply to records that refer to persons. In bioin-formatics, record linkage can help finding genome se-quences in large data collections that are similar to a new, unknown sequence at hand. Increasingly im-portant is the removal of duplicates in the results re-turned by Web search engines and automatic text in-dexing systems, where copies of documents (for exam-ple bibliographic citations) have to be identified and filtered out before being presented to the user (Bhat-tacharya and Getoor 2007). Finding and comparing consumer products from several online stores is an-other application of growing interest (Bilenko et al. 2005). As product descriptions are often slightly dif-ferent, matching them becomes difficult.
If unique entity identifiers (or keys) are available in all databases to be linked, then the problem of linking at the entity level becomes trivial: a simple database join is all that is required. However, in most cases no unique keys are shared by all databases, and more sophisticated linkage techniques need to be applied. These techniques can be broadly classi-fied into deterministic, probabilistic, and modern ap-proaches (Christen and Goiser 2007, Winkler 2006).
A general schematic outline of the record linkage process is given in Figure 1. As most real-world data collections contain noisy, incomplete and incorrectly formatted information, data cleaning and standard-isation are important pre-processing steps for suc-cessful record linkage, and also before data can be loaded into data warehouses or used for further min-ing (Rahm and Do 2000). A lack of good quality data can be one of the biggest obstacles to successful record linkage and deduplication (Clarke 2004). The main task of data cleaning and standardisation is the conversion of the raw input data into well defined, consistent forms, as well as the resolution of incon-sistencies in the way information is represented and encoded (Churches et al. 2002).
Cleaning and standardisation Cleaning and standardisation Database A Database B Non− matches matches Matches Evaluation Clerical Possible Indexing
Weight vector Field
review comparison classification
Blocking /
Figure 1: General record linkage process. The output of the blocking step are candidate record pairs, while the comparison step produces weight vectors with nu-merical similarity weights.
If two databases,AandB, are to be linked, poten-tially each record fromAhas to be compared with all records fromB. The total number of potential record pair comparisons thus equals the product of the size of the two databases, |A| × |B|, with | · | denoting the number of records in a database. Similarly, when deduplicating a database,A, the total number of po-tential record pair comparisons is|A|×(|A|−1)/2, as each record potentially has to be compared to all oth-ers. The performance bottleneck in a record linkage or deduplication system is usually the expensive detailed comparison of fields (or attributes) between pairs of records (Baxter et al. 2003, Christen and Goiser 2007), making it unfeasible to compare all pairs when the databases are large. Assuming there are no du-plicate records in the databases (i.e. one record in databaseAcan only match to one record in database B, and vice versa), then the maximum number of true matches corresponds to the number of records in the smaller database. Therefore, while the computational efforts increase quadratically, the number of poten-tial true matches only increases linearly when linking larger databases. This also holds for deduplication, where the number of duplicate records is always less than the number of records in a database.
To reduce the large amount of potential record pair comparisons, record linkage methods employ some form of indexing or filtering techniques, collectively known asblocking(Baxter et al. 2003): a single record attribute or a combination of attributes, often called the blocking key, is used to split the databases into blocks. All records that have the same value in the blocking key will be inserted into one block, and candidate record pairs are then generated only from records within the same block. These candidate pairs are compared using a variety of comparison functions applied to one or more (or a combination of) record attributes. These functions can be as simple as an exact string or a numerical comparison, can take vari-ations and typographical errors into account (Cohen et al. 2003, Christen 2006), or can be as complex as a distance comparison based on look-up tables of ge-ographic locations (longitudes and latitudes).
Each comparison returns a numerical similarity value (called matching weight), often in normalised form. Two attribute values that are equal, therefore, would have a similarity of 1, while the similarity of two completely different values would be 0. Attribute values that are somewhat similar would have a sim-ilarity value somewhere between 0 and 1. As
illus-R1: Christine Smith 42 Main Street R2: Christina Smith 42 Main St R3: Bob O’Brian 11 Smith Rd R4: Robert Bryce 12 Smythe Road WV(R1,R2): 0.9 1.0 1.0 1.0 0.9 WV(R1,R3): 0.0 0.0 0.0 0.0 0.0 WV(R1,R4): 0.0 0.0 0.5 0.0 0.0 WV(R2,R3): 0.0 0.0 0.0 0.0 0.0 WV(R2,R4): 0.0 0.0 0.5 0.0 0.0 WV(R3,R4): 0.7 0.4 0.5 0.7 0.9
Figure 2: Four example records (made of given name and surname; and street number, name and type at-tributes) and the corresponding weight vectors result-ing from the comparisons of these records.
trated in Figure 2, a vector (calledweight vector) is formed for each compared record pair containing all the matching weights calculated by the different com-parison functions. These weight vectors are then used to classify record pairs into matches, non-matches, and possible matches, depending upon the decision model used (Christen and Goiser 2007, Fellegi and Sunter 1969, Gu and Baxter 2006). Record pairs that were removed by the blocking process are classified as non-matches without being compared explicitly.
Two records that have the same values in all their attributes will with high likelihood refer to the same entity, as it is very unlikely that two entities have the same values in all their attributes. The weight vector calculated when comparing such a pair of records will have matching weights of 1 in all vector elements. On the other hand, weight vectors that have 0 or very low similarity values in all their elements are with high likelihood the result of a comparison of two records that refer to different entities, as it is highly unlikely that two records that refer to the same entity have different values in all their record attributes. For ex-ample, even if a woman changes her surname and her address when she gets married, her date of birth and her maiden name will stay the same.
From this follows that it is often easy to classify with high accuracy record pairs that are very similar as matches, and pairs that are very dissimilar as non-matches. On the other hand, it is much more diffi-cult to classify pairs that have some similar and some dissimilar attribute values. This is illustrated in Fig-ure 2, where recordsR1andR2are very similar, with only two minor difference in the given name and street type attributes (which usually are taken care of in the data cleaning and standardisation step (Churches et al. 2002)), and thus very likely refer to the same person. On the other hand, recordsR3 and R4 are more different to each other, and it is not obvious if they refer to the same person.
Based on the above observations, it is possible to automatically extract training examples (weight vec-tors) from the set of all weight vectors that with high likelihood correspond to true matches or true non-matches, and to then use these weight vectors to train a supervised classifier. From the six weight vectors shown in Figure 2,WV(R1,R2) can be used as a training example for matches, whileWV(R1,R3)
andWV(R2,R3), and possibly evenWV(R1,R4)and
WV(R2,R4), can be used for non-matches.
This two-step approach to automated record pair classification, which has been inspired by similar approaches that were developed for text classifica-tion (Basu et al. 2002, Liu et al. 2003, Nigam et al. 2000, Yu et al. 2002), is presented in more detail in Section 3, and evaluated experimentally in Section 4. First, in the following section, an overview of related research is presented. Conclusions and an outlook to future work is then given in Section 5.
2 Related Work
The classical probabilistic record linkage approach, as developed by Fellegi and Sunter (1969), has been improved in recent years mainly through application of the expectation-maximisation (EM) algorithm for better parameter estimation in record pair classifica-tion (Winkler 2000), and through the use of approx-imate string comparisons to calculate partial agree-ment weights when attribute values have typograph-ical variations (Christen 2006, Winkler 2006).
In the late 1990s researchers started to explore the use of techniques originating in machine learn-ing, data minlearn-ing, artificial intelligence, information retrieval and database research to improve the link-age process. Many of these approaches are based on supervised learning techniques and assume that train-ing data is available (i.e. record pairs with known true match and true non-match status). However, such training examples are often not available in real world situations, or have to be prepared manually (an ex-pensive and time consuming process).
One supervised approach is to learn distance mea-sures for approximate string comparisons, such as the costs for character inserts, deletes and substitutions for edit-distance (Bilenko and Mooney 2003, Cohen et al. 2003), with the aim to adapt similarity compu-tations to a particular data domain. Decision tree in-duction (Elfeky et al. 2002, Neiling 2005, Tejada et al. 2002) and support vector machines (SVM) (Nahm et al. 2002) are two popular supervised machine learn-ing techniques that have been employed successfully for record pair classification. These techniques usu-ally achieve better linkage quality compared to unsu-pervised methods.
In (Elfeky et al. 2002), three approaches to record pair classification are described; the first based on su-pervised decision trees, the second using unsusu-pervised k-means clustering (with three clusters, one each for matches, possible matches and non-matches), and the third being a hybrid approach that combines the first two to overcome the problem of lack of training data. In this hybrid approach, a sub-set of weight vectors is clustered in a first step (again into matches, pos-sible matches and non-matches), and the match and non-match clusters are then used as training data for a supervised classifier in a second step. Both the fully supervised and hybrid approach outperformed the clustering approach in experimental studies.
Active learning is another approach, aimed at reducing the amount of training data required. In (Sarawagi and Bhamidipaty 2002), a system is de-scribed that presents a difficult to classify record pair to a user for manual classification. After such a pair is classified manually, it is added to the training set and the classifiers are trained. This process is re-peated until all record pairs are successfully classified. The authors reported that manually classifying less than 100 training pairs using their approach provided better results than a fully supervised approach that used 7,000 randomly selected examples. A similar approach has been presented in (Tejada et al. 2002), where a committee of decision trees is used to learn a set of rules that describe linkages.
Unsupervised clustering techniques have been in-vestigated both for improved blocking (Cohen and Richman 2002, McCallum et al. 2000) and for auto-matic record pair classification (Elfeky et al. 2002). The k-means clustering algorithm has been used in (Gu and Baxter 2006) to group weight vectors into matches and non-matches (i.e. k = 2). In this approach, a user can identify a ‘fuzzy’ region in the middle between the two cluster centroids where the difficult to classify record pairs are located. These pairs will then be given to the user for manual
cler-ical review. Using synthetic data, it was shown that this approach can significantly reduce the number of record pairs that have to be reviewed manually, while keeping high linkage quality. In (Goiser and Christen 2006), the clustering techniques k-means and farthest-first were compared with supervised de-cision tree induction on both synthetic and real data sets. Surprisingly, the simple farthest-first technique achieved results comparable to decision trees.
Another area where unsupervised techniques have been explored in recent years is entity resolution of relational data based on relational clustering (Bhat-tacharya and Getoor 2007). While the techniques de-scribed so far assume that only similarities between attribute values of record pairs are available for classi-fication, in relational data the entities have additional relational information that can be used to improve the quality of entity resolution. Relational informa-tion includes, for example, census databases that con-tain a family relationship attribute (with values such as ‘married to’, ‘dependent of’, or ‘parent of’); or bibliographic data where, besides the name of a pa-per, a publication record also contains a list of au-thors. Two author names in different publications that have several co-authors in common in other pub-lications will more likely refer to the same real per-son compared to an author with the same name that has different co-authors. Experimental results (Bhat-tacharya and Getoor 2007) on various data sets have shown that collective relational entity resolution out-performs non-relational entity resolution that is based on record pair similarities only. However, there are still many situations in the real world where no rela-tional data is available, and this paper concentrates on improving the unsupervised classification of such non-relational data.
The two-step approach presented here has been inspired by similar approaches to text classification, where often only a small number of labeled positive examples and a very large number of unlabeled exam-ples are available. The aim is then to learn a binary classifier from these positive and unlabeled examples. In (Yu et al. 2002), the PEBL approach is presented, which is based on iteratively training a SVM using the positive and a selected set of strong negative ex-amples. More unlabeled examples are included into the negative training set as the trained classifier be-comes more accurate, until all unlabeled examples are classified. A comparison of different approaches to learning from positive and unlabeled examples is given in (Liu et al. 2003). The techniques compared were PEBL, Na¨ıve Bayes classification, Rocchio text classification in combination with SVM, and an EM based approach (called S-EM) that uses ‘spy’ docu-ments, positive examples that are inserted into the set of unlabeled documents to better model their dis-tributions (Liu et al. 2002). A new approach, that uses a biased SVM formulation, is then proposed that achieved better classification results than all previous methods (Liu et al. 2003).
In a related text classification scenario, only small numbers of both positive and negative labeled train-ing examples, as well as a large number of unlabeled examples, are available. In (Nigam et al. 2000), a combination of the EM and Na¨ıve Bayes classifiers is presented. Training is started using only the labeled data, and then iteratively refined using the unlabeled examples. The experimental results presented showed that this approach was able to reduce classification errors by up to 30%.
Also related to the work presented here is semi-supervised clustering (Basu et al. 2002), which is based on the idea of using a small amount of labeled data to initialise the cluster centroids, for example for k-means, rather than using random centroid
ini-tialisation. Experimental results discussed in (Basu et al. 2002) show that this can significantly improve cluster quality. In the area of record linkage, such an approach can be taken for classifying weight vectors, by initialising two cluster centroids, one to the ex-act similarity values (matches) and the other to total dissimilarity values (non-matches). Such a cluster-ing approach will be compared to other classification techniques in Section 4 below.
3 Two-step Record Pair Classification The idea behind the approach presented in this paper is based on the following two assumptions. First, the weight vectors generated in the comparison step that have exact or high similarity values in all their vec-tor elements were with high likelihood produced when two records were compared that refer to the same en-tity, as it is very unlikely that two different entities have high similarities in all their attributes. Second, weight vectors with mostly low similarity values were with high likelihood produced when two records were compared that refer to different entities, as it is highly unlikely that two records that refer to the same entity have different values in all their attributes.
Thus, the hypothesis investigated in this paper is that it is possibly to select in a first step weight vectors as training examples that with high likeli-hood correspond to either true matches or true non-matches, and to then use these examples in a second step to train a supervised classifier. This paper con-centrates on the first step, and presents and evalu-ates several approaches to automatically select train-ing examples. Combined, these two steps will allow fully automated, unsupervised record pair classifica-tion, without the need to know the true match and non-match status of the weight vectors produced in the comparison step.
3.1 Step 1: Selection of Training Examples There are two main approaches to selecting training examples, either using thresholds or nearest-based. As illustrated in Figure 1, pairs of records that were generated in the blocking step are compared using d comparison functions (withd≥1), resulting in a set W of weight vectors wi (1 ≤ i ≤ |W|) of length d
containing matching weights (similarity values), with
| · | denoting the number of elements in a set. It is assumed that all comparison functions return nor-malised similarity values between 1.0 (exact similar-ity) and 0.0 (total dissimilarsimilar-ity), i.e. 0.0 ≤ wi[j] ≤
1.0,1 ≤ j ≤ d,∀wi ∈ W. The weight vector that
contains exact similarities in all its vector elements is denoted by m (i.e. m[j] = 1.0,1 ≤ j ≤ d), and the weight vector that contains total dissimilarities only byn(i.e.n[j] = 0.0,1≤j≤d).
The aim of the training example selection pro-cess is to choose weight vectors from W that with very high likelihood correspond to true matches and true non-matches, respectively, and to insert them into two sets, the match example training set, WM,
and the non-match example training set, WN.
Gen-erally, only a fraction of all weight vectors will be selected for training, and thus it is expected that (|WM|+|WN|)≪ |W|. In the following, the two
ap-proaches to training example selection are presented in more detail.
3.1.1 Threshold-based Selection
In this approach, one threshold for matches,tm(with
0.0 < tm <1.0), and one for non-matches, tn (with
0.0< tn<1.0), are used to select weight vectors that
have all their similarity values either withintmof the
exact match value (1.0) or withintn of the total
dis-similarity value (0.0). More formally, the match and non-match example sets WM and WN are formed
according to:
WM = {wi∈W: (m[j]−wi[j])≤tm,1≤j≤d},
WN = {wi∈W: (n[j] +wi[j])≤tn,1≤j≤d}.
Depending upon the values oftm andtn, there is
the possibility that a weight vector could be included into both training example sets WM and WN. In
such a situation, this weight vector will be removed from both WM and WN, as it cannot be a good quality training example for both matches and non-matches. For example, this would happen whentm=
tn = 0.6 for a weight vector which has all similarity
values set to 0.5, i.e.wi[j] = 0.5,1≤j≤d.
3.1.2 Nearest-based Selection
Rather than using thresholds, in this approach the weight vectors closest to m are selected into WM, and the weight vectors closest to ninto WN. More formally, if xm and xn (with xm > 0 and xn > 0)
are the number of weight vectors to be selected into WMandWN, respectively, and the distance between
two weight vectors is calculated using the Manhattan distance asdist(wi,wk) =Pdj=1|wi[j]−wk[j]|, then
the training example sets are formed according to: WM = {wi∈W,wk∈/ WM :dist(m,wi)<
dist(m,wk)},
WN = {wi∈W,wk∈/ WN :dist(wi,n)< dist(wk,n)},
withxm=|WM|andxn =|WN|.
There are two variations of how the xm and xn
nearest vectors can be chosen. First, they can be se-lected regardless if some of them contain the same values in all of their vector elements. For example, there might be a number of weight vectors that con-tain only exact match values (i.e. that are equal tom) if there are pairs of records that are exact matches, i.e. that have the same values in all compared attributes. Similarly, as illustrated in Figure 2, there will be a large number of weight vectors that only contain to-tal dissimilarity values (i.e. weight vectors that are equal ton). In the worst case, the weight vectors se-lected intoWM will all be equal tomand the weight
vectors selected intoWN will all be equal ton. This
situation would not be very useful for training the classifier in step two. Thus, in order to make sure weight vectors with different values are selected, the xmandxn nearestuniquevectors can be inserted into
the setsWM and WN of training examples. These
two variations will be referred to asnon-unique and
unique nearest in the experimental results presented in Section 4 below.
A second variation in the nearest-based approach is how to choose the values ofxm andxn. Both can
be set to the same value, resulting in a balanced clas-sification problem that has the same number of match and non-match training examples. However, as dis-cussed in Section 1 earlier, the number of true non-matches in the set of weight vectors generated by the blocking and comparison steps will likely be much larger than the number of true matches, because the number of true matches is usually limited by the size of the smaller data set. Classifying the weight vec-tor set W is therefore an imbalanced classification problem, and this should be reflected in the number of training examples provided to the classifier in step
Data set Number of Task Pairs Reduction Number of weight Ratio of true matches records completeness ratio vectors (i.e.|W|) to true non-matches
Census 449 + 392 Linkage 1.000 0.988 2,093 1 / 5.40 Restaurant 864 Deduplication 1.000 0.713 106,875 1 / 953.24 DS-Gen 1,000 Deduplication 0.957 0.995 2,475 1.13 / 1 DS-Gen 2,500 Deduplication 0.940 0.997 9,878 1 / 2.06 DS-Gen 5,000 Deduplication 0.953 0.997 35,491 1 / 4.48 DS-Gen 10,000 Deduplication 0.948 0.997 132,532 1 / 9.32
Table 1: Data sets used in experiments. See Section 4.1 for more details.
two. An estimation of the ratio of matches to non-matches, r, can be calculated based on the number of records in both data sets, |A| and |B|, and the number of weight vectors|W|:
r= min(|A|,|B|)
|W| −min(|A|,|B|). (1) The number of weight vectors selected into the match examples training set WM will therefore
usu-ally be smaller than the number of vectors selected into the non-match examples training set WN. In
the experiments presented in Section 4 below, the re-sults for this variation will be shown in two separate tables.
3.2 Step 2: Classification of Record Pairs Once example training data for matches, WM, and
non-matches,WN, has been selected, any binary
clas-sifier can be trained on them, followed by the classi-fication of the weight vectors that have not been se-lected as training examples, i.e. WT =W\(WM ∪
WN). In this paper, a support vector machine (SVM) classifier will be used, as this technique can handle high-dimensional data and has shown to be robust to noisy data. The use of other classifiers, such as deci-sion trees, is possible and will be investigated as part of future work.
One important issue that is also left for future work is that the example training data generated au-tomatically in the first step will be linearly separable, as the two training sets only contain examples that are either close to the exact match value or close to the total dissimilarity value. Thus, there will be a ‘gap’ between the match and non-match training ex-amples. Similar to the inclusion of ‘spy’ documents in the S-EM approach (Liu et al. 2002), adding ran-domly sampled weight vectors from this ‘gap’ into the training example sets should improve the overall classification accuracy. This idea is currently being implemented and results will be reported elsewhere. 4 Experimental Evaluation
In this section, the different approaches to automat-ically select training examples for matches and non-matches will be compared with three other classifi-cation methods. The first is a linear kernel SVM that uses all weight vectors and their match status for supervised classification (10-fold cross validation results are reported). The second is the standard k-means clustering approach using Euclidean distance and with two clusters (one for the matches and one for the non-matches), with the cluster centroids ini-tialised to the exact match vectormand total dissim-ilarity vectorn, respectively. The third is an ‘optimal threshold’ classifier that has access to the match sta-tus of all weight vectors, and that emulates an optimal probabilistic approach (Fellegi and Sunter 1969). It sums each weight vector into a single matching weight (i.e. it generates 1-dimensional weight vectors), and
then finds the optimal classification threshold using these matching weights that minimises the number of false matches and false non-matches.
All techniques described here were implemented in the Febrl (Christen et al. 2004) open source record linkage system,1 which is written in the Python
programming language. For SVM classification the
PyML2 Python module was used, which is based on
thelibsvm library (Chang and Lin 2001). The default linear kernel SVM method fromPyMLwas chosen in all experiments presented here. Further experiments using SVMs with non-linear kernels and other classi-fication approaches are planned for future work. All reported experiments were carried out on a Dell Op-tiplex GX280 with an Intel Pentium 3 GHz CPU and 2 Gigabytes of main memory, running Linux 2.6.20 (Ubuntu 7.04 Feisty Fawn) and using Python 2.5.1. 4.1 Data Sets and Linkage Setup
The proposed approaches were evaluated using both real and synthetic data sets, which are summarised in Table 1. Two small real data sets were taken from theSecondString toolkit,3while artificial data sets of
various sizes were created using the Febrl data set generator (Christen 2005). This generator works by first creating a number oforiginal records based on frequency tables containing real world names (given-and surname) (given-and addresses (street number, name and type; postcode; suburb and state name), fol-lowed by the random generation ofduplicatesof these records based on modifications (like inserting, delet-ing or substitutdelet-ing characters, and swappdelet-ing, remov-ing, insertremov-ing, splitting or merging words), also based on real error characteristics. All data sets generated for this paper contained 60% original and 40% dupli-cate records, with up to nine duplidupli-cates for one origi-nal record (the number of duplicates created per orig-inal record are ‘Zipf’ distributed), and with a mum of three modifications per attribute and maxi-mum ten modifications per record.
A standard blocking approach (Baxter et al. 2003) was used for all experiments reported here, with the blocking keys being combinations of name, address and postcode values. For the record pair compari-son step, the Winkler (Christen 2006, Winkler 2004) approximate string comparator (commonly employed in record linkage for name comparisons) was used on the name and address attributes. Additionally, for the Census and synthetic data sets, character differ-ence comparisons were used on the zipcode, postcode, street number and state abbreviation attributes.
The pairs completeness measure shown in Table 1 is the number of true matched record pairs generated by a blocking technique divided by the total number of true matched pairs (Christen and Goiser 2007). It measures how effective a blocking technique is in gen-erating true matched record pairs. Pairs completeness
1 http://febrl.sourceforge.net 2 http://pyml.sourceforge.net 3 http://secondstring.sourceforge.net
Data set Train Threshold set 0.1 0.3 0.5 0.7 0.9 Census WM 0 100 96.2 73.4 67.9 WN 0 0 100 100 100 Restau- W M 100 98.5 4.5 0.19 0.2 rant WN 0 0 100 100 100 DS-Gen WM 0 100 100 100 100 1,000 W N 100 100 100 99.0 86.1 DS-Gen WM 100 100 100 99.8 99.7 2,500 WN 100 100 100 99.4 92.0 DS-Gen WM 100 100 100 98.0 96.5 5,000 WN 100 100 100 99.7 96.3 DS-Gen WM 100 100 100 95.5 93.6 10,000 WN 99.2 99.7 100 99.9 98.3
Table 2: Quality of threshold-based training example selection. WM denotes the match example training
set, andWN the non-match example training set. All
result values are given as percentages.
corresponds to therecall measure as used in informa-tion retrieval. The reducinforma-tion ratio measure, rr, is the number of record pairs generated by the blocking process divided by the number of all possible record pairs. For a linkage between two data sets,AandB, rr= 1.0− |W|/(|A| × |B|) (withWthe set of weight vectors generated in the comparison step), while for a deduplicationrr= 1.0−2|W|/(|A| ×(|A| −1)). The more record pairs are removed by a blocking tech-nique the higher the reduction ratio value becomes. However, reduction ratio does not take the quality of the generated candidate record pairs into account (how many are true matches or not). The ratio of matches to non-matches shown in Table 1 refers to the corresponding number of weights vectors inW. 4.2 Quality Measures
The quality of the training examples selected in step one (shown in Tables 2, 3 and 4) are given as the percentage of correctly selected weight vectors, i.e. the percentage of true matches in the match examples training set WM, and the percentage of true
non-matches in the non-match examples training setWN.
Due to the usually imbalanced distribution of matches and non-matches in the weight vector setW, the commonly used accuracy measure is not suitable for assessing the quality of record linkage (Christen and Goiser 2007). The large number of non-matches would dominate the accuracy measure and yield re-sults that are too optimistic. Instead, the F-measure, the harmonic mean of precisionPand recallR, is used for measuring classifier quality: F = 2P R/(P +R), withP =T P/(T P+F P) andR=T P/(T P+F N). T P is the number of true positives (true matched record pairs classified as matches), T N the number of true negatives (true non-matched record pairs clas-sified as non-matches), F N the number of false neg-atives (true matched record pairs classified as non-matches), andF P the number of false positives (true non-matched record pairs classified as matches). 4.3 Results and Discussion
Tables 2, 3 and 4 show the quality of the training examples selected using the different approaches dis-cussed in Section 3.1. As can be seen clearly, in many cases the selected weight vectors are of very high quality, i.e. they are all (or almost all) true matches and true non-matches. The threshold-based approach is problematic, as threshold values that are set too low will possibly result in no weight vectors being selected into a training set. The nearest-based ap-proach overcomes this problem, especially the
imbal-Data set Train Non-unique near. Unique nearest
set 1% 5% 10% 1% 5% 10% Census WM 100 100 81.8 100 100 79.9 WN 100 100 100 100 100 100 Restau- WM 9.8 2.0 1.0 5.6 1.1 0.59 rant WN 100 100 100 100 100 100 DS-Gen WM 100 100 100 100 100 100 1,000 WN 100 96.7 95.5 100 95.9 95.5 DS-Gen WM 100 100 100 100 100 100 2,500 W N 99.0 98.4 98.3 99.0 98.4 98.2 DS-Gen WM 100 100 99.0 100 100 99.0 5,000 WN 100 99.8 99.5 99.7 99.7 99.6 DS-Gen WM 100 99.0 75.4 100 98.6 74.1 10,000 WN 100 99.8 99.7 99.8 99.8 99.7
Table 3: Quality of balanced nearest-based training example selection.
Data set Train Non-unique near. Unique nearest
set 1% 5% 10% 1% 5% 10% Census WM 100 100 100 100 100 100 WN 100 100 100 100 100 100 Restau- WM 100 100 90.8 100 76.7 58.6 rant WN 100 100 100 100 100 100 DS-Gen WM 100 100 100 100 100 100 1,000 WN 100 96.7 95.5 100 95.9 95.5 DS-Gen WM 100 100 100 100 100 100 2,500 WN 99.0 98.4 98.3 99.0 98.4 98.2 DS-Gen WM 100 100 100 100 100 100 5,000 WN 100 99.8 99.5 99.7 99.7 99.6 DS-Gen WM 100 100 100 100 100 100 10,000 WN 99.9 99.8 99.7 99.8 99.8 99.7 Table 4: Quality of imbalanced nearest-based training example selection.
anced nearest-based selection, which produced very good quality training data in almost all experiments. In the balanced nearest-based approach, it seems that too many weight vectors are selected into the match example training setWM, resulting in a loss
of its quality, as with increasing training set size more false matches (false positives) will be selected due to the imbalanced numbers of true matches and non-matches in the weight vector setW.
For the Census and Restaurant data sets, the 0% values for the threshold-based approach in Table 2 indicate that each of the record pairs compared had similar or equal values in at least one of the compared record attributes, while for the larger synthetic data sets there were record pairs that had no similar at-tribute values at all. This means that the blocking step for the synthetic data sets could be improved, as record pairs that have no similar attribute values at all clearly should not be compared using the compu-tationally expensive comparison functions.
As can be seen from Table 5 and Figures 3, 4 and 5, using the automatically selected training example sets for classification produced results of a wide variety, ranging from almost as good as the supervised op-timal threshold and SVM classifiers, to F-measure values much lower than those of k-means cluster-ing. With most data sets, the linear SVM classifier achieved the best F-measure results, outperforming the optimal threshold classifier. Of the two-step ap-proaches, the threshold based approach seems to be very sensitive to the chosen threshold value, while with the nearest-based approach, imbalanced train-ing set size outperforms balanced traintrain-ing set size in most cases, often achieving significantly better results than k-means clustering. For the balanced nearest-based approach, there seems to be a general trend that smaller training set size results in better
classi-Classification Data sets
approach Census Restaurant DS-Gen 1,000 DS-Gen 2,500 DS-Gen 5,000 DS-Gen 10,000
Optimal threshold 0.784 0.839 0.900 0.846 0.813 0.787 SVM 0.785 0.466 0.944 0.917 0.884 0.829 K-means clustering 0.434 0.002 0.802 0.814 0.763 0.213 Threshold-0.1 0.000 0.000 0.000 0.844 0.808 0.750 Threshold-0.3 0.000 0.000 0.857 0.809 0.735 0.655 Threshold-0.5 0.187 0.001 0.711 0.609 0.527 0.751 Threshold-0.7 0.171 0.704 0.826 0.816 0.744 0.492 Threshold-0.9 0.149 0.379 0.779 0.681 0.688 0.458 Nearest-1%, NU, B 0.566 0.001 0.854 0.821 0.728 0.500 Nearest-5%, NU, B 0.643 0.001 0.865 0.815 0.573 0.199 Nearest-10%, NU, B 0.317 0.001 0.840 0.757 0.344 0.080 Nearest-1%, U, B 0.566 0.002 0.863 0.834 0.741 0.515 Nearest-5%, U, B 0.500 0.002 0.861 0.813 0.582 0.203 Nearest-10%, U, B 0.271 0.002 0.841 0.757 0.348 0.087 Nearest-1%, NU, IB 0.567 0.044 0.865 0.815 0.780 0.738 Nearest-5%, NU, IB 0.644 0.012 0.851 0.823 0.805 0.751 Nearest-10%, NU, IB 0.410 0.002 0.830 0.817 0.797 0.739 Nearest-1%, U, IB 0.568 0.008 0.858 0.806 0.781 0.757 Nearest-5%, U, IB 0.511 0.005 0.849 0.822 0.807 0.756 Nearest-10%, U, IB 0.388 0.003 0.832 0.815 0.799 0.747
Table 5: F-measure classification results. ‘U’ and ‘NU’ denote the unique and non-unique weight vector selec-tion, respectively, and ‘B’ and ‘IB’ the balanced and imbalanced training set size selection. Note that ‘Optimal threshold’ and ‘SVM’ are supervised classification techniques, while all other approaches are unsupervised.
fication quality compared to larger training sets. No such trend is visible for the imbalanced nearest-based approach. There is also no clear advantage or disad-vantage for using unique or non-unique nearest-based selection of training examples.
These initial results indicate that the proposed two-step approach to automatic record pair classifi-cation is feasible and can achieve linkage quality al-most as good a fully supervised classification. Specif-ically, the nearest-based selection of match and non-match training example sets can automatically gen-erate training data of high quality. However, more investigation is needed for the second step of the pro-posed approach; on how to best use the generated training example sets for classification. More exper-iments using different data sets and additional clas-sifiers have to be conducted in order to validate the general applicability of the proposed approach to a wide range of data with different characteristics. 5 Conclusions and Future Work
In this paper, a novel two-step approach to record pair classification has been presented that aims to automate the record linkage process. This approach combines an automatic selection of training examples, that with high likelihood are either true matches or true non-matches, with a traditional supervised clas-sifier. Initial experiments on a range of data sets showed promising results, in certain cases achieving F-measure values almost as good as a fully super-vised linear SVM classifier, but generally better than an unsupervised k-means clustering approach.
There are two main extensions to the basic ap-proach presented here that will be investigated in the near future. First, rather than only using a classifier once in the second step to classify all weight vectors in WT, an iterative approach, similar to PEBL (Yu
et al. 2002), will be explored. The basic idea is that in each iteration the strongest classified matches and non-matches in WT, i.e. the weight vectors furthest
away from the decision boundary, will be added to the training setsWM andWN. This process is repeated until a certain stopping is fulfilled.
Second, similar to the inclusion of ‘spy’ documents in the S-EM approach to partially supervised text
classification (Liu et al. 2002), adding randomly sam-pled weight vectors from the ‘gap’ between the se-lected matches and non-matches into the training ex-ample sets should improve the overall classification accuracy. Random sampling should be conducted such that weight vectors closer to the exact match vector m are more likely included into the set of match training examples,WM, while weight vectors
that contain mainly dissimilarity values should more likely be included into the set of non-match training examples, WN. This idea is currently being
imple-mented and results will be reported elsewhere. Additionally, the effects of using different approx-imate string comparison techniques (Christen 2006) on the proposed approach will also be investigated.
Other future work will include a scalability and complexity analysis, as well as timing measurements, of the proposed approach. Given that only a fraction of all weight vectors are selected into the two training sets in step one, the time required to train a classifier in the second step of the proposed approach should be small compared to using all weight vectors for su-pervised classification or clustering.
6 Acknowledgements
This work is supported by an Australian Research Council (ARC) Linkage Grant LP0453463 and par-tially funded by the New South Wales Department of Health. The author would like to thank Paul Thomas for proof-reading this paper.
References
Basu, S., Banerjee, A. & Mooney, R.J. (2002), Semi-supervised clustering by seeding,in ‘International Conference on Machine Learning’ (ICML’02), Syd-ney, Australia, pp. 19–26.
Baxter, R., Christen, P. & Churches, T. (2003), A comparison of fast blocking methods for record linkage, in ‘ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolida-tion’, Washington DC, pp. 25–27.
Bhattacharya, I. & Getoor, L. (2007), ‘Collective en-tity resolution in relational data’, ACM
Transac-Figure 3: Precision results for all data sets and classification methods.
Figure 4: Recall results for all data sets and classification methods.
Figure 5: F-measure results for all data sets and classification methods.
tions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1.
Bilenko, M. & Mooney, R.J. (2003), Adaptive dupli-cate detection using learnable string similarity mea-sures,in‘ACM International Conference on
Knowl-edge Discovery and Data Mining’ (SIGKDD’03), Washington DC, pp. 39–48.
Bilenko, M., Basu, S. & Sahami, M. (2005), Adap-tive product normalization: Using online learn-ing for record linkage in comparison shopplearn-ing, in
‘IEEE International Conference on Data Mining’ (ICDM’05), Houston, Texas, pp. 58–65.
Chang, C.-C. & Lin, C.-J. (2001), LIBSVM: a library for support vector machines, manual. Department of Computer Science, National Taiwan University. Software available at:
http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Christen, P., Churches, T. & Hegland, M. (2004),
Febrl – A parallel open source data linkage sys-tem,in‘Pacific-Asia Conference on Knowledge Dis-covery and Data Mining’ (PAKDD’04), Sydney, Springer LNAI 3056, pp. 638–647.
Christen, P. (2005), Probabilistic data generation for deduplication and data linkage, in ‘Interna-tional Conference on Intelligent Data Engineering and Automated Learning’ (IDEAL’05), Brisbane, Springer LNCS 3578, pp. 109–116.
Christen, P. (2006), A comparison of personal name matching: techniques and practical issues, in
‘Workshop on Mining Complex Data’ (MCD), held at IEEE ICDM’06, Hong Kong.
Christen, P. & Goiser, K. (2007), Quality and com-plexity measures for data linkage and deduplica-tion, in F. Guillet & H. Hamilton, eds, ‘Qual-ity Measures in Data Mining’, Springer Studies in Computational Intelligence, vol. 43, pp. 127–151. Churches, T., Christen, P., Lim, K. & Zhu, J.X.
(2002), ‘Preparation of name and address data for record linkage using hidden Markov models’,
BioMed Central Medical Informatics and Decision Making, vol. 2, no. 9.
Clarke, D.E. (2004), ‘Practical introduction to record linkage for injury research’,Injury Prevention, vol. 10, pp. 186–191.
Cohen, W.W. & Richman, J. (2002), Learning to match and cluster large high-dimensional data sets for data integration, in ‘ACM International Con-ference on Knowledge Discovery and Data Mining’ (SIGKDD’02), Edmonton, pp. 475–480.
Cohen W.W., Ravikumar P. & Fienberg S.E. (2003), A comparison of string distance metrics for name-matching task,in ‘IJCAI-03 workshop on informa-tion integrainforma-tion on the Web’ (IIWeb-03), Acapulco, pp. 73–78.
Elfeky, M.G., Verykios, V.S. & Elmagarmid, A.K. (2002), TAILOR: A record linkage toolbox, in
‘International Conference on Data Engineering’ (ICDE’02), San Jose, pp. 17–28.
Fellegi, I.P. & Sunter, A.B. (1969), ‘A theory for record linkage’, Journal of the American Statisti-cal Society, vol. 64, no. 328, pp. 1183–1210. Gill, L. (2001), ‘Methods for automatic record
match-ing and linkmatch-ing and their use in national statistics’,
National Statistics Methodology Series, no. 25, Na-tional Statistics, London.
Goiser K. & Christen, P. (2006), Towards auto-mated record linkage, in ‘Australasian Data Min-ing Conference’ (AusDM’06), Sydney, Conferences in Research and Practice in Information Technol-ogy (CRPIT), vol. 61, pp. 23–31.
Gu, L. & Baxter, R. (2006), Decision models for record linkage, in ‘Selected Papers from AusDM’, Springer LNCS 3755, pp. 146–160.
Kelman, C.W., Bass, J. & Holman, D. (2002), ‘Re-search use of linked health data – A best practice protocol’, Aust NZ Journal of Public Health, vol. 26, pp. 251–255.
Liu, B., Lee, W.S., Yu, P.S. & Li, X. (2002), Par-tially supervised classification of text documents,
in‘International Conference on Machine Learning’ (ICML’02), Sydney, Australia, pp. 387–394. Liu, B., Dai, Y., Li, X., Lee, W.S. & Yu, P.S. (2003),
Building text classifiers using positive and unla-beled examples,in ‘IEEE International Conference on Data Mining’ (ICDM’03), Melbourne, Florida, pp. 179–186.
McCallum, A., Nigam, K. & Ungar, L.H. (2000), Effi-cient clustering of high-dimensional data sets with application to reference matching,in ‘ACM Inter-national Conference on Knowledge Discovery and Data Mining’ (SIGKDD’00), Boston, pp. 169–178. Nahm, U.Y., Bilenko, M. & Mooney, R.J. (2002), Two approaches to handling noisy variation in text mining, in ‘ICML’02 workshop on text learning’ (TextML’02), Sydney, Australia, pp. 18–27. Neiling, M. (2005), Identification of real-world objects
in multiple databases,in ‘29th Annual Conference of the Gesellschaft f¨ur Klassifikation e.V.’, Univer-sity of Magdeburg, pp. 63–74.
Nigam, K., McCallum, A.K., Thrun, S. & Mitchell, T. (2000), ‘Text classification from labeled and un-labeled documents using EM’, Machine Learning, vol. 39, no. 2, pp. 103–134.
Rahm, E. & Do, H.H. (2000), ‘Data cleaning: Prob-lems and current approaches’, IEEE Data Engi-neering Bulletin, vol. 23, no. 4, pp. 3–13.
Sarawagi S., & Bhamidipaty A. (2002), Interactive deduplication using active learning,in‘ACM Inter-national Conference on Knowledge Discovery and Data Mining’ (SIGKDD’02), Edmonton, Canada, pp. 269–278.
Tejada S., Knoblock C.A. & Minton S. (2000), Learning domain-independent string transforma-tion weights for high accuracy object identifica-tion,in ‘ACM International Conference on Knowl-edge Discovery and Data Mining’ (SIGKDD’02), Edmonton, Canada, pp. 350–359.
Wang, G., Chen, H., Xu, J.J. & Atabakhsh, H. (2006), ‘Automatically detecting criminal identity deception: An adaptive detection algorithm’,IEEE Transactions on Systems, Man and Cybernetics (Part A), vol. 36, no. 5, pp. 988–999.
Winkler, W.E. (2000), ‘Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage’, Technical report RR2000/05, US Bureau of the Census.
Winkler, W.E. (2004), ‘Methods for evaluating and creating data quality’, Information Systems, Else-vier, vol. 29, no. 7, pp. 531–550.
Winkler, W.E. (2006), ‘Overview of record linkage and current research directions’, Technical report RR2006/02, US Bureau of the Census.
Yu, H., Han. J. & Chang, K.C.C. (2002), PEBL: Pos-itive example based learning for Web page classi-fication using SVM, in ‘ACM International Con-ference on Knowledge Discovery and Data Mining’ (SIGKDD’02), Edmonton, Canada, pp. 239–248.