Conclusion and future work - Aspects of record linkage

The approach described in this chapter can be used in record linkage with practical recall and precision properties in a computationally efficient way. Because the model performs normalization on individual records, there is no trade-offbetween computational efficiency and recall. The method produces a substantial number of links with high edit distance, which is desirable for any record linkage procedure. The accuracy of the method can be attributed to the fact that a comparison of cores is a more informed string similarity measure than traditional edit distance. The core of a name represents the elements that are actually important for the identity of that name, based on training data. This provides a conceptual foundation for the method, while in traditional edit distance

all characters are considered equal. There have been various extensions and adaptations of edit distance that address relative importance of characters, such as Soundex (posi- tion and grouping of characters), Jaro-Winkler distance (prefix matching), or weighted Levenshtein distance (learn common edit operations from data). The current work is an attempt to build a model that can cover all these aspects, and deduce from data what the most important aspects of strings (in this case names) actually are.

The contribution of this work consists of three aspects: a novel, morphologically motivated model of name variation; computational efficiency and high recall in discov- ering links with small edit distance; and additional discovery of a significant amount of links with large edit distance within practical levels of precision.

The method has been evaluated on the domain of historical archives in the Nether- lands. However, the method itself is not restricted to the Dutch language or to historical data. Provided that training data on name variation is available, the method can be applied to various other domains.

In future work the training data can be chosen to be more specific and the feature set can be expanded in order to improve precision and recall. Bootstrapping can be developed to increase the use of information contained in the data.

Internal variant mining

This chapter describes an approach to discover name variants based on automatically derived record matches. The chapter is based on the paperLearning name variants from

true person resolution[15].

6.1 Introduction

Variation in person names is a key aspect of record linkage. The issue of variation can be addressed using string similarity measures, as described in Chapters 1, 2 and 4. Al- ternatively, names can be mapped to some kind of base representation which results in a binary classification of variants and non-variants. In Chapter 5 a model is developed to compute a base form using a set of features. In the current chapter a match-oriented approach is used to find name variant pairs. These pairs can either be used directly in matching or as an intermediate step in the derivation of base representations. The approach is based on the concept ofexcess information: if a subset of the information contained in a pair of records is sufficient to establish a match between the two records, then the remaining information can be used to derive domain knowledge. Applied to the current dataset this means that if two records that both containnperson names have

n−1 names in common, then thenthname is a variant. In Figure 6.1 an example of this concept is provided. In this example all names, except for the first name of the mother,

marriage, 06-06-1858 death, 13-09-1882 . . .

bride deceased

First name Johanna First name Johanna

Last name Endt Last name Endt

father bride father deceased

First name Gerrit First name Gerrit

Last name Endt Last name father Endt

mother bride mother deceased

First name Dorothea First name Doortje

Last name Kerbert Last name Kerbert

Figure 6.1: Example of a name variant derived from excess information.

are equal. This is sufficient to assume a match between these two certificates. The first name of the mother is used to derive the true variantDorothea–Doortje. The Leven- shtein edit distance between the variant names is 4, which means that the proportional edit distance relative to word length is around 0.5. Both the absolute and the relative distance would likely be below any reasonable acceptance threshold for name variants. The Jaro-Winkler distance is 0.85, which is a borderline value for variant acceptance. How- ever, using the excess information approach this true variant will be discovered without considering any string similarity threshold.

This chapter is structured as follows: in Section 6.2 the basic approach is presented, in Section 6.3 several post-processing procedures are discussed to filter incorrect name pairs, in Section 6.4 an evaluation of the approach is provided and Section 6.5 concludes.

6.2 Approach

The general idea of the name variant discovery procedure, as described above, is to ex- tract name variants from accepted record matches. To obtain correct variant pairs, the source matches should be highly accurate. The matches used in the current approach are based on exact matching, which is generally assumed to be accurate. The source data consists of birth, marriage and death certificates from the Genlias dataset. The three certificate types all contain the relation child-parents: newborn child with parents

in birth certificates, bridegroom or bride with respective parents in marriage certificates, and deceased with parents in death certificates. In many cases the age or birth date of the child is also provided. Matches between certificates can therefore be based on com- binations of three people, as illustrated in Figure 6.1. Each person has a first name and a last name, which is a total of six names, however since the last name of the child is generally the same as the last name of the father five distinct names can be extracted from each record. Note that a person can have multiple first names or occasionally multiple last names, which are initially considered as a single name containing whitespace. Following the excess information approach as outlined above a match is based on four out of five names being exactly equal with an additional check on year of birth. The fifth name, which is not equal, is taken from one of the parents to prevent variant attribution of sibling names. The additional check imposes that the year of birth is different by at most one year between certificates. In many cases the age is listed instead of the year of birth, in which case the value is derived from the certificate date and the age. This derivation is approximate, because the exact value depends on whether or not the (unknown) birthday of the individual in the certificate year has already taken place at the certificate date. However, the margin of one year can accomodate both situations.

The assumption behind this approach is that an exact match on four out of five names generates accurate matches. To test this assumption, the difference between these matches and fully matching records (five out of five names and year of birth) is ex- amined. If only a single match is expected for a record, and this match can be found using fully exact matching, removal of one of the names should not generate additional matches for this record. This test cannot be used in case the number of expected matches is unknown, e.g., birth-marriage pairs (see also Section 3.1). However, other pairs can be used, such as birth-death for which at most one link is expected. Using fully exact matching on five names 1,107,162 matches are found, which accounts for around 25% of all birth certificates. Removing one of the four parent names results in a very low number of 85 incorrect additional matches (0.008%). This indicates that, given that five out of five exactly matching names combined with matching year of birth provides accurate matches, also four out of five exactly matching names combined with matching year of birth provides accurate matches.

cate matches (two or more birth certificates matching a single death certificate or vice versa), which is 0.2%. A low percentage of duplicate matches indicates that an exact match on five out of five names is indeed accurate, especially considering that part of the duplicate matches are in fact source duplicates, i.e., the same birth recorded in different municipalities, certificates containing corrected or additional information recorded alongside the original certificate, digitization duplicates, et cetera. However, some of the duplicates are actually incorrect, for example two children from the same parents with the same first name born in the same year in January and November (resulting in the same year of birth), presumably because the first child has died which makes the first name available for the next child. Such an error is visible because of the duplicate match, and the incorrect match can be identified using the exact date and possibly additional death certificates. However, this type of errors can also occur for single matches which means that for every match all available information should be checked, and even then errors can remain undetected. However, without being able to confirm the assumption that this type of exact matching is accurate, we can state that exact matches have a very high likelihood of being correct and the resulting set of matches is internally highly consistent.

A matching procedure has been performed on all certificates using the presented set-up, i.e., the year of birth is equal, four out of five names are equal and the fifth name is different. For the total of 14.8 million civil certificates this procedure resulted in 804,470 name pairs. Details are provided in Table 6.1. The table shows a difference between first names and family names: variation in family names occurs more often than variation in fist names but first name variants are repeated more often (5-6 times on average compared to 2 times for family name variants).

name type total pairs unique pairs

male first names 183,050 31,885

female first names 246,519 48,684

family names 374,901 177,258

In document Aspects of record linkage (Page 98-104)