6.3.1 Choice of blocking variables
Blocking is employed to efficiently compare two datasets by reducing the number of records to compare between the two. For example, if one wants to link a dataset having 100,000 records with another containing 1,000,000 records, the total number of comparisons would be 100,000 x 1,000,000. Blocking cuts down the total number of records to compare by only comparing records that exactly match on the specified blocking variable. In effect, the comparison space is cut down to only those records which have a potential to match, as specified by the blocking variables.
In choosing the blocking variable, the analyst aims to keep the size of the block small to efficiently reduce the number of comparison pairs, yet big enough to avoid missing true matching record pairs.31 For instance, if the analyst blocks on gender, two huge blocks are created, resulting in an inefficiently large number of comparisons to perform. On the other hand, blocking performed on a numeric identifier produces numerous mini-blocks – perhaps as many records as there are in the datasets. A problem arises when there is an error or missing value for the blocking variable. Two matching records will not be compared and the match will be missed. The following methods have been employed successfully at Statistics NZ to design blocks of good quality and size.
One technique is to choose a variable that has a good number of values (eg overcomes the problem of gender variable in the example above), with a fairly uniform distribution, so as to have blocks of uniform size. Blocks of uniform size are desired because the number of comparison pairs “generated by any blocking method depends on the number of blocks (the method) generates and (the resulting blocks’) sizes. Very large blocks have therefore dominant effects on the efficiency of the blocking methods”.32
It is also desirable to have a blocking variable that has a high reliability value, in order to avoid the scenario of two matching records failing to be in the same block, with no chance of being linked.
One approach is to “keep the block sizes as small as possible and compensate for errors in blocking by running multiple passes”.33 This technique is achieved by using multiple blocking variables in the different passes to overcome block size problems (very large blocks may heavily slow down the linkage software or could even cause it to crash) and data errors.
Essentially, each time a pass is run the links are kept, and another pass with new blocks and new comparison pairs is performed on the remaining unlinked records. New blocks and new comparison pairs mean more chance of not missing out on true matches.
Truncated fields can also be used to mitigate the effects of erroneous encoding when blocking, in addition to using phonetic coding and variables which are thought to be reliable.
For instance, because the SOUNDEX for surname ‘William’ and ‘Williams’ return different codes, a new variable containing a truncated form of the surname could be considered. This
31 Baxter R, Christen P, Churches T (2003). “A Comparison of Fast Blocking Methods for Record Linkage”, CMIS Technical Report 03/139, First Workshop on Data Cleaning, Record Linkage and Object Consolidation, KDD 2003, Washington DC.
32 Gu L, Baxter R (2004). “Adaptive Filtering for Efficient Record Linkage”, SIAM International Conference on Data Mining Conference Proceedings, Florida.
33 Ascential Software (2002). Integrity SuperMATCH Concepts and Reference Guide Version 4.0, 5–17.
new field, together with other fields, could produce new matches which otherwise might have been missed.
Event dates, birth dates separated into month, day and year, forenames, and surnames (or their corresponding phonetic codes) are good blocking variables. Unique identification numbers, although potentially erroneous or missing, partition the files into a large number of sets. Unless there is rigorous control of the issue and recording of identifiers, the
recommendation is to use unique identifiers as blocking variables in the first pass, with other matching variables to verify the link. Other variables would be used to block in subsequent passes.
Sparsely populated fields are not good for blocking purposes, since records with missing values will remain unblocked and ineligible for potential linking.
6.3.2 Choice of linking variables
Practically all the variables common to the two datasets undergoing integration can be used for linking. In doing so, redundancies in the information imparted by the related variables may be helpful in reducing matching errors, provided the errors are not highly correlated or functionally dependent.34 It should be noted, however, that linkage software does not necessarily compute correlations. Moreover, it is not advisable to have highly correlated linking variables in the same pass, as they increase the composite weight without providing additional discrimination between record pairs which should link and those which should not.
Usually, however, only a subset of the variables common to the datasets is used for linking.
Gill (2001) suggests using six groups of variables, with a combination of variables coming from the different groups to be used for linking. The six groups are:
Group 1: Proper names, which rarely change over a person’s lifetime (except possibly for a woman’s surname) (eg forenames, initials, surnames)
Group 2: Non-name personal characteristics, which rarely change over a lifetime (eg date of birth, sex)
Group 3: Socio-demographic variables that may have several changes over a lifetime (eg address, marital status)
Group 4: Variables collected for special registers (eg occupation, date of injury, diagnosis)
Group 5: Variables used for family record linkage (eg surnames in Group 1 plus other surnames, birth weight)
Group 6: Arbitrarily allocated numbers that identify the record (eg IRD number).
Gill notes that it is common practice to choose and combine linking variables from Groups 1, 2, 3 and 6.
When choosing linking variables, spelling errors, phonetic coding choice (SOUNDEX codes for William and Williams are different) and the like may affect the classification of the
variables as either ‘agreeing’ or ‘disagreeing’. A quick run-through of the problems with some common variables and what has been done in practice to increase their reliability will also help the analyst in selecting the linking (and blocking) variables.
34 Gu L, Baxter R, Vickers D, Rainsford C(2003). “Record Linkage: Current Practice and Future Directions”, CMIS Technical Report No 03/83, CSIRO Mathematical and Information Sciences, Canberra.
Surnames: May be prone to changes, as in marriage and divorce. The order of use of the surnames in some ethnic groups may be different. Surnames may be prone to spelling variations resulting from erroneous transcription. A phonetically coded surname may be used to reduce transcription/spelling errors. Surname array (different surname fields merged into one) may also be used to handle multiple surnames. The arrays are then compared using some comparison function (see section 3.3) to make allowance for misspellings.
Forenames: Possess many of the same problems as surnames. Modernised name versions and nicknames are possibly used in some documents, while the formal forename is used in others. Forenames may be prone to transcription/spelling errors. Sometimes only forename initials, instead of full forenames, are available from the dataset. An array of the initials may be created as a new variable.
Sex: Generally reliable, if collected at all. Sex, however, has a low discriminatory power in distinguishing between a match and a non-match.
Birth date: There may be differences in the format (eg European v American format, although this should have been handled during the standardisation phase). Birth month and birth day are usually more reliable than the birth date. Gill suggests some tolerance when using the birth year, as this is more prone to error than the month or day of birth.
Age: May be used with some tolerance, like year of birth. When available together with the birth date, a data check can be performed to see if these two agree.
Address: As with dates there can be format problems, but the field can be standardised.
The standardisation process can be laborious. (When done in SAS, for instance, rule sets, can be used to standardise addresses in a relatively straightforward manner.) This is a good field for confirming matches, but could be poor for cases of disagreements, as the person might have changed address. When not used as a linking variable, the unlinked records from each of the two datasets being integrated can be sorted according to address, and then a comparison of the sorted files can be made to check if some matching records have failed to link.
Experience at Statistics NZ across various data integration projects shows that, generally, the standardised forms of most of the above variables are reliable.
6.3.3 Commonly used comparison functions for linking variables
Each field that is used for linking (and thus compared) will have an agreement or adisagreement weight (some positive/negative value, respectively). The field weight takes the full agreement weight if the fields completely agree (see Chapter 5). Field agreement or disagreement, however, need not be exact. With the use of comparison functions, partial agreements are possible. Below are some commonly used comparison functions available in QualityStage,35 the software package currently used by Statistics NZ for its data integration projects.
ABS_DIFF: The absolute difference comparison. It compares the difference between two numeric values. As an example, assume the field being compared is age and the tolerance specified is “5” (ie plus or minus five years in the ages would still be considered to match). If the age in the first dataset is 24 and the age in the other is 28, since the difference is within the allowed tolerance, the full agreement weight is assigned to the field age for this particular
35 QualityStage (2003). Match Concepts and Reference Guide Version 7.0, Chapter 5, 1–34.
comparison pair. If the age in the second dataset is 30, however, since the difference is beyond the tolerance, the field weight for age would be the full disagreement weight. Note that unlike several of the comparison functions below, the ABS_DIFF does not assign partial agreement weights.
CHAR: The character-by-character comparison. Any mismatch in the character of the fields undergoing comparison merits the assignment of the full disagreement weight.
DATE8: The comparison that allows tolerance in dates. At least one of two tolerance
parameters has to be specified. If only the first tolerance parameter is set, say “2”, the analyst has allowed up to two days difference between the two dates compared. Unlike ABS_DIFF and CHAR, however, a partial agreement weight is assigned to the field if the difference is within the prescribed tolerance. If only the first parameter is specified and the value entered is
“2”, a one-day difference between the dates compared reduces the agreement weight by one-third of the weight range (agreement weight – disagreement weight). A two-day
difference cuts the agreement weight by two-thirds of the range. A three-day difference merits the assignment of the full disagreement weight.
If two parameters are specified, the first parameter is the number of days tolerated when variable B > variable A, while the second parameter is the number of days tolerated when variable B < variable A.
MULT_EXACT: The comparison function used to allow the agreement of free-form text when the order of the words does not matter and where there may be missing or erroneous words.
It is similar to comparing arrays where the individual words are the array elements. The string of characters to be compared from each of File A and File B must be specified.
MULT_UNCERT: The comparison function identical to the MULT_EXACT comparison function, except for a parameter of uncertainty, which must be specified and is based on how similar the two comparison strings are. A higher value is given to identical strings; a lower value to strings that are almost certainly different. Weights are linearly proportioned between the full agreement weight and disagreement weights, depending on how close the score is to the specified threshold. A score outside the threshold is given the full disagreement weight.
NAME_UNCERT: The comparison allowing truncated fields. For example, the field in one dataset has the name Albert, while the other dataset has Al. If CHAR is used, the fields will not match (total disagreement). With NAME_UNCERT, the comparison will use the shorter length (truncation) of the two names and will not compare characters after that length. In the example above, the two names are considered to fully agree. A parameter, the minimum threshold, must be specified where the value given is based on how similar the two strings are. Weights are linearly proportioned between the full agreement weight and disagreement weights, depending on how close the score is to the specified threshold. A score outside the threshold is given the full disagreement weight.
PRORATED: Like the ABS_DIFF, the PRORATED comparison function is for comparing numeric fields. The prorated comparison allows numeric fields to disagree by a specified absolute amount as specified by an additional parameter. For example, if the parameter was 15 and the absolute value of the difference in the field values is greater than 15, the
disagreement weight would be assigned to the comparison. If the difference were zero, the full agreement weight would be assigned. Any difference between 0 and 15 would receive a weight proportionally equal to the difference. A difference of eight would receive a weight exactly between the agreement and disagreement weight.
Two additional arguments can be specified if it matters whether the difference is positive or negative. The first argument is the tolerance if the value on file B is greater than the value on file A. The second argument is the tolerance if the value on file A is greater than the value on file B.
UNCERT: A character comparison which allows partial weight assignments like the
NAME_UNCERT. The weight assigned is based on the difference between the two strings compared as a function of the string length, the number of character transpositions,
unassigned insertions, deletions or replacement of characters (recall that NAME_UNCERT is only for truncated names). A parameter, the minimum threshold, must be specified where the value given is based on how similar or not the two strings are. Weights are linearly
proportioned between the full agreement weight and disagreement weights depending on how close the score is to the specified threshold. A score outside the threshold is given the full disagreement weight.
6.3.4 The m and u probabilities
The m and u probabilities can be defined in two different ways. Global m and u probabilities assume the probability is constant through all variable values. Value-specific m and u probabilities are probabilities that may contain variable value differences.
Global u probabilities are used if it is assumed that the distribution of possible values within the field is (nearly) uniform. In practice, the linkage software may automatically estimate value-specific u probabilities to reflect the actual distribution of variable values in the dataset.
Value-specific m probabilities may be used for fields where some values are more reliable than others. However, global m probabilities are generally used, as it is to be expected that the values in a field are affected in the same way by the things that make a field reliable or unreliable (mode of collection, maintenance practices etc).
The m probability is the probability that the fields agree given that the record pair is a match.
It is a reflection of how reliable the field is, as it is computed as 1 minus the error rate of the field. Because all fields are not equally reliable, it is expected that m probabilities for different fields will vary. In practice, the error rates are generally not accurately known. Initially, when no estimates of the m probabilities are available, the following may be used:
• For most fields, 0.9
• For very important fields, 0.999
• For moderately important fields, 0.95
• For fields with poor reliability, 0.8 or less.
Setting a high m probability value for a field forces a high penalty for disagreement in that field.
Statistics NZ experience across various data integration projects shows that the
standardised variables sex, name, surname and birth date have good m probability values.
Variables such as address, ethnicity and phone number have been observed to generally have a lower reliability. This is not to say that these fields are unreliable in the absolute sense. Experience has shown that variables which have been collected and maintained carefully by the source agencies have good m values (are reliable), whereas variables that are of less importance to source agencies – that is, those not necessary to support their core operational requirements – tend to be less reliable. Where the law requires an event to be reported within a prescribed short period of time, event dates have proven to be reliable fields.
While there have been some theoretical approaches to modelling the m values, (eg Winkler, 198836), at Statistics NZ an iterative approach has been used. The first linking is done using an estimate for m based on what is known broadly of the importance of the variable, or from previous experience, as above. A new m value is then estimated from the values for data that has been linked. The m probability may be estimated by dividing the number of times the field values agree in a comparison by the number of times the value participated in a comparison (excluding in the computation of the m probability the records with missing entries for the field of interest). This should be done when the analyst has a certain degree of confidence that most of the good links have been captured.
The u probability is the probability that the fields agree given that the record pair is not a match. This is a reflection of how likely things are to agree by chance. Assuming a uniform distribution for the values a field may take, u is estimated by 1/n, where n is the number of field values. For example, it may be estimated that the u probability for gender is 1/2, as the variable gender takes two possible values. Similarly, for the variable month of birth, the u probability may be estimated as 1/12.