• No results found

Weights

In document Data Integration Manual (Page 50-54)

Once the weights have been calculated, the next step is to decide which records are links and which are non-links, based on the evidence of the weights.

5.7.1 Distribution

In a typical integration project there are hundreds of thousands of records, and millions of possible pairings. Most of those record pairs do not refer to the same entity, and thus there will be more non-links created than links. The distribution of these weights therefore is bimodal, like the following figure:

-150 -100 -50 0 50 100

Weights

Non-Matches Matches Observed

Distribution of Composite

Weights Across All Possible

Comparison Pairs

Number of comparison pairs

(Note: the ‘observed’ line is the distribution actually observed, which has been offset slightly to make it more differentiable from the other distributions.)

5.7.2 Cut-off thresholds

Once the weights have been calculated, upper and lower thresholds are established. The upper threshold is the weight above which every record pair is determined to be a link. There is usually only one link per record, so other possible pairings can either be ignored or

considered duplicate records. The lower threshold is the weight below which every record pair is determined to be a non-link.

Distribution of Composite Weights and

Threshold Cut-offs

Upper

Cut-off

Links

Lower

Cut-off

Non-links

-150 -100 -50 0 50 100 Weights Number of comparison pairs

The problem is: although record pairs are in reality either true matches or true non-matches, in the world of data integration, with imperfect/insufficient data, the picture isn’t so clear. Some true matches have low weights because of data errors or similar problems, just as some true non-matches are given high weights for the same reasons.

There is theory stating the best way to determine the threshold levels (as outlined in Fellegi and Sunter, 1969), but in practice it is up to the person working on the data integration project to decide where the cut-off thresholds will go. This is usually done by reviewing record pairs near a likely cut-off point and making judgements about how the computer differentiated the pairings. More detail is given about the impact of setting particular threshold levels in Chapter 6, below.

5.7.3 Clerical review

If the link and non-link thresholds are the same, this divides the set of record pairs cleanly into two sets. However, if they are not, then the record pairs with weights in between the two limits are in the ‘clerical review’ area. In this area, the human operator decides which record pairs are links and which are non-links.

With some statistical integration software, it might not be possible to assess record pairs in a clerical review period. In this case, it is necessary to make the link and non-link thresholds the same.

5.8 Blocking

As mentioned above, there is likely to be a very large number of records to compare.

Comparing 1,000 records with 1,000 records means that 1,000,000 (1 million!) comparisons are made. With only 1,000 record pairs being a match, this gives 999,000 records pairs that are a non-match, which are determined to be non-links.

1,000 records 1,000 records

1,000 x 1,000 = 1,000,000

Total Com parisons = 1,000,000

To reduce the number of comparisons made and focus on the records that are more likely to be matches, the records can be filtered first so that only certain records are considered in comparison to each other.

This filtering is called ‘blocking’, and is done by selecting variables to ‘block on’. Only records that agree on the values in those variables are compared to each other. For example, if sex is chosen as a blocking variable, only records with the same value of sex are compared to each other. This cuts out about half of the comparisons required. If month of birth were chosen, this decreases the number of comparisons by a factor of 12. Choosing both sex and month of birth means that 1/24th of the comparisons are made.

The following diagram illustrates the reduction in comparisons for the case where there are five equal-sized blocks on each file.

1,000 records 1,000 records 200 200 200 200 200 200 200 200 200 200 200 x 200 = 40,000 200 x 200 = 40,000 200 x 200 = 40,000 200 x 200 = 40,000 200 x 200 = 40,000 Total Comparisons = 200,000 5 Blocks 5 Blocks

In the example introduced in section 5.4, above, with the John Black/Jon Block record pair, if sex was used as a blocking variable, the two records will still be compared. If year of birth was used as a blocking variable, then they wouldn’t be compared.

5.9 Passes

In the example above, if year was chosen to block on, then the two records would not be compared. If this was the only comparison done, then these records would never be compared. However, more than one comparison can be run.

A ‘pass’ is an iteration of record linkage using a combination of blocking variables and matching variables. In a data integration project, more than one pass is used to block the file in different ways and to allow for different variable comparisons, and for errors in the blocking variables. For example, one pass might block on year of birth, and match on name and, sex. Another pass might block on sex, and match on name and address.

The number of passes used should reflect how well the record linkage process is working. If only a small number of links are created in each pass, then multiple passes might be

needed. On the other hand, this might indicate that there isn’t enough information to give high-quality links and more passes won’t help. It is up to the user to decide.

6 Record Linkage in Practice

Summary

This chapter focuses on the practical application. It also covers things that can go wrong, and discusses what makes the output of a record linkage exercise ‘fit for use’.

In document Data Integration Manual (Page 50-54)

Related documents