Introduction to the Record Linkage Method

4. Chapter Four: Improving Quality of Ethnic Codes in HES

4.4 Record Linkage Method

4.4.1 Introduction to the Record Linkage Method

The record linkage method, which links different data by their common variables, such as name, sex and date of birth, is a reliable method to restore the missing data. By linking with other data, the data of interest also can be enriched with additional variables. Two main record linkage methods are available, namely deterministic matching and probabilistic matching methods. Deterministic matching is an exact matching method, which links different data by their unique identifier. However, for reasons of confidentiality, the data may not contain such high quality identifiers, which make this method frequently not possible. The probabilistic matching method is used when a combination of information (such as age, sex and date of birth) about the same person rather than a unique identifier is available in both data (Aspinall and Jacobson, 2007).

There is some previous research which links administrative records to surveys and other administrative data using the record linkage method. For example, in the English Longitudinal Study, the 1971-2001 Censuses have been linked together along with other vital events for 1% of the population of England and Wales based on individual personal detail (Blackwell et al., 2003). The HES data have been linked to the mortality data for England from 1998/99 to the present by the University of Oxford, based on which a series of papers about mortality rates after hospital

admission for myocardial infarction, stroke and diabetes have been published (Roberts and Goldacre, 2003, Roberts et al., 2004, Goldacre et al., 2004). Other examples include linking the Pupil Level Annual School Census with the National Pupil Database, Millennium Cohort Study and The Avon Longitudinal Study of Parents and Children. However, there are few studies in the UK that validate the ethnicity codes or supply additional ethnic information in the data, with an exception of the linkage between the Scottish 2001 Census and the Scottish NHS Community Health Index used in Scottish Longitudinal Study (Aspinall and Jacobson, 2007).

In this study, the deterministic matching method, which is an exact matching method, is used to improve the quality of ethnicity codes in the HES. Unlike other applications of the record linkage method that involve different data, this deterministic matching only relies on the HES historical data itself. The underlying idea of this exact matching method is to restore the missing ethnicity codes by linking the admissions with valid ethnicity codes to the historical admissions without valid ethnicity codes based on the same person’s unique HESID. The deterministic matching is realized upon two characteristics of the HES data, the unique HESID and the historical readmissions. Firstly, in the HES, each episode has been assigned a HESID, which is generated by matching records for the same patient using a combination of NHS Number, local patient identifier, sex and date of birth. If two episodes have the same HESID, they are believed to belong to the same patient. (HESonline) Therefore, uniquely identifying a patient across all data years, HESID could act as the unique identifier in the deterministic matching method. Second, the HES is a data warehouse containing all the NHS hospital admissions across many years, from 1998/1999 over 12 million records have been added into the HES per year, and from 2003/2004, this number has reached more than 14 millions. During the data years, one person may come to hospital for more than one time, which could be identified by the patient’s unique HESID. If there is a valid ethnicity code within any of the historical admissions belonging to one person, all the other missing ethnicity codes could be replaced by this valid ethnicity code.

The potential of this method is that, as the data quality is better and better, there will be more and more episodes with valid and accurate ethnicity codes, which could be used to restore previous missing ethnicity information. There is evidence that some trusts with the worst ethnicity coding before have achieved almost complete coding in subsequent years (Mindell et al., 2007). So it is reasonable to believe that not only the data quality of future years will be better, but also the quality of ethnicity codes in the historical data will be better. In addition, by linking the historical admissions in different years, it is possible to trace individuals’ vital medical information and events during the life course, such as birth, diagnosis, operations, and death and so on. The HES data itself could be a specific longitudinal data about health, as long as the hospital admissions are recorded continuously.

This method is based on the assumption that each patient has only one valid ethnicity code across the whole HES data. This is true for 98.5 per cent of all the patients with at least one valid ethnicity code. They have been assigned to the only ethnic group that has been recorded in their historical admissions, which is a true match for their records. However, inconsistency of valid ethnicity coding has been observed for some patients. About 1.5 per cent of all the patients with at least one valid ethnicity code have been recorded to multiple ethnic groups in the historical admissions. And 98% per cent of these patients have two different valid ethnic groups. There are several possible reasons.

a) Ethnicity classification reason. The England 2001 Census has introduced the new ethnicity classification system. However, HES has continued to accept the old codes as well as the new codes for the 2001/02 and 2002/03 data years. There is some inconsistency between the old and new ethnicity coding, particularly for the mixed blood population. As there was no “Mixed” group in the old coding system, people who described themselves as mixed blood population in new classification in the later years had to choose one single

codes for them.

b) Organization or institutional reason. Although it is mandatory for NHS hospital trusts to collect ethnicity information about patient, in the early years, some trusts didn’t perform well on it. Staff might record patients’ ethnicity simply by guessing or they might simply assign ‘Other’ ethnic group to the patients. There is some evidence that white patients have been coded as ‘Other’ in some trusts (HESonline, 2004b). Recently, as ethnicity monitoring has been paid more attention, the patients who were simply assigned to ‘Other’ ethnic group before are more likely to have been assigned to their representative ethnicity in the data. So these patients usually have two valid ethnic groups in the records, ‘Other’ group and another valid ethnicity.

c) Personal reason. Some people would feel it difficult to describe their ethnicity, especially the people who have mixed origins. They might describe themselves as Mixed group sometimes, but sometimes they might prefer one of their origins. In addition, some people from minority ethnic groups might be reluctant to describe themselves as minorities in some cases, so they might not provide their true ethnicity in the records.

Given the above possible reasons, some criteria have been set according to their historical admissions to assign a ‘most likely’ ethnic group to the patients with multiple ethnic groups recorded.

1). If a certain valid ethnic group occurs more than 80 per cent of all the records with valid ethnicity codes, this person is more likely to belong to this ethnic group.

2). If no one ethnic group accounts more than 80 per cent of all the records with valid ethnicity codes, if ‘mixed’ is among the valid ethnic groups, this person is more likely to belong to the ‘mixed’ population.

3). If no one ethnic group accounts for more than 80 per cent of all the records with valid ethnicity codes, if ‘Other’ is among the valid ethnic groups, this person is more likely to belong to the other valid ethnicity code rather than the ‘Other’ group. (About 98% per cent of the patients with more than one valid ethnic group codes only have two different valid ethnic groups)

4). If one patient’s ethnicity code distribution doesn’t follow any above criteria, then the most recent valid recorded ethnicity group will be assigned to this person, since generally the most recent HES data are more accurate and have better data quality than previous data.

As this record linkage method is based on the historical hospital admissions with valid ethnicity codes, there is a possibility that people who are generally sicker that have more historical hospital admissions are more likely to have been recorded with valid ethnicity codes. Thus the cardiovascular disease admissions with invalid ethnicity codes belonging to these people are more likely to be restored with valid ethnicity codes, which might introduces bias. However, it seems reasonable to assume that this will be true of all ethnic groups and therefore there will be no bias when making comparisons between ethnic groups.

In document Exploring ethnic inequalities in cardiovascular disease using Hospital Episode Statistics (Page 136-140)