Chapter 3: Data sources and variable definitions
3.1. Data sources
This study will utilise data from two EHR databases: the UK CPRD and the Hospital Episode Statistics (HES) database.
The Clinical Practice Research Datalink
3.1.1.1.
Overview
CPRD is a large computerised database of anonymised patient records from UK primary care. At the time this research project was carried out, over 600 general practices contributed data from over 12 million patients and over 5 million currently registered patients. Approximately 7% of the UK population were represented, making it one of the largest sources of electronic primary care data in the world for research. Enrolled practices use a specific information technology (IT) system, called Vision, which uses coded and free text data to record information. Practices agree to participate, however individual patients may opt out upon request. Over 98% of the UK population are registered at a general practice and studies have shown that patients in CPRD are broadly representative of the UK population in terms of age, sex and ethnicity.85
The general practitioner (GP) in the UK acts as the gatekeeper of primary care and specialist referrals, therefore the majority of patients will seek care initially from their GP for health- related issues. This system results in a rich source of patient-level health data, including all consultations, diagnoses, prescriptions, tests, immunizations, referral to hospitals and hospitalizations. Unlike many administrative databases, particularly those collected for insurance purposes, CPRD also contains some lifestyle and anthropometric data, such as smoking behaviours and BMI.
40
3.1.1.2.
Data structure and coding
CPRD release new database builds on a monthly basis. For this thesis the January 2012 build was used, except for the descriptive study of antiviral use (chapter 5) in which the June 2011 build was used. The data are split into several files and Table 1 below describes the main file types in CPRD used for this study, along with their contents. The patient-level files can be linked using a unique patient identifier, present in each file. The last three characters of this identifier also constitute a unique practice identifier, for linkage to the practice-level data. Table 1: Description and contents of files types available and utilised in this thesis
File type File contents
Patient Patient level demographic details including; year of birth, gender, registration
status (acceptable/unacceptable), death date, transfer out date.
Practice Practice level data including; geographical region, 'Up to standard' date (date
CPRD have classed the practice data sufficient quality for research), last data collection date (for the practice).
Consultation Patient level data on consultations with GP; date of consultation, type of
consultation and duration of consultation.
Clinical Patient level data on clinical events including; date of clinical event, diagnosis
given or symptom recorded.
Additional Clinical Details
Additional detail regarding clinical events, such as number of cigarettes smoked per day and test results. The file is split into entity type, which relates to a specific type of data. There are a total of 460 different entity types. For example, entity “type one” records information on blood pressure.
Referral Patient level data on referrals to specialist services including; date of referral,
diagnosis given, method of referral, referral specialty, urgency of referral.
Therapy Patient level data on drug prescriptions and apparatus including; date of
prescription, CPRD product code for the prescription.
GPs in the UK record medical and non-medical events using the hierarchical Read code
classification system.86 This system covers a range of areas including symptoms, diagnoses and
administrative processes. The Read code hierarchy is organised into chapters and subchapters, with initial values representing high-level categories and following values specifying further detail on the event. Therapy prescriptions are recorded using the Multilex product dictionary, and include pharmaceuticals, drug appliances or devices. CPRD have translated these Read and Multilex codes into medical and product codes respectively, and created dictionaries that can be easily searched. For this thesis the dictionary versions 1.3.2 were used.
41
Defining follow-up for individual patients
Although records go back many years, CPRD recommend restricting the start of follow-up period to be the latest of practice ‘up to standard’ date or the date the patient first registered at the practice. They also recommend ending follow-up at the earliest of the following; when the patient died, transferred out of the practice, or when data were last collected on that practice.
3.1.1.3.
Data Quality and Validity of Information
CPRD data undergoes some checks to ensure the data meet certain standards before release.87
These checks occur at the practice and patient level. Practices are assessed and labelled ‘up to standard’ when the practice is considered to have continuous high quality data, fit for
research. The practice must meet various criteria, such as a minimum referral rate per 100 patients. Patient level data are marked as “acceptable” for research, where certain criteria are met, such as: 1) age at end of follow-up is below 115 years; and 2) year of birth is recorded.
Guidance documents for practices contributing to CPRD ask GPs to record various aspects of a patients’ medical details including; all significant clinical events from a patients history at registration and as they occur, indications for therapy prescriptions, all known hospitalisations and cause of death. However, data completeness may vary substantially over time, by
population or type of data.
Recording of certain data types has been improved by the introduction of the Quality and Outcomes Framework (QoF), which encourages recording of key data items through an
incentivised payment programme for GPs.88 QoF was introduced in 2004 and sets out a series
of data fields for collection, such as the BMI status of diabetes patients and the delivery of services to patients with severe mental health conditions. The completeness of data for factors included in QoF increased following the introduction of this programme.
Diagnostic validity in the CPRD is generally considered to be an advantage of the database. A systematic review of studies validating a variety of disease diagnoses found the median
proportion of CPRD-defined cases with a confirmed diagnosis was 89%89 (positive predictive
value (PPV) of a recorded diagnosis). However, the PPVs ranged from 24 to 100% for individual diseases and the review acknowledged that validation studies were limited due to their size and frequently restricted to specific populations. Furthermore, the same systematic review noted that negative predictive values (NPV) are very rarely assessed in CPRD validation studies,
42
due to the financial implications of sampling a vast number of patients without the diagnostic codes of interest. The lack of information on the NPV is an acknowledged weakness of CPRD data.
Linked Hospital Episodes Statistics
3.1.2.1.
Overview
A subset of English CPRD practices participate in a linkage scheme, where related datasets are linked to CPRD. One of these datasets is HES, a secondary care database of hospital admissions from all National Health Service (NHS) trusts throughout England. Patients in CPRD are linked to HES data using deterministic matching (where all or some identifiers are required to match exactly) on a combination of the patient’s NHS number, gender, and partial date of birth. The research in this thesis uses hospitalisation data from April 1997 to March 2012, during which time 375/497 (75%) of English practices participated in the linkage scheme. Linked HES data contains information on inpatient admissions only (limited outpatient data did not become available until September 2014). HES contains comprehensive diagnostic information, however prescription data are not currently available.
3.1.2.2.
Data structure and coding
The structure of HES data are demonstrated in Figure 1 below. HES data are divided into “hospitalisations”, which relate to a stay in hospital. For each hospitalisation there may be one to many “episodes”, an episode being a time period for which a patient is under the care of a particular consultant. Within an episode, a patient has a primary diagnosis and up to 20 further secondary diagnoses; the primary diagnosis field usually relates to the reason the patient was admitted. The data are provided to researchers as files relating to hospitalisations or episodes.
Clinical diagnoses in HES data are coded using the International Classification of Disease, tenth revision (ICD-10) system, developed by the World Health Organisation (WHO). It is organised into chapters, themed on particular medical areas, covering diagnoses and procedures. ICD-10 codes are made up of 6 or 7 digits; the first three digits indicate the medical category, the next
three give information on the location, severity or aetiology and the 7th digit is optional,
43 Figure 1: HES data structure
Note: “hospitalisations” relate to a stay in hospital and “episodes” relate to a time period for which a patient is under the care of a particular consultant.
3.1.2.3.
Data quality
HES data are collected during a patient’s hospital stay, and are processed to allow hospitals to be paid for the care they deliver. HES data are also designed to enable secondary use, that is use for non-clinical purposes, such as research. Trained clinical coders input ICD-10 codes from unstructured, hand-written clinical notes. Data input by clinical coders are then sent to a data warehouse. At pre-arranged time-points in a year, HES then extracts a copy of the data, and carry out validation and data cleaning. Each variable undergoes a set of cleaning rules, for example each patients date of birth must lie between 1/1/1885 and the last day of the period being processed, and where invalid codes appear, they are overwritten with the code for
“Unknown”.90
Errors and omissions within HES data are acknowledged to occur;91,92 in 2013/14 an audit of
8,990 episodes of care from 50 NHS trusts compared case notes to clinical codes, and
estimated the error rate of clinical codes in admitted patient care data at 10.8%.93 The errors
may be entirely incorrect codes, or codes lacking detail of the clinical event.94 The main
reasons for errors are thought to be incomplete paper records and the lack of involvement
from front-line clinicians in the coding process.91 The impact of this level of coding inaccuracy
on epidemiological research is difficult to quantify and is likely to vary according to the hospital
Episode 1 Episode 2 Episode 3 Episode 1 Episode 2
1° diagnosis field 2° diagnosis field
20° diagnosis field
= reason patient was admitted (usually) Time
44
specialty and the study question. Despite this, data quality is improving over time and is considered sufficiently robust for health research.95,96