Chapter 3 : Data sources
3.7 General Practice Research Database
As mentioned in Chapter 1 section 1.5.7, the GPRD was superseded by the Clinical Practice Research Datalink (CPRD) in April 2012 but as the former was in operation at the time of the project, references will be made to the GPRD and not CPRD throughout this thesis. The GPRD has been used extensively for epidemiological and healthcare research. It is well validated and renowned for its representational coverage of the UK population.
3.7.1 Data coverage
Data provided by the GPRD for this research were from the October 2010 build of the GPRD database. For this time period, the GPRD contained data for 12.1 million patients.130 This number included the up-to-research-standard records of 4.87 million currently actively registered patients, and 5.77 million inactive patients who either died or were transferred out of the participating practice.130
3.7.2 Strengths and weaknesses of GPRD data
3.7.2.1 Strengths of the database
Wide international use of database with validation for many diseases, conditions and treatments, 192-195 including comparison with HES.192
Population coverage - approximately 8% of UK population and over 590 GP practices.
Detailed datasets of clinical and non-clinical data.
Linkage available with disease registries, secondary care and death data.
Data have received “preliminary cleaning” by GPRD to ensure they meet “research standard”.
3.7.2.2 Weaknesses of the database
Only general practices using the Vision computer system can participate in GPRD.130
Voluntary participation by practices with pay incentive (10p per patient per year).130
Limited free linked data available under a MRC license for academic institutions (correct at time of application for GPRD data. The MRC licence has since expired and
a new arrangement for access to data should be made to the CPRD via the Independent Scientific Advisory Committee, ISAC).130
3.7.3 GPRD for monitoring adverse events
Patient harm associated with drugs or other forms of treatment in general practice have been well investigated using GPRD data.192,195-197
Yet fewer studies have taken advantage of the longitudinal nature of the database to explore non-drug-related AEs.197,198
3.7.4 Dataset for this project
Data were obtained under the Data Linkage Scheme. Integrated hospital admissions data from Hospital Episode Statistics (HES), central mortality data from the Office of National Statistics (ONS) and social deprivation by Index of Multiple Deprivation (IMD) 2007 scores were included in the dataset. It was therefore possible to conduct a detailed exploration of the relationships between potential risk factors, AEs, and other patient outcomes.
3.7.5 Data cleaning
The raw dataset contained records for 100,000 patients who were registered at 584
participating GP practices during the study period (1st January 1999 to 31st December 2008).
Basic cleaning of the dataset removed the records of:
1. Patients with invalid sex field, n=3.
2. Patients missing valid clinical, medical or consultation data, n=404.
3. Patients without valid Read coded fields, n=2,047.
4. Patients missing registration date, year of birth or where the first registration date at the GP practice was after the date of the patient’s first ever recording in the
computer system, n=3.
5. Patients residing outside of England, n=18,328.
6. Patients who did not have any consultations (in any location, with any type of staff) during the study period, n=4,452.
Once cleaned, data for 74,763 patients registered at 457 practices remained. More cleaning was carried out for the analyses reported in Chapters 6 to 8. The results of this data
preparation are reported in the respective chapters. In the next sub-sections, I describe nuances of the dataset that are worthy of note and that had implications for the analyses.
3.7.6 Registration period
In the GPRD dataset, unique patient identifiers are GP practice-dependent; new patient identifiers are assigned to patients when they join a practice. Thus, it is not possible to track patients who transfer out of one practice and who then register with other practices. Given this artefact of the dataset, only the first registration period of each patient at their current GP practice was included in analyses.
3.7.7 Ethnicity
Data on patients’ ethnic classification were only available through the linked HES data, i.e.
only patients who had an admission record also had valid ethnicity data. It follows then that ethnicity status was recorded for approximately a quarter of the patients in the original raw dataset (24,307/100,000 patients). Ethnicity data were provided in 13 categories (including a category for "data not entered"). Due to small numbers and to improve consistency when comparing results, I aggregated the ethnicity groups into 6 categories that correspond with the current ethnicity categories used by HES and ONS.189,199
3.7.8 Referrals
The recording of referrals in the GPRD dataset was poor.
3.7.9 Social deprivation
Only 35,207/100,000 patients in the raw GPRD dataset had a valid Index of Multiple Deprivation (IMD) score. A code for missing deprivation status was created for analyses.
Deprivation was measured by population weighted quintiles provided by GPRD and derived from IMD scores.
3.7.10 Data on admissions
Admissions are reliably recorded in English general practice and have been used in prior primary care studies.197,200 Nevertheless, completeness of admission information can be improved by linkage with secondary care data. In the GPRD dataset, diagnoses on admission
and date of admission from linked HES data improved the accuracy of estimates on hospitalisations associated with safety incidents occurring in non-acute care.
3.7.11 Recording of death
Similar to the availability of linked admissions data, causes and date of death provided through linked data from the ONS enabled more accurate estimates of patient outcomes that occur after AEs. Within the core GPRD dataset, death data are reportedly well recorded and derived using an in-house algorithm.201 The linked ONS central mortality data are extracted mainly from death certificates.202 During cleaning of the dataset, I discovered that the GPRD and ONS death fields in the obtained dataset did not fully match. However, the discrepancies were few in number. For example, 30 records with valid ONS death data were missing date and causes of death in the corresponding GPRD fields.
Nevertheless, these records did contain date of death as indicated from HES or ONS data.
There were also 8 records where the date of death in GPRD and ONS fields did not match.
The difference in the recorded date of death ranged between 1 and 40 days, with the date in the ONS derived field preceding over the date in the GPRD field. These differences may be attributed to variation in data processing between GP practices participating in the GPRD and the ONS, with the ONS providing absolute recording of deaths.
3.7.12 Data fields not used
A variable for life events was derived for each patient based on whether there had ever been Read codes indicating divorce, bereavement, homelessness or unemployment in their records. Place of residence was also derived from the “Residence Types” code in the GPRD data, which was used to generate a binary flag to indicate whether patients lived alone.
Data were too poorly populated for all three variables for them to be included in the analyses.
3.8 Other data sources
Together with the three main datasets, further data were obtained from several publically available data sources. I now describe each of the additional datasets in relation to the analyses conducted.
3.8.1 Index of Multiple Deprivation (IMD), 2007
The Index of Multiple Deprivation (IMD) measures socio-economic deprivation across seven domains; “income deprivation”, “employment deprivation”, “health deprivation and
disability”, “education, skills, and training deprivation”, “barriers to housing and services”,
“living environment deprivation” and “crime” are measured.203 Higher IMD scores indicate greater deprivation.
In the GPRD dataset, deprivation scores for patients derived using the IMD for 2007 were provided and applied in Chapters 6 to 8. For analyses using the HES standalone dataset (Chapter 8), the IMD scores for patients’ place of residence and GP practices were mapped by postcodes. IMD scores by postcodes have been previously created by a colleague at DFU, whereby IMD scores by Lower Super Output Area (LSOA) were mapped to postcodes using a postcode to geography level lookup table.
3.8.2 NHS Information Centre
In Chapter 7 – Emergency admissions for diabetic hyperglycaemic emergencies, comparisons were made between the study results and nationally reported data on
admission rates. These national data were obtained from the NHS Information Centre (and in conjunction with QRESEARCH) and the National Diabetes Audit (NDA).204-206 In Chapter 8, I use data on the number of full time equivalent (FTE) GPs, excluding GP retainers and
registrars in 2010. These data were previously obtained from the NHS Information Centre by a colleague in PCPH for departmental use.207 These data were available by age group, sex and country of primary medical qualification.
3.8.3 National Statistics Postcode Directory (NSPD)
For the analyses in Chapter 8, the rural/urban classification for patients’ place of residence and GP practices were defined using the 2010 National Statistics Postcode Directory (NSPD), from the ONS.208 Classifications were available at the LSOA level, which were then mapped to the corresponding postcodes of patients’ homes and GP practices using the online GeoConvert tool from the Census Dissemination Unit at the University of Manchester.209 The three categorises used were:
Urban >10K;
Town and fringe and village; and
Hamlet and isolated dwellings.
3.8.4 Quality and Outcomes Framework data
The Quality and Outcomes Framework (QOF) was first implemented in England in the 2004/05 financial year. This is a voluntary performance-related payment system for NHS GP practices.210 It enables comparisons to be made on the quality and delivery of health
services, using points-based indicators within four domains (clinical, organisational, patient experience and additional services).210 Higher scores indicate better performance, with a maximum attainable score per practice of 1,000 points. Annual results are publically available at national, local and practice levels. 210 QOF data for the most recent financial year available (2010/11) contains data on 134 indicators, with data collected from 8,245 GP practices for over 55 million patients in England (99.7% of registered patients).211
In Chapter 8, five QOF measures were mapped to practices using the unique identifier code assigned to each practice. The overall practice performance, two cancer indicators and two patient experience of access indicators were assessed by averaging each indicator score over the years of the study (the patient experience indicators were only available for the latter two years of the study period). These data were downloaded from the NHS
Information Centre for Health and Social Care’s website for the three years covering 1st April 2007 to 31st March 2010 (from 1st April 2008 for the two patient experience measures).212
The cancer indicators were:
CANCER 01 – “register of patients with a diagnosis of cancer excluding non-melanotic skin cancers from 1st April 2003”, and
CANCER 03 – “percentage of patients with cancer who have been diagnosed within the last 18 months and have had a patient review recorded as occurring within 6 months of the practice receiving confirmation of the diagnosis”.212
The patient experience of access indicators were:
Patient Experience 07 – “percentage of patients who were able to obtain a consultation with a GP within 2 working days”, and
Patient Experience 08 – “percentage of patients who were able to book an appointment with a GP more than 2 days ahead”.212