• No results found

4.2 Chapter aim and objectives

4.3.12 Variables created

A number of additional variables were created during the data work-up stage; these are described below.

4.3.12.1 Patient index date

For patients with thrombocytosis, the index date was the date of their first raised platelet count within the study timeframe. This was supplied as a variable in the raw data file from the CPRD. For patients with a normal platelet count, using the date of their first normal platelet count within the study timeframe as their index date may have introduced bias if there was a considerable time difference between this and the index date of their matched thrombocytosis counterpart. Therefore the date of the platelet count nearest in time to the index date of their matched thrombocytosis counterpart was assigned as the index date for patients with a normal platelet count. Because this index date was sometimes later than the date used to define patients for inclusion in the cohort, 374 (3.7%) of these ‘normal’ patients actually had thrombocytosis at their index date and were excluded.

The two cohorts were age matched; patients from each matched pair were the same age on the index date of the thrombocytosis patient. The process to define index dates for the normal platelet count patients may have resulted in a difference in the median age at index date between the two cohorts. This was investigated in the analysis by comparing the time between index dates between the two cohorts. Due to this method of selecting index dates for the normal platelet count patients, the time difference between index dates for thrombocytosis and normal platelet count patients was estimated. This is because bias may have been introduced to the study if the normal platelet count patients’ index dates were much later in time (and consequently, when the patients were older) than the thrombocytosis patients.

4. Thrombocytosis as an early marker of cancer

4.3.12.2 Date variables

Stata deals with dates numerically by converting all day-month-year formatted dates into a number: the number of days that have passed since 1st January 1960. In the raw data provided by the CPRD, date variables are given as the number of days that have passed since the patient registered with their practice. As all patients have been registered for a different number of days (some have been registered with the same practice since birth whereas others have moved recently), this coding format made comparisons between patients and variables difficult. Dealing with dates was further complicated by the fact that cancer registry dates are given as a month and year. Only the month and year of diagnosis are supplied to protect anonymity, so the first of the month was arbitrarily assigned to all diagnosis dates. To achieve consistency across all patients and data sources, new date variables were created for index date, all cancer diagnosis dates, and the date of any other symptoms. All dates were recorded as the number of days that had passed since 1st January 1900. This enabled easy comparison

of dates.

Most patients diagnosed with cancer had a record of this in both the CPRD and the cancer registry, and in this case the first recorded date was taken as the date of diagnosis. Some patients had cancer recorded in the CPRD but not in the cancer registry, and vice versa. The date of the present record was taken as the date of diagnosis. An exploration and comparison of cancer recording in the CPRD and the cancer registry in Chapter 5, including an analysis of cases recorded in one source and missing from the other.

4.3.12.3 Age variables

To protect anonymity, only the month and year of birth is provided by the CPRD. The first of the month was assigned as the day of birth for all patients. Date of birth was converted to the number of days that had passed since 1st January 1900, and then subtracted from the ‘days passed since’ variables for index date, diagnosis date, and the date of other recorded symptoms to determine age at each of these dates. A categorical age variable was also created with 10 year age brackets; initially this included 40-49 years, 50-59 years, 60-69 years, 70-79 years, 80-89 years, and 90 and over, but there were too few people in the final category so a 80 and over group was used.

4.3.12.4 Smoking status

Smoking status and behaviour are recorded using a number of variables in the CPRD, and evidence of the relationship between smoking status and platelet count is mixed

4. Thrombocytosis as an early marker of cancer

(Butkiewicz et al., 2006; Green et al., 1992; Sloan et al., 2015; Suwansaksri et al., 2004). A binary ‘ever smoked’ variable was created (with a ‘missing’ option) using those raw variables. Smoking status is defined in the CPRD as current, past, or never. Additional variables record the type of smoking (pipe, cigarette etc.) and the number or amount smoked per day. Current or past smoking codes were used to define patients as having ‘ever’ smoked or not. This binary variable coded patients who had ever smoked as 1 and patients who had never smoked as 0. Where smoking status was missing for a patient but data were available on their smoking habits (type or frequency), this was used to enter data for the ever smoked variable (classified as ever smoked). Only patients who were coded as having never smoked in the raw CPRD smoking variable were classed as having never smoked.

4.3.12.5 Patient symptoms

One of the limitations of this study is that the reasons for patient blood tests being ordered are unknown. Blood tests are ordered in general practice for a variety of reasons; this can be in response to symptoms or in asymptomatic patients for routine or health check reasons. If cancer was already suspected in patients with thrombocytosis when blood tests were ordered, then the usefulness of thrombocytosis as a clinical prompt of suspected cancer is limited. In an attempt to address this, the symptoms reported by all patients in the month prior to their index date were compared for those with thrombocytosis and those with a normal platelet count. To do this, patients’ electronic records were searched for all recorded medcodes within 28 days before index date blood test. The 100 most common medcodes in each group were listed. After excluding ‘organisational’ codes, the 10 most common symptom codes in the two groups were compared. Eight of these appeared in both cohorts.

4.3.12.6 Geographical region

The geographical region in which the patient’s CPRD practice was based was supplied in the raw CPRD data. These were tabulated, and the number and percentage in each region were reported. Whilst the CPRD is commonly cited as a ’geographically representative sample’, there may be some variation in the proportion in patients in each region.

4. Thrombocytosis as an early marker of cancer

4.3.13 Outcome variables