Data Cleaning - Pre-Processing of the Collected Data

Chapter 2: The Data, Ethical and Legal Frame

2. Technical issues

2.3.1 Pre-Processing of the Collected Data

2.3.1.3 Data Cleaning

Once all the data was inserted into the final database ‘RHub’, inspection and cleaning of the contents of the data variables could begin. The size of the dataset at this time was immense. The ORU.Pathology (blood results) table alone consisted of over 1.1 billion rows, and the CDS.APC table had over 16 million rows. However, this was the ‘uncleaned’ data, and was currently unfit for use in my analysis. The steps I performed to ‘clean’ this data were:

1. The first task was to check the quality of the linked identifiers. Frequently, when a patient arrives in hospital, although they have been admitted/registered on the hospital systems before, a new hospital number (MRN) is assigned to them. Thus, a patient may have more than one MRN. The resultant problem is that all the patient’s historic data becomes inaccessible if the new MRN is used to search for such data (for example, blood results). To remedy this situation in my dataset, I searched for multiple MRNs that were linked to the same NHS Number (I had requested the hashed MRN and NHS numbers for each patient). For each of these, I then checked whether the date of birth (month and year of birth) and sex matched; if so, all these identifiers were consolidated into one, and the old identifiers deleted. This was checked across both the CDS.APC and ORU.Pathology tables. If there were clashes, such as if the linked MRNs and NHS numbers had different birth dates and sexes associated with them, then these data were deleted. Eventually, a final identifier list was created and updated across all the tables.

2. I then, for each of the variables listed in Tables 2.1 and 2.2, sequentially checked how much of the data within each variable conformed to the expected attributes. All data that did not perfectly match the expected attributes were either converted to match these where possible (for example, ethnic category ‘British’ changed to ‘A’), or deleted. When data was missing for a variable, this was checked against the ORU.Pathology and demographics tables, and if present were used.

3. Admission and discharge dates prior to 1 January 2004 and 1 October 2016 were also deleted, as were rows of data where either of these values was missing, or where the discharge date was before the admission date. In addition, all rows of data where the discharge date was >183 days (more than six months) from the admission date were also deleted. This was because I found lengths of stays of five years or more for hundreds of patients, many with the same discharge date (implying that a default discharge date had been inserted when none was present; this is poor but common practice in databases that have been designed not to accept empty values). I acknowledge that this approach does exclude genuine patients with lengths of stays of greater than six months, but these patients form a tiny proportion of the total number of admissions. 4. Date of birth parameters were also further pruned. The remit of the study was to investigate only

adult patients; however, as the dataset spans over a decade, patients who are now adults may have been admitted when they were younger than eighteen years old. Although these admissions

would not be investigated directly, the past medical history of adult patients is crucially important in understanding a patient’s condition and their likelihood of deterioration. Thus, to comply with the remit of my project, but also to retain past medical history data, I deleted the data for all patients born on or after September 1995; eighteen years before the start of my project. Thus, only adult patients, according to the approved data ethics for my project, were included in the investigation. However, I also took the additional step of including only patients who were eighteen years or over on the date of admission to hospital in my subsequent analyses (Chapters 3 to 5). Most NHS Trusts were not able to perform this filtering of the data prior to transfer. 5. For each blood result test, multiple dates and times were available: i.e. for sample collected,

sample received in laboratory, and result validated. However, all three categories were sparsely populated, with the most common variable being ‘validated datetime’. Therefore, I used ‘validated datetime’ for my analyses. Out-of-range values were also deleted.

6. For the ‘TestNames’ variable, there were thousands of unique blood result tests, and sorting and categorising all of these would not have been possible during the course of my investigation. Therefore, I focussed on the most common blood tests performed in hospital, specifically full blood count (FBC), urea and electrolytes (U’s and E’s), and albumin. Each of these tests was represented under multiple names. Customised searches were carried out on common variations of each of the possible names, and when it was confirmed that they did indeed represent the test in question, all the differing names were changed to the database standard. All other tests were left untouched for ‘cleaning’ at a later date. I also received result data from samples other than blood; however, I analysed only venous or arterial blood results.

7. One would expect the units in which common blood tests are reported to be the same; however, this was not the case. For each blood test, I confirmed the range of its actual results along with the pairing of its units, and if they conformed to reported quality standards, approved them. However, when the ranges and units did not match, further investigation was carried out, and where possible the result values were converted to the appropriate units.

8. Finally, I also carried out a range of additional quality checks for the distribution of the data.

Apart from various specific data cleaning tasks described above, I also deleted all blood results for which where there were no matching patient identifiers in the CDS.APC dataset. These blood tests were probably performed by the hospital laboratory on patients who had never been admitted to hospital, and were thus not relevant for my analyses. Duplicate blood results were also deleted.

The size of the dataset following these extensive changes was, for CDS.APC, ~8 million rows; and for ORU.Pathology, ~500 million rows. Further details of the data are provided in each of the analysis chapters (Chapters 3 to 5).

In document ML-EWS: Machine Learning Early Warning System. The application of machine learning to predict in-hospital patient deterioration (Page 71-73)