Database Preparation and Filtering - Methodology and Framework

Chapter 4: Methodology and Framework

4.4 Database Preparation and Filtering

Administrative datasets are heterogeneous and complex. As the dataset is collected over several years in different healthcare facilities, we should first inspect the attributes and formatting of the dataset, look for inconsistencies and correct them before doing any actual analysis. The dataset should be organised into a structured database that is suitable for our analysis. Further, we need to consider privacy issues, because of the sensitive

101

nature of the data. In this regard, before beginning the actual analysis, we need to inspect thoroughly, sanitise and organise the dataset, which involves different steps. This section discusses the main components of the database preparation step.

4.4.1 De-identification

Healthcare datasets may contain sensitive information, including social security or Medicare number, patient’s name, home address and exact date of birth. As a result, before beginning the analysis itself, we should take adequate measures to ensure that the data is properly sanitised. For most of the part, our research framework does not require any such information that may be deemed sensitive in nature (e.g., patient’s name, address or Medicare number); the only exception is that we need the date of birth to calculate the patient’s age, as it is an important risk factor for most chronic diseases. However, we do not need to know ages precisely to the day, so we can remove the day and month information from the date of birth, keeping the birth year only. This will significantly reduce risks of re-identification. Next, names and postcodes are stripped from the dataset, as they are not required. Names are, however, replaced by random and uniquely generated IDs to identify patients, as data about the same patients is spread over several database entities. While removing personal information, one should also consider provisions for linking these dispersed records (discussed previously in Section 3.4.1) with other data obtained from different sources. If the dataset will have possible linkage requirements, personal identification should be removed in such a way that it can be linked with other datasets later on. For our framework, we do not have any linkage requirements; therefore, this issue did not arise.

4.4.2 Database organisation

After the data is properly sanitised, the framework focuses on the proper organisation and structure of the data. In most cases, healthcare data comes in an electronic format (e.g., database), and normally the dataset is structured following the owner organisation’s standard. However, it may not still be enough to apply our methods readily on the dataset

102

before doing some preliminary reorganisation. One potential set of problems with the received dataset is that of integrity. For example, same data may be present in multiple records over different data tables. One possible reason for this duplication is that hospital data can be recorded in several stages. A patient file is maintained during the hospital stay, where different doctors can input their detailed diagnostic and clinical notes; doctors may also refer to medical tests, whose reports are also attached in the file. The patient file is often maintained in paper format until a clinical coder prepares the HCP data from it. The billing department also keeps track of the medical items against which the patient or the insurers are billed. The health insurers, therefore, may receive the data in two different forms: the patient’s HCP data, and the claim data for the same patient from the billing department. These two forms may both be transmitted to the researchers, incurring data duplication.

Another potential problem with a dataset can be discrepancy: that is, part of the admission information for a patient may be available in one record, and another part of the same admission information may be present in another record in the received data, depending on the ways in which they were recorded and structured. Therefore, if we were to keep the original database structure intact, we would need to run the methods on multiple tables to obtain the full set of information. Therefore, we should organise the dataset before doing the analysis and ensure that the data integrity is ensured.

The SQL-based relational data structure is adequate for the framework. The design of the database should firmly implement the primary key–foreign key relationship to ensure consistency. For example, we know that a patient can have multiple admissions. On the other hand, each admission should be associated with exactly one patient. Also, no two patients can have the same identifier. These relationships can easily be implemented in SQL-based database designs by making the patient identifier in the patient table the primary key, and creating a foreign key in the admission table that must point to the primary key. In this way, there would be no possibility, even accidentally, of breaching the integrity of the patient–admission relation. Further, the main entities—patient, provider,

103

admission, treatment and DRG—should be logically separated in the database by putting them into different separate tables. This database design is also capable of applying complex queries, either directly in the database client engine or through the framework. As part of the analysis methods, we often need to perform such queries in order to run some complex analyses, and the relational design will significantly improve performance, and moreover ensure data security.

From the above discussion, we can discern clear reasons why it can be necessary to reformat or restructure the original dataset. We have already discussed the logical database structure for the framework in the previous chapter (see Section 3.5) in great detail. Therefore, in the next section, we move on to discuss assessing the data.

4.4.3 Preliminary assessment

This is the final step in a three-step process of preparing the dataset. In this step we focus on assessing the coding quality present in the data and, based on that assessment, removing any records that do not have sufficient information to be considered in the analysis. This will ensure that the data is noise-free to the greatest possible extent, which can otherwise affect the overall performance and accuracy of the framework. In assessing the data, we will look for several data characteristics across the entities. These characteristics are given as follows:

• All patients should have sufficient duration of time represented in their records over which we can trace their admission histories. If we include patients with insufficient time information, it may introduce outliers and noise in the disease progression network. The exact upper and lower bounds of time duration depend on the quality and nature of the dataset, as well as the pathophysiology of the specific chronic disease that the framework is analysing. Data sourced from health insurers can include the patient’s joining date and termination date (the lack of which may indicate that they are still members); these can be used to calculate the duration of a patient’s record, as we can assume that during that time all

104

admission information was sent to the insurers. Sometimes joining and termination dates can be omitted or obfuscated for privacy reasons. In that case, we can consider the duration between first and last admissions. For the part of the framework where we construct the baseline network from chronic disease (i.e., T2D) patients, we examine the period of their record up to the point when they are first diagnosed. Therefore, the effective duration is calculated from the joining date (if available) or first admission date until the admission date at which the first chronic disease code appeared.

• Some patients may have medical conditions that require recurrent admissions. For example, a patient may have a physical injury requiring frequent admissions for dressing. Alternatively, a patient may need regular medical services (e.g., dialysis). As a result, their comorbidity information will be recorded each time they are admitted. Allowing these records can lead to overestimation of their comorbidities, and thus may introduce bias. Therefore, we should set a threshold that allows patients a certain maximum number of admissions per year.

• There should also be a minimum threshold in terms of admission numbers to be considered in the framework. As the framework initially constructs a disease progression network for an individual, having a very small number of admissions will not give reliable information on how the diseases progressed. Also, theoretically we need at least two admissions for each patient, each of which should have valid disease codes, in order to depict transitions over time.

• Some diagnoses or treatment codes that do not affect the chronic disease progression or onset should be excluded from the framework. For example, accidental or physical injuries that are not related in any way to the chronic disease should be omitted. Codes related to consultation with a GP and general diagnostic tests (e.g., full blood count) do not carry significance for our task of predicting chronic disease. In addition, some specific physical conditions or attributes present in the admission record may not make sense if considered alone. Therefore, those conditions or attributes should also be ignored from the analysis. For example, if

105

conditions like fever, vomiting or vertigo are present during admission, they are recorded in the HCP data; however, these conditions do not contribute to the framework, and hence are put in the exclusion list.

In document Predicting the Risk of Chronic Disease: A Framework Based on Graph Theory and Social Network Analysis (Page 120-125)