Data cleaning and preprocessing is a very important aspect in any comprehensive data analytics study. This is even more important in healthcare analytics, especially when real-world EMR data is involved—because the data are captured and stored in different clinical/hospital settings and
47
for reasons other than data analytics [107]. Hence, in this study, data preparation was taken very seriously.
Dara Preprocessing
The data used for this research was obtained from the Cerner Corporation’s Health Facts data warehouse; a comprehensive, relational repository of real-world, de-identified, and HIPAA- compliant patient data. A simplified conceptual data diagram of the Cerner Health Facts data warehouse is presented in Figure 4.1.
Processing and analyzing large EMR datasets involves various challenges. Jagadish, et al. [108] classify these challenges into five categories: data acquisition; information extraction and cleaning; data integration, aggregation, and representation; modeling and analysis; and interpretation. Regarding the large number of variables in the data warehouse, we spent a significant amount of time to understand the purpose and relevance of each variable. An even more demanding step in preparing the dataset for final analysis was aggregrating the records at the patient level and integrating patients’ comorbid conditions. Hence, information extraction and cleaning, together with data integration, aggregation, and representation constituted the majority of our data preprocessing efforts.
The nature of EMR data posed yet another difficulty to this study. Because EMR data is collected for purposes other than performing data analytics, it suffers from mutiple defficiencies. First, since EMR data is collected from several facilities around the country, it lacks integrity and consistency. For instance, different units or even naming might be used in different hospitals. Second, data missingness or incompleteness, which are endemic in EMR data, need to be
addressed. And third, outliers and other data entry errors are prevalent in EMR data. We describe the approach we used in the data preparation step to address these challenges later in this section.
48 Encounter Encounter_ID Patient_ID Hospital_ID Patient Patient_ID Lab Procedure Lab_Procedure_ID Encounter_ID Clinical Event Clinical_Event_ID Encounter_ID Hospital Hospital_ID Diagnosis Diagnosis_ID Encounter_ID
Figure 4.1- A simplified conceptual data model for Cerner Health Facts data warehouse
For the purpose of this research, we extracted data of more than 1.4 million unique diabetic patients from approximately 5.3 million visits. Since the number of variables collected from different data tables was rather large (300+), we needed to take many data selection, aggregation, and preparation steps.
First, data from all tables (e.g., encounter, patient, lab procedure, clinical event, etc.) for diabetes diagnosis and all associated complications, such as diabetic neuropathy, nephropathy, and retinopathy, were extracted. The table that included the lab procedure data was very important in this regard. The first dataset extracted from this table for diabetic patients included more than 800 different lab procedures. This primary dataset was very sparse, since not every patient had all of these lab results. We dropped those lab procedures that lacked sufficient results in the data. After taking several data cleaning steps, 88 lab procedures remained. Because EMR data are collected from hundreds of facilities across the United States, different names may be used for the same lab procedures. We consulted with clinical experts and merged identical lab procedures into one variable. As a result of this step, 58 lab procedures remained in our dataset.
The lab procedure table contained a column labeled “lab_procedure_name” that included all lab procedures for individual visits (encounters). We transposed this table so that each lab procedure
49
had its own column. This increased the number of columns in the lab procedure table from 35 to about 100. Since every patient at each visit (hospital stay) could have multiple results for the same lab procedure, we retained the last result as consultations with physicians and clinical experts suggested these values could be considered the stable condition for a patient. Moreover, because our focus was on developing a CDSS for the early detection of retinopathy, we selected each patient’s first chronological visit to increase the validity and generalizability of our findings. In the next step, we used table keys (i.e., “Patient_ID,” “Encounter_ID,” and “Diagnosis_ID”) to join data from multiple tables into a single table that included lab results, demographic data, and diagnosis data, with each record representing an individual diabetic patient. The resulting table included data from over 300,000 unique patients. Figure 4.2 depicts different data preparation steps in our study.
Data Data
Raw Data
Data Preprocessing 3 ü Merging tables ü Selecting first visits
Prepared Data Data Preprocessing 2 (Other Tables) ü Cleaning ü Selecting ü Transforming Data Preprocessing 1 (Lab
Procedure ) ü Selecting populated lab tests ü Aggregating same lab tests
with different names ü Transposing the table
50 Modeling Procedure
To develop the predictive models, we employed logistic regression, decision tree, random forest, and artificial neural networks. Building each of these models was preceded by various data manipulation steps, including transforming variables to approximate a normal distribution, replacing and filtering extreme data points, and applying different imputation methods, and at the end, we compared the results. The modeling procedure is shown in Figure 4.3.
The target variable was a binary variable, where 0 denoted no diabetic retinopathy diagnosis and 1 denoted diabetic retinopathy diagnosis. Diabetic patients without a retinopathy diagnosis were included as the control group. The final dataset was largely imbalanced. In fact, we may face imbalanced data in several real world problems: fraud detection, oil-spill detection, and medical applications (Kubat, et al. [109], Rao, et al. [110], Chan, et al. [111]). The majority class in the dataset was diabetic patients without retinopathy (95%), and our class of interest, diabetics with retinopathy (5%), was the minority class. The main challenge in analyzing imbalanced datasets is that the performance of most standard machine learning techniques will be poor in terms of identifying the target variable [112]. Therefore, a balanced dataset is necessary to develop predictive models with high accuracy. Since there were a reasonable number of retinopathy patients in the minority class (about 15,000 patients), we created a balanced dataset by randomly under-sampling the majority class. The next step was to partition the data into training and validation datasets to objectively assess the different model types. In the following section, we provide a brief description of each of the modeling techniques used in this study.
Modeling Techniques
Logistic regression: Logistic regression is a classic statistical model. This method is capable of predicting and classifying categorical variables, but is mostly used for binary variables [113].It is an extended version of linear regression, but instead of modeling a continuous value, binary
51
logistic regression models the log odds of the probability of an event, as opposed to its occurrence, as a linear function of the predictors.
Selection methods are often used to construct an optimal regression equation using a large number of predictors. Three statistical regression methods of variable selection are forward selection, backward elimination, and stepwise selection. The training in forward selection starts with an empty equation and adds predictors one at a time starting with the most significant predictor. Selection ends when all remaining predictors fail to meet the specified F-to-enter value. Backward elimination training starts with all predictors and removes, one at a time, the least significant predictors. Elimination ends when all remaining predictors fail to meet the specified F-to-remove value. The stepwise method is a variation of the above methods. It starts with an
empty model, and after each step in which a predictor is added based on the F-to-enter value, it evaluates predictors in the model against the specified significance level. Those that fall below this level are removed. In this study, we applied the stepwise method. The binary logistic regression equation is shown in Equation 4.1 and Equation 4.2.
𝑙𝑜𝑔𝑖𝑡[𝑃(𝑥)] = ln [
𝑃(𝑥) 1−𝑃(𝑥)] = 𝛽
0+ 𝛽
1𝑥
1+ ⋯ + 𝛽
𝑛𝑥
𝑛(4.1)
𝑃(𝑥) =
𝑒𝛽0+𝛽1𝑥1+⋯+𝛽𝑛𝑥𝑛 1+𝑒𝛽0+𝛽1𝑥1+⋯+𝛽𝑛𝑥𝑛=
1 1+𝑒−𝑙𝑜𝑔𝑖𝑡[𝑝(𝑥)](4.2)
𝑃(𝑥) is the probability that the target variable belongs to a specific category (in our study, a patient has retinopathy) and 𝛽𝑖 is the coefficient of the 𝑖th predictor.
52 Prepared Data Data Preprocessing ü Imputation (Tree/ Mean) ü Under-sampling the Majority Class Variable Transformation (Max Normal) Extreme Points Replacement Model Development: - Logistic Regression - Decision Tree - Random Forest - Neural Network Training
Dataset Validation Dataset Using Data As It IS
Model Evaluation ü Area under ROC ü Accuracy ü Sensitivity ü Specificity Trained
Models
Figure 4.3- Modeling procedure
Artificial Neural Network: Artificial neural network (ANN) is a very popular model in healthcare analytics. ANN can be defined as “massively parallel processors, which tend to preserve
experimental knowledge and enable their further use” [114]. One of the advantages of the neural network model is its ability in handling highly complex problem structures with non-linear relationships among variables. A limitation of this method, however, is its high sensitivity to model parameters (i.e., structure/architecture of the model, learning rate, number of layers and neurons in each layer, etc.) [115]. Figure 4.4 exhibits a simple two-layer perceptron network. In this example, there are three inputs and two neurons in the hidden layer. There is a transfer function for the output layer and for each neuron in the hidden layer. In this study, we used two- layer perceptron networks with hyperbolic tangent transfer functions in the hidden layer and a
53
soft-max transfer function in the output layer (see Equations 4.3, 4.4, and 4.5). We also used the conjugate-gradient optimization technique to optimize the network. For more details about the neural networks design, we refer the readers to Hagan, et al. [116].
𝑎
1= 𝑓
1(𝑤
11𝑥
1+ 𝑤
21𝑥
2+ 𝑤
31𝑥
3+ 𝑏
1)
(4.3)
𝑎
2= 𝑓
2(𝑤
12𝑥
1+ 𝑤
22𝑥
2+ 𝑤
32𝑥
3+ 𝑏
2)
(4.4)
𝑦 = 𝑓(𝑤
1𝑎
1+ 𝑤
2𝑎
2+ 𝑏)
(4.5)
x1 x2 x3 w11 w12 w21 w22 w31 w32 w1 w2 a1 a2 b1 b2y
bFigure 4.4- Two-layer perceptron neural networks
In these equations, 𝑥1, 𝑥2 𝑎𝑛𝑑 𝑥3 are input variables; 𝑤𝑖𝑗 is the weight of the input 𝑖 for neuron 𝑗; 𝑏𝑗 and 𝑎𝑗 are the bias and output of neuron 𝑗 respectively; 𝑓1 and 𝑓2 are the transfer functions for the hidden layer; 𝑓 is the transfer function of the output layer; and 𝑦 is the output of the network. In this study, we developed neural network models in two settings. In the first setting, we fed all of the variables into the neural network models, but in the second setting, we only used the variables that were selected through the stepwise method in logistic regression.
Decision Tree: Decision tree is a method that recursively partitions the data based on a predictor [117]. The training process in this method starts at the root node (i.e., all the records and
predictors). The tree is built by splitting the records at each stage (i.e., each node) according to the best cut-off value of a predictor. There are several criteria to select the best split. In this study,
54
we used Pearson’s 𝜒2 p-value and the Gini index. Pearson’s 𝜒2 p-value measures the level of separation achieved by the split. To calculate this measure, consider a 2 × 2 contingency table for the split. Columns represent the branch directions and rows specify the target variable (0 or 1). The 𝜒2 value is calculated as in Equation 4.6.
𝜒2 = ∑ ∑ (𝑂𝑖𝑗−𝐸𝑖𝑗)2 𝐸𝑖𝑗 2 𝑗=1 2 𝑖=1 (4.6)
In this equation 𝑂𝑖𝑗 is the observed frequency in row 𝑖 and column 𝑗, and 𝐸𝑖𝑗 is the expected frequency in row 𝑖 and column 𝑗. The p-value of the 𝜒2 is then calculated. The smaller the p- value, the better the split or the higher the level of separation.
The Gini index shows the level of purity achieved by the split. Gini is the probability that two randomly selected members of a population are the same. For a pure population, this index would be 1. The calculation of the Gini index in each leaf of a split is as in Equation 4.7, where 𝑝1 and 𝑝2 are the proportions of each level.
𝐺𝑖𝑛𝑖 = 𝑝12+ 𝑝22 (4.7)
Then, the Gini score of the split is calculated as in Equation 4.8, where 𝑤𝑙𝑒𝑓𝑡 and 𝑤𝑟𝑖𝑔ℎ𝑡 are the proportion of the records in each leaf.
𝐺𝑖𝑛𝑖𝑠𝑐𝑜𝑟𝑒= 𝑤𝑙𝑒𝑓𝑡𝐺𝑖𝑛𝑖𝑙𝑒𝑓𝑡+ 𝑤𝑟𝑖𝑔ℎ𝑡𝐺𝑖𝑛𝑖𝑟𝑖𝑔ℎ (4.8)
The higher the Gini score, the higher the level of purity achieved by the split. Although the decision tree method is easy to understand, especially for those without knowledge of theories underlying data mining methods, one of its major drawbacks is that data partitioning may result in one leaf comprised of few data points, precluding any useful information from that portion of the data [115].
Random Forest: Random forest could be considered an extension of decision tree. This method develops multiple smaller trees that classify each member of the sample data. The final predicted
55
class for a particular sample member is determined using a voting mechanism based on the prediction of all trees [118]. Each tree in the random forest uses a subset of records and variables. Random sampling with replacement is used for building each tree. In this study, after examining several scenarios developed by altering model characteristics, we used 60% of the training data and the square root of the number of variables to build each tree. Several advantages can be enumerated for random forest. Besides high accuracy, this method provides a variable importance metric that can be used for identifying important risk factors. Random forest can also handle datasets with a large number of variables [16].
Predictive Model Sets
In this research, four different sets of predictive models were developed (see Figure 4.5). The first set, called the basic models, encompassed models that were developed using lab procedures and demographic data of diabetic patients. In the second set, models were built on lab procedures, demographics, and comorbidity data. These models are called comorbid models. The third set, dubbed over-sampled models, consisted of models built using the over-sampled data by applying the synthetic minority over-sampling technique (SMOTE). And, the fourth set included ensemble models that were developed based on the outputs of individual classifiers.
Basic Models: In this set of models, we used the data compiled during the data preparation phase. We call this dataset “basic data” as it only included demographic and lab results of the diabetic patients.
Models Based on Comorbid Data: The second set of predictive models was based on the
comorbidity information. To develop these models, comorbidity data were added to the basic data through several data preparation steps. In these models, we considered the existence of other diabetes-related complications to predict diabetic retinopathy. The following complications were included in our analyses: neuropathy, nephropathy, peripheral circularity, hyperosmolarity,
56
diabetes-related coma, and other specified diabetes-related conditions. To prepare the comorbid dataset, we performed several steps on the primary data table, which consisted of the list of patients, their complication (diagnosis code), and their demographic and lab data. Since each complication of a patient generated a different record in the database, we extracted all records in which the diagnosis was one of the aforementioned complications and saved them in separate tables. Next, we merged these tables by patient ID and added a binary variable for each
complication. Therefore, for each patient, in addition to the demographic and lab data, we added information about their other co-existing complications. After taking these steps, the dataset became ready for the development of the predictive models.
Predictive Models
Set 1: Models based on Lab
and Demographic Data (Basic Data)
Set 2: Models based on Basic
and Comorbidity Data
Set 3: Models based on Over-sampled Data
Set 4 (Ensemble Models)
Figure 4.5- Predictive model sets
Models Based on Over-Sampled Data: In the previous two sets of models, we carried out random under-sampling for the majority class to create a balanced dataset. One obvious limitation of under-sampling is the possibility of losing important information about the majority class by removing some fractions of the data [76]. The other available approach to create a balanced dataset is to over-sample the minority class. Numerous over-sampling methods have been
proposed in recent years, and among them synthetic data generation for the minority class has one of the best performances. By generating synthetic data, new examples of the minority class are generated using different techniques to reach some desired degree of balanced class distribution. SMOTE [78] is one of the most famous methods in this regard. In this method, synthetic data
57
points are generated on the line joining each minority sample and any/all of its 𝑘 minority class nearest neighbors (minority class with the smallest Euclidean distance from the original sample). Consider 𝑥 a minority class, and 𝑥𝑖 one of its k minority class nearest neighbors. The new data will be generated as in Equation 4.9,
𝑥
𝑛𝑒𝑤= (1 − 𝛿)𝑥 + 𝛿𝑥
𝑖(4.9)
where 𝛿 is a random number between [0, 1]. Figure 4.6 depicts the synthetic data generation process. x Xi Generated Data, Xnew p1 p2
Figure 4.6 - Synthetic Minority Over-Sampling Technique (SMOTE)
The number of 𝑘 nearest neighbors to be used depends on the amount of over-sampling required. For instance, if we want to increase the minority class by 300%, 𝑘 would be equal to 3. We can enumerate several advantages for this method. First, it requires no information other than the dataset itself [119]. Further, since it is a preprocessing method, over-sampled data can be used in any classification technique with good performance on balanced data [120]. Finally, by
generating synthetic minority data, as opposed to simply replicating existing minority data, the minority region can be generalized and overfitting, a limitation of replication, can be avoided [85].
58
In this study, we considered 5 neighbors to increase the size of the minority class by 10 times, which means we generated two synthetic data points on each line connecting the minority class and each of its five nearest neighbors. Rather than over-sampling the minority class up to the level of the majority class, we increased the size of the minority class to some extent and then under-sampled the majority class to reach a balanced dataset, which is consistent with Chawla, et al. [78] study that showed the combination of SMOTE and under-sampling of the majority class has a better result compared to plain under-sampling. For simplicity, from this point to the end of the this chapter, we call models in Set 1 “basic models”, models in Set 2 “comorbid models”, and models in Set 3 “over-sampled models”.
Ensemble Models
In this study, we developed a new ensemble approach, called confidence margin ensemble. We assessed the performance of confidence margin in comparison to four other exiting ensemble techniques, which are simple average, weighted average, voting-based, and random forest. We