3.3 Data Cleaning & Feature Selection
3.3.2 Feature Selection
After data cleaning, we then apply the following two feature selection techniques to assess which features are more helpful in constructing clinical prediction models.
• Random Forest (RF ) [35, 97]: this is an ensemble classifier based on randomized deci- sion trees and provides different feature important measures, which can be visualized by the Gini index scores [137, 71]. This feature importance score provides a relative ranking of the spectral features and can be used as a general indicator of feature relevance. Here we run random forest classifier on all features (whose missing rate < 50%) and select useful features based on their Gini index scores.
• Logistic Regression (LR) [63]: this uses maximum-likelihood estimation to compute the coefficients for all features, which can be used to rank them based on their relative importance.
As a classifier, random forest method performs an implicit feature selection, using a small subset of important variables for the classification, leading to its superior performance on high dimensional data. However, logistic regression works well when the number of features is limited (e.g., fewer than 100) since it is easy to calculate the coefficient values for all
Table 3.1: The ALS Functional Rating Scale The ALS Functional Rating Scale Features
1. Speech 2. Salivation 3. Swallowing 4. Handwriting
5. Cutting food and handling utensils 6. Dressing and hygiene
7. Turning in bed and adjusting bed clothes 8. Walking
9. Climbing stairs ...
features. Thus, in order to further demonstrate our feature selection methods, we conduct studies using the following datasets:
• The ALS clinical data from the P RO − ACT (Pooled Resource Open-access ALS Clinical Trials) [1] dataset, which consists of more than 10,700 patients’ records with 6318 features including demographics, discriminated ALSF RS features [43], medical and family history, respiratory measurements and other general lab data. For the ALS dataset, we extract features for progression and survival rate predictions.
• The RHC dataset [2] contains data of 5735 patients with 62 attributes, which includes not only the patients’ characteristics (e.g., age, sex, education, income, medical insur- ance) but also various lab tests results that describe the severity of patients’ conditions. For the RHC dataset, we extract features for survival rate prediction.
• The STAR*D dataset [122]: STAR*D is a study involving over 4000 patients to iden- tify the most effective treatment or combination of treatments for patients diagnosed with non-psychotic Major Depression Disorder (MDD) which lasted over a period of seven years. For the STAR*D dataset, we extract features for predicting disease relapse.
Disease Progression Prediction
Table 3.2: Feature Selection for Progression Prediction Scheme Prediction Feature Selection Names of Features UglyDuckling ALS Slope Onset-delta, Trunk, Q1-Speech,
Phosphorus, Q5-Cutting, Leg Our ALS Slope Random Forest Q1-Speech, Q3-Swallowing,
Weight, ALSFRS-Total
• Static features: contain values of patients’ profiles, e.g., time of onset, first symptoms, gender, etc.
• Temporal features: contain functional (ALSF RS) measures [43], body weight, lab test results, etc, where their values are different as time varies.
However, since no one knows exactly which features (patients characteristics) are more important for ALS disease progression, we conduct feature selection experiments to assess which features are more helpful in constructing ALS clinical prediction models.
Multiple feature selection algorithms have been designed using the ALS dataset based on information gains, random forest, etc during the Prize4Life challenge. For example, in order to predict ALS slopes, the UglyDuckling team [5] selects features based on the information entropy, where information gain is computed for all variables. The GuanLab team [6] also selects features based on ALSF RS measures and Forced Vital Capacity (F V C) surveys, where F V C is the volume of gas that can forcibly be blown out after full inspiration, measured in liters.
In our work, we choose to focus on patients’ ALS functional rating scale (ALSF RS) fea- tures (Table 3.1) since they are essential elements of the ALS clinical trials. The ALSF RS features reflect physical functions in carrying out activities of daily living of ALS patients (i.e., how well patients speak, swallow, etc). Based on the prediction results obtained using the random forest classification method, we have selected three features including “Q1- Speech”, “Q3-Swallowing” and “ALSFRS-Total” from ALSF RS features with highest Gini index scores, where “Q1-Speech” and “Q3-Swallowing” are the evaluations of the functional change (e.g., speech, swallowing) of patients over time.
While ALSF RS plays an important role in the diagnosis of ALS, other factors should also be considered so as to improve the predication accuracy. Thus, we also include the
Table 3.3: Feature Selection for Survival Rate Prediction Scheme Prediction Feature Names of Features
Selection
GuanLab ALS Survival Age, ALSFRS-Total, FVC1, FVC-percent, Q3-Swallowing, Weight Our ALS Survival Random FVC, FVC-percent, FVC-percent1,
Forest Age, ALSFRS-Total, Onset-delta Our ALS Survival Logistic FVC, Q8-walking, Q10-respiratory,
Regression Age, ALSFRS-Total, Creatinine Our RHC Survival Random Cat1, Death, Swang1, Gender,
Forest Race, Hrt1, Card, Ca, Age, Meanbp1 Our RHC Survival Logistic Cat1, Death, Swang1, Gender, Race, Regression Ninsclas, income, Ca, Age, Meanbp1 Our Star*D Survival Random gender, mdswch1, hwl, ctswch1,
Forest menop, mdsch2ct, mdswch2a, mdaug2a Our Star*D Survival Logistic gender, mdswch1, mdaug1, ctswch1,
Regression ctaug1, mdsch2ct, mdswch2, mdaug2
“weight” feature since people affected by ALS tend to lose weight. This may be caused by several factors: (i) they often have difficulty swallowing, (ii) they burn more calories than unaffected people, and (iii) cells in their intestines may have difficulties extracting nutrients from the food. The selected features are shown in Table 3.2, where for the “ALSFRS-Total”, “Q1-Speech”, “Q3-Swallowing”, we have used both their minimum and maximum values as additional features. Note that logistic regression method is not used for the ALS slope prediction since such dataset contains many features (more than 100 features) and hence it is inefficient to extract useful features using this method.
Survival Rate Prediction
(1) For ALS Survival Rate Prediction
• We use random forest classifier to select six features (with highest Gini index values), which include “FVC”, “FVC-percent”, “FVC-percent1”, “Age”, “ALSFRS-Total” and “Onset-delta” to generate learning model for ALS survival rate prediction, where “Onset-delta” is the time between disease onset and the first time the patient was tested in a trial, “FVC-Percent” is the percentage of normal lung function (exhala- tion is gentle and not forced), and “FVC-percent1” is the percentage of the volume
of air forcefully exhaled in one second.
• We also use logistic regression analysis to select useful features, which include “ALSFRS- Total”, “FVC”, “Age”, “Q8-Walking”, “Q10-Respiratory” and “Creatinine”, where “Creatinine” in the blood reflects both the amount of muscle a person has and their amount of kidney function, “Q8-Walking” and “Q10-respiratory” are the evaluations of the functional change (e.g., walking, breathing) of patients over time.
The selected features are shown in Table 3.3, where for the “ALSFRS-Total”, “FVC”, “FVC-Percent”, “FVC-Percent1”, “Q8-Walking” and “Q10-Respiratory” features, we have used both their minimum and maximum values as additional features.
(2) For RHC Survival Rate Prediction
• We select 10 features based on the results using the random forest classification method, where Gini index scores are computed for all variables. The selected fea- tures include age, gender, cat1 (primary disease category), ca (none cancer, localized cancer, metastatic cancer), meanbp1 (mean blood pressure), hrt1 (heart rate), swang1 (right heart catheterization performed within first 24 hours), death (estimation of the probability of surviving 180 days after admission), race (black, white, other) and card (cardiovascular diagnosis).
• We also use logistic regression analysis to select important features including age, sex, race, years of education, income, swang1, ninsclas (type of medical insurance including private, medicare, medicaid, private and medicare, medicare and medicaid, or none), cat1, ca, death and meanbp1.