Evaluation Strategy - Tackling Data Imbalance

4.7 Tackling Data Imbalance

5.1.3 Evaluation Strategy

The machine learning models can be evaluated in several different ways depending on how the problem is specified. Some widely used methods are stratified cross-validation and randomly splitting data into train, validation and test sets. However, in deep learning on very large datasets such as computer vision problems, it is less common to apply techniques like cross- validation as the dataset contains millions of training examples. In such cases, applying cross- validation makes model evaluation and hyperparamters tuning computationally infeasible. In our case, due to a comparatively small size of the dataset and to assess how effectively the learned model is doing for each driver, we employed Leave-One-subject-Out Cross-Validation (LOOCV). In this method, the model is learned from scratch holding out data of each driver as validation set and using the rest as a training set (see Figure 5.3). This process is repeated for every driver and performance metrics are calculated on validation driver’s dataset. The average validation scores are then used for hyperparameters tuning. In addition, LOOCV better reflect model performance as for each training cycle the network will never get to see the data of hold-out driver, which will be used for validation. Although, the validation results of this procedure are used to guide hyperparameter optimization, there are high chances that the network will overfit the dataset. To make sure the network generalizes well, we used the 20% hold-out test set for final evaluation of the optimal models found via LOOCV.

5.1.4 Performance Measures

We evaluated model performance by two widely used metrics, F-score (Sokolova and Lapalme, 2009) and Cohen’s Kappa (Wood, 2007). These measures take data imbalance into account as compared to using accuracy, which intrinsically does not have any knowledge of class

Chapter 5. Experiments and Discussion 2 3 4 6 7 18 21 24 25 26 29 Driver 0 10000 20000 30000 40000

Frequency

Under-aroused Normal Over-aroused

Figure 5.1:Class label distribution by drivers after applying SMOTE.The over-aroused and normal minority classes are oversampled by 25% and 50% respectively, to balance the dataset. Table A2 Appendix shows which classes are oversampled for each driver based on number of datapoints.

Under-aroused

Normal

Over-aroused

Class Label

0 100000

200000

300000

400000

Frequency

Figure 5.2:Overall class label distribution after applying SMOTE.The combined class label distribution of eleven drivers is shown in the graph after applying SMOTE and Tomek Links for oversampling minority classes.

Chapter 5. Experiments and Discussion

Iteration 1

Iteration 2

Iteration 3

Iteration N

Single validation set Total number of subjects

Figure 5.3:Illustration of leave-one-subject-out cross validation.The data of each subject is treated as one fold whereN represents total number of drivers. The model is trained onN−1 folds and evaluated onNthholdout fold, representing his own data. This process is used for hyperparameter tuning or finding the optimal architecture. The 20% holdout test set is used for final evaluation to determine model generalizability.

distribution. Likewise, accuracy alone, provides misleading results in case of imbalanced datasets, if model simply predicts majority class most of the time.

F-score

F-score is the combination of precision (specificity) and recall (sensitivity). The precision specifies the number of correctly classified instances that are actually right. On the other hand, recall provides a number that indicates a majority of true positives without taking false positives into account. In multi-class setting, we first calculate F-score for each class independently and later on take average to get a single number representing model performance on all classes. The formulae, given in Equation 5.1, were used to calculate weighted F-score by considering a number of true instances for each label.

Precision: pc= tp tp+f p Recall: rc= tp tp+tn F-score: fc= 2× pc×rc pc+rc Weighted F-score: F = PN C c=1 wc×fc PN C c=1 wc (5.1)

Where tp, f pand tndenotes true positive, false positive and true negative respectively for classc. Likewise,N Crepresents total number of classes andwcis the weight or total number

of actual true positives of class c.

Chapter 5. Experiments and Discussion

Kappa

It is a measure of how well the classifier performed as compared to how well it would have performed simply by chance. More formally, Cohen’s Kappa is a measure of the overall agreement between two raters classifying items into a given set ofk categories. The formula for Kappa is given by Equation 5.2, wherepiiis the proportion of examples that both raters

classify into categoryi.pi+ is the proportion of examples that raterA assigns to category i and p+i is the proportion assigned to categoryiby rater B. The denominator term is used

as a normalizing factor to makeK equals to 1. A kappa statistic can take minimum value of

−1 in the case of complete disagreement and maximum of 1 for perfect agreement.

K = P pii−Ppi+×p+i 1−P pi+×p+i (5.2) 5.1.5 Implementation

Tensorflow1 was used to implement all the deep neural network models because the flexibility and efficiency it provides for the implementation of complex architectures (Abadi et al., 2016). In addition, it has a strong open-source community and well-established support for deployment on both cloud and mobile devices. Similarly, for the oversampling methods open- source library named “imbalanced-learn” (Lemaıtre et al., 2017) was used. It provides an implementation of several popular sampling techniques including strong compatibility with widely used machine learning libraries and different versions of Python.

5.1.6 Baselines

Our plain baselines are 4-layers feed-forward neural network and denoising autoencoder for unsupervised pre-training, where decoder was replaced with 2 additional layers for supervised learning (see Section 4.2). These baseline models were considered to motivate the use of convolutional and recurrent neural networks for supervised sequential learning. The examined baseline models do not take the sequential nature of physiological signals into account. Every sample is treated independently with respect to others, which is not true for sequences, where nearby points have a strong dependency on each other. Similarly, these models have no concept of weight sharing which is the core of sophisticated neural network architectures and significantly reduce the number of parameters in the model. However, the denoising autoencoder baseline model itself can be seen as an improvement over a typical feed-forward neural network.

Chapter 5. Experiments and Discussion

5.2 Results

5.2.1 Validation of Baseline

Result 1 DAE baseline performed slightly better than NN model with sigmoid activation.

We first evaluate baseline models as their classification performance is compared with other complex architectures. The cross-validation results for our baseline models (i.e. feed forward neural network and denoising autoencoder) trained on physiological signal segments of different window sizes are presented in Table 5.1 and 5.3. The plain neural network model reached average validation F-score and Kappa of 0.75 and 0.53 respectively, for a window size of 30 seconds. Likewise, the pre-training using denoising autoencoder with two additional layers used for supervised training achieved F-score 0.76 and 0.55 Kappa for 30 window size. These results show some improvement over the first baseline (i.e. 4-layers neural network) but an important thing to notice is that, it could be because of more layers in addition to pre-training of the model. The key takeaway is that models with sufficient capacity and using pre-trained weights for supervised classification can outperform shallow models with randomly initialized weights.

To determine the effect of non-linear activation function on model performance, we evaluated tanh and sof tsignnon-linearities in addition to sigmoid for an optimal window size of 30 seconds. The baseline model architectures is preserved, where only activation functions of each layer are changed. We foundsof tsignto be working reasonably well compared to others on our dataset (see Table 5.2). Furthermore, the optimal baseline model evaluated on 20% hold-out test set for which neural network model achieves F-score and Kappa of 0.75 and 0.53 respectively. Likewise, the denoising autoencoder model reaches F-score of 0.76 and Kappa of 0.56 on the test set. The confusion matrices for the validation and test sets, generated by combining individual ones of each driver are shown in Figure 5.4 and 5.5. It can be seen that both baseline models performed poorly on detecting over-arousal state. For all latter comparisons, we use optimal results achieved from baseline models for 30 seconds window size.

Table 5.1: Baseline - Neural Network Results.

Window Size (secs.) Validation F-score Validation Kappa Test F-score Test Kappa

10 0.749 ±0.134 0.517± 0.226 0.757 ±0.135 0.526 ±0.218

30 0.757 ±0.121 0.531 ±0.214 0.757 ±0.132 0.535 ± 0.217

60 0.740 ±0.148 0.517± 0.259 0.739 ±0.156 0.522 ±0.264 90 0.747 ±0.130 0.538± 0.215 0.760 ±0.136 0.568 ±0.218

Chapter 5. Experiments and Discussion

Table 5.2: Evaluation of non-linear activation functions with Neural Network.

Activation Function Validation F-score Validation Kappa Test F-score Test Kappa

tanh 0.768 ±0.124 0.533 ±0.213 0.779 ±0.125 0.561 ±0.193 softsign 0.779 ±0.136 0.563 ±0.233 0.775 ±0.133 0.562 ±0.205

Table 5.3: Baseline - Denoising Autoencoder Results.

Window Size (secs.) Validation F-score Validation Kappa Test F-score Test Kappa

10 0.736 ±0.168 0.524 ±0.255 0.748±0.156 0.538 ±0.237

30 0.762 ± 0.130 0.558 ±0.219 0.763± 0.138 0.563 ±0.231

60 0.763 ±0.112 0.549 ±0.199 0.761±0.121 0.553 ±0.195 90 0.729 ±0.203 0.548 ±0.260 0.727±0.189 0.529 ±0.253

In document Deep physiological arousal detection in a driving simulator (Page 57-62)