Classification accuracy and real-time adaptation

6.3 Data collection

6.4.2 Classification accuracy and real-time adaptation

To assess the real-time performance of our classification models, the same thresholds used for training to divide the self-reported arousal and valence into two classes were used for testing. However, participants in this study reported significantly lower mean levels of arousal (t=6.22, p<.001) and valence (t=3.99, p<.001) compared to our previous study. Whilst the mean self-reported arousal was 0.87, in the current study it was 0.75. Similarly, the mean self-reported valence was 0.85 in the previous study and 0.76 in the current study. This means that participants in this study used a bigger range of values when reporting their affective states compared to participants in Study 2. One session of two participants were removed from this analysis due to technical problems with the sensors and the Affective Slider. Using these thresholds, the distribution of arousal classes was 220 low and 65 high arousal data points. Valence had a similar distribution, with 214 negative and 71 positive data points.

Two SVM algorithms were used for arousal and valence classification. Whilst the arousal classification model was trained with 35 features extracted from the HMD and HR sensors, the valence classification model was trained with 10 features from the HMD and EMG sensors. Arousal classification achieved an accuracy of 41% and valence 42%, both below chance level (50%). Table 6.1 illustrates the confusion matrices of arousal and valence classification. These

1_{The Bonferroni correction is used to avoid Type I errors when making multiple compar-}

isons. It is calculated by dividing the significance level (in this case 0.05) by the number of tests that are being performed.

Table 6.1: Confusion matrices of arousal and valence classification in real-time Arousal Reported Arousal Prediction Low High Low 51 7 High 169 58 Valence Reported

Table 6.2: Accuracies of arousal and valence classification in each session for all participants playing the adaptive version.

Session Participant Affect. Dimension 1 2 3 Mean Arousal - .67 .67 .67 1 _Valence _- _.4 _.4 _.4 Arousal .53 .67 .4 .53 2 _Valence _.4 _{.53 .4} _.44 Arousal .67 .07 .2 .31 3 _Valence _{.53 .67 .13 .44} Arousal .07 0 .13 .07 4 _Valence _{.53 .2} _{.33 .35} Arousal .93 .93 .8 .88 5 _Valence _{.13 .2} _{.33 .22} Arousal .4 .2 .13 .24 6 _Valence _.8 _{.53 .6} _.64 Arousal .27 .13 - .2 7 _Valence _{.53 .4} _- _.47

results evidence a poor classification performance of the models, which led to erroneous real-time adaptation decisions of the affect-based decision layer, although the performance-based decision layer worked as expected. Due to the imbalance of classes (see Table 6.1), an additional analysis examined the precision and recall of both arousal and valence classification. While precision expresses the proportion of data points classified as relevant that were actually relevant, recall represents the percentage of relevant instances over the total amount of relevant instances. Arousal showed a good recall of 0.89 but a much lower precision of 0.25. The classification of valence got poorer results, with a recall of 0.31 and a precision of 0.16. F1 measures, which combines precision and recall, were also computed, being 0.40 for arousal and 0.21 for valence. These results demonstrate a poor real-time performance of the classification models. Since participants were different in the current and the previous study, significantly lower levels of self-reported arousal and valence were found in this study compared to our previous study (training stage), which could explain the imbalanced distribution of classes in the testing dataset.

Table 6.2 presents the arousal and valence classification accuracy of each participant in the adaptive version in every session. The best mean prediction accuracy of arousal was 88% for participant 5. This participant reported high levels of arousal (>.92) during the whole study that were successfully detected. The worst performance of arousal classification achieved a mean accuracy of 7% for participant 4, who’s mean self-reported arousal was 0.68 (SD: .1). Us- ing the thresholds applied during training, her self-reported arousal was always labelled as low (<.87) but wrongly classified as high arousal. Similar results

were found for participant 3 in sessions 2 and 3. On the other hand, valence classification showed similar results. Participant 6 got the best mean valence accuracy with 64%. The overall mean valence reported by this participant was 0.63 (SD: .19), which was labeled as negative valence and successfully classified by the model. Similarly, participant 5 always self-reported high levels of positive valence (Mean: 0.97; SD: .07), although it was mostly predicted as negative valence, achieving the lowest mean classification accuracy (22%). These results, together with the confusion matrix table (see Table 6.1), indicate that the arousal and valence models mainly predicted high levels of arousal and negative valence. Nevertheless, looking at Table 6.1, most of the self-reported low arousal was successfully recognised.

Although the arousal and valence classification did not work well overall, the adaptation worked better for some participants depending on individual factors such as motivation. For example, participants 3 and 5, whose motivation was to be the best player, played in a more aggressive manner moving their head and hands very abruptly. This was mostly classified by the machine learning algorithms as frustration (high arousal and negative valence), which made the adaptation engine to sustain or reduce the difficulty level, keeping participants in the easiest levels (1-4). Due to individual differences in preferences or moti- vations to play video games, participants can experience video games differently or have different playing styles [142]. One of the participants, who never experi- enced VR before, reported having a positive experience when interviewed, even though the game was sometimes too difficult for her. This may be explained by the excitement of trying VR for the first time, known as novelty effect [96]. These individual differences present important challenges in the design of generic adaptation methods and subject-independent machine learning models.

In document Games 4 VRains: Affective Gaming for Working Memory Training in Virtual Reality (Page 113-115)