Subjective cognitive workload rating - Classification using a randomized task order

9.2 Classification using a randomized task order

9.3.2 Subjective cognitive workload rating

Since the working-memory tasks were conducted as block design the subjects were asked for a cognitive workload rating (“1” being too easy up to “7” being too hard) after each level of difficulty. For the math word problems, the participants had to rate their cognitive workload after each trial. The average ratings are stated in Figure 9.2.

N-back task

For the n-back task, the mean cognitive workload rating increases from 3.25 for the easy, to 5.33 for the medium and to 6.25 for the difficult condition. All combination of ratings differed significantly (p < 0.05, paired, t-test).

9 Task order effect in cross-task classification

Reading span task

The average rating values for the reading-span tasks differed significantly (for all three combinations: p < 0.05, paired, t-test), with 2.75 for level 1, 3.67 for level 2 and 5.58 for level 3.

Realistic learning tasks

For the arithmetic task, the mean cognitive workload rating was 2.43 for the easy, 3.23 for the medium and 3.39 for the difficult condition. During algebra tasks participants rated level 1 with 2.43, level 2 with 3.23 and level 3 with 3.39. For both word math problems, level 1 vs. level 2, as well as level 1 vs. level 3 differed significantly (p < 0.05, paired, t-test).

9.4 Classification results

The within-task classification results based on objective labeling and subjective cognitive workload labeling are stated in section 9.4.1, Table 9.1. The cross-task classification results using objective labeling and subjective cognitive workload labeling are shown in section 9.4.2, Table 9.2 and Table 9.3.

9.4.1 Within-task classification

As in section 7.7.1 the classifier was trained and tested using EEG data recorded during the same tasks (datasets B, C, D and E). Furthermore a 10-fold cross validation was used to estimate the accuracy of the trained SVM to guarantee that training and test data do not overlap. This section will firstly present the within-task results with objective labeling, followed by the within-task results using subjective cognitive workload labeling (see Table 9.1).

Objective labeling

In the case of objective labeling, the arithmetic task reached the highest accuracy over all subjects with 76.79 %. As shown in Table 9.1 the reading span task had an accuracy of 73.35 % over all subjects, whereas the n-back task performance over all subjects reached 72.69 %. The algebra task reached lower performance accuracies, the average accuracy was 63.62 %. The within-task classification worked well in almost all cases and reached significant classification accuracies (p < 0.05, permutation test) in average, except the algebra task (p > 0.05, permutation test). Remarkable are the classification results from subject 5, since they are smaller than 50 % for all tasks, except the arithmetic task. Re- moving the channels Fz and Pz the classification accuracies increased significantly for this subject (on average 67.43 %). Thus, it can be assumed, that the poor results are based on artifacts, measured at these electrodes.

9.4 Classification results

Table 9.1:Within-task classification accuracies in % for a two-class problem for all subjects, with objective as well as subjective cognitive workload labeling. The bold values represent the significant accuracies (p < 0.05, permutation test). Row 2 exhibits the datasets used for classifier training, whereas row 3 represents the datasets for classifier testing. B = n-back, C = reading span, D = arithmetic, E = algebra. “−” indicates when a rearrangement of trials in two classes was not possible.

Within-task Classification

Objective Labeling Subjective Workload Labeling

Training B C D E B C D E Test B C D E B C D E Sub 1 73.24 60.26 90.00 75 − − − − Sub 2 80.88 79.73 73.33 85.71 − − − 72.73 Sub 3 79.17 91.03 89.47 50 − − − 60 Sub 4 74.65 67.95 76.47 73.68 − − 58.62 48.28 Sub 5 47.06 47.95 78.95 44.44 − − 64.29 57.14 Sub 6 62.5 73.08 84.21 47.37 − − − − Sub 7 94.44 53.95 52.94 72.22 − − 60 61.54 Sub 8 79.17 82.43 80.00 55.00 − − − 41.38 Sub 9 62.9 78.57 89.47 53.33 − − 68.97 70.83 Sub 10 76.39 87.18 85.00 75.00 − − − 73.33 Sub 11 66.2 85.71 90.00 66.67 − − 73.33 60.71 Sub 12 75.71 72.37 31.58 65 − − 65.52 − Mean 72.69 73.35 76.79 63.62 − − − −

Subjective cognitive workload labeling

Compared to chapter 8, the subjective cognitive workload ratings for class labeling were not so promising in the current study. In some cases the rearrangement of trials in two classes (easy vs. difficult) led to an unbalanced set of data, no efficient classifier training was possible and no significant results could be reached. In the arithmetic tasks subject 2, 3, 6, 8 and 9 were affected by this problem and in the algebra task subject 12. Furthermore, subject 1 and 6 rated each task with the same difficulty value. Thus, a rearrangement of the trials in two classes was not possible for these subjects (see Table 9.1). As in the objective labeling method the arithmetic task reached higher classification accuracies over all subjects (65.12 %) compared to the algebra task (60.66 %). In nearly all applied cases within-task classification worked well, but a significant classification accuracy (p < 0.05, permutation test) could only be reached for one subject in the algebra task.

9 Task order effect in cross-task classification

Table 9.2:Cross-task classification accuracies in % for a two-class problem for all subjects, with objective feature labeling. The bold accuracies represent the significant accuracies (p < 0.05, permutation test). Row 2 states the datasets used for classifier training, whereas row 3 presents the datasets used for classifier testing. B = n-back, C = reading span, D = arithmetic, E = algebra, BC = combination of 2 working-memory tasks and DE = combination of 2 complex learning tasks. In each cross-task classification combination, level 1 vs. level 3 was classified.

Cross-task Classification - Objective Labeling

Train B C BC Test D E D E D E DE Sub 1 25 35 45 30 35 40 43 Sub 2 60 64 60 36 53 36 48 Sub 3 84 40 47 45 79 40 49 Sub 4 94 53 35 32 65 37 53 Sub 5 53 50 53 56 53 56 54 Sub 6 74 58 32 74 68 63 58 Sub 7 59 67 59 50 65 33 43 Sub 8 75 45 50 40 60 55 60 Sub 9 89 33 79 33 95 27 65 Sub 10 55 70 70 60 65 70 65 Sub 11 60 33 65 39 75 39 45 Sub 12 53 60 42 40 53 35 46 Mean 65.1 50.7 53.1 44.5 63.8 44.2 52.3 9.4.2 Cross-task classification

Similar to the previous study (see section 7.7.1) a cross-task classification was accom- plished. A SVM was trained by using dataset B, C, or BC and then tested on EEG data from independent complex learning tasks (dataset D, E and DE). This section will firstly present the within-task results with objective labeling (see Table 9.2), followed by the within-task results using subjective cognitive workload labeling (see Table 9.3). In general, predicting cognitive workload during complex learning tasks based on classifiers trained with working-memory tasks seems to be challenging.

Objective labeling

As it is known from previous chapters 7 and 8 cross-task classification of diverse working- memory tasks is possible. Thus, the combination of working-memory tasks for classifier training and complex learning tasks for classifier testing is the most interesting combination for an adaptive learning environment. Therefore, data recorded during solving a n-back task or a reading span task (B, C and BC) was utilized for classifier training, whereas data recorded during complex learning tasks (D, E and DE) were used for classifier testing. The cross-task classification results based on objective labeling are shown in Table 9.2.

9.4 Classification results

As in the results part of section 7.7.2, n-back reached best results for workload prediction in complex arithmetic tasks (B-D), with 65 %. For this train-test combination, as well as for BC-D best classification accuracies could be achieved on individual basis, with 94 % and 95 %. Further, 7 out of 12 subjects, as well as 8 out of 12 reached a significant classification performance (p < 0.05, permutation test) on individual basis for these train-test combinations. For n-back and algebra data as train-test combination (B-E) a mean accuracy of 50.7 % was achieved. 4 out of 20 subjects reached significant classification accuracies (p < 0.05, permutation test) on individual basis for this cross-task classification. The cross-task classification using reading-span for training and arithmetic respectively algebra tasks for testing (C-D, resp., C-E) resulted in 53.1 % and 44.5 % classification accuracies. For C-D only 4 out of 12 subjects, whereas for C-E merely 1 out of 12 subjects achieved a significant classifier performance (p < 0.05, permutation test) on individual basis. The combination of both working-memory tasks resulted in lower performance rates than training with a single working-memory task (on average: BC-D =63.8 %, BC-E =44.2 % and BC-DE =52.3 %). For the train-test combination BC-E 2 out of 12 subjects, whereas for BC-DE 3 out of 12 subjects achieved a significant classifier performance (p < 0.05, permutation test). None of the classification accuracies averaged over all subjects were significant (p > 0.05, permutation test).

Subjective cognitive workload labeling

Using the subjective cognitive load labeling, the classifier was trained on dataset B, C or BC and evaluated on dataset D, E, or DE. The diverse tasks differed substantially with regard to the level of workload they induced. Based on the method stated in section 8.2.2, the trials were rearranged in two classes (easy vs. difficult). For some subjects, this reordering led to an unbalanced set of data, so that no efficient classifier training was possible and no significant results could be reached. This was the case for subjects 2, 3, 8, 10 and 12 in the arithmetic task. Furthermore, subject 1 and 6 rated each task with the same difficulty value, whereas, a rearrangement of trials in two classes was not possible. The results for cross-task classification of the %ERS/ERD values are stated in Table 9.3. As in chapter 8 no average values will be reported, as mean accuracies were based on different numbers of subjects for each task, which cannot be compared. The majority of classification combinations led to no significant classifications with accuracies around chance level (p > 0.05, permutation test). Overall it can be recognized, that the classifier trained on n-back data reached higher classification accuracies, than SVMs trained on data recorded during solving a reading span task. The train-test combination B-D reached stable classification accuracies on individual basis, with a maximum of 72 %. Further, 4 out of 5 subjects reached a significant classification performance (p < 0.05, permutation test) on individual basis for these train-test combination. The cross-task classification using n-back for classifier training and algebra tasks for testing (B-E) resulted in accuracies of 70 % on individual basis. Over all subjects lower performance rates were reached, merely 3 out of 10 subjects achieved a significant classifier performance (p < 0.05, permutation test). For the train-test combination BC-D and BC-E, 2 out of 12 subjects achieved a significant classifier performance (p < 0.05, permutation test). On individual basis a workload prediction with an accuracy up to

9 Task order effect in cross-task classification

Table 9.3:Cross-task classification accuracies in % for a two-class problem for all subjects, with subjective cognitive workload labeling. The bold accuracies represent the significant accuracies (p < 0.05, permutation test). Row 2 reports the datasets used for classifier training. Row 3 exhibit the datasets used for classifier testing. B = n-back, C = reading span, D = arithmetic, E = algebra, BC = combination of 2 working-memory tasks and DE = combination of 2 complex learning tasks. “−” indicates when a rearrangement of trials in two classes was not possible. In each cross-task classification combination, level 1 vs. level 3 was classified.

Cross-task Classification - Subjective Cognitive Workload Labeling

Train B C BC Test D E D E D E DE Sub 1 − − − − − − − Sub 2 − 64 − 55 − 68 50 Sub 3 − 47 − 43 − 50 61 Sub 4 72 48 55 55 69 52 66 Sub 5 61 71 54 57 64 71 61 Sub 6 − − − − − − − Sub 7 68 62 56 54 56 50 47 Sub 8 − 56 − 52 − 59 53 Sub 9 66 33 59 42 59 29 58 Sub 10 − 43 − 57 − 57 53 Sub 11 67 54 53 50 57 54 55 Sub 12 − 79 − 48 − 59 55

69 % and 71 % was possible. The cross-task classification using the combination of n-back and reading span data for classifier training and the dataset from both complex learning tasks for classifier testing (BC-DE) resulted in low performance rates. Only 1 subject out of 10 reached a significant classifier performance (p < 0.05, permutation test) with 66 % classification accuracy. Reading span was worst suited for workload prediction in complex learning tasks. For the C-D and C-E cross-task classification, no significant performance (p > 0.05, permutation test) could be reached on individual basis. Further, the classifier performance was around chance level (< 60 %) for all subjects. Rearranging the trials into two classes based on subjective cognitive workload ratings led to unbalanced classes. As in the previous chapter 8, the AUC was calculated for the most important combinations (BC-D and BC-E). The AUC is a more stable and reliable performance measurement for unbalanced data. According to the findings in chapter 8 the AUC results were not significant for all subjects (AUC < 0.65, BC-D : ∅ = 0.48, BC-E : ∅ = 0.44). The classification accuracies and the AUC-values are not significant for the results current study. This might indicate that cross-task classification using these type of working-memory tasks, as well as these complex learning tasks, is challenging.

9.5 Discussion

The study examines, if the task order induces an effect for cross-task classification. A pseudo-randomized order for task presentation was used to ensure, classification performances are based on changes in workload only and not due to order effects (see chapter 7). As in chapter 7, the accuracy using cross-task classification is low. The combination of the n-back and the reading span task for classifier calibration led to best classification accuracies for workload prediction in complex learning tasks on individual basis. But compared to the results of chapter 8, a decrease of the classification accuracy can be noticed, by using the subjective cognitive workload rating. These results led to the assumption, that there might be a strong order effect in the data, which might bias the classification. Thus, the precise classification results are mainly based on non-stationarities over time.

Presentation order of tasks

The main problem in a real learning setting is handling learning effects. Two materials with identical degree of difficulty will impose a different amount of cognitive workload at different times during learning. Additionally, certain materials will be too complex to be understood at the beginning of a learning phase, while they might be quite easy at a later point in the learning phase. This implies, a randomized task order for classifier testing cannot be realized in a real learning setting, because the instructional materials need to be presented in an increasing order of difficulty. Confounding task difficulty and presentation time is inevitable for learning materials. To avoid these confounds in future applications, methods for covariate shift adaptation which alleviate non-stationarities [159, 160] can help to improve classification accuracies.

9.6 Conclusion

In conclusion, the task order of the presented learning material has an effect on the classifier performance. As the presentation of a fixed order of task-difficulty is necessary in real-world learning environments, cross-task classification seem to be too prone to con- foundations. Using cross-task classification in future studies, working-memory tasks, as well as complex learning tasks, have to be modified in their complexity to invoke similar patterns in EEG. Furthermore, generalizable classification methods should be used, which are less susceptible to non-stationarities induced by time effects.

10 Cross-subject workload prediction

As the cross-task approach, presented in chapter 7, leads to classification performances around chance level a cross-subject workload prediction based on linear regression will be introduced in this chapter. The possibility to calibrate a generalized regression model independently from each subject is an essential benefit for realizing an adaptive learning environment. The results presented in the following sections are partially published in [100].

10.1 Study design

An introduction to the participants data, the EEG recordings, the task design, as well as the procedure will be given in the following sections.

10.1.1 Participants and EEG recordings

A total of 10 subjects, 4 male and 6 female aged between 17 and 32 (median age 26) were recruited for this EEG experiment. A set of 29 active electrodes (actiCap, BrainProducts GmbH), attached to the scalp and placed according to the extended International Electrode 10 - 20 Placement System [36] (FPz, AFz, F7, F3, Fz, F4, F8, FT7, FC3, FCz, FC4, FT8, T7, C3, Cz, C4, T8, CPz, P7, P3, Pz, P4, P8, PO7, POz, PO8, O1, Oz, O2), was used to record EEG signals. Three additional electrodes were utilized to record EOG signals; two placed horizontally at the outer canthus of the left and right eye to measure horizontal eye movements and one placed in the middle of the forehead between the eyes to measure vertical eye movements. EOG and EEG signals were amplified by two 16- channel biosignal amplifier systems (g.USBamp, g.tec). The sampling rate was 512 Hz and the impedance of each electrode was < 5 kΩ. EEG data was high-pass filtered at 0.1 Hz and low-pass filtered at 60 Hz during the recording. Furthermore, a notch-filter was applied between 48 Hz − 52 Hz to filter power line noise.

10.1.2 Paradigm

Each subject had to solve 240 addition tasks with an increasing level of difficulty while EEG was measured. The math problems were presented in blocks with six different degrees of difficulty. While the first block consisted of 90 easy problems, the following five blocks with increasing difficulty consisted of 30 math problems each. Depending on the level of difficulty, two numbers between one and four digits had to be added, including a different amount of carry effects. Each trial, in which an addition task should be solved, consisted of three phases (see Figure 10.1). First, the calculation phase occurred where

10 Cross-subject workload prediction

Fixation

cross Fixationcross

Calculation task 5000 S 5000 5000 TimeS(ms) 240STrials IA max.S 3500 Response 1500 ISI

Figure 10.1:Schematic flow of the addition task. The grey line indicates the activation interval (IA), followed by a black dashed line representing the response phase. The white line indicates the inter-stimulus interval (ISI). The parts with the stripped surface indicate shortened time windows for better visualization.

the problem to be solved was shown for 5 sec. Subsequently, subjects had 3.5 sec to type in their result, followed by an inter-trial interval of 1.5 sec, resulting in a total length of approximately 42 min. To avoid the classifier being based on perceptual-motor confounds, the time windows used for analyzing EEG data should not contain motor events. As typ- ing in the answer leads to motoric artifacts, only the calculation phase was used for EEG analysis.

Measure of task difficulty

As postulated from Kantowitz [161], increasing task difficulty always increases workload, since by definition, an increase of task difficulty demands additional workload capacity. Further, an increase of workload is characterized by changes in the theta-, alpha- and beta- frequency bands [51] in the EEG. The presented math problems in this study varied in degrees of difficulty as measured by the information content (Q) of an arithmetic task, according to Thomas [162]. The paper and pencil tasks under which Q was derived were designed to reflect typical mental arithmetic performance, by individually adding the single digits and including the carry in the next single digit addition. Given an arithmetic problem x+ y =?, with xi being the i-th digit of x and yi being the i-th digit of y, the difficulty of

the problem x + y will be described by the information content Q(x + y). Q(x + y) is calculated by separating the multi-digit problem into corresponding single-digit problems and calculating the sum of all single digit problems, taking occurring carries into account.

Q(x + y) = Q(x1+ y1) + Q(x2+ y2+ c2) + . . . + Q(xn+ yn+ cn) (10.1)

where x1 is the rightmost digit of number x and xn the leftmost digit of number x. cn

indicates the carry, coming from the single digit addition before (xn−1+ yn−1). For an

addition problem with two single-digit numbers including no carry effect, the Q-value is generally calculated as follows:

Q(x1+ y1) = log(x1+ y1+ res(x1+ y1)) (10.2)

In document EEG workload prediction in a closed-loop learning environment (Page 105-115)