Performance measurements for evaluating prediction methods

2.6 Classification methods used for EEG data

2.6.4 Performance measurements for evaluating prediction methods

To be able to compare the performance of diverse classification methods, a wide range of performance measurements are known. In this section all metrics which are used in the following studies will be introduced.

2.6 Classification methods used for EEG data

Correlation coefficient

The Pearson’s correlation coefficient (CC) is used to observe the statistical relationship between two variables X and Y . The correlation coefficient applied to samples can be obtained from the population Pearson’s correlation coefficient:

px,y= cov(X ,Y ) σX, σY (2.25) = E[(X − µx) (Y − µY)] σXσY (2.26) = _q E(XY ) − E (X ) E (Y ) E(X2_{) − E (X )}2 q E(Y2_{) − E (Y )}2 (2.27)

where cov is the covariance, σX is the standard deviation of X , µX is the mean of X and

E is the expected value. The CC of two datasets {x1, . . . , xn} and {y1, . . . , yn}, containing

nvalues each, is calculated by substituting covariances and variances in (2.27) with their estimates on data points xi and yi. Therefore, the CC, mostly given as r, is defined as

follows:

r= n ∑

i=1xiyi− ∑ki=1xi∑ki=1yi

q n ∑ki=1x2i − ∑ k i=1xi 2 q n ∑ki=1y2i − ∑ k i=1yi 2 (2.28)

The CC can be positive, as well as negative. If the CC is close to 0, there is no linear correlation between the actual and the predicted variables. A CC of 1 means the two variables correlate perfectly.

Squared Correlation Coefficient

The squared correlation coefficients (r2) are used for performance estimation and feature selection [95, 96]. These values can be assumed to be a correlation measurement. The r2-value describes the variance dimension for a feature, which is explained by the class membership. It is located between 0 and 1, where 0 stands for no correlation and 1 for perfect correlation. Using this value as performance estimation, it enables to compare diverse prediction methods.

Root Mean Square Error

The Root Mean Square Error (RMSE) is the square root of the mean squared error (MSE). The MSE is a measure of how close a fitted line is to data points. The squaring is done in order that negative values do not cancel positive values. Furthermore, to punish data points which are further away of the fitted line stronger, compared to those which are almost on the fitted line. The smaller the MSE, the closer the fit is to the data.

2 Fundamentals for developing EEG-adaptive learning environments

The MSE is calculated as follows,

MSE=1 n n

∑

i=1 (xi− yi)2 (2.29)

where xiand yi are data points from the datasets {x1, . . . , xn} and {y1, . . . , yn} containing n

values.

Calculating the square root of the MSE leads to the RMSE, described as follows:

RMSE= √ MSE= s 1 n n

∑

i=1 (xi− yi)2 (2.30)

The RMSE is thus the average distance, of a data point from the fitted line, measured along a vertical line.

Global deviation

The global deviation (GD) is an additional performance measurement to observe the statistical relationship between actual values and the corresponding predicted variables. GD is defined by the average squared difference [97]:

GD(X ,Y ) = 1 n n

∑

i=1 (xi− yi) !2 (2.31)

where yi denotes the actual value at time instance i and xi is the corresponding predicted

value. The smaller the GD-value, the smaller the predicted bias error. Compared to GD, the RMSEand the CC captures noise. Furthermore, it is the only method allowing a reasonable estimation of the prediction bias.

Accuracy

The classification accuracy (Acc) is widely used as performance measurement, when the number of classes and the time of a trial is constant.

Acc=# of correctly classified trials

# of total trials (2.32)

Acc is a straightforward measure but has some limitations based on the facts, that the classification accuracy of less frequent classes have smaller impact.

Cross-validation

Cross-validation is a split-validation technique where a dataset is partitioned in the following way: one subset is not used for model training, but reserved for the evaluation of the classification performance, resulting in a single entry of accuracy statistics. k-fold cross- validation divides the data into k subsets; k − 1 subsets are used for classifier training and

2.6 Classification methods used for EEG data

Figure 2.9:Three ROC curves representing an excellent, good and useless classifier for a binary classification problem, with the respective AUC-values.

the retained subset is used for classifier validation. This procedure is repeated k times so that every subset has been used once for testing and the other times for training. The final Accresult is reported as an average of all folds [98].

Bootstrapping

To provide a more robust statement of Acc, an approach that uses confidence intervals can be used, e.g., the non-parametric method bootstrapping [98]. This method can be used to assess variations in the estimated model accuracy. For bootstrapping no assump- tion is made regarding the populations of the input variables. Given a training set

D

of size n, bootstrapping generates m new training sets

D

0 _{each of size n, by sampling from}

D

uniformly and with replacement. Sampling with replacement is done multiple times to estimate the mean variability and variance of model outputs. Finally, the m models are combined by averaging the classification output.

ROC-Curve

When having an unbalanced number of trials, the receiver operating characteristics (ROC) is a more accepted measurement to evaluate the behavior of the classifier as using the Acc. The ROC illustrates the performance of a binary classifier system as its discrimination threshold is varied. The ROC is created by plotting the false positive rate (FPR) on the x-axis against the true positive rate (T PR) on the y-axis.

T PR= T P

T P+ FN (2.33)

FPR= FP

2 Fundamentals for developing EEG-adaptive learning environments

T P represents true positives, where FN stands for false negatives, FP means false positives, T N means true negative. To compare diverse classification models as well as to estimate the performance of the classier the Area Under the Curve (AUC) has to be calculated. The AUC is the area between the ROC curve and the x-axis (see Figure 2.9) which can be calculated by the definite integral. A classifier that separates the classes perfectly has an AUC equal to 1. Conversely a classifier that separates the classes not better than random guessing has an AUC close to 0.5. Thus, the higher the AUC value the better the performance of the classifier.

3 State of the art in workload classification

and EEG-based tutoring systems

The state of the art in workload detection methods, as well as classification methods used for workload detection and prediction are reported in the following sections. This chapter is partially based on [8, 99, 100].

3.1 EEG-based measurement of workload during learning

As reported in section 2.3.5, EEG has widely been used for determining workload in in- dividuals. Gevins and colleagues [50] did research on the influence of task difficulty in EEG signals and workload. In another study, Antonenko and Niederhauser [28] revealed differences in theta- and alpha- frequencies when reading hypertext with and without link previews. The researchers Gerlic and Jausovec [30] found that learning about planets from spoken text combined with music and pictorial information (i.e., high workload) yielded to alpha-desynchronization in temporal and occipital electrodes. Learning from written text alone (i.e., low workload), an alpha-desynchronization in frontal and central electrodes occurred. However, in all studies, it remains unclear, due to the complex learning materials used for experimentation, whether the observed EEG differences between more and less demanding learning materials really go back to differences in workload or whether they might be mostly artifacts of perceptual or motor differences between experimental conditions. These problems of perceptual-motor confounds seem inevitable when using standard EEG power analysis in comparing realistic learning materials varying in levels of difficulty, instead of comparing more controlled experimental tasks without perceptual- motor confounds.

To summarize the widespread literature, the reported findings suggest that alpha- and theta- frequencies are stable indicators of cognitive performance and are useful as parameter for detecting different workload states. All reported studies were analyzed offline. Further- more it is unclear, whether the classifications are really based on differences in cognitive workload or on some of the perceptual-motor confounds of the different instructional conditions. Therefore, modified analysis and classification methods are necessary, to answer the residual questions.

3 State of the art in workload classification and EEG-based tutoring systems

In document EEG workload prediction in a closed-loop learning environment (Page 38-44)