Chapter V Discussion
2. Parameter tuning and test results
For both classification and regression, the first step was to determine how the feature processing pipeline affects performance. It turned out that in both cases much higher detection rates can be obtained with no smoothing applied, regardless of whether it is a median filter, a moving average window or the additional sliding window used in the initial classifier tests. While less smoothing is beneficial for higher true positive rates, this comes at the expense of much lower hold times, which effectively turns the state-driven BCI into an event-driven one. And because of the very short hold time, the dwell time was unsuccessful in improving performance. This is also due to the non-overlapped short time windows from which band power features are extracted. In other works, the duration of the window is typically longer, generally up to 2.5 seconds, and the windows are overlapped. This is a form of smoothing in itself and potentially allows the dwell time to be effective.
Unlike the dwell time, refractory periods were found to significantly increase detection rates, although any improvement came at the cost of significantly more false positives on the test set. It seems that the larger the reduction of sample FPR is in cross-validation, the larger the event FPR will be on the test set. This observation is presented in Table 9, page 77, where we show that the relative decrease of sample FPR in cross-validation is indeed proportional to the relative increase of event FPR on the test set. Crucially, this relation is similar for all subjects, which seems
86
to indicate that in this context, refractory periods have almost no generalization ability on new data. This is because of the large differences in experimental protocol between the training set and the test set. While in the former all imaginary movements last four seconds, with four seconds breaks between them, the duration of both IC and NC periods is variable in the independent test set. The optimal refractory period is thus inherently adapted to the temporal structure of the data, as it crucially depends on the average duration of NC periods. Thus, the choice of an appropriate duration of the refractory period should not be based on the performance obtained on data with a fixed temporal structure, as is the case of our calibration set; otherwise, the improvements will not generalize to new data with different characteristics.
Both the dwell time and the refractory period are useful in decreasing false positive rate. Hence, it turns out that there are two possibilities to use these tools. The first possibility is to use a fixed threshold and determine the configuration for which the difference between true and false
positive rate is maximized. This is the approach of Townsend et al [43], who first introduced dwell
and refractory periods for BCI. The second possibility is to fix the false positive rate, rather than the threshold, and then similarly find the dwell and refractory periods which maximize the difference between true and false positive rates. This is the approach used in this study. The former possibility might not improve detection rates, as the threshold of the classifier is still the same, but will most likely reduce false activations; the latter possibility is guaranteed to improve the true positive rate (provided that the FP rate can be decreased), but the false positive rate remains the same. Which approach generalizes better on an independent test set remains a mystery, because only the second possibility was investigated in this study. But indeed, our results show that the improvements in true positive rate not only hold quite well on the test set, but are comparatively larger than those found in cross-validation.
For classification, we have also investigated the influence of the selection of training intervals. While the moderate improvements found in cross-validation were insignificant on the test set, the results offer a valuable insight: for three-state classification in self-paced motor imagery BCI, the selection of training intervals for one IC state influences the detection rates obtained for the other one. If the same interval would have been optimal for both IC states, the largest values of the average detection rates from Figure 34 (page 65) would have been found on the diagonal, yet this was not the case. This conclusion should be taken into account for future research, considering that in most BCI experiments the same intervals are considered for both classes. One possible reason why this is not commonly investigated is the computational inefficiency of cross-validating the classifier for all possible configurations. While this is still feasible with simple models such as discriminant classifiers, it would not be appropriate for more complex ones such as support vector machines or hidden Markov models. An alternative could be a filter approach, where instead of cross-validating a classifier for each set of parameters, statistical measures of the data would be calculated and optimized.
Overall, linear regression was found to offer lower performance than classification. Unlike classification, which tries to separate the data belonging to different classes according to some criterion, the least squares regression used here is crucially dependent on the target values. Minimizing the mean squared error, the optimization criterion of the Wiener filter, is also not a wise choice given the vague relation between mean squared error and self-paced performance. It
was shown that intuitive but improper target values can degrade performance significantly [60].
However, taking the one-dimensional projection found by LDA as the targets for the Wiener filter did not improve performance but actually decreased it. The genetic algorithm managed to improve cross-validation performance, but as usual, this came at the expense of higher event false positive rate on the test set. This brute force approach to optimization was possible because of the simple and computationally inexpensive regression model.
87
The final attempt to increase performance was by linearly combining the output of several classifiers and that of the Wiener filter with a genetic algorithm. Although four different optimization criteria were tested, results were unsatisfactory and any improvement in true positive rate also increased the event false positive rate on the test set. This might be caused by the fact that all classifiers were trained with the same features, extracted from the same intervals within the trial, and the same processing steps were applied for each of them. These parameters were found optimal for the three-state LDA, but there is no guarantee that they also work best for the other classifiers. Therefore, there is the possibility that performance could have been improved by using different pipelines, individually optimal for each classifier. Also, one could argue that instead of genetic algorithms, meta-classifiers could have been used, with the additional benefit of potentially determining non-linear relations between the scores of the base classifiers. We did not properly investigate this possibility, but some preliminary tests were performed with LDA, QDA and decision trees used as meta-classifiers. The results were not consistent though, and additional difficulties arose, such as which base classifiers should be used for training the meta-classifier. The genetic algorithm approach was thus preferred, partly because it offered the choice of specifically optimizing different performance metrics.