Comparison to the Human Recognition Performance

5. A Static Baseline Approach

5.4. Evaluation

5.4.3. Comparison to the Human Recognition Performance

In this section, the classification results of the best-performing features, the mean AAM feature vectors, are compared to the human recognition results, more concretely to the results for the

only-face/full-time context condition, because it matches best the information the automatic

features 01 02 03 04 05 06 07 08 09 10 11 mean SD – all 76 83 80 95 84 57 62 74 66 71 88 76.0 11.5 m-AAM – success 67 82 89 90 81 60 52 69 25 75 83 70.3 19.2 – failure 83 83 67 100 88 54 70 81 87 67 91 79.2 13.3 – all 82 75 85 92 68 73 94 67 78 95 92 82.0 19.1 human – success 91 66 84 89 61 70 91 52 66 95 93 78.1 21.2 – failure 73 84 86 95 75 75 98 82 91 95 91 86.0 16.1

Table 5.12.: Classification accuracies for the mean AAM features and the human recognition performance. For each person, the classification accuracy for all scenes, only suc-

cess, and only failure scenes is shown, as well as the mean accuracy and standard

deviation over all persons. Please refer to Sec. 5.4.3.

sake of convenience, these classification accuracies are shown again in Tab. 5.12.

The average classification accuracy of the mean AAM features (76.0%) is notably lower than the average human performance (82.0%), although the differences are not statistically significant.6 However, we assume that the high variances in the performances for different persons, paired with the comparatively low number of persons, are the reason why the significance of the differences cannot be confirmed, while in fact the classification accuracies of the automatic approach are systematically lower than the human ones, not just by chance. Then again, the human recognition performance was evaluated on a subset of 88 videos only, while the automatic classification used all available videos. When evaluated on this subset of videos only, the performance of the mean AAM features is comparable to the human one: 83.0% for all videos (SD 10.1), 75.0% for success videos (SD 19.4), and 90.0% for failure videos (SD 12.6). These 88 videos were randomly chosen (please see Sec. 3.5.1). It might be the case that—by chance—these 88 videos are in some general sense “easier” to classify than the average of the database, but just as well the performance increment for the mean AAM features on this subset might be by chance; the data at hand does not allow a conclusive answer to this question (intuitively, we suspect the latter).

There are some commonalities between human and automatic recognition performances: • on average, failure scenes were easier to classify than success scenes

• the variance for success scenes is higher than for failure scenes

• the variance of the classification accuracy (depending on the subject) is high in general Nevertheless, there is no significant correlation at all regarding the classification accuracies for the individual persons (Spearman correlation, ρ ≈ 0.04, p > 0.9). However, this question can also be considered in a more detailed way, namely not on person level, but on video level. In the latter case, the single classification results for all 88 videos are compared, while in the former one, the average classification accuracies of the 11 subjects are evaluated. In order to do this, the classification results for the 11 observing subjects7 were binarized for each video by setting the classification result to 1 if more than half of the subjects classified it correctly, and to 0 otherwise. This binarization was done to become compatible with the results of the automatic recognition, which yielded only one binary value (correct or false classification) for each video. It turned out that there is a weak, but close to significant correlation between

p > 0.2 for both a two-tailed t-test and a Wilcoxon rank sum test 7

There were 44 observing subjects, who were distributed over the four context conditions, thus resulting in 11 observing subjects for each context condition, not to be confused with the 11 subjects shown in the videos.

5.5. Conclusion 87

these classification results on the 88 videos (Spearman correlation, ρ ≈ 0.2, p < 0.06). Thus, measured on video level, the human observers and the automatic classification tended to make some similar classification errors to some (weak) extent.

5.5. Conclusion

We investigated the person-specific automatic recognition of FCSs in terms of valence using SVMs as classifier and AAMs, GEFs, and raw images as features. Although shown to yield good results on other facial analysis problems, the GEF features performend worse than the AAM and also raw image features in our evaluations. The good performance of the raw images, compared to the AAMs, suggests that also the video parts with large out-of-plane head rotations, which are a main cause for AAM fitting failures, convey useful information and should be considered for the interpretation. In general, the achieved classification accuracies are rather low for a two-class problem, espescially for the success class. A main problem is the apparently low interclass to intraclass variance ratio on frame level.

The best performance was achieved by the mean AAM feature vectors, yielding an average classification accuracy of 76.0%, which is still lower than the average human performance of 82.0%. When evaluated only on the subset of videos that was judged by the human subjects, the classification accuracy increased to 83.0%. However, we regard the classification performance for the whole dataset as the more important performance measure. Likewise to the human classification, the variances of the recognition performances for different persons were very high in general and for success scenes in particular. On average, failure scenes were somewhat easier to classify than success scenes.

An investigation of the surprisingly good performance of the mean feature vectors, compared to the majority voting over frames, indicated that the usage of descriminative subsequences of the videos for the classification appears to be a promising direction for further investigations. This assumption is confirmed by a visual inspection of the videos, which furthermore suggests that the temporal dynamics of the displayed FCSs are important for their recognition. Both issues were neglected by the simple static classification approach presented in this chapter. Thus, we investigate a more sophisticated and dynamic recognition approach that addreses them in the next chapter. The classification accuracies of the static approach serve as baseline for the dynamic approach to compare to.

In document Facial Communicative Signals: valence recognition in task-oriented human-robot Interaction (Page 97-101)