5.2 Experiments
5.3.2 Success in Raven-Bongard
In the RavenBongard experiment, we explored the possibility of building a predictive model of collaboration outcomes using eye-movements (Nüssli et al., 2009). We used machine learning algorithms to predict success in solving the task from dual gaze indicators, possibly combined with speech indicators.
Method
We defined four gaze features to represent how subjects looked at the cells composing the problems. We identified every sequence of at least 3 fixations with at least one back and forth movement between one cell and another. We defined the comparisons feature as being the ratio of all fixations which belong to such sequences. A related feature is the comparison
intensity. For each of these comparison sequences, we computed the number of transitions
between the two cells and then, we averaged this number over all comparison sequences. We also computed the standard deviation of a vector representing the 9 cells on the screen and containing aggregated fixation times to obtain the gaze dispersion feature. Finally, we computed the cosine between such vectors of aggregated fixation times of each subject to obtain a dual gaze feature called gaze divergence. Examples of these features are shown on figure 5.7.
Figure 5.7: Illustration of gaze features. Top-left picture depicts one subject (blue) doing an intense comparison between the upper-left cell and middle cell and the other subject (red) doing a weak comparison between the upper-right cell and middle-right cell. On the top-right picture, we can see a dispersed subject (blue) and a not dispersed one (red) Bottom images illustrate high gaze divergence (left) and low gaze divergence (right).
The recorded audio was automatically labeled, second by second, as speech or no speech according to a simple threshold test on the volume. From this, we extracted two simple features: the speech time, i.e. ratio of time spent at speaking, and speech time asymmetry, i.e. difference of speech time between subjects.
We used these features to train two types of classifiers: Binary decision tree and Naive Bayesian classifier. These two types of classifier were chosen because it is possible to analyze the
Chapter 5. Dual eye-tracking experiments
resulting model in order to get insights on the way the prediction is done. The predicted variable was the outcome of the problem: solved or unsolved. Algorithms were fed with gaze and speech features computed over one minute time windows and the time of the window was also used as a predictor. However, we discarded for each problem the last minute before the solution was announced in order to avoid the effect of speech which may due to the explanation of the solution. We used a 10-fold cross-validation procedure to assess the performance of prediction. We made three different prediction experiments: one to predict the success independently of the task and two to predict the success in each of the two classes of problems separately.
Results
Table 5.2: Results of the machine learning algorithms for both problem classes combined. The values in parenthesis are the corresponding kappa scores.
Speech Gaze Both
Both problems classes
Naive Bayesian classifier 77% (45 %) 74% (35%) 78% (50%) J48 Binary decision tree 86% (65%) 68% (10%) 79% (51%)
Raven problems only
Naive Bayesian classifier 78% (56%) 68% (32%) 78% (56%) J48 Binary decision tree 91% (81%) 68% (34%) 91% (81%)
Bongard problems only
Naive Bayesian classifier 76% (29%) 77% (37%) 76% (34%) J48 Binary decision tree 75% (21%) 77% (37%) 75% (25%)
The prediction results are presented in table 5.2. We indicate both the % of correct predictions and as this is not a balanced classification task, the kappa statistic, which can be considered as the % of correctness above chance level. First, it is very interesting to note that we can obtain good results (50% above chance level) when predicting success in both tasks at the same time. This suggests that there may exist some patterns in gaze and speech which are partially task independent. We can also see that at this level, speech features play a larger role than gaze. Indeed, we see that models using only gaze features are the worst for both algorithm types. However, for the Naive Bayesian classifier, gaze seems to slightly improve the performance compared to speech only, indicating that it can still play a role. The results concerning Raven problems only are surprisingly high, producing up to 80% above the chance level with 91% of correctly classified instances. Moreover, we can see that these results are explained only by speech features. For Bongard problems, the situation is the opposite, although the results are much lower than for Raven and even lower than for both classes combined. This suggests that the correct predictions made on both problem classes taken together are essentially explained by correct predictions of Raven problems. However, it is interesting to note that the best prediction performances for Bongard problems are explained mainly by gaze features, as the best models are achieved by taken only gaze features without speech features.
We also tried to make predictions by using only the data from the first two minutes of solving, the results were either similar to those presented or a little bit (3 or 4%) lower. These results are even more interesting because they suggest that it could be possible to detect after one or two minutes if the pair will succeed or not. Moreover, we also tried to predict if success will happen the following minute. In this case, the results are clearly lower than the previous ones but they are still sufficiently high to be considered. We obtain kappa-scores of 40% (instead of 50%) for both problems combined. It suggests that there exist some phases in the solving processes which are distinguishable by using gaze patterns and this is consistent with the results found using usual statistical methods.
Discussion
As we have seen, we can predict up to a certain point collaborative problem solving outcomes by using only raw measure of speech and dual-gaze features. Moreover, we see that we may be able to predict the moment of resolution one minute before it happens, suggesting a certain ability to detect phases in the collaborative process. These results have interesting implications as they tend to show that it could be possible to build gaze-sensitive applications, possibly combined with simple automatic speech analysis, in order to provide meaningful feedback to users. Of course, these results must be taken with care as the number of subject is low.