4.3 Evaluating gesture controllers
4.3.3 Comparing the two evaluations
4.3.3.1 Yuck responses
In the second experiment, we remedied to these faulty behaviors by adding extra-rules to our gesture controllers. For example, in order to avoid immobility due to periods of poor external stimulation, the gaze controller automatically randomly loops on the two last regions of interest when the delay from the last xation exceeds 3 sec. With this rule, the number of yucks at timestamps 10, 11,12, 13, 16, 18, 19 and 24 are signicantly reduced as shown in Figure 4.13.
However, this randomization should not be equally distributed and should favor the sub- ject's face, since the participants still complain about its lack of engagement with the human subject (e.g. around peak 11,12,13 in counting task). This problem will be suppressed by systematically adding the subject's face to the current attention stack and favoring this region of interest in the gaze distribution.
In the rst evaluation, yucks such as those occurring at timestamps 2, 3, 5 were due to the wait-motion-done setting. In the re-design, these faulty behaviors have been removed by disabling the wait-motion-done option that discard any new command while the current gesture has not reached its target according a given precision. We supposed that viewers are able to decode intentions and authorize the interruption of the robot's movements . This policy is ecient: the yuck responses at landmarks 2,3,5 are signicantly reduced in the second evaluation. The yucks at landmarks 14 and 15 were repaired by forcing the closing gesture at the end of phonation.
Although many of the faulty behaviors are suppressed, several faulty detections still remain while some new yucks emerge from the background, notably the absence of expressiveness, e.g. emphatic responses to subject's embarrassment or head nodding normally associated with incentives, respectively cued by yellow vs. cyan extrema.
We compared the probability distributions of yucking for the rst vs. second assessments. We also further distinguished between subjects who performed both assessments. The average yucking frequency is respectively 0.013, 0.007 and 0.007 yucks/s for the three groups. The dierence of the average yucking frequency between the rst vs. second assessment is statisti- cally signicant, whether subjects participated to both experiments or not. This clearly shows that we eectively succeeded in resolving some of the faulty multimodal behaviors since the average yucking probability is divided by a factor of two between the two evaluation sessions.
Figure 4.13 Comparing the yucking probability as a function of time for rst vs. second assessment by the subjects.
Figure 4.14 Comparing subjective ratings according to conditions (same conventions as gure 4.13).
4.3. Evaluating gesture controllers 97
(a) Robot adapts
(b) Feels relaxed
4.3.3.2 Subjective ratings
We also compared subjective ratings from the rst vs. second assessment (see gure 4.14). While the new behavioral score results in an eective decrease of the yuck responses and the rating of descent behavior eectively improves most other o-line subjective rat- ings degrade. Likelihood ratio tests comparing the combined multinomial model RATINGS
∼ SEX+SESSION+EXPOSURE with the individual models RATINGS ∼ SEX+SESSION,
RATINGS ∼ SEX+EXPOSURE and RATINGS ∼ EXPOSURE+SESSION show that SEX
signicantly contributes to the ratings of questions 1 and 3 as shown in Figure 4.15 (females being less convinced by the robot's adaptation capabilities but more relaxed than males). In addition, the second version has signicant contributions on two ratings, i.e. feel relaxed (p < 0.02) and pleasant interaction (p < 0.09). This means that people feel more relaxed and the robot was rated as more friendly in the rst evaluation.
4.3.3.3 Comments
In the free comments, some raters of the rst evaluation campaign mentioned the rather directive style of our female interviewer and the absence of emotional vocal and facial displays of our SAR e.g. laughs and smiles. While most raters of the second evaluation campaign underly the quality of gaze behavior, the majority criticize the poorness of emotional displays: "robot without human warmth!", "why robots never smile?", "[the robot] does not react to humor", "in its behavior, I sometimes felt boredom or weariness", etc. It seems that the increased behavioral quality and appropriateness also increased the participants' expectations. As they have the impression that the robot is reactive, aware of the situation and monitors the interaction task in an appropriate way, they can allocate more attentional resources to the social and emotional aspects of the interactive behavior.
These critical reviews concur with Masahiro Mori statements [Mor70] about the uncanny valley, or perhaps more likely the uncanny cli hypothesis [Bar+07] that postulates that the likability of robots may evolve on an uncanny cli without necessarily falling in the valley. The challenge is then to maintain performative, socio-communicative and emotional behaviors at the same level of acceptability.
Moreover, these experiments show the limits of HHI-to-HRI transfer learning: multimodal behaviors exhibited by human tutors may not be fully acceptable by SAR. Here, while the neuropsychologist is quite licensed to concentrate on her score sheet while the subject is performing a counting task or trying to retrieve an item from memory, such a casual behavior is associated with carelessness and coldness from a SAR. This has to be conrmed by asking our web subjects to rate using the same methodology the behaviors of a panel of human neuropsychologists. Cormons et al [CDP16] are notably comparing bahaviours of practitioners as a function of their curriculum: nero-psychologists, clinicians vs. speech therapists.
Goetz et al [GK02] found that robots with playful behaviors are usually rated more positive than robot with serious personality, but people followed the instructions of serious robots"
4.4. Conclusions 99