5.4 Benets of the Wizard of Oz approach
5.4.3 Evaluating the Immersive Teleoperation system
5.4.3.3 Illustrative Examples of Evaluating Data
1Figure 5.11 illustrates eyes and head movements of a participant performing the clicking- beaming evaluation in the three conditions. In the xed head condition, the gazes move with higher amplitude (gaze-1 and gaze-2) and the time for nishing the test is the largest one from the three conditions due to the large saccade errors. In the xed robot eyes condition, the head just seems to move to one position (see Figure 5.11 (b) so that the participant can see the items from the 2 sources (tablet and desktop screen) in the head-mounted display screens. At that time, he tries to do his best to just move his eyes to see the displayed item and click. Therefore, the duration time for this test become shortest. Finally, with movable both of eyes-head conditions, the participant move both head and eye at the same time. Because
(a) Schematic modal of a beaming evaluation
(b) A target word, as seen on the screen (c) The list of words on the virtual tablet, where the target word should be found and clicked
5.4. Benets of the Wizard of Oz approach 121
Figure 5.9 Distributed distance between two consecutive items
Figure 5.10 The items are sorted and merged to exhibit a balanced order
of latency between the participant eyes movement and the display from robot camera, the participant tend to move slower his/her eyes so that the head will contribute more largely in gazing targets as is the case when human gaze with their own head/eyes.
The evaluation system is ready to perform: in near future, we will have more data to analyze and validate the Beaming system performances before its use to collect interactive data.
5.4.4 Comments
The evaluation framework focused on pilot's point-of-view, which measures the eective of the Beaming system. In future, we also can measure acceptance of the Beaming system in
(a) Eye movements (b) Head movements
Figure 5.11 Gaze and head movement of a subject in clicking-beaming evaluation
subject's point-of-view, in which participants interacting directly with the robot controlled by a professional human pilot.
5.5 Conclusion
In this chapter, we discuss our perspective on how to build autonomous robots that can perform the short-term interactions. For now, the robot has been able to replicate the PTT scenario, in which actions are generated from interactive models as well as from interactive data. We discuss our strategy to build perception modules and evaluate required modules (gesture controllers and interactive models) so that the robot can collaborate with human naturally and eectively in future.
For RL/RI scenario, a basic autonomous robot is built with rule-based models to drive gaze, arm, speech while backchannels are generated by interactive models; and a speech recog- nition system is used to provide inputs for the interactive models. We also presented the challenges we are facing to achieve a more mature autonomous system with higher social levels.
Most of the challenges are concerned by interactive data. In order to progress against these limitations, we propose to apply immersive teleoperation system, which are integrated with Finite State Machine to collect naturally interactive data for HRI.
The immersive teleoperation system allows a pilot (e.g. a neuro-psychologist) to interact with human subject through the robot body. In fact, the human pilot can perceive the scene
5.5. Conclusion 123 through robot eye cameras and ear microphones. At the same time, (s)he can drive the robot verbal and nonverbal behaviors such as head, gaze and arm movement. Therefore, the pilot can control robot to adapt its behaviors when interacting with dierent subjects. We also present how to evaluate the immersive teleoperation system before using it to collect the interactive data.
In fact, with the immersive teleoperation system, the interactive data could be more natural for teaching the robot how to perform social interaction, with more variability. However, it still has diculties in scaling from human perception abilities to the robot abilities. The pilot would have diculties at the same time interacting with human subjects and care what the robot can learn. We will discuss a semi-autonomous robot which shares control between human pilot and the robot in the next chapter. Therefore it can reduce loading of the human pilot and enable the pilot to focus on the robot perception limitations.
Chapter 6
Conclusions and Perspectives
6.1 Conclusions
This thesis main concern/goal is about a humanoid robot that can interact naturally with humans through verbal and co-verbal behaviors. We propose here a learning framework that enables a humanoid robot to learn multimodal interactive behaviors from human coaches, which we detail and use with 2 short-term interactions and dedicated behavior models, with a nal demo implementation on the iCub robot.
The learning framework includes three main parts: (1) collecting interactive data of mul- tiple modalities (speech, head, arm, gaze, etc.) from human-human interaction (HHI) and process the data to get events and features that can be scaled to HRI; then (2) build mul- timodal interactive behavioral models, which can capture both inter- vs. intra-coordinations between the modalities of interactions (e.g. speech, arm, gaze, head motions, etc.) to gen- erate actions from perception streams; nally (3) make gesture controllers to execute actions generated from the models.
We focused on short-term interactions in order to collect more easily enough interactive data to train interactive models. We selected here two short-term interactive scenarios, which our robot will be involved:
Collaborative performative task The rst scenario is Put That There (PTT), in which the robot will play the role of an instructor who instructs manipulator to move cubes. This scenario requires the robot to perform precise coordination between speech and other non-verbal behaviors (arm pointing, gaze attention, etc.). The intra-coordinations concerns the time relations between triggered events of the robot's speech and its gaze and arm movements. In contrast, the inter-personal coordinations concern the robot's sensitivity to the actions of its human partner (here, manipulator's arm movements). By these multimodal interactions, the manipulator is expected to better perceive the robot's instructions than ones given by unimodal interaction (e.g. just by speech). Therefore, the collaborative task can be performed more eciently.
Interview The second task is a neuro-psychological test, namely a Selective Reminding Test (French RL/RI). In this task, the robot plays the role of a neuro-psychologist who interviews elderly people to diagnose potential Alzheimer disease. This task requires the robot to perform not only adequate verbal and nonverbal coordinations but also mimic
professional skills of the psychologist to keep engagement of the subjects and encourage them to foster their memory. In particular, the right choice and timing of backchannels play an important role.
Multimodal data collected during HHI experiments are then processed to get useful fea- tures for training multimodal interactive behavioral models. We compared performance of Long-Short Term Memory (LSTM), a Deep Learning method, with that of statistical models, previously developped in the laboratory, to construct interactive models. The LSTM-based techniques are used to generate discrete events (gaze, arm, interaction units for the PTT scenario, backchannels for the RL/RI scenario) as well as continuous variables (head mo- tions for the PTT scenario). Compared with the statistical methods, LSTM generates not only better prediction performance but also provides better coordinations between multimodal streams (measured by Coordination Histogram). These good results are explained by the ability of LSTM to capture long-term time dependencies, which provide decision layers with relevant contextual information. For head motion generation, we proposed a cascaded LSTM model, which rst uses one LSTM layer to predict gaze as an intermediate task and a second LSTM to generate head motion. By this way, we explicitly provide a causal relation between head and gaze.
We then build gesture controllers that are used to execute action events generated from the interactive models. We also propose an online evaluation framework to de- tect faulty behaviors of gesture controllers, and helps improving the models. The online evaluation requires participants to continuously watch a video from the viewpoint of a sub- ject interacting with the robot and just press a key (ENTER key) to signal faulty behaviors. Time distribution of these yuck responses exhibits clear maxima that cue elementary faulty behaviors that system devloppers can then handle. We evaluated the gesture controllers with action events directly driven by HHI data in two versions of the robot: (1) an initial version; (2) a corrected version that benets from the correction of faulty behaviors identied bythe rst assesment. We found that the corrected version eectively reduces the number of yuck responses. This conrms the eectiveness of the framework for detecting the robot's faulty behaviors. By analyzing the post-hoc questionnaires and the free comments given at the end of the test, we found that people seem to have higher expectations when the robot is less faulty. For example, in the second experiment, participants usually comment "why does the robot not smile? or it is better if the robot give a joke, which did not usually happen in the rst experiment. In future, the same procedure can be used to actually rate the autonomous robot with fully functional interactive models.