The Curious Robot: Exploring Salient Objects

While the previously described Home-Tour focuses on the question how mixed initiative facilitates learning, the Curious Robot scenario addresses the question how mixed initiative facilitates the interaction itself. From a technical perspective, the scenario has advanced the refinement of the Task State Protocol by providing new use cases. It can also be stated that the Task State Protocol has been applied more systematically in this scenario, and that it has been identified and investigated as a general coordination principle for the first time [Lüt11].

6.2.1 Scenario Description

The second scenario is an interactive object learning and manipulation scenario with a hu- manoid robot exploring objects that are interesting or salient for it, assisted by the human tutor. Experiences with the original Home-Tour – which, in contrast to the above described iteration, relied mainly on the human’s initiative – show that untrained users require a significant amount of prior instruction to complete the task [LHRS09], because the robot’s interaction model is not immediately obvious. Therefore, their behaviors and interaction strategies vary enormously, which makes it almost impossible for a system to cope with.

6.2 The Curious Robot: Exploring Salient Objects 109

In the Curious Robot scenario, the interaction strategy consequently focuses on the question how mixed-initiative allows to structure the interaction for the users and thus make their behavior more predictable. In particular, the robot asks the user about object labels and how to grasp them. By asking about objects at its own instead of leaving it to the user to demonstrate them the robot provides guidance within interaction which in particular unexperienced users can benefit from. It also communicates what is interesting for it, which unexperienced users might not be aware of. Not least, with using robot initiative to determine the objects to learn, the error-prone visual analysis of human demonstration behavior can be bypassed. If for an object both the label and the appropriate grasping technique is known, the robot grasps it at its own initiative and puts it away. On the other hand, the user can trigger or abort a grasping action at any time. However, in contrast to the Home-Tour scenario and the later Curious Flobi scenario, users cannot demonstrate objects on their own initiative, but they can test the robot and check its knowledge by asking test questions about learnt objects and the grip technique appropriate for a specific object. An overview of the robot’s interaction capabilities is given in table 6.2.

Initiative Situation Example dialog

Robot Asking for label R: What is that? hpointingi

H: This is a banana.

Asking for grip R: How can I grasp the banana?

H: With the power grasp.

Grasping R: I am going to grasp the banana.

R: I start grasping now. R: hgraspingi

R: OK!

Human Grasping instruction H: Grasp the apple!

R: OK. I start grasping now. R: hgraspingi

R: OK!

Interrupting system H: Stop!

R: OK, I’ll stop. hstops graspingi R: OK!

Test questions H: How would you grasp the apple?

R: With the power grasp.

H: What objects do you already know? R: I know the apple and the banana. H: What objects are present on the table? R: Two apples and one lemon.

Table 6.2: Example dialogs in the Curious Robot scenario.

6.2.2 System Overview

The platform used for the Curious Robot scenario consists of two Mitsubishi robot arms fixed to the ceiling, with a left and right Shadow robot hand attached, combined with an anthropomorphic robot torso in the background that serves as interaction partner. The

Figure 6.4: The Curious Robot setup.

setup is shown in figure 6.4. Sensors not visible in the figure are an overhead camera and a headset microphone.

The software system is composed of three subsystems for speech and dialog management, visual analysis, and motor activities. Initially, the Sunshine Dialog system was used for dialog management, but in the course of the development process, it was transparently exchanged by the Moonlight system, as already mentioned in section 5.2. The subsystem for visual analysis generates the robot initiative, e.g. asking for objects, based on visual bottom-up saliency. Based on a the saliency of a region an object is located in, and based on the context information known about the object (i.e. label and appropriate grip type), the vision subsystem proposes an interaction goal that the robot pursues: “acquire label”, “acquire grip type”, or “grasp”. The interaction goals are initiated as a task executed by

the dialog system, if the dialog situation permits this.

The motor subsystem controls grasping and performs pick-and-place operations using three basic grasp prototypes. Operations can be triggered via a task interface.

Thus, the coordination between the three subsystems relies exclusively on the Task State Protocol. In the course of the scenario development, the task life-cycle was being ex- tended for the states cancel,cancel accepted/failed, update, update accepted/failed and intermediate_result, and a dedicated toolkit for requesting and monitoring tasks was developed [LHS+_{10]. Regarding task coordination, the temporally extended grasping}

action provided an interesting use case. For the first time, tasks were split into subtasks: the interaction goal “grasp”, proposed as a dialog task by the visual analysis, is executed by the dialog system, which initiates a grasp task for the action subsystem. Also, for the

6.2 The Curious Robot: Exploring Salient Objects 111

first time, interleaving subdialogs were realized by admitting the user to ask test questions during an on-going grasping action. It also allowed to cancel an on-going action, based on the newly introduced task states.

6.2.3 Evaluation: A Video Study

The scenario was evaluated by means of a video study. Ten participants who had no prior experience with the system were asked to watch a video in which a person interacted with the system. The video was stopped at preset times, and users were asked what they would do in this situation. The questions were always asked after after the robot had acted, but before the person in the video reacted to it, to guarantee an unbiased answer. The evaluation originally aimed to explore several aspects: the expectation that is raised through the robot’s appearance, how users interpret a faulty situation (where, for example, the robot points at an empty spot), how they would recover such situations, and on the effectiveness of the robot’s guidance. The below description will focus on the latter, but a more detailed description of the study has been published in [LPS+_09].

To investigate the effectiveness of the robot’s guidance, two contrasting situations were compared, in which the robot’s behavior did or did not call for a specific user reaction. The first situation was immediately after the robot’s label query (“What is that?”), while the second one was after the grip query (“How should I grasp the ...?”). The first question can be answered very intuitively, by simply naming the object label, but the second question is somewhat confusing because it is not specified which aspect of grasping it refers to. This difference is well reflected by the results of the study, shown in table 6.3. The participant’s replies to the first question were very consistent. Only three constructions were used, and they are all slight variations of each other. In contrast, the second question was much more open and accordingly yielded more variation in the user behavior. Answers referred to five fundamentally different aspects of grasping, and there were also many variations in the specific wording. Moreover, in the guided situation, participants answered quicker (5 seconds vs. 19 seconds, measured from end of question to end of answer) and required less clarification from the experimenter (1 vs. 5 participants) compared to the situation where the robot provides less guidance.

As an aside, the robot’s grip query was intended to get results on how subjects intuitively describe grasping. It was observed that 7 out of 10 participants, perhaps unconsciously, complemented their verbal description of grasping with a gesture. This suggests that demonstrating an action is more natural than describing it. Consequently, this feature was added in a later iteration of the Curious Robot scenario, in which grasping may also be demonstrated non-verbally using a data glove [LPH+_10].

Situation Answer or Aspect Described % of Participants

“What is that?” “That is a...” 70%

“a ...” 20%

“a yellow ...” 10%

“How should I grasp the ...?” Effector position relative to object 30% Trajectory of effector 20%

Fingers to Use 40%

Force to Use 30%

Grasp point on object 20% Table 6.3: Replies after System Initiative

In document Modeling Human-Robot-Interaction based on generic Interaction Patterns (Page 116-120)