• No results found

7.4 Evaluation: A PARADISE-style User Study

7.4.1 User Study Setup

Out of 32 participants who took part, we used 28 recordings (14 male, 14 female), with the remaining four excluded due to technical problems during their trial. Most of the participants had been recruited at a university event for the general public and thus represented a wide age range, with mean age at 33.5 years, minimum 21 and maximum 79.

7.4 Evaluation: A PARADISE-style User Study 129

On a scale from 1 (none) to 6 (lots of), the average rating for knowledge of computers was at 5.07, of speech systems at 2.52, of robot systems at 1.96 and of programming experience at 2.26. They were compensated for their participation in the experiment.

Instructions

In order to study natural demonstration behavior, participants received as little instruction as possible. They received written instructions, specifying that they were to engage in interaction with the robot Flobi, and that Flobi was supposed to learn object labels during interaction. They were also advised to check that the robot had actually learned the labels. It was, however, not specified how they were to present and check objects. They were told that they should interact with the robot as long as they wished, with 5-10 minutes recommended as a guideline. Also, they were informed that they could begin the interaction by greeting the robot, and end the interaction by saying goodbye. In addition, participants were advised not to be discouraged by speech recognition problems, and that they could repeat or rephrase their utterance in such cases. Last, an emergency phrase (“Restart”) was provided. The interactions were in German. No other person was present in the room during the interaction. A translated instruction hand-out can be found in Appendix E.

Wizard control

As described in section 7.3, the system is not fully autonomous, but contains two WOz components: reference resolution and ROI selection. In the study, the experimenter first instructed the participants, then left the room and took the role of the wizard. The wizard control station was located in an adjacent room, where the robot’s field of view was displayed on a computer screen. The wizard’s tasks were to identify the objects that were referred to and to mark them in a graphical user interface. Moreover, in condition C2 and C2 (which will be described below) the wizard had to trigger robot initiative by marking the objects the robot should ask for in the graphical user interface. The participants were not told that the system was partially controlled by the experimenter. Objective measures

A wide range of objective measures has been collected, most of which were derived from system logs. For each component, the relevant event notifications were logged, such as speech recognition results, text-to-speech output, dialog pattern state changes as well as object recognition and reference resolution tasks. With these log data, a detailed reconstruction of the interaction can be achieved. The data was also annotated manually based on the video material to capture inappropriate robot utterances, or the correctness of the robot’s answer on a test question. In total, we used 28 of these for evaluation (see table 7.5). As proposed in the PARADISE framework [WLKA97], we divided them

into the categories dialog efficiency, dialog quality and task success. Technically, the interactional aspects of system performance (i.e. the dialog quality and dialog efficiency measures) are calculated mainly based on information related to the Interaction Patterns, whereas the Task State Protocol provides information at task level (i.e. the task success measures).

The dialog efficiency measures capture the rapidity of the interaction and include for example the duration of interaction, the number of user and robot utterances within a certain time unit, the mean length of user utterances, or the number of objects learning episodes within a certain time unit. The dialog quality measures address the smoothness of the interaction. We considered for example gaps, overlaps, repairs and label corrections. The task success measures concentrate on the outcome of the interaction with respect to object learning. Among others, we measured the proportion of successful reference resolution and object learning tasks, the proportion of correct robot answers to test questions, and the user’s out-of-capability utterances.

Subjective measures

In addition to the objective measures described above, we collected subjective measures based on a questionnaire the participants were asked to complete after the interaction with Flobi had finished. We attempted to rely on standardized questionnaires as far as possible. In this regard, a trade-off had to be found between validated but much generic, and more informative but non-validated questions.

The questionnaire consisted of 50 items, that we aggregated into seven category mea- sures. The first four categories, dialog efficiency, task success, cooperativeness and usability, refer to the interaction itself. They contain questions to assess the participants’ impression of dialog efficiency and task success, on how cooperative they felt the robot behaved during the interaction, and on how they rated the usability of the system. The interaction-oriented items are roughly based on the evaluation of the COMIC dialog sys- tem [WFOB05], which we adapted for our specific scenario. The remaining three categories, likeability, perceived intelligence and animacy address the participants’ impression of the robot. They were adopted from the standardized Godspeed questionnaire1 [BCK08].

In addition, the questionnaire included five single (summarizing) questions, targeting the overall impression of ease, efficiency, clarity, pleasantness and understandability of the interaction. All replies to the questions had to be given using a six-point Likert-scale. The complete questionnaire can be found in Appendix F.

1 However, we skipped the categories anthropomorphism and perceived safety, as we considered them irrelevant for the scenario at hand.

7.4 Evaluation: A PARADISE-style User Study 131

Performance functions

The objective and subjective measures have been related to each other by means of a PARADISE-style evaluation. This evaluation method uses stepwise multiple linear regression to make predictions about subjective measures, like user satisfaction, based on several objective performance dimensions, like task success, dialogue quality, or dialogue efficiency (cf. chapter 2.3.1). The performance functions that result from this analysis supply answers to questions like: Which are the relevant factors that contribute to user satisfaction? Which components need to be optimized in future iterations of the system? The results are, to a certain extent, generalizable to similar systems.

The PARADISE approach originally suggests the Kappa coefficient as a measure for task success. The Kappa coefficient can be used to measure how many of the concepts were transmitted correctly during an interaction (cf. chapter 2.3.1). It is suitable for classical information seeking domains, but does not appropriately cover the complex task structure of an action-oriented robotic scenario. Hence, the Kappa coefficient was replaced by the above described objective measures for task success.

Moreover, in contrast to the original PARADISE method, user satisfaction was not assessed by a single target variable, but broken down into the different subjective measures described above, like ease or efficiency of the interaction. These rather abstract concepts were further broken down into several items that are easier assessable by the users. For example, the intuitiveness of interaction was assessed by questions like “I found the last object easier to teach than the last one”, and the dialog quality was assessed by asking for the appropriateness of the robot’s utterances regarding content and timings. Additionally, the above summarizing items that ask for the users’ overall impression were directly used as target variables.

Between-subjects factor

Moreover, we were interested in the influence of the robot taking initative. As a three-level between-subjects factor, the degree of task initiative of the robot was varied:

• Condition C1 (User Initiative) allows for user initiative only.

• Condition C2 (Mixed Initiative) allows for both user and robot initiative.

• Condition C3 (Structured Initiative) is identical with C2, except that the robot additionally yields initiative explicitly.

More specifically, in condition C1 the only way to teach the robot objects was to demon- strate the objects one after another. In condition C2, learning is performed in mixed initiative. The robot asks for an object label on its own initiative at interaction begin. Also in later stages of the interaction the robot would ask for object labels, provided that no other interaction episode is ongoing. In condition C3, the robot yields initiative explicitly after having asked for two object labels at the start of the interaction (“You can show me something, too”). By means of an analysis of variance (ANOVA) the differences between the groups regarding subjective and objective measures were evaluated.

7.4.2 Results