Results - Analysing Wizard Performance - Supporting Wizard of Oz experimentation for language t

6.4 Analysing Wizard Performance

6.4.4 Results

Looking at the results of the first study (i.e. Study A) we can see a certain adaptation to our new wizard interface as well as to the task. Over the span of 17 experiment sessions there was a trend towards decreased selection time for response utterances as well as a faster filtering of domain data which resulted in quicker dialogue turns (cf. Figure 6.13). Furthermore the wizard reported on relying on the notification mechanism, which was particularly useful in one case where the experiment set-up experienced some connection problems. On the other hand the study also exhibited a certain humanisation of the responses over time. For example, it was observed that during the first experiment sessions the wizard used confirmation utterances to acknowledge customer input. Over time this behaviour, however, changed and the use of ‘OK’ as a way of confirmation increased. Figure 6.14 shows the trend of those two parameters throughout the course of the 17 experiments trials. Looking at the two tested modes (i.e. text and speech) separately we furthermore find that in speech-based interaction the wizard seemed

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Experiment trials T ype of utter ances used 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

● _{Dedicated confirmation utterance}

OK to confirm

Figure 6.14– Types of utterances used to confirm customer (i.e. test participant) input over the course of 17 experiment trials.

to utilize ‘OK’ more often (max=8, median=4) and confirmation utterances slightly less often (max=3, median=1) than in their text-based counterparts (max=6|4, median=3|1).

The challenge of consistently simulating machine-like behaviour was also apparent in cases where the wizard responded to input that seemed unlikely to be understood by a machine. In general, estimating this threshold of what can realistically be processed by a system and what should instead trigger an error recovery routine, can be seen as a major challenge when ap- plying the method, particularly if consistency needs to be achieved over the span of many experiment sessions. Peissner et al. [2001] recommend the use of probabilistic errors to simu- late more realistic behaviour. In our case, we observed that the wizard tried to solve this issue by always using the same set of utterances to start as well as to end a dialogue (including sim- ulated recognition errors). For the main part of the conversation, however, this usage of a strict set of utterances was not applicable.

Comparing some of these results from Study A with the second study (i.e. Study B) we find an interesting difference of wizard behaviour. During the first trials of Study B the wizard’s response time was stable, even slightly decreasing (not significant), which points to better wizard performance due to task and interface habituation, similar to what was already observed in Study A. After some time, however, it started increasing again (cf. Figure 6.15 line ‘IQR of the measured response times’). With respect to this result it is important to point out that unlike Study A, this second study was less structured and depended more on a test participant’s performance. That is, in Study A the dialogue and its possible system utterances (i.e. the

utterances the wizard was able to use) were entirely pre-defined. Study B, however, enabled the wizard to interact with a participant in a chat-like fashion. While some of the utterances were still pre-defined and sent by clicking a button, we implemented the possibility to give context specific feedback, i.e. feedback that could be adapted to the unpredictable performance of a test participant pronouncing a word or sentence in English. To do so, a wizard was able to choose a pre-defined utterance and change it so that the feedback fits the result produced by the pronunciation analysis. In cases where changing a pre-defined utterance was too cumbersome, the wizard could alternatively type the complete feedback and send it on.

If we look at how the wizard used this feature over time we see that, except for the second trial where the test participant had problems operating the client interface and the wizard therefore needed to send numerous clarification utterances, the number of utterances processed in this ways stayed more ore less consistent (mean=12.54, mode=11, median=11) throughout the course of all 13 experiment trials. Analysing the content of the generated utterances more closely, however, we see that the average number of words that were used started increasing with the seventh trial (cf. Figure 6.15 line ‘Average number of words produced in an utterance’). While we are unsure why the wizard changed his feedback style after the sixth session (Note: As the experiment was used to develop a corpus of feedback utterances the goal of the wizard could have been to test an increased number of possibilities), the increasing number of typed words is likely the cause for this increase in response time. The connection between those two parameters becomes even more apparent if we calculate the 90th percentile of the measured response time (Note: We use the 90thpercentile in order to clean the data from the expectationally long response times that sometimes happened due to unforeseen participant ac- tions; e.g. a participant took a long time to respond to a wizard response) and compare it with the average number of words produced in an utterance (cf. Figure 6.15 lines ‘90th percentile of the measured response time’ and ‘Average number of words produced in an utterance’). A Pearson significance test showed a highly significant positive linear correlation between these two variables; r(11)=0.9469, p=9.336e-07.

What we see here is an example that shows how one can successfully adapt to the wizard task (Note: This was the first time our wizard was engaged in this role) as well as to its operation interface, leading to measurable performance improvements over time. Yet, we also see that this type of familiarisation can lead to changing wizard behaviour. Producing longer utterances, as we observed in our example, points to a certain humanisation of wizard responses that comes with growing experience. From a participant’s perspective longer and more complex utterances lets a system appear more intelligent and therefore can create a false impression of the product’s potential. In cases where the perceived system performance changes throughout the duration of an interaction this might furthermore lead to frustration.

If we look at the variation of produced utterance lengths within a trial we do see that the average standard deviation in the first six trials is slightly lower (mean SD=3.12) than in the subsequent ones (mean SD=4.23), indicating a greater lack of consistency. However, this figure is highly influenced by the actual performance of individual test participants and so after having

● ● ● ● ● ● ● ● ● ● ● ● ● Experiment trials Seconds and n umber of w ords 1 2 3 4 5 6 7 8 9 10 11 12 13 ●

IQR of measured response times

Average number of words produced in an utterance 90th percentile of measured response times

Figure 6.15– IQR of the measured response time per trial, average number of words produced in an utterance per trial, and 90th _{percentile of the measured response time per trial. While initially}

the response time decreases, more words per utterance reverses the effect later on.

analysed the relevant content in more detail we believe that participants did not experience any change in system behaviour. Yet between experiment trials we see a difference where the first instances resembled a less ‘sophisticated’ system than the later ones. The results of the questionnaires that were given to participants after they had finished a test suggest that they did not experience a difference, as the feedback quality was constantly rated with 4 or 5 on a five point Likert item ranging from 1 ‘not enough’ to 5 ‘very good’.

However, while for a single test participant the intelligence of a system (beyond a certain threshold) might not matter, it can very well influence the results of a study and therefore highlights an important challenge of simulation.

Following, another consistency challenge, namely the time a wizard gives test participants to read a text utterance, is discussed as part of a meta-analysis conducted over the observed WOZ experiments.

In document Supporting Wizard of Oz experimentation for language technology applications (Page 123-126)