Main Experiment

2.3 ANNOTATIONS & METRICS

2.3.3 Main Experiment

For the Main corpus we first show how we partitioned the population into a high and a low split based on metrics of interest (2.3.3.1). Then, we identify a particular subset of interest in this corpus (2.3.3.2). We also computed a series of other metrics (2.3.3.3 and 2.3.3.4). Note that these metrics, the subset and the splitting procedure are not specific to this corpus and can be reproduced in any other ITSPOKE corpora. We only needed them for the analyses that compare the three conditions from the Main experiment: R, NM and PI.

2.3.3.1

High-low splits

It is a common analysis practice in tutoring research to investigate the relationship between user aptitudes and his/her performance while working with the system (e.g. (McNamara and Kintsch, 1996; VanLehn et al., 2007; VanLehn et al., 2005; Ward and Litman, 2006)). Several studies have shown that the treatment condition can produce effects only on specific subsets of the populations and that, in some cases, the treatment has opposite effects depending on the subset. Typically, the subsets are generated by splitting the user population based on the mean or median of the aptitude metric: a high subset and a low subset. We used a mean split in this work.

One of the aptitudes measured in this study is the initial physics knowledge (the PRE score – see 2.3.1.1). PRE Split was generated by splitting users using the mean PRE score in the Main Experiment (Mean PRE = 12.5). Table 6 shows the number of users in each condition for each low/high split. Details for the other aptitude metric, the working memory span, are available in Appendix A.4.3.

Table 6. Number of users in each PRE split subset Condition Low High

R ₁₅ ₁₀

PI ₁₆ ₁₁

NM 16 11

PRE Split

2.3.3.2

RELNLG Subset

In Section 2.3.1.2 we mentioned that the NLG score is less stable for users with a higher PRE. Thus, in our learning-related analyses we will also investigate what happens if we eliminate these users. But what is a good cutoff threshold? We decided to use the average POST score in the R condition (19.1 in Table 5) as the cutoff threshold. From a learning perspective, users with an initial knowledge (PRE score) higher than the average post-instruction knowledge (POST score) are less interesting: they know already more than what the population will achieve on average. Since in this experiment the instruction was not tailored to user essays (see Section 2.2.4), it is very likely that the system discussed with these users a lot of things they already knew. As another argument, even if these users do not work with the system, it is likely they will achieve a higher POST score than average since the pretest and posttest are isomorphic. Other studies remove users with a perfect PRE score (e.g. (Chi and VanLehn, 2008)).

The remaining subset (i.e. users with a PRE score of 19 or less) is called the RELNLG subset (reliable NLG). The subset contains only 22 of the 25 R users, 25 of the 27 NM users and 25 of the 27 PI users, for a total of 72 users. As we will see, all of our learning-related results become clearer on this subset.

Note that the PRE Split was not recomputed for the RELNLG subset (i.e. we did not split based on the mean PRE on the RELNLG subset). As a result, when looking at PRE Split, all the low pretesters subsets remain the same; the removed students come out of the high pretesters subsets.

2.3.3.3

Dialogue time

We define the dialogue time as the amount of time spent by users conversing with the system in each problem. Thus, dialogue time is defined as the duration between the start time of the first tutor turn in the dialogue for a given problem and the end time of the last tutor turn in that dialogue. Note that we could have used the total time spent on each problem however, besides the dialogue time, this duration also includes the time user spent reading the problem and typing the initial essay and its revision. In addition, users were told they can take breaks during the essay part. Thus the total time spend on each problem is a less reliable measure.

We compute the dialogue time for each problem (P1Time-P5Time) and a total dialogue time as the sum of the five problem dialogue times (TotalTime).

The dialogue time is influenced by two factors: the recognition performance and user correctness. Since in the Main experiment the dialogue was not tailored to the user initial essay, all users went through the same instruction for each problem. Deviations from this plan are only due to speech errors (i.e. rejections and timeouts) and/or incorrect answers (e.g. an incorrect answer will engage a remediation subdialogue for certain questions). Thus the dialogue time is a good measure of overall correctness of the user and how well they are recognized by the system. The latter is in turn a result of the ASR performance and user’s ability to adapt his answers to what the system expects them to say (especially for correct answers).

The time users spend with a system is an important metric as researchers strive for learning efficiency when designing tutoring systems: deliver increased learning as fast as possible. When no improvements in learning over a baseline are observed, it is a positive result to achieve the same amount of learning in a shorter time spam (e.g. (VanLehn et al., 2007)).

2.3.3.4

Number of system turns

We define number of system turns as the total number of turns the system has uttered in the dialogue for a problem. Note that this number will also include system turns that deal with speech errors (i.e. repetition of the last turn for timeouts and “Could you please repeat that” turns for rejections). Also, a tutor turn can include one or more goals (e.g. for some incorrect users answers, the next system turn will include a correction of the incorrect answer and the next question).

We compute the metric for each problem (P1Tut-P5Tut) and as a sum over all problems (TutTotal).

The difference between the dialogue time metric (2.3.3.3) and this metric is that the number of system turns metric ignores the duration of the system turn and the duration of the user turns8, where duration is dependent on the number of words in these turns and the speaking rate. In addition, the number of turns stays the same regardless of the system correctness of the user answer for questions that do not require remediation subdialogues. In general, an increase in number of turns is due to extra remediation dialogue. In these cases, the metric incorporates the size of the remediation dialogue in system turns. Rejections due to speech problems can also increase the number of turns through additional

8_{The duration of the user turns represents a very small proportion of the total dialogue time in ITSPOKE}

(about 6%). In the Main experiment, the average total dialogue time is 2550 seconds out of which 147 seconds is the average total user turn duration.

system turns that handle these rejections (e.g. “Could you please repeat that?” system turns). Another phenomenon that increase the number of system turns is timeouts (i.e. when users do not answer the question in the allotted time). In such cases, the system simply repeats the question. Note that the number of user turns is similar to the number of system turns since the interaction follows a question-answer format where the system asks questions and the user has to answer.

In document Applications of Discourse Structure for Spoken Dialogue Systems (Page 43-46)

2.3 ANNOTATIONS &amp; METRICS