2.3 ANNOTATIONS & METRICS
2.3.1 Common annotations & metrics
2.3.1.1
Test scoresBoth tests (pretest and posttest) contain 26 multiple-choice questions and 4 essay questions. We calculate for each test a score (PRE score for pretest, POST score for posttest) defined as the number of correctly7 answered multiple-choice questions. The 4 essay questions are not used as manual annotation of the correctness of the answer is required (see (VanLehn et al., 2007) for alternatives to scoring essay questions).
Table 5 shows the average and the standard deviation of the PRE/POST scores for our corpora. It is interesting to note that most PRE averages are similar with the exception of S05PR which has a higher
7 Note that since the test interface allowed users to go back and review their answers, the answer with the
average. Although POST scores can not be directly compared across experiments, they are in a similar range with the exception of the NMPrelim experiment (a direct effect of users working only through 2 of the 5 problems). Interestingly, POST scores are not affected in the Main experiment by the limitation of the instruction to walkthrough dialogues and disabling of the essay interpretation component.
Table 5. PRE/POST scores for all corpora Experiment Condition PRE POST
F03 F03 12.5 (4.4) 17.9 (4.6) S05SYN 12.5 (4.4) 18.2 (4.2) S05PR 13.5 (3.3) 18.6 (3.5) F 12.8 (4.8) 15.2 (4.6) S 12.0 (3.6) 14.6 (3.6) R 12.6 (4.2) 19.1 (3.1) PI 12.5 (4.4) 19.4 (2.8) NM 12.6 (4.1) 18.8 (3.9) S05 NMPrelim Main
2.3.1.2
Learning metricsThe primary performance metrics for tutoring systems is learning due to interaction with the system. Other metrics are also important but secondary to learning (e.g. user satisfaction – Section 2.2.3.1, dialogue efficiency – see Sections 2.3.3.3 and 2.3.3.4). There are several ways of measuring learning. All use the PRE/POST scores but differ in terms of the perspective they offer.
The simplest way to measure learning is the POST score. Ideally, we would like to see all users achieving the perfect test score. This metric disregards the fact that users have different levels of knowledge when they come in. As a result, improvements are not measured: i.e. it treats the same users with the same POST score regardless of how high or low their PRE score was. For example, it treats the same users that reached a POST of 20 even if they started with a PRE of 6 or 19.
To account for the PRE score, other metrics look in various ways at the distance between the PRE and the POST score. Learning gain is defined as the arithmetic difference between the POST score and the PRE score. However, this metric assigns similar learning to users that improve from a PRE of 6 to a POST of 7 and users that improve from a PRE of 19 to a POST of 20. In addition, it suffers from a “ceiling effect”: users with a higher PRE score can improve less than users with a lower PRE score (e.g. users with a PRE of 20 have a maximum learning gain of 6 while users with a PRE of 6 have a maximum learning gain of 20). For these reasons, we will not use this metric in our analyses.
The Normalized Learning Gain (NLG) fixes the learning gain issues by normalizing the distance between PRE and POST to the distance between PRE and the maximum POST. In our case NLG = (POST-PRE)/(26-PRE). When PRE/POST scores are measured as percentage (e.g. in our case, PRE
divided by 26), NLG is defined as (posttest-pretest)/(1-pretest). In effect, this metric measures the percentage improvement relative to the perfect improvement: an NLG of 0.0 means no improvement, an NLG of 0.5 means we are half-way there while an NLG of 1.0 means maximum improvement.
However, even NLG has an important drawback: stability issues for higher PRE scores. More specifically, the higher the PRE score, the more sensitive is the NLG score to small variations in the POST score. For example, if a student starts with a PRE of 22 and achieves a POST score of 25, the NLG will be 0.75. However, if the student would miss one of the 25 correctly answered questions (e.g. lost concentration for that problem or was misled by the problem text), the NLG will become 0.50. Thus, one small user mistake transforms the user from a high learner to a medium learner (average NLG is around 0.5 in most corpora). The same is not true of users with a lower PRE score. One way to address this issue is to eliminate from analyses users with higher PRE score (see Section 2.3.3.2).
When comparing a control condition and an experimental condition in terms of NLG, a common metric that is reported in pedagogical studies is the effect size (e.g. (Bloom, 1984)). Effect size is defined as (average NLG experimental – average NLG control)/(standard deviation NLG control) and measures the improvement offered by the experimental condition. An effect size of 1.0 is approximate of one letter grade improvement. An effect size of 2.0 for adult tutoring in replacement of classroom instruction (Bloom, 1984) has been the main catalyst behind work on computer-based tutors.
As another alternative to measuring learning, we can apply the ANCOVA test and use the adjusted posttest scores (posttest scores that account for the pretest score). Details will be discussed in Section 2.4.4.
To test if certain phenomena are associated with learning, correlations and partial correlations are typically used. Details will be discussed in Section 2.4.2.
2.3.1.3
User turn transcriptsDuring the interaction with ITSPOKE, there is a spoken dialogue between the system and the user: the system asks questions and the user answers back. Two transcripts of the user turn are available: system and manual transcript. The system transcript is obtained by running the user speech through the Automated Speech Recognition (ASR) component – Sphinx II. The top recognition hypothesis is used. This transcript is interpreted by the Why2-Atlas backend in terms of semantics to determine the appropriate system response.
Because the system transcript in not perfect, after each experiment, a human annotator transcribed all user turns. A web interface was used in which the annotator could listen to the student speech and transcribe the content. Non-linguistic events (e.g. laughs, coughs, sighs, background noise, etc) were ignored when transcribing.
Figure 6 shows an example of the system transcript (ASR lines) and the human transcript (STD lines).
2.3.1.4
CorrectnessITSPOKE uses the Why2-Atlas backend to drive the conversation. The Why2-Atlas backend has a semantic interpretation component which identifies concepts in the user input. Based on what concepts are present and the authored tutoring information, Why2-Atlas will decide if additional statements or questions are needed before moving on to the next question. A deterministic procedure that uses the output of the semantic interpretation component and the authored tutoring information was developed to assign 3 labels of correctness to any user input: Correct, Partially Correct and Incorrect. The system can ask the student to provide multiple pieces of information in her answer (e.g. the question “Try to name the forces acting on the packet. Please, specify their directions.” asks for both the names of the forces and their direction). If the student answer is correct and contains all pieces of information, it was labeled as Correct (e.g. “gravity, down”). The Partially Correct label was used for turns where part of the answer was correct but the rest was either incorrect (e.g. “gravity, up”) or omitted some information from the ideal correct answer (e.g. “gravity”). Turns that were completely incorrect (e.g. “no forces”) were labeled as Incorrect.
Depending on the user turn transcript (2.3.1.3) that is fed to the semantic interpretation component, two versions of correctness can be automatically computed: ASR correctness (ASEM) and transcript correctness (TSEM). An additional correctness label “Unable to Answer” was created to automatically mark turns where the user used either variations of “I don’t know” or simply did not say anything.
Figure 6 shows the two versions of correctness (ASEM and TSEM labels). In cases where due to speech recognition problems the system and human transcripts differ enough, the ASEM and TSEM label can be different (e.g. Figure 6, STD1).
2.3.1.5
Speech recognition problems (SRP)Three types of SRP have been annotated in the corpus: Rejections, ASR Misrecognitions and Semantic Misrecognitions. Rejections occur when ITSPOKE is not confident enough in the recognition hypothesis thus discarding the current recognition and asking the student to repeat (e.g. Figure 6, STD3). When ITSPOKE recognizes something different than what the student actually said (i.e. human transcript is different from the system transcript - 2.3.1.3) but was confident in its recognition hypothesis, we call this an ASR Misrecognition (e.g. Figure 6, STD1,2).
Semantic accuracy is more relevant for dialogue evaluation, as it does not penalize for word errors that are unimportant to overall utterance interpretation. For ITSPOKE, the semantic interpretation is defined in terms of correctness (2.3.1.4). We define Semantic Misrecognition as cases where ITSPOKE was confident in its recognition hypothesis and the correctness interpretation of the system transcript (ASEM) is different from the correctness interpretation of the manual transcript (TSEM) (e.g. Figure 6, STD1).