The dialog strategy learning task for ITSPOKE

5.0 USER SIMUALTION FOR TWO DIALOG SYSTEM DEVELOPMENT

5.1.1 The dialog strategy learning task for ITSPOKE

In this section, we introduce our task of learning a new dialog strategy to handle student certainty in the ITSPOKE system (introduced in 3.1). We assess the quality of different user simulations by comparing the qualities of the dialog strategies learned from each simulated corpus. A user simulation is considered better if the dialog strategy trained from the simulated corpus generated by that user simulation is better.

The current ITSPOKE system17

Here we refer to the ITSPOKE system that was available at the time of our study in 2007. Now, a new version of the system that handles uncertainty is available (Forbes-Riley et al., 2008).

can only respond to the correctness of a student’s utterances; the system thus ignores other underlying information, for example certainty, which is believed to provide useful information for the tutor. In a previous study (Forbes-Riley and Litman, 2005), each student utterance was manually annotated as certain, uncertain, neutral, and mixed based on both lexical and prosodic information. In this study, we use a two way classification of certainty: certain (cert) and not-certain (ncert), where we collapse uncertain,

neutral, and mixed to be ncert to balance our data. In addition, each student utterance is automatically judged as correct (c) and incorrect (ic) by the system and kept in the system’s logs. Percent incorrectness (ic%) is also automatically calculated and logged. Remember in Section 4.2.1, we manually clustered tutor questions into 20 clusters. Therefore, each tutor utterance is associated with a cluster (e.g., 3rdLaw). An example coded dialog is shown in Table 27.

Remember that our user simulations work on the word level by using the student answers in the human corpus as the candidate answers for the simulated students (Section 3.1). Here, we simulate student certainty in a very simple way: the simulation models output the certainty originally coded with that utterance.

In the s05pre corpus we collected with the ITSPOKE system (described in Section 3.1), the strength of the tutor’s minimal feedback (defined below) is strongly correlated with the percentage of student certainty (chi-square test, p<0.01). Strong Feedback (SF) is when the tutor clearly states whether the student’s answer is correct or incorrect (i.e., “This is great!”); Weak Feedback (WF) is when the tutor does not comment on the correctness of a student’s answer or gives slightly negative feedback such as “well”. Therefore, we want to develop a new dialog strategy which manipulates the strength of the tutor’s minimal feedback in order to maximize student’s overall certainty in the entire dialog. We keep the other parts of the tutor feedback (e.g. explanations, questions) so the system’s original design of maximizing the percentage of student correct answers is utilized. A sample coded dialog is shown in Table 27.

Table 27. Sample Coded Dialog

ITSPOKE1: Do you recall what Newton's third law says? [3rdLaw] Student1: Force equals mass times acceleration. [ic, c%=0, ncert] ITSPOKE2:

Well, Newton's third law says … If you hit the wall harder, is the force of your fist acting on the wall greater or less? [3rdLaw, WF]

MDP configuration. Remember in Section 2.3 we reviewed that MDP has four main components: states, actions, a policy, and a reward function. In this study, the actions allowed in each dialog state are SF and WF; the policy we are trying to learn is in every state whether the tutor should give SF or WF in order to maximize users’ percent certainty in the dialog. In the experiments in Section 5.1.2, 5.1.3, and 5.1.4, we use a simple state space representation (referred as SSR1) which is described by the correctness of the current student turn and percent incorrectness so far. The reward function (referred as RF1) is to assign +100 to every dialog that has a percent certainty higher than the median from the training corpus, and -100 to every dialog that has a percent certainty below the median. Another state space representation SSR2 and another reward function RF2 are introduced in the experiments in Section 5.1.3 to explore the influence of different MDP configurations on the quality of learned dialog strategy. Other MDP parameter settings are the same as described in (Tetreault et al., 2006).

Evaluating learned strategies

We learn dialog strategies from the simulated dialog corpora. In our experiments, we run a user simulation with the dialog system 8,00018

We empirically discovered that the dialog strategies learned from this size (8,000*5=40,000 dialogs) of simulated corpora is stable.

times to simulate 8000 simulated users, each of which complete 5 dialogs with the system to generate a simulated corpus of 40,000 dialogs in total. We use the dialog strategy learned from a simulated corpus to represent the quality of the simulated corpus. There are different ways to evaluate the learned new dialog strategy. One way is to implement the learned strategy into the original system and then test the effectiveness of the new system in maximizing student certainty. In our study, we introduce an evaluation measure (referred as EM1) to evaluate the new dialog strategy by counting the number of dialogs that would be assigned +100 according to RF1. A policy is considered better if it increases the

number of dialogs that will be assigned +100. Similar to previous studies (e.g., (Schatzmann et al., 2007b), (Lemon et al., 2006)), we test the new dialog system with a user simulation that can generate similar behaviors as human users, i.e., the CLU model, since it is the most human-like user simulation (shown in Section 4.1) we built for the ITSPOKE system. In our experiments, we simulate a group of 100 CLU simulated users, each of which interacts with the ITSPOKE system to generate 5 dialogs, to create an evaluation corpus that is of comparable size to (Schatzmann et al., 2005b). The baseline of EM1 is 250, since half of the 500 dialogs will be assigned +100 using a median split.

In the experiments in Section 5.1.2 and 5.1.3, we implement the learned dialog strategies into the original dialog system to evaluate the learned strategies using EM1. Since we used another reward function RF2 in Section 5.1.3, we also introduce a corresponding evaluation measure EM2 in that section. However, since first implementing a new dialog strategy and then testing it can be a complicated process, (Williams and Young, 2007b) use Expected Cumulated Reward (ECR) as an estimation of the quality of the dialog strategy learned by reinforcement learning (in our case MDP). (Tetreault et al., 2007) further introduce an approach to construct the confidence interval for ECR so that we get a sense of how reliable the learned dialog strategy is. We use the ECR to evaluate learned dialog strategies in 5.1.4.

Previous studies (e.g., (Paek, 2006)) have pointed out that the MDP configuration has a strong impact on the quality of learned dialog strategies. Therefore, the quality of the learned dialog strategies does not totally depend on the quality of the user simulations, but also on the MDP configuration. Since our focus here is to compare the effectiveness of different simulated corpora in the dialog strategy learning task, ideally we would like to factor out the impact of

different MDP configuration by experimenting with different state space representations and different reward functions. In Section 5.1.4, we apply another approach to factor out the impact of MDP configuration. We introduce an evaluation measure which compares the simulated corpora directly by calculating the transitional probabilities that are represented in the corpora. The transitional probability distribution in a user corpus has a direct impact on the quality of the dialog strategy trained on the corpus. Therefore, we can expect to see differences in learned dialog strategies trained from user corpora with different transitional probability distributions, although the difference we observe does not help us to figure out which strategy is better and which is worse.

Table 28 summarizes the MDP configurations and evaluation measures for experiments in Section 5.1 as we explained above.

Table 28. Summary of Experimental Configrations

Experiments Configuration Section

In document User Simulation for Spoken Dialog System Development (Page 79-83)