A General Evaluation Framework to Assess Spoken Language Dialogue Systems: Experience with Call Center Agent Systems

(1)

A General Evaluation Framework to Assess Spoken

Language Dialogue Systems: Experience with Call Center

Agent Systems

Marcela Charfuelán, Cristina Esteban López

Jose Relaño Gil , Ma. Carmen Rodríguez , Luis Hernández Gómez

Dep. SSR ETSIT-UPM Ciudad Universitaria Madrid (Spain) [email protected]

Speech Tecnology Group, Telefónica Investigación y Desarrollo , S.A. C. Emilio Vargas, 6 28043 Madrid (Spain)

Abstract

In this paper we present our experience during the evaluation of two prototypes of call-center agents systems. We describe the general framework we have used during the collection and annotation of dialogue evaluation data bases. We also present our results using the well known PARADISE framework to derive a system performance function. The relative importance of different cost measurements for the SLDS’s prototypes under evaluation is also discussed.

1. Introduction

As Spoken Language Dialogue Systems (SLDSs) are becoming more and more attractive for a wide range of applications, there is an increasing demand on standardized benchmarks to test and compare their performance. The Spoken Language community has made significant progress towards this goal (Walker et al., 1998; Price et al., 1992; Minker, 1998), and most of the proposals for spoken dialogue evaluation are based on the use of information from properly designed evaluation dialogue corpora. Generally these corpora are extracted from log files as the evaluated system is working, and no specific nor standardized annotation procedures are used to represent the relevant information, though some proposals are presented in (DARPA, 1999; Isard et al., 1998; Dybkjoer et al., 1998).

Our experience on SLDSs evaluation is related with the creation of call center agents based on a spoken dialogue system (Alvarez et al., 1996; Relaño et al., 1999) developed at Telefónica I+D of Spain: one called “ATOS system” whose domain is mainly telephonic functions, and the other called “Voice PORTAL system” whose domain is basically access to information through telephone . These agents were developed as prototypes for which we have designed an experimental evaluation procedure to organize the information collected in log files during an actual evaluation. This experimental evaluation framework has been already presented in

(2)

(Charfuelán et al., 2000), it could be summarized in three main aspects: an annotation scheme, an annotation tool and automatic extraction of dialogue metrics from annotated corpora.

After we have collected our dialogue evaluation data bases we extract from them metrics and statistics like user and system turns average, number of tasks completed, user satisfaction etc. These metrics were used to calculate a predictive performance function of the system as it is proposed in PARADISE framework (Walker et al., 1998; Bonneau-Maynard et al., 2000).

At First, in Section 2 the evaluation framework is briefly reviewed. In Section 3 we describe the dialogue system characteristics, architecture and functionality common to the two proto-types. Section 4 presents details of the evaluation environment and Section 5 of the dialogue evaluation databases obtained. Finally Section 6 contains an analysis of results after applying PARADISE framework and conclusions are made in Section 7

2. Overview of the Evaluation Framework

2.1. Annotation scheme

As it is shown in Figure 1, we follow a two-step annotation process at utterance and dialogue level. At utterance level we perform two complementary tasks: processing logged information (for example: system response, recognizer’s and parser’s outputs) and including all manual objective or subjective information (such as the transcription of user utterances, or whether the recognizer’s or parser’s outputs correctly captured the task-related information in the utterance). The dialogue level is more global, here some information related to the dialogue structure is included (for example segments of dialogue corresponding to the starting and ending points of a particular SLDS task, or error recovery segments). After this dialogue structure mark-up is completed a set of simple automatic procedures are applied to obtain dialogue metrics and statistics as we will show afterwards in section 5. The output of the complete process is stored in annotated XML files. Questionnaires User SLD Utterance Dialog Level Annotated Level files Annotation Levels Log-file Audio file Annotator System XML

Figure 1: Block diagram of the global annotation process

2.2. Annotation Methodology and Tools

The annotation methodology combines manual procedures from a human annotator and au-tomatic processing of annotated data. At utterance level, we needed an easy access to the audio speech file, then we developed an annotation tool for this level that we call ULAT (Utterance Level Annotation Tool). This tool let us:

(3)

Manual transcription of user’s turns having a controlled access to the audio file.

Automatic extraction of information related to system turns, recognizer’s and parser’s outputs, and subjective information of the user from log files and external information files.

Inclusion of subjective information from a human evaluator or annotator, for example, whether or not the user’s concept (dialogue act) is lost after the speech recognizer and parser analysis.

The inputs to the ULAT tool are: a recorded audio file of the dialogue, a log file provide by the system and an external information file (questionnaires). The annotator only has to mark a section of the speech wave, corresponding to a turn, listen to it and check if the information that the ULAT tool has presented for the turn is correct. The beginning and end samples are obtained by the tool and recorded automatically in the output file.

At dialogue level we can use different XML tools because the files generated by the ULAT tool are in XML format. For example we have used the MATE Workbench (MATE, 1998) to annotate tasks (dialogue segments) an add the attributes of correctness, completion and user satisfaction of these segments. After that we used the LT XML (XML, 1999) tools and the developer’s tool-kit (a C-based API) to extract metrics and statistics of evaluation from the XML data bases.

3. Dialogue System Characteristics

The platform used to develop the call center agent prototypes “ATOS system” and “Voice PORTAL system” is based on the following main modules:

Natural Language Speech Recognizer. Semantic Parser.

Dialogue Management Module. Text-to-Speech System.

The Natural Language Speech Recognition module has a vocabulary of 4500 words with 355 first names and 983 family names. It uses a language model based on trigrams working on parts-of-speech clustering. The Speech Recognition module uses context-dependent triphones represented through Hidden Markov Models.

The Semantic Parser we use is a Spanish version of the PHOENIX parser developed at MIT, that can be described as a frame-based concept-spotting semantic parser.

The Dialogue Manager is rule-based and its design is based on a collaborative dialogue model. According to the classification of Dialogue Systems proposed by J. Allen1 it could be described as a system with topic-based performance capabilities, adaptive single task, a minimal pair clarification/correction dialogue manager and fixed mixed-initiative.

1_{This classification was presented in the Tutorial: Dialogue Modelling by J. Allen, University of Rochester, in}

(4)

Finally the TTS system we use is the Spanish TTS developed by Telefónica Investigación y Desarrollo. This system is a diphone based system and includes rule-based prosodic modeling and LPC-based speech synthesis.

4. Evaluation Environment

4.1. Scenarios and Constrains

The evaluation of the prototypes was done by selecting a set of different telephonic functions or tasks with different names or numbers as data constrains. The ATOS system was designed to execute twelve different tasks while the voice PORTAL could execute six.

Table 1: Scenarios and Data Constraints in ATOS system

Task Data

T11 Phone Call Full name T12 Phone Call Phone number T13 Phone Call Extension number

T41 Ask for information “Electronic mail” and a full name T42 Ask for information “Office number” and a full name T51 Multi-conference Two full names

T52 Multi-conference Two phone numbers T53 Multi-conference Two extension numbers T61 Collect call Full name

T62 Collect call Phone number

T71 Change password Old and new password T81 Send a message Full name

Table 2: Scenarios and Data Constraints in PORTAL system

Task Data

T1 Phone Call Full name

T2 Where buy something “Music” or “books”

T3 Ask for information “Electronic mail” or “Office” and a full name T4 News information “National”, “International”, “Sports”,

“Cultural”, “Weather” and “Society” T5 Change password Old and new password

T6 Send a message Full name

All the full names are taken from a list or data base. The data in quotation marks (keywords) are information that is expected by the system. That not means that the user have to say them as if they were commands, the users are encouraged to use them in his own expressions in natural language.

A population of 30 subjects was selected for the first field trial, who were novice users of the ATOS system, they executed 179 tasks (Approx. 3 hours 44 minutes of recording). For the second field trial with the Voice PORTAL system the population was of 17 novice users and they executed 50 tasks (Approx. 50 minutes 18 seconds of recording). Every subject involved

(5)

in the evaluation processes was previously instructed in the basic functionality of the system and in the evaluation procedure.

For every telephone call the dialogue system generated and stored the following information data:

One speech audio file using 8 bits mu-law samples and a sampling frequency of 8 KHz. The whole dialogue was recorded in a single channel.

An ASCII log file that included information of the dialogue system as it was working: the text of every system turn and the outputs of the recognizer and semantic parser for each user turn.

Since each task in ATOS was simple, the testing procedure we used consisted of a sequence of six different functions executed by each subject. In this way, almost fifteen repetitions of each function or task were obtained. The same strategy was used for the Voice PORTAL system with the only difference that in the last each telephone call to the system was intended for more than one task.

4.2. User Satisfaction Measure

This subjective metric was obtained through a survey made at the end of the tests. Each user was asked to complete a short questionnaire about the system.

We made two questionnaires one for evaluation of each task and other for the global be-haviour of the system. The questions for each task were:

Could you complete the task? (Yes/No)

How the system carried on this task? (1 to 10, 1 very bad and 10 excellent)

The questions for the global evaluation were (1 very low or bad, 10 high or Excellent):

Level of comprehension of the system prompts. Frequency in which you can’t follow the dialogue. Level in which the system comprehend you.

At which level the system become slow in its response time. Give us a score (1 to 10) of global evaluation of the system.

Finally we averaged the answers for each task to obtain a user satisfaction score for each one. The question about the completion of the task was also useful to verify, during the annotation stage, if effectively the user completed the task (otherwise this metric could be not objective at all).

(6)

4.3. Paradise Framework

The PARADISE (PARAdigm for DIalogue System Evaluation) was proposed by (Walker et al., 1998) as a general framework for evaluating and comparing the performance of spoken dialogue agents. This framework uses methods from decision theory to combine a disparate set of performance measures into a single performance evaluation function. The objective of PAR-ADISE structure is to maximize user satisfaction through maximize task success and minimize costs (efficiency measures and qualitative measures). The performance equation is estimated using multivariate linear regression which provides different weights for each parameter in the performance equation. These weights give us an idea of the relative contribution of the success and cost factors to user satisfaction.

The success at achieving the information requirements of the task is measured with the Kappa coefficient:

Where is the proportion of times that one task have been successfully completed and is the proportion of times that one task is successful by chance. (We have given a slightly

different meaning to and from the examples given in (Walker et al., 1998)).

5. Dialogue Evaluation Databases

Following the two-stage annotation process at utterance and dialogue levels, we obtained two XML dialogue databases: EvalAtos and EvalPortal. Table 3 shows the performance metrics extracted from the EvalAtos data base which have been processed to get the mean values for each kind of task. The first column shows the scenarios and data constrains in each case. The mean dialogue metrics values per task for the Atos system were: 179 tasks executed (11.6 turns on average) from which 125 were completed or successfully completed (69.3%) and 54 not completed. The percentage of correct concepts (PCC) was 66.8%. We define a correct concept when the parser system could extract the name of the function or data from the recognized phrase for each user turn. As it will be discussed below it is important to notice that the PCC is not a precise reflect of the percentage of word recognition which in this case is 73.6%.

Table 4 shows the Portal system performance metrics. The first column shows the scenarios in each case, as we have said here data constrains are a little bit different from previous data base, here they are more accurate and less variable than a full name or a number. The mean dialogue metrics values per task were: 50 tasks executed from which 44 were completed suc-cessfully and 6 not completed. The percentage of correct concepts here was 94.58% though the percentage of word recognition is 65.32%.

6. Analysis of Evaluation Results

We have made a PARADISE paradigm analysis of the two systems described, based on the information extracted from the data bases. Our objectives were to estimate a performance function for user satisfaction (US) and compare the influence of different cost factors in both systems.

For this we have calculated the kappa coefficient and as cost measures (or predictor factors) we have selected: average number of turns for each task (TT), percentage of correct concepts

(7)

Table 3: Performance metrics for a Paradise case study, “ATOS system”: US = User Satisfac-tion, = Kappa coefficient, TT = Turns number for each Task, PCC = Percentage of Correct Concepts, PWR=Percentage of Word Recognition, TC = Percentage of task completed.

Task US TT PCC PWR TC

T11 Call (name) 5.92 0.30 8.6 61.7 68.6 71.4

T12 Call (Phone number) 6.14 0.22 8.0 65.2 75.9 64.2 T13 Call (extension) 7.42 0.46 7.4 75.1 75.3 92.8 T42 Ask for information about Office number 6.28 0.30 8.0 68.4 71.1 71.4 T51 Multi-conference (two names) 5.93 0.25 13.2 72.8 78.4 66.6 T53 Multi-conference (two extensions) 6.33 0.42 16.2 75.3 75.1 86.6 T41 Ask for information about electronic mail 5.68 0.27 8.4 67.3 69.1 68.7 T61 Collect call (name) 5.87 0.38 10.2 70.4 78.5 81.2 T62 Collect call (phone number) 5.25 0.27 13.8 60.5 74.9 68.7 T71 Change password 5.50 0.20 16.8 72.7 79.5 62.5 T81 Send a message 4.62 0.33 16.6 55.2 71.7 75.0 T52 Multi-conference (two phone numbers) 3.53 -1.16 11.6 57.5 65.3 23.0 GLOBAL PERFORMANCE 5.71 0.18 11.6 66.8 73.6 69.3

Table 4: Performance metrics for a Paradise case study, “Voice PORTAL system”: US = User Satisfaction, = Kappa coefficient, TT = Turns number for each Task, PCC = Percentage of Correct Concepts, PWR = Percentage of Word Recognition, TC = Percentage of task completed.

Task US TT PCC PWR TC

T1 Call (person name) 8.16 1.0 10.3 93.93 58.6 100 T2 Buy music or books 9.20 1.0 7.0 96.0 68.0 100 T3 Ask information about e-mail of a person 9.00 1.0 15.7 96.4 66.8 100

T4 News information 9.07 1.0 7.3 98.0 53.9 100

T5 Change user password 5.08 0 22.7 85.9 69.6 50 T6 Send a message (person name) 9.40 1.0 10.9 97.1 74.7 100 GLOBAL PERFORMANCE 8.32 0.83 12.34 94.58 65.32 91.66

(PCC), percentage of word recognition (PWR), and percentage of complete tasks (TC). These are showed in tables 3 and 4. To make the regressions these values have to be normalized because the different factors are in different scales, then each factor x is normalized to its Z score: Where

is the standard deviation for x.

To calculate the regressions we can use several mathematical programs, these generally give additional information about the results, for example the standard error of prediction (p in our regressions) which gives an idea of how significant are the factors used. Another important information is the multiple correlation coefficient ( in our regressions) which gives an idea of the contribution of the combined factors to the variance of US.

(8)

or equation is Forward Selection (Walpole & Myers, 1992). The procedure is basically to follow a sequence of regressions starting with individual factors. Each time the most significant factor (greater ) is selected and combined again with the others until the most significant

combination is obtained.

The estimations made for the ATOS system are the following:

US = 0.572*TC + 0.453*PCC (p=0.000065) ( =81.07%) US = 0.608*TC + 0.539*PCC - 0.166*PWR (p=0.00037) ( =82.63%) US = 0.421*TC + 0.459*PCC + 0.158* (p=0.00051) ( =81.41%) US = 0.559*TC + 0.434*PCC - 0.351*TT (p=0.000005) ( =93.37%) US = 0.542*TC + 0.395*PCC - 0.378*TT + 0.073*PWR (p=0.00004) ( =93.60%) Initially the factor most significant was TC that combined with PCC accounted for 81.07% of the variance in US, with a prediction error of p=0.000065.

When we combined these two factors with the others, PWR, and TT we observed the following: TT obtained the most significant contribution to the variance of US ( =93.37%),

which is quite obvious because when the number of turns of one task become large and even if the dialogue system could recover from word recognition errors, the greater the number of turns is the less the US satisfaction could be.

Respect to the others prediction factors, (PWR and ), they seem to be less significant but a little strange is the negative contribution of PWR, which could give us the erroneous idea that the less recognition the more user satisfaction. As we can see in the last regression involving all the factors, (except that is high-correlated with TC), the PWR factor contributes to the variance but with a very short positive weight respect to the others. That means that in this system it is more important PCC than PWR. In other words the US could be better with the same level of recognition if a robust parser is used.

The estimations made for the Voice PORTAL system are the following:

US = 1.013*PCC + 0.122*PWR (p=0.000125) ( =98.18%) US = 0.650*PCC + 0.140*PWR + 0.387* (p=0.00024) ( =99.6%) US = 0.892*PCC + 0.156*PWR - 0.156*TT (p=0.0012) ( =98.82%) US = 0.650*PCC + 0.140*PWR + 0.387*TC (p=0.00024) ( =99.6%) US = 0.624*PCC + 0.158*PWR + 0.338*TC - 0.092*TT (p=0.0028) ( =99.80%) For the PORTAL system the most significant factor was PCC that combined with PWR accounted for 98.18% of the variance in US, with a prediction error of p=0.000125. However the weight of PCC is greater than that of PWR, this could be explained because this system depends more on few specific keywords (“National”, “Sports”, “books”, etc.) than on longer sequences of words like full names or telephone numbers.

When we combined these two factors with the others , TT and TC we observed the follow-ing: TC and do not seem to contribute too much to the variance. Here again we could say that these two factors are correlated. But between PWR and TC we can observe that the weight of

(9)

TC is greater than that of PWR, which confirm the fact that the semantic parser and the dialogue management module have done a very good work in spite of the level of word recognition.

After all these regressions we find that one of the most important predictors of user satisfac-tion in both systems was the percentage of correct concepts (PCC), and one of the less important the percentage of word recognition (PWR). We could explain this because even though the PWR of the second system (65.3%) is less than the first (73.6%) its PCC is much better (94.5% com-pared with 66.8%) and it contributes more to user satisfaction. This also shows that it is more important the way we extract the concepts information (function and data) from the recognized phrase than the exact word recognition of the entire phrase. The task complexity is also impor-tant, for example in ATOS system a concept could be formed by a name a surname and even a number in the same utterance, then exact recognition of all the words is very important to infer the concept and data. On the other hand in voice PORTAL system only with the exact recognition of some keywords, that could be surrounded by some non-keywords, the system immediately could infer what kind of function is inquired.

7. Conclusion

In this paper we have presented a general framework for dialogue annotation in the context of the evaluation of Spoken Language Agents. To examine the viability of the proposed coding scheme and annotation tools, they have been tested while evaluating two real prototypes of call-center agents using different simple dialogue metrics under the PARADISE framework.

From the analysis of the experimental results we can say that the PARADISE methodology is useful to describe two apparently similar systems but with different behaviour in the field. Although more work it is necessary specially to determine which factors are correlated and in future trials verify the predictive performance of the equations obtained.

Another noticeable conclusion of this work is the important role of the percentage of correct concepts over the percentage of word recognition. Then for future evaluations more empha-sis will be necessary during the annotation stage of that we have called “concepts” (Antoine et al., 2000), because a more detailed classification and annotation could provide us with more information about the performance of the system.

References

ALVAREZ J., GIL J. C., CASAS C. C. & MERINO D. T. (1996). The natural language processing module for a voice asisted operator at telefónica i+d. In ICSLP ’96, Philadelphia, USA.

ANTOINE J.-Y., SIROUXJ., CAELENJ., VILLANEAU J., GOULIAN J. & AHAFHAF M. (2000).

Ob-taining predictive results with and objective evaluation of spoken dialogue systems: experiments with the dcr assesment paradigm. In Proceedings of Second International Conference on Language Resources

and Evaluation LREC-2000, Athens Greece.

BONNEAU-MAYNARD H., DEVILLERS L. & ROSSET S. (2000). Predictive performance of dialog systems. In Proceedings of Second International Conference on Language Resources and Evaluation

LREC-2000, Athens Greece.

CHARFUELÁN M., RELAÑO GIL J., RODRÍGUEZ M. C., TAPIAS D. & GÓMEZ L. H. (2000). Dia-logue annotation for language system evaluation. In Proceedings of Second International Conference on

Language Resources and Evaluation LREC-2000, Athens Greece.

(10)

DYBKJOERL., BERNSENN. O., CARLSONR., CHASEL., DAHLBACKN., FAILENSCHMIDK., HEID

U., HEISTERKAMP P., JONSSON A., KAMP H., KARLSSON I., V. KUPPEVELT J., LAMEL L., PA

-RAUBEKP. & WILLIAMS D. (1998). The disc approach to spoken language systems development and evaluation. In Proceedings of First International Conference on Language Resources and Evaluation,

LREC-1998, Granada Spain.

ISARDA., MCKELVIE D. & THOMPSONH. S. (1998). Towards a minimal standard for dialogue tran-scripts: A new sgml architecture for the hcrc map task corpus. In Proceeding of International Conference

on Spoken Language Processing, Australia.

MATE (1998). Mate. project overview. http://mate.nis.sdu.dk/.

MINKER W. (1998). Evaluation methodologies for interactive speech systems. In Proceedings of First

International Conference on Language Resources and Evaluation, LREC-1998, Granada Spain.

RELAÑO GIL J., TAPIAS D., RODRÍGUEZ M. C., CHARFUELÁN M. & GÓMEZ L. H. (1999). Ro-bust and flexible mixed-initiative dialogue for telephone services. In Ninth Conference of the European

Chapter of the Association for Computational Linguistics, Bergen, Norway: Proceedings of EACL ’99.

PRICEP., HIRSCHMAN L., SHRIBERG E. & WADE E. (1992). Subject-based evaluation measures for interactive spoken language systems. In DARPA Proceedings of Speech and Natural Language

Work-shop.

WALKERM., LITMAND. J., KAMM C. A. & ABELLAA. (1998). Evaluating spoken dialogue agents

with paradise: Two case studies. Computer Speech and Language, 12, 317–347.

WALPOLE R. E. & MYERSR. H. (1992). Probabilidad y Estadística. México: McGraw-Hill / Inter-americana.

XML L. (1999). Language technology group, lt xml version 1.1. http://www.ltg.ed.ac.uk/software/xml/index.html.