Evaluation of Systems and Methodologies - Virtual Doctor: An Intelligent Human-Computer Dialogu

In this section, the evaluation of the methodologies that were presented previously is provided. In order to conduct this evaluation, a number of features were selected. These features are presented in Table 3-5.

Table 3-5: Evaluation features.

Feature Description

Speech recognition and synthesis (F1)

The methodology is able to recognize and to generate speech as input and output, respectively.

Multimodal I/O support (F2)

The methodology can support several input and output modalities.

Adaptability to user’s needs (F3)

The methodology supports user-initiative at the beginning of or during the interaction.

Human-like dialogue capability (F4)

The dialogue resembles the natural course of a dialogue between humans.

Learning capability (F5)

The methodology is able to learn different concepts from the interaction with the user.

Friendliness (F6) The methodology is user-friendly.

Originality (F7) A novel methodology.

Model complexity

(F8) The complexity of the methodology.

Availability (F9) The ability to obtain and/or to implement the methodology based on the description that is provided.

Prototype (F10) The methodology has been implemented at the experimental stage [60].

Released product (F11)

The methodology has been implemented as a commercial product [60].

Robustness (F12) The methodology’s ability to cope with incorrect input.

Correctness and

reliability (F13) The methodology produces correct and accurate results.

Scalability (F14) The methodology’s ability to be used for other applications, in new environments, etc.

The selection of the features that are depicted in Table 3-5 was based on the characteristics that a dialogue system must meet, in order for it to be considered mature enough for real-life scenarios. More specifically, general features that correspond to some properties that are desirable for any computer system, like robustness (F12) and scalability

(F14) are among the ones that were selected. Moreover, a computer system should also comply with certain requirements that are anticipated by the end-users, if it is to be used in everyday life. For example, a system should be able to adapt to the needs of the user, as well as to be user-friendly. Hence, features like adaptability to user’s needs (F3) and friendliness (F6) were selected, respectively. Furthermore, since the reviewed systems are dialogue systems, a number of features was selected to correspond to the nature and the appropriate qualities of these systems. For instance, the ability to recognize and synthesize speech, as well as to produce human-like dialogue, are among the desirable traits of a dialogue system. Thus, features like speech recognition and synthesis (F1) and human-like dialogue capability (F4) were also selected for the evaluation process. Additionally, features such as originality (F7) and model complexity (F8) were chosen to highlight the importance of some design parameters to the developers of a dialogue system. Finally, in order to demonstrate the development stage of a reviewed system, prototype (F10) and released product (F11) were also among the selected evaluation features.

In order to conduct the evaluation of the methodologies that are presented in Table 3-4, a score between 1 and 5 was assigned for each methodology and each one of the features that are presented in Table 3-5. A score equal to 5 means that a methodology performs very well for that specific feature, while a score equal to 1 means the exact opposite. For example, a methodology that is assigned a score equal to 5 for robustness (F12) is more robust than one that is assigned a score equal to 3 for that same feature.

However, the reverse approach was followed for model complexity (F8), since a less complex model would be preferred to a more complex one that performs the exact same tasks. Therefore, a score equal to 1 indicates a very complex methodology, while a score

equal to 5 is an indication of a simpler model. Lastly, a score equal to 1 was given if there was not enough information regarding a feature for a methodology. An illustration of the scores that were assigned for each methodology is provided in Table 3-6.

Table 3-6: Evaluation scores of the reviewed dialogue systems.

Reviewed

In addition to the scores that were assigned for each feature and each methodology, and with the aim to indicate the perspective of the involved parties in the use and the

development of dialogue systems, i.e., the user and the developer, a weight was assigned to each feature. These weights reveal the significance that every feature has for the user and the developer and are illustrated in Table 3-7.

Table 3-7: Feature weights.

As it can be observed in Table 3-7, two weights between 1 and 5 were assigned to each one of the evaluation features. The first of these weights corresponds to the significance of a feature to the user of the system, while the second one indicates its importance to the developer of the system; an average weight was also calculated for each feature. For each one of the evaluation features, a weight that is equal to 5 indicates that the related feature is of great importance to the corresponding involved party, while a weight equal to 1 indicates the opposite. Thus, originality (F7), for example, is really

important to the developers of a system, as they are interested in presenting a novel methodology. On the contrary, the users of a system are mostly interested in its performance and it is insignificant to them whether the methodology in which this system is based is novel or not.

The final step of this evaluation is the production of a maturity score for each one of the reviewed systems and methodologies that were presented previously. In order to produce this maturity score, both the scores that were assigned to each one of the reviewed dialogue systems and the feature weights, which were presented in Table 3-7, need to be taken into consideration. Therefore, the maturity score was calculated using the following equation, similarly to [1].

𝑀 =^∑

∑ (3-1)

In equation (3-1), Mi indicates the maturity score for methodology i, wj represents the weight for feature j and fij the score that was assigned for methodology i and feature j.

Since there are three weights for each one of the evaluation features, taking into consideration the user’s, the developer’s and the average perspective, there are three maturity scores produced, based on each one of the aforementioned perspectives. The calculated maturity scores are illustrated in Table 3-8.

Table 3-8: Maturity scores of the reviewed dialogue systems.

Reviewed

With the intention of providing a graphical illustration of the maturity scores that are presented in Table 3-8, the final results for the maturity evaluation for each methodology are also depicted in Figure 3-3. These results are also presented separately for answering, semi-dialogue and full-dialogue systems in Figures 3-4, 3-5 and 3-6, respectively.

Figure 3-3: Maturity scores of the reviewed dialogue systems.

Figure 3-4: Maturity scores of the reviewed answering systems.

Figure 3-5: Maturity scores of the reviewed semi-dialogue systems.

Figure 3-6: Maturity scores of the reviewed full-dialogue systems.

Table 3-8 and Figures 3-3, 3-4, 3-5 and 3-6 illustrate the maturity scores that were obtained for the reviewed dialogue systems. In particular, there are three maturity scores for each system; the first two are based on the user’s and the developer’s perspective, respectively, while the third one is the average maturity score.

As it can be observed in Table 3-8 and in Figures 3-3, 3-4, 3-5 and 3-6, none of the reviewed dialogue systems obtained a maturity score close to the maximum one, which is equal to 5. However, there are some systems that reached a relatively high score. These mainly include the four commercial answering systems that were presented in paragraph 3.2.1, as well as a small number of research prototypes that also reached a score close to the maximum attained one. Therefore, it can be inferred that the majority of the mature dialogue systems are commercial products. Nonetheless, as it has been mentioned previously, these systems belong to the class of answering systems, indicating that there is still a scarcity of, and accordingly, a need for mature full-dialogue systems.

Furthermore, by examining the evaluation results, it can be inferred that there are systems that are more mature based on the user’s perspective but, at the same time, less mature according to the developer’s perspective. This is due to the fact that the users and the developers consider different features to be more important than others. For instance, while developers are interested in presenting a novel methodology, users are more fascinated by a user-friendly system. As a result, there are systems that obtained a high user maturity score and, at the same time, a low developer maturity score, or vice versa.

This is an indication of the fact that the number of mature dialogue systems, which fulfill the requirements that are desirable by both their users and their developers, is still insufficient.

Finally, some of the low maturity scores can also be justified by the fact that several dialogue systems obtained really low scores for specific features. In order to demonstrate this fact, the arithmetic mean and the mode of each evaluation feature, for all of the reviewed dialogue systems, are presented in Table 3-9 and in Figure 3-7.

Table 3-9: Arithmetic mean and mode of each evaluation feature.

Feature Arithmetic

Mean Mode

Speech recognition and synthesis (F1) 4.6 5

Multimodal I/O support (F2) 2.4 1

Adaptability to user’s needs (F3) 2.43 2

Human-like dialogue capability (F4) 2.83 2

Learning capability (F5) 1.33 1

Friendliness (F6) 2.47 2

Originality (F7) 2.7 2

Model complexity (F8) 3.03 3

Availability (F9) 3.17 4

Prototype (F10) 4.87 5

Released product (F11) 1.67 1

Robustness (F12) 2.37 1

Correctness and reliability (F13) 2.53 2

Scalability (F14) 3.53 3

Figure 3-7: Arithmetic mean and mode of each evaluation feature.

Based on the results that are illustrated in Table 3-9 and in Figure 3-7, it can be inferred that the reviewed dialogue systems failed to obtain a high score in some evaluation features, which indicates some of their weaknesses. More specifically, the majority of the reviewed dialogue systems obtained a score below 2.5 for multimodal I/O support (F2), adaptability to user’s needs (F3) and friendliness (F6), which shows that they are not capable of supporting multiple input and output modalities, and adapting to the user’s needs, while, at the same time, they are not considered to be very user-friendly. Moreover,

the average score for human-like dialogue capability (F4) is slightly above 2.5, while most of the reviewed dialogue systems obtained a score equal to 2 for that particular feature.

This proves that the majority of the reviewed dialogue systems do not produce a human-like dialogue. Furthermore, the average score for learning capability (F5) was 1.33, while only 20% of the reviewed dialogue systems obtained a score higher than 1, and only one of them achieved the highest possible score. Therefore, it is safe to conclude that one of the greatest weaknesses of the reviewed dialogue systems is their ability to learn different concepts from their interaction with the user. Additionally, the acquired scores for robustness (F12) and correctness and reliability (F13) were approximate to 2.5, with a mode equal to 1 and 2, respectively. As a result, most of the reviewed dialogue systems do not appear to be able to either cope with incorrect input, or to produce correct and accurate results. Finally, the low score that was acquired for released product (F11) indicates that the majority of the reviewed dialogue systems are still in the development stage and have not yet been released as commercial products. All in all, the results that were presented previously are an indication of the limitations, the challenges, as well as the future research efforts that need to take place in the development of human-computer dialogue systems.

3.4 Conclusions

In the previous paragraphs, a review and a classification scheme for dialogue systems were presented. As a result of this literature review, there were three classes of dialogue systems proposed, including answering, semi-dialogue and full-dialogue systems.

Subsequently, the reviewed systems and methodologies were evaluated according to the user’s and the developer’s perspective. Nevertheless, the motivation for this evaluation was

not to compare each methodology with the rest of them, but rather to indicate some weak points and disadvantages in their development. Through this process, a number of issues and challenges that need to be unraveled were designated, so that dialogue systems can actually become useful and suitable for real-life scenarios. All things considered, the survey that was presented in the previous paragraphs suggests that the field of human-computer dialogue systems is still an area open for further research efforts.

4 The Virtual Doctor Dialogue System

4.1 Introduction

Dialogue systems, which are also known as conversational systems or conversational agents, are a new type of HCI systems that has arisen and has been the subject of keen research and market interest over the past few decades. These systems are computer systems that interact with humans by using a variety of input and output modalities, and they can be integrated into various other computer systems and devices.

As it was mentioned in the literature review of human-computer dialogue systems that was presented in Chapter 3, dialogue systems have had a huge commercial success, which is evident by the popularity of systems such as Google Now and Siri. Furthermore, numerous prototypes have been developed as part of research efforts in the area of dialogue systems and HCI. For instance, the SimCoach project incorporates anthropomorphic avatars to interact with and provide support to humans who suffer from a psychological burden [54]. BEETLE II, on the other hand, includes lectures and exercises that can be used for educational purposes [55], while a telephone-based dialogue system that provides

farmers in India with information regarding the prices of agricultural products is presented in [27]. Therefore, it could be inferred that a variety of human-computer dialogue systems have been are still being developed to cover a wide spectrum of applications.

This dissertation focuses on the modeling and the development of an intelligent human-computer dialogue system that will be part of the general VDr system that has been presented previously. As it has already been mentioned, the general VDr system’s goal is to extract a patient’s measurable and non-measurable pathological symptoms and, subsequently, provide a reliable diagnosis to the patient’s physician. In order to achieve this goal, a wearable health monitoring device is used to measure the patient’s measurable pathological parameters and to provide their measurable pathological symptoms. However, the patient’s non-measurable symptoms are also required for the prognosis process, since they are essential to the health condition of a human and they could be an indication of potential pathological conditions. Thus, a VDr human-computer dialogue system has been proposed and developed as part of the general VDr system to address this issue.

The rest of this chapter provides a description of the characteristics and the components of both the implemented VDr dialogue system, as well as of dialogue systems, in general.

In document Virtual Doctor: An Intelligent Human-Computer Dialogue System for Quick Response to People in Need (Page 63-76)