Acceptability measure - Methods and Metrics in Evaluation of Question Generation

Chapter 6: Methods and Metrics in Evaluation of Question Generation

6.6 Discussion

6.6.2 Acceptability measure

The question generation strategies have been evaluated using two ontologies as a source of question keywords. Three question categories, namely definition, concept completion, and comparison, were generated from AQGen, and the acceptability of each QG strategies to construct this question categories was evaluated. The acceptability of the generated question is evaluated using five question deficiencies (QD1 to QD5), with a 3-point scale with 1-YES, 2-No option and 3-NO. Thus, the closer the mean score to 1, the better acceptability measure for that question.

129

Results revealed that total mean rating for all QDs was favoured to rating YES which indicates agreement to the QDs as presented in table 6.29. The results of this evaluation confirm that the QG strategies able to generate feasible questions.

Table 6.29: Overall mean rating for each QDs QD1 Mean(SD) QD2 Mean(SD) QD3 Mean(SD) QD4 Mean(SD) QD5 Mean(SD) ontoCN 1.17(0.33) 1.27(0.30) 1.36(0.42) 1.34(0.34) 1.31(0.29) ontoOS 1.25(0.41) 1.06(0.17) 1.08(0.16) 1.09(0.17) 1.10(0.21)

Although the results showed the proposed question strategies able to generate acceptable questions, there are still some questions that rated as less acceptable. Amongst the reasons are:

A. Grammatical error

The grammatical errors recorded as a missing or incorrect article (a, an) used in question sentence. The incorrect article used might lead to a wrong interpretation of the questions. It is, therefore, essential to have correct articles for question sentence. Although not many questions were rated as having an incorrect article, it does affects the QD1 rating score. The examples of these questions are “What is interrupt handler?” and “What is direct memory access?” that should be written as “What is an interrupt handler?” and “What is direct memory access?”

The results for OS subject shows mean rating, M = 1.66 in Table 6.17 indicate that the generated questions are favour toward getting the grammatical error. The grammatical error indicated from the expert’s feedback was inappropriately used or articles (a, an) in the question. The grammatical error was identified in the following questions:

• What is a secondary storage structure? • What is a device driver?

130 • What is direct memory access?

•

What is disk scheduling?

Based on the questions above, the affected question template is of definition type, which is “What is [X] ?” and What is [X] in [SC]?. It is recorded that the grammatical error was due to the missing article on the generated questions. Compared to ontoCNS, the representation of the object properties follows the nouns it connected with. For example, when the nouns start with vowel letters such as “input device”, the object property is set to “is-an” instead of “is-a”. Therefore, the grammatical error about articles is resolved. Alternatively, the question template should be modified to capture information about articles such as “What [art] [X] ?”, with [art] represent the article. One simple way, to resolve this problem is by using a simple algorithm to detect the first string of [X]. If the first-string start with the letter a, e, i, o, or u, then variable [art] is set to ‘is-an’ otherwise set [art] to ‘is-a’. Although the suggestion is not the best solution, exploring a better solution for correcting articles in the question sentence might be the interest for future work.

As the question generated is from question templates that we derived from the established and credential textbook revision questions and there are in a predefined structure; thus, we did not put checking grammar as the highest interest in the investigation. We invite experts from the computer science domain, who have experienced in learning or teaching Operating system and computer network subject. However, the result shows there is a small percentage of experts that evaluate the questions as having a grammatical error in Table 6.17 for OS subject. This is due to the background of this one expert that learned and taught in the non-English platform. Later, we sent the set of question to proof-reader to validate the accuracy of grammar used in this set of questions, and the outcome shows the questions is at appropriate grammar.

Another grammatical error is the used of a small letter for some concept name where it supposes to be capital letter such as TCP/IP.

131

B. The use of an abbreviation

On average, there is one expert for each set of questions that suggest writing the concept name in full rather than using abbreviation. Example of a question affected by this are:

• What uses an FTP?

• What is an application layer in the TCP/IP suite? • Explain the term OSI Model.

C. Inappropriate question context

It is assumed that all questions would be in context since the key terms were extracted from course ontology. However, the results showed that several questions rated as not within the context of the subject being evaluated, such as and “Discuss what a presentation is?” and “Explain the term infrared”. This might be due to judging the question alone without considering the whole context of the questions set. Consider the following question setting:

TUTORIAL 1 Computer Network & Security Answer all questions.

1. Discuss what a presentation is. 2. Explain the term infrared

TUTORIAL 2

Answer all questions.

1. Discuss what a presentation is. 2. Explain the term infrared.

Example in Tutorial 1 would provide context for all questions with the subject title “Computer Network and Security” whereas the setting in Tutorial 2 does not indicate any context for the listed questions.

D. Question not useful

The result in Table 6.22 indicates that the definition questions for QGS1 generated not useful questions. The template is affected “What is [X]?”. The

132

response from the respondent indicates that when the grammatical error and less relevant questions will be marked as not useful questions. The negative result here only appeared for CNS subject.

E. The inappropriate label of object properties

An inappropriate word used for an object property affected the meaning of the question. For example, the object property ‘belongs-to’ in the triple ‘802.11.g belongs-to a WLAN’ should be label as ‘is-the-category-of’ and a question generated would be “What is the category of WLAN?” compared to ‘What belongs to WLAN?’ It will change the meaning of the sentence and hence affect the rating for QD1, QD2, QD3, and QD4. Therefore, appropriate names of object properties are as important as concept names to provide semantically correct representations.

F. Inappropriate used of question word

The result presented in Table 6.24 shows that the question word used in this strategy is ‘What’, which identified as inappropriate by 2 out of 5 experts. One of the experts suggested the following questions should use question word ‘List’ rather than asking ‘What’.

What is a type of Network Operating System?

What is a layer of TCP/IP?

When the question word ‘List’ is used, it needs to indicate the number of items for each question asked. For example, three (3) in the following questions indicate the number of items required for that question:

List three (3) types of Network Operating System?

List three (3) layers of TCP/IP?

Another expert suggests using ‘identify’ instead of ‘what’. Therefore, inappropriate question word used in these questions affect the rating score for QD5.

133

G. Ambiguous concept names

Result in Table 6.18 shows that some questions are said to be not understandable due to general keyword used, for example, “What is hybrid”. The keyword ‘hybrid’ is not giving any meaning to the question. There are only two questions that were recorded as not useful in the questionnaires, which are “What transmits a data?” and “What is meant by the term ssh-2 in ssh?”. The ssh-2 and data are individual, which might not suitable for definition question word while the first question was recorded to be a wide assumption or too general.

6.6.3 Mean rating comparison between question generation

In document Ontologies for automatic question generation (Page 130-135)