Evaluation measures for Text Summarisation

2.5 Text Summarisation

2.5.4 Evaluation measures for Text Summarisation

Steinberger and Jeˇzek [98] presented a taxonomy of summary evaluation measures cat- egorising them as being either: (i) intrinsic and (ii) extrinsic. Intrinsic evaluation is directed at the analysis and comparison of the generated summary with the original document or with a summary generated by a human. Extrinsic evaluation is directed at determining how useful a summary is with respect to a certain domain. Intrinsic evaluation is then subdivided into: (i) text quality and (ii) content evaluation. Text quality evaluation requires that summaries should be: (i) grammatically correct, (ii) non-redundant, (iii) present referential clarity, (iv) have structure and (v) coherence. Content evaluation, on the other hand, is more quantitative and is further subdivided into two categories: (i) co-selection and (ii) content-based. Co-selection considers: (i) precision, (ii) recall, (iii) F-score and (iv) relative utility. Content-based considers: (i) cosine similarity, (ii) unit overlap, (iii) longest common subsequence, (iv) n-gram match- ing, (v) Pyramids and (vi) Latent Semantic Analysis (LSA) measures. The extrinsic evaluation consists of three measures: (i) document classification, (ii) information re- trieval and (iii) question answering.

The labels used with respect to the text summarisation output factors proposed by Afantenos [2] (see Subsection 2.5.1) can also be used to evaluate summaries. Summaries should be: (i) complete, (ii) accurate and (iii) coherent. As suggested in [2], there is still no general consensus among the research community on the criteria that can best be used to evaluate a summary since summarisation has a subjective aspect whereby a

generated summary could be considered to be of good quality by some people but not by others. In some cases domain experts have produced “gold” summaries which may be used as benchmarks. However, three problems can be identified: (i) domain experts may not agree on the characteristics that a “gold” summary must have, (ii) it may be resource expensive to generate such “gold” summaries because many domain experts would be required to agree on the criteria to follow, and (iii) it will be time consuming to generate such summaries because, even if the documents to be summarised are short, a considerable number of them should be manually summarised so that an extensive set of “gold” summaries can be produced.

With respect to the research presented in this thesis, the generated summaries were evaluated by both domain experts and by the author of this thesis in terms of completeness and by considering the taxonomy of summary evaluation measures presented in [98]. The intrinsic evaluation measures taken into account were focused on the quality of the text: (i) grammaticality, (ii) non-redundancy, (iii) referential clarity, (iv) structure, (v) completeness, (vi) accuracy and (vii) coherence. The only extrinsic evaluation measure used was a quantitative one, namely document classification, in other words how well the classification methods performed.

2.6 Summary

In summary, this chapter has presented a literature review of the relevant approaches that are close to the research topics covered by this thesis: (i) Questionnaire Data Mining, (ii) Text Classification and (iii) Text Summarisation. The significance of the research presented in this thesis was explained as well as its location in the landscape of the three aforementioned areas of research. An overview of Questionnaire Data Mining approaches was presented, as well as the most relevant approaches depending on the part of the questionnaire they were aimed at: (i) tabular data, (ii) free text and (iii) tabular data and free text combined. In the context of text classification a review was presented of: (i) feature selection techniques, (ii) relevant approaches and (iii) evaluation measures. Finally, with respect to text summarisation the following was reviewed: (i) the categorisation of the text summarisation techniques, (ii) text summarisation approaches, (iii) text summarisation approaches that use text classification methods and (iv) evaluation measures. The following chapter, Chapter 3 will consider the nature and the preprocessing of the data sets used.

Chapter 3

Evaluation Data Sets and Data

Preprocessing

3.1 Introduction

In this chapter the nature of the data sets used to evaluate the work presented later in this thesis is described. Three data sets were considered: (i) SAVSNET (Small Animal Veterinary Surveillance Network), (ii) OHSUMED and (iii) Reuters-21578. Of these data sets the SAVSNET data was the most significant, as this was used as the main motivation for the work described in this thesis, in that it is a questionnaire data set (the other two data sets, as will be seen, are more traditional text data sets). The chapter introduces these data sets and also describes the associated data preprocessing; an important step in the context of knowledge extraction in general, and text summarisation in particular. As explained in Chapter 1, the focus of this thesis is to generate summaries through a process of classification. The quality of the generated summaries thus depends on the quality of the generated classifiers, which in turn depends on the nature of the training sets and of course the nature of the class labels assigned to these documents. Free text documents can be related to single or multiple class labels; in the later case the labels can be structured in a hierarchical manner.

Obtaining questionnaire data sets for research purposes is not a straightforward task for reasons of data confidentiality of the data gathered. The data sets used to evaluate the text classification methods used to generate text summaries, presented later in this thesis were as follows:

• SAVSNET

- SAVSNET-840-4-FT - SAVSNET-840-4-TD+FT

- SAVSNET-971-3-FT - SAVSNET-971-3-TD+FT

- SAVSNET-917-4H

• OHSUMED

- OHSUMED-CA-3187-3H (Cardiovascular Abnormalities) - OHSUMED-AD-3393-3H (Animal Diseases)

• Reuters-21578

- Reuters-21578-LOC-2327-2H (Locations) - Reuters-21578-COM-2327-2H (Commodities)

The naming convention used with respect to the above is as follows. Each name comprises three or four elements: (i) title; (ii) sub-title (may be omitted); (iii) total number of records; and (iv) number of classes, or the number of levels (H) in the case of hierarchical data sets. A hierarchical data set is one that features a hierarchy of class labels. In the context of hierarchical data sets the indicator L will be used to indicate the data and class label pairings at a particular level. In the particular cases of the SAVSNET-840-4 and the SAVSNET-971-3 data sets, which contain both tabular data and free text, the indicator FT will be used where the data set contains only free text, and the indicator TD+FT will be used if it contains tabular data and free text. The OHSUMED and the Reuters-21578 data sets are comprised only by free text. Thus we have the SAVSNET data set SAVSNET-840-4-FT which contains 840 records, 4 class labels and only free text. Alternatively, we have the OHSUMED data set OHSUMED-CA-3187-3H which contains 3,187 records arranged into a hierarchy of three levels. Regarding the Reuters-21578 data sets the “21578” indicates the total number of documents in the original data set and is included in the name along with the actual number of documents in the variation of the data set. Note that in the case of the OHSUMED data sets, the number of records in each level is not the same, the first level consists of all the documents and lower levels subsets of the documents in the first level. In some cases, for the purpose of summary generation, the class labels associated with a particular data set at a particular level in a hierarchy are treated independently. In this case the last element of the data set name indicates the level. For example, SAVSNET-917-1L, OHSUMED-CA-2570-2L or Reuters-21578-LOC-2327-2L.

Thus from the above five hierarchical data sets are used in this research: (i) SAVSNET-917-4H, (ii) OHSUMED-CA-3187-3H, (iii) OHSUMED-AD-3393-3H, (iv) Reuters-21578-LOC-2H and (v) Reuters-21578-COM-2H. The SAVSNET-917-4H features a four level hierarchy, the OHSUMED data sets feature three level hierarchies and the Reuters-21578 data sets feature two level hierarchies. It should also be noted that the OHSUMED and the Reuters-21578 data sets are not questionnaire data sets. However, these data sets were used with respect to the work described in this thesis to

evaluate the application of the proposed text summarisation classification techniques to the generation of text summaries from different types of textual data than just questionnaire data. The hierarchical data sets are only applicable with respect to the hierarchical method presented in Chapter 7, for the other three methods considered in this thesis each level of the hierarchy in the data sets is considered as an independent data set, hence the naming of the data sets indicating the levels to which they belong in their respective hierarchy. The data sets will be described in further detail later in this chapter.

The remainder of this chapter is organised as follows. Section 3.2 presents a general discussion of the preprocessing and representation of the textual and tabular elements of questionnaires, as well as free text in general. The individual data sets (SAVSNET, OHSUMED and Reuters-21578) used in the experiments that were carried out to evaluate the proposed approaches presented in this research, as well as their preprocessing, are then described in detail in Sections 3.3, 3.4 and 3.5 respectively. A summary of the chapter is presented in Section 3.6.

In document On the use of text classification methods for text summarisation (Page 51-55)