Methodology - On the use of text classification methods for text summarisation

A schematic illustration of the summarisation approach using standard classification techniques is presented in Figure 4.1. From the figure, the input consists of data sets containing text documents which may also include tabular data (as in the case of questionnaire data). The approach comprises four main stages: (i) preprocessing and representation of the input data, (ii) feature selection, (iii) classification and (iv) summarisation. The first stage (shown in Figure 4.1 in the area labelled “(1) Pre- processing”) was comprehensively described in a generic manner in Chapter 3. As explained in Chapter 3, two feature selection techniques were used, (i) TF-IDF in conjunction with Chi-squared and (ii) TF-IDF in conjunction with CFS, and thus these are not discussed further here. The last two stages are the most significant with respect to the approach presented in this chapter and are described in detail in the following two subsections.

4.2.1 Classification

As shown in the area labelled “(3) Text Classification” in Figure 4.1 a classifier is applied to the input data. How this classifier is best generated was one of the objectives of the work described in this chapter. Experiments were conducted (reported later in this chapter) using seven different standard classification techniques as follows:

• TFPC • C4.5 • SMO • LibSVM • Naive Bayes • K-Nearest Neighbour • RIPPER

usage has been widely reported in the literature and (ii) they display a variety of methods of operation. Each of the adopted classifier generators was introduced in Chapter 2. Once the classifiers have been applied to the data sets, they may be evaluated. With respect to the work described in this thesis the following two step classifier generation procedure was adopted:

1. For each classification technique in question, train using stratified Ten-fold Cross Validation (TCV).

2. For each generated classifier evaluate using five evaluation measures: (i) overall accuracy expressed as a percentage, (ii) Area Under the ROC Curve (AUC), (iii) precision, (iv) sensitivity/recall and (v) specificity. Of these, accuracy and AUC were considered to be the most relevant; precision, sensitivity/recall and specificity are presented so as to provide a broader insight into the effectiveness of the individual classifiers. (Each of these measures was described in Chapter 2).

The classifier generation process that produced the best classification results was then further used to evaluate the summarisation techniques described in Chapters 5 to 7.

4.2.2 Summary Generation

As shown in Figure 4.1 in the area labelled “(4) Text Summarisation”, the classes assigned to unseen documents as a result of applying a selected classifier were used to generate the desired summaries by prepending or appending “canned” text to individual class labels, and in consequence increasing the readability of the extracted meaning. Simple rules of the form:

if <CLASS N AM E> then <P REP EN D AP P EN D> <DOM AIN SP ECIF IC T EXT>

were established by recourse to domain experts. These rules were then used to prepend or append domain-specific text to generated class labels (names). Some example rules are presented in Table 4.1 where the last rule is treated as a “catch all” default rule. Note that, with respect to Table 4.1, in some cases, as in the case of class1 and class3, the same text may be appended or prepended.

Algorithm 1 shows how the names of the classes that are assigned to text documents, after the application of a classifier, are coupled with domain-specific text according to rules of the above form. The input to the algorithm is: (i) a class label ci associated

with a particular document and (ii) a set of n rules R = {r1, r2, . . . , rn}. Domain-

specific text is prepended/appended to the name of the class according to the rules in R. In the rare case where no class was assigned, default domain-specific text is used as

Table 4.1: Example of rules for generating summaries.

if (class1) then (prepend) (‘‘This document was about’’) if (class2) then (append) (‘‘was the main topic of this document’’)

if (class3) then (prepend) (‘‘This document was about’’) if (class4) then (append) (‘‘was the main topic of this document’’) if (noClass) then (append) (‘‘This document was about <domain area>’’)

defined by rulern. The output is a text summarisation which is enclosed by quotation

marks.

Data: action,ci,R={r1, r2, . . . , rn}

Result: text summary of the original document which includes ci and

prepended/appended text 1 s=null string

2 for allri in R fromi= 1 to i=n−1do

3 if ri.antecedent = ci then

4 if action=prependthen

5 s←ci with ri.consequent prepended

6 else if action=appendthen

7 s←c_i with r_i.consequent appended 8 end

9 if s=null string then 10 if action=prepend then

11 s←c_i withr_i.consequent prepended 12 else if action=appendthen

13 s←ci withri.consequent appended

14 returns

Algorithm 1:Prepending/appending text to a name of a class.

As was previously highlighted in Chapter 1, in this thesis text summarisation is conceived as a form of text classification in that the classes assigned to text documents are viewed as an indication of the main ideas of the original free text but in a coherent and reduced form. Therefore, it is suggested that summaries of the forms: (i)Domain- specific text+name of class and (ii)Name of class +domain-specific text, are relevant, especially where the text is unstructured and contains features such as the ones found in the free text part of questionnaires (misspelled words, poor grammar, and abbreviations and acronyms related to a specific domain).

In document On the use of text classification methods for text summarisation (Page 83-85)