8.3 Materials and Methods
11.3.3 Classifier Design
Multi-stage or cascading learning [26] [27] is a special case of ensemble learning. As the name suggests, several classifiersC1,C2, ...,Cn−1,Cn are staged serially so thatCn learns not
only from the attributes of the training instances but also from the class distributions of these instances provided byCn−1(e.g., ifCn−1is a decision tree, the class distributions are probability
values of class membership for the training instances). In contrast to this multi-stage learning, voting or stacking ensembles are multi-expert learning methods. Multi-stage learning methods are often as fast and as good as ensemble learners but only require simple learners in their cascades.
To classify summary sentences, we have designed a two-stage learner: our first stage in- volves ak-Nearest Neighbour learner [28] while our second stage is a Na¨ıve Bayes learner [29].
There are several reasons for choosing these learners for our two-stage classification. First, both learners arestable: a small change in the training data rarely affects performance. Second, we are interested in exploiting the strengths of both discriminative (k-Nearest Neighbour) and gen-
erative (Na¨ıve Bayes) learners. Third, both learners are simple as their objective functions are based on probabilities. Last but not least, both learners—especially Na¨ıve Bayes—perform well in text-based classification. We have used the implementations and the default parameter
0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 Error (%) Value of K Training Error (%) Validation Error (%)
Figure 11.2: The learning curve of thek-Nearest Neighbour classifier on theCAST-30dataset
used to determine the optimum value of k.
values of these learners found in the Weka machine learning toolkit [30].
The overall learning is evaluated using a stratified 10-fold cross validation. Treating each dataset independently, the values for the 87 stylometric attributes are computed for each sen- tence. Then, each dataset is randomly divided into 10 equal-sized stratified sets. Stratification means that the+and−classes in each set are represented in approximately the same proportion as in the full dataset. One set is used for evaluation and the remaining sets for construction of ak-Nearest Neighbour classifier (stage 1 of 2). This classifier then generates two probability
values (one for each class) for each instance in the evaluation set. This cross-validation process is then repeated until each of the 10 sets is used exactly once as the validation data for the
k-Nearest Neighbour classifier. Now, in addition to the 87 stylometric attributes, each instance
in the dataset has been assigned two more attributes. So, each instance is now represented by 89 attributes and and each has the human-assigned class attribute. Using these attributes, a Na¨ıve Bayes classifier then generates models in a stratified 10-fold cross validation (stage 2 of 2). We report the average values for the 10 folds of the measures described in Section 11.3.4.
Finding a good value of k for a k-Nearest Neighbour classifier is important since a too
low k value usually generates a low bias-high variance classifier which may experience an
overfit. On the other hand, a too highkvalue may generate ahigh bias-low varianceclassifier
which perhapsunderfits the data. To find the correct value of k, we have examined the bias-
variance tradeoff using a learning curve where training and validation error rates of the k-
Nearest Neighbour classifiers are plotted by varying the value ofkfrom 1 to 29; only the odd
numbers in this range have been considered. As suspected, the learning curve in Figure 11.2 illustrates that lowkvalues generatelow bias-high varianceclassifiers for theCAST-30dataset:
the classifiers have very low training error but comparatively high validation error. However, according to the curve, we can expect to get a smooth decision boundary withk= 15. Thek
values for the remaining datasets are obtained in a similar way; we, however, have not included their learning curves that can be found elsewhere2.
2http://cogenglab.csd.uwo.ca/additionalmaterial/summary/text-summary-learning-curves-2014.
Chapter11. 141
Actual
+ −
Prediction + True Positive (TP) False Positive (FP)
−False Negative (FN)True Negative (TN)
Table 11.3: Confusion matrix for summary sentence classification problem.
11.3.4
Evaluation Measures
To summarize the performances of the classifiers, we have used a wide variety of standard evaluation measures. The measures include precision, recall, F-score, accuracy, false positive rate, false negative rate, area under curve (AUC), and the Matthews correlation co-efficient. Noting that all of the datasets we have used suffer from a high class imbalance ratio (see Table 11.1), we have selected the measures because all except accuracy can deal with the class imbalance problem.
To understand the measures, refer to the confusion matrix shown in Table 11.3. The pre-
cision of classification is the fraction of instances correctly classified (into one of the two
classes). Quantitatively, in our case it is the number of correct predictions for the summary sentence class divided by the total number of summary sentence predictions (Eq. 11.1). The
recall(or true positive rate), on the other hand, is the fraction of relevant instances (for a class)
that are correctly classified, which in our case is the number of correct predictions for the sum- mary sentence class divided by the number of summary sentences in the dataset (Eq. 11.2).
TheF-scoreis the harmonic mean of the precision and the recall to represent their average (Eq.
11.3).
Precision= TP
TP+FP (11.1)
Recall= TP
TP+FN (11.2)
F-score= 2×Precision×Recall
Precision+Recall (11.3)
Theaccuracy of a method represents the fraction of its overall classifications—both for+
and− class in our case—that are correct (Eq. 11.4). However, for datasets with a high class imbalance ratio, this measure is not appropriate because it does not reflect misclassification costs and has a strong bias to favour the majority class [31]. Nevertheless, we have reported classification accuracy since it has been reported by some contemporary studies [9].
Accuracy= TP+TN
TP+FN+FP+TN (11.4)
Thefalse positive rate (FPR)is the fraction of negative instances that are misclassified. In
our case, it is the number of sentences misclassified as a summary sentence divided by the total number of sentences that are not summary sentences (Eq. 11.5). Similarly, thefalse negative
rate (FNR)is the fraction of positive instances that are misclassified. Interpreting Eq. 11.6 for
our task, it is the number of misclassified summary sentences divided by the total number of summary sentences.
FPR= FP
FP+TN (11.5)
FNR= FN
FN+TP (11.6)
In addition, we report the area under curve (auc) which is a single scalar value representa- tion of a method’s receiver operating characteristic (ROC) performance. AROCcurve plots true positive rates (Eq. 11.2) and false positive rates (Eq. 11.5) for a binary classification method. The value of AUC will always be between 0 and 1.0. A random method will have anAUC of 0.50 and no realistic method should have anAUC less than this. In practice, this measure dis- criminates well and is often a good choice when a general measure of predictiveness is desired for data with a high imbalance ratio [32].
Our last measure to gauge the classification performance is theMatthews correlation co-
efficient (MCC)[33]. This correlation measure takes bothpositivesandnegativesinto account
(Eq. 11.7) and is a highly regarded measure when the class sizes vary significantly. In essence, MCC reports the correlation between theactualclasses and thepredictedclassifications made by a method. Its value is always between −1 and +1, where+1 represents a perfect method, 0 represents a random method and−1 refers to a method that totally disagrees with the actual class membership.
MCC= √ (TP×TN)−(FP×FN)
(TP+FP)(TP+FN)(TN+FP)(TN+FN) (11.7)