Cross-validation - Empirical Implementations and Experimental Results

Chapter 4 Empirical Implementations and Experimental Results

4.2 Cross-validation

The ultimate aim of classification models in a real-life application is to predict the class of some unknown instances based on their observed attributes. However, in an experimental environment, and as stated by the CRISP-DM model, the performance of the developed classifier must be measured before it can be deployed [27], [134].

It has been reported that the performance of a classification model is measured in order to address the following points [135]:

a. To identify the most suitable model for a given task. b. To anticipate the model performance when deployed.

c. To prove that the developed model meets the objectives for which it has been developed.

The basic principle of the classifier cross-validation process is to test the developed classification model using a testing data set that has not been used during the classifier training phase. Since the classes of the instances in the testing data set are known in advance, the performance of the classifier is determined by counting the frequencies when the developed classifier predicts the correct/incorrect instance class. The output of the cross- validation process is a two-dimensional matrix known as the confusion matrix [134], [135].

Figure 4-1 Confusion Matrix Structure [134]

Figure 4-1shows that the confusion matrix contains four values. Each one has captured a certain performance aspect. The confusion matrix values were interpreted as follows [135]:

a. True positive (TP): this is the number of instances that have a positive value in the test data set and predicted to have a positive value by the classifier.

b. True negative (TN): this is the number of instances that have a negative value in the test data set and predicted to have a negative value by the classifier.

c. False positive (FP): this is the number of instances that have a negative value in the test data set and predicted to have a positive value by the classifier.

Page | 83 d. False negative (FN): this is the number of instances that have a positive value in

the test data set and predicted to have a negative value by the classifier.

The confusion matrix not only provides detailed information about the predicted results, but also forms the basis for calculating other performance measures [135]. The following points cover the measures used in this thesis and calculated based on the confusion matrix:

a. Classification accuracy: this can take values in the range between 0 and 1; higher accuracy indicates a better performance. Equation (4.1) explains the calculation process for classification accuracy [135].

Classification accuracy = (TP + TN)

( + + + ) (4.1)

a. Precision: this measures the certainty that a positive instance in the testing data has been correctly classified as positive by the developed classifier. It takes values in the range between 0 and 1. Higher precision indicates a better performance. Equation (4.2) shows the calculation of classifier precision [135].

Precision = TP

( + ) (4.2)

b. Recall: this measures the certainty that all positive instances in the testing data set have been found by the proposed model. It takes values in the range between 0 and 1. A higher recall value indicates a better classification performance. EquationError! Reference source not found. shows the calculation of recall [135].

( + ) (4.3)

c. F1 Measure: this represents the combination of precision and recall into one measure, which is a simpler alternative to the misclassification rate. EquationError! Reference source not found. defines the F1 measure calculation [135].

F1 measure = 2 ⨉( ⨉ )

( + ) (4.4)

d. Average class accuracy: the classification accuracy defined in equationError! Reference source not found. (above) can misjudge the classifier performance if the tested data set is imbalanced. Hence, the average accuracy was used to overcome this issue. Average accuracy is defined in equationError! Reference source not found. (below) [135].

Page | 84

= 1

| ( )| recall

∈ ( )

(4.5) Where ( ) refers to the set of levels the targeted feature t can take; | ( )| is the set of levels size and recall is the recall value obtained by the model for level l.

e. Average class accuracy (harmonic mean): the average accuracy defined in equation (4.5) (above) used the arithmetic mean. However, other research prefers to use the harmonic mean, which highlights the effect of smaller values and produces a more realistic measure of how a model is performing. EquationError! Reference source not found. (below) defines how the harmonic mean accuracy is measured.

Average class accuracy HM =

| ( )|

∈ ( )

(4.6) Although various approaches are available to create the test data set, recent research indicates that the 10-fold cross-validation approach has been widely used [134]. In this approach the available data was divided into 10 equal-sized partitions, and then in each run 1 partition was used as test data, while the other 9 partitions were used as training data. This process was repeated 10 times until all partitions had been used as testing data. The overall prediction model performance represents the aggregation of the model performance in each run. Figure 4-2 (below) explains the k-folds process in the form of pseudo code [134], [135].

Figure 4-2 K-fold Cross-validation Process [134]

Having discussed the performance measures taken to compare the proposed SAHBN model with the existing Bayesian-based classification algorithms, the next section explains the data sets created in the human ageing case studies.

Page | 85

In document Semantically aware hierarchical Bayesian network model for knowledge discovery in data : an ontology based framework (Page 96-99)