Performance Evaluation - New techniques for Arabic document classification

In data mining, evaluation of the accuracy of machine learning algorithms is an essential step. To classify given data, a set of training data is used to build the classication model. For estimating the accuracy of the obtained classier, two common approaches, hold-out and cross-validation are used to assess the ability of the classier to predicate the correct class or category of unseen instances.

2.11.1 Hold out

In the hold out method, the available data is arbitrarily split into two separate sets, a training set and a test set. Usually two thirds of the data are retained for training and the remaining third is used for testing. A problem may arise when one of the classes is not represented in the training portion of data. This problem is solved using what is called stratied hold out. In this case, the selected sample contains instances from all classes of the data. In other words, all classes are represented in both data sets [24, 25].

2.11.2 K-fold-Cross validation

In this method, the data is randomly split into K equal subsets or folds. Repeating K times, each subset is used for testing and the other remaining folds for training. Then, overall error is estimated by averaging the K errors. Usually, the stratied version of K fold cross validation is used to ensure that all given classes are represented in all folds [23,24, 25].

2.11.3 Leave One Out Cross Validation (LOOCV)

Leave one out cross Validation is similar to K fold cross validation. The dierence is that the number of folds is equal to the number of instances in the dataset. This means that at each run, there is only one instance in the test set. The advantage of the LOOCV technique is that it avoids random sampling. All training data partic- ipate in the learning algorithm training; however, this method is computationally costly [24, 25].

2.11.4 Error rate

Error rate is the percentage of misclassied instance in a given test set. Consider a test set D consists of N instances, and r is the number of misclassied instances by a classier. The accuracy of the classier for correctly predicting the classes of the instances in D can be estimated as follows [23, 24]:

Acc = _Nr (2.25) For more reliable estimation, normal distribution is used to estimate the accuracy. In case the dataset size is not small, the estimated accuracy is given as below:

P = z r

(Acc)(1 Acc)

N (2.26)

The accuracy is in the range:

Acc = Acc P (2.27) The disadvantage of the error rate method is that it ignores the cost of wrong prediction which is important in machine learning. This problem can be avoided using F-measure [24].

2.11.5 F1-measure

F-measure is widely used in the information retrieval eld and is calculated based on two measures, precision and recall. In this context, consider the documents in

the test set that is category A. The classier predicts a category for each document, and these predictions will fall into four classes with respect to category A [24, 25].

True Positives (TP): the set of documents that are in category A, and were correctly predicted to be in category A.

True Negatives (TN): the set of documents that are not in category A, and were predicted to be in a dierent category than A.

False Positives (FP): the set of documents that were predicted to be in category A, but in fact they are of a dierent category.

False negatives (FN): the set of documents that were predicted not to be in category A, but are actually in category A.

Precision is the proportion of predicted category A documents that were correctly predicted.

P recision = _{jT P j + jF P j}jT P j (2.28) Recall is the proportion of actual category A documents that were correctly predicted.

Recall = jT P j

jT P j + jF Nj (2.29) The F-measure is the harmonic mean of precision (p) and recall (r).

F-measure = 2 _{precision + recall}precision:recall = _{2 jT P j + jF P j + jF Nj}2 jT P j (2.30)

2.11.6 Confusion Matrix

This is a simple way to view the performance of classication algorithms. Consider a problem of two classes; using the confusion matrix, the actual and predicted classes of the test set instances can be displayed as in Table 2.4 [24, 25]:

The accuracy and error of the classier are calculated as the following: error = F P + F N

T P + F P + T N + F N (2.31) accuracy = _{T P + F P + T N + F N}T P + T N (2.32)

Predicted Actual Yes No Yes TP FN No FP TN

Table 2.4: Confusion Matrix of two classes

2.11.7 T-test

T-test is a statistical method that is used to measure the dierence between the means of two sets. In machine learning, T-test can be applied to assess the performance of two learning techniques to see whether the dierence between their accuracy means is statistically signicant or not [23].

In document New techniques for Arabic document classification (Page 48-51)