Evaluation and Interpretation - Statistical Text Mining (STM)

Chapter 2 Research Methodology

2.4 Statistical Text Mining (STM)

2.4.3 Evaluation and Interpretation

There are numerous evaluation statistics that can be used to evaluate and interpret the performance of the model. Based on the results of these statistics, the model can be determined as effective or not.

Confusion Matrix

Before applying one of the statistics, a confusion matrix needs to be created. This matrix will show the numbers of correctly and incorrectly classified documents processed by the model. The center of figure 6 shows the four possible results that can occur from a classification problem. If both the gold standard and model predicted classifications match, then the result is either a True Positive (TP) or True Negative (TN). If, however, the classifications do not match then one of the false results will be applied. A False Positive (FP) is where the gold standard classification is negative yet the model predicted the classification to be positive, or falsely classified as a positive. These are the results of Type I errors. On the other hand, if the gold standard classification is positive and the model predicted the classification to be negative, or falsely classified as a negative, then a False Negative (FN) is the result. These misclassifications are Type II errors.

Figure 6.: Confusion Matrix and Evaluation Metrics

Evaluation Metrics

From these four classification results, several statistics can be calculated to help evaluate and interpret the model. Figure 6 shows the statistics and the formulas for calculating them around the outside of the chart. Table 8 lists the evaluation metrics used and a short explanation of each.

Traditional NLP research does not report specificity or NPV (Sokolova and Lapalme, 2009). The reason being is in that type of research, typically there are no cases classified as true negative (TN). A grammatical NLP pipeline with the purpose of extracting terms compared to a human annotated gold standard will find terms that match the human annotation and those will be true positives (TP). It will find terms that were not found by the human annotation and those will be false positives (FP). It will also not find terms that were found by the human annotation and those will be false negatives (FN). The pipeline, however, cannot not find terms that were also not found by the human annotation. Those would be true negatives. Because of this, specificity and NPV are typically excluded from NLP research. Those statistics are however, reported in STM research and will be reported here.

Many different metrics are used to evaluate the performance of a classifying model, among those, accuracy is the most widely used (Sokolova and Lapalme, 2009). Depending on how the model is to be evaluated, some measures will work better than others. A measure is invariant if its value does not change when the confusion matrix changes. For example, if the number of cases making up the (FN) counts were reduced and the cases making up the (TN) counts were increased by the same amount, the metric precision would not change to reflect the change in counts.

Table 8: Evaluation Metrics

Statistic Definition

Sensitivity (Recall) Proportion of actual positives which are correctly identified as such. This measures the lack of missed classifications

Specificity Proportion of negatives which are correctly identified. Accuracy Proportion of correctly identified of the total identified. It

is a balance of Precision and Recall.

Negative Predictive Value Proportion of actual negatives identified from the total negatives identified.

Precision (PPV) Proportion of actual positives identified from the total positive identified. This measures the lack of false positives.

F-Measure Weighted average of the precision and recall.

As mentioned above, traditional NLP does not have a count of TNs. This explains why F- Measure is reported in this type of classification task because if there are zero TNs, the F-Measure calculation would have zero in its denominator. Accuracy on the other hand uses all four counts in its calculation. Where accuracy is not a good evaluator is in a task where one class is very small and the other very large. For example, if a task’s data split is 2% for the positive class (class of interest) and 98% for the negative class, the model could simply classify everything as the negative class and achieve 98% accuracy which would be deceiving. Accuracy is an acceptable measure when being compared to baseline. The data split for the data set used in the three studies in this dissertation is approximately 23% for the class of interest and 77% for the negative class. Because the class of interest is not very small, accuracy is the measure that will be used to ultimately com- pare the models to baseline and decide which model to move forward through the process. Other evaluation metrics will be reported as well but will not be used in the selection of a model. Accu- racy is invariant in two situations; first, if the positives and negatives are exchanged, and secondly, when there is a uniform change in the positives and negatives. These two situations are not of con- cern for this research because the size of the data set is constant and the models are not being compared to models created from other data sets.

Chapter 3

In document Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages (Page 52-55)