Evaluating Information Extraction Systems

2.6 Evaluation Measures

2.6.1 Evaluating Information Extraction Systems

When evaluating an information extraction system, the systems’ output is usually compared to some gold standard information, which is assumed to be correct and used as ground truth during the evaluation process. In the context of named entity recognition and normalization systems, this ground truth are usually annotations of human linguists (Nadeau and Sekine, 2007). In the following, we present the so-called confusion matrix classifying the decisions of an information extraction system into different groups of correct and incorrect decisions. Then, we present the frequently used measures of precision, recall, and f-score as well as the accuracy measure.

While, in this section, we just briefly described UIMA’s basics, which are required for understanding the later described tools and document processing pipelines, we refer for further information about UIMA to Ferrucci and Lally (2004a,b) and http://uima.apache.org/[last accessed April 8, 2014].

2 Context of the Work & Basic Concepts

gold standard (ground truth) system prediction positive negative

positive TP FP

negative FN TN

Table 2.3: General confusion matrix.

Confusion Matrix

Information extraction tasks can often be considered as specific sequential tagging and classification tasks (Weiss et al., 2005: p.132) and the confusion matrix (also called contingency table or contingency matrix) can be used to describe a system’s errors compared to a gold standard (Manning and Schütze, 2003: p.268). In the following, we assume that in a gold standard, single instances, e.g., tokens, are either manually annotated as being a specific entity or as not being an entity. The system that is to be evaluated performs this task automatically, i.e., extracts entities of that type. All decisions of the information extraction system then can be grouped with the confusion matrix into one of the following four classes of a binary classification (Manning and Schütze, 2003: p.268):

• true positive: An instance, which is annotated by the system, is also annotated in the gold standard. • true negative: An instance, which is not annotated by the system, is also not annotated in the gold

standard.

• false positive: An instance, which is annotated by the system, is not annotated in the gold standard. • false negative: An instance, which is not annotated by the system, is annotated in the gold standard. In Table 2.3, the confusion matrix is depicted. Based on the four categories, the widely used evaluation measures of precision and recall can be calculated.

Precision, Recall, and F-score

Precision p (Equation 2.1; see, e.g., Manning and Schütze, 2003: p.268) is defined as ratio of instances correctly marked as positive by the system (TP) to all instances marked as positive by the system (TP+FP), with 0 ≤ p ≤ 1. If all instances marked as positive by the system are correct, then precision equals 1. In contrast, if all instances marked as positive by the system are incorrectly marked, then precision equals 0.

precision p = true positives (TP)

true positives (TP) + false positives (FP) (2.1)

Recall r (Equation 2.2; see, e.g., Manning and Schütze, 2003: p.269) is defined as ratio of instances correctly marked as positive by the system (TP) to all instances that should be marked as positive, i.e., to all instances marked as positive in the gold standard. As for precision, the range of recall is between 0 and 1 (0 ≤ r ≤ 1). Recall equals 0 if none of the instances that should be marked as positive are marked as positive by the system while recall equals 1 if all instances that should be marked as positive are also marked as positive by the system.

2.6 Evaluation Measures

recall r = true positives (TP)

true positives (TP) + false negatives (FN) (2.2)

Obviously, there is a trade-off between precision and recall. Marking all instances as positive results in a recall of 1 while marking only a single instance correctly as positive results in a precision of 1. Depending on the ratio of positive and negative instances in the gold standard, the other measures (precision and recall, respectively) would be rather low if these strategies were applied. Once a system already reaches a specific level for precision and recall, an increase of one of the measures usually involves a decrease of the other measure. Thus, the goal is often to find a good trade-off between precision and recall. To determine what “good” is, the fβ-score (also called fβ-measure) can be calculated (Equation 2.3; see, e.g., Manning

et al., 2008: p.156). The fβ-score measures the weighted harmonic mean of precision and recall.

fβ-score fβ =

(1 + β2_{) × precision (p) × recall (r)}

β2_{× precision (p) + recall (r)} (2.3)

Depending on the choice of β, precision and recall can be weighted differently. Frequently used values for β are 0.5, 1, and 2. The f0.5-score weights the precision twice while the f2-score weights the recall

twice. Most frequently used is the f1-score to calculate the balanced harmonic mean (Equation 2.4). Thus,

it is often also referred to as f-score or f-measure.

f1-score f1=

2 × precision (p) × recall (r)

precision (p) + recall (r) (2.4)

Note that in the context of named entity recognition and normalization (cf. Section 2.2), the measures precision, recall, and f-score can be calculated for the extraction subtask or for the full task of extraction and normalization. In the first case, an entity is considered as true positive (TP) if it is extracted by the system and marked in the gold standard. In the latter case, an entity is only considered as true positive if it is extracted by the system and marked in the gold standard, and, if it is additionally normalized correctly. Calculating the measures of precision, recall, and f-score to evaluate the extraction as well as the extraction and normalization quality of information extraction systems allows to interpret their evaluation results and to compare different systems in a meaningful way.

Accuracy

An additional way to evaluate the quality of an information extraction system is to calculate the accuracy (Equation 2.5; see, e.g., Manning and Schütze, 2003: p.269). The difference between precision and accuracy is that precision deals only with a system’s decisions about those instances marked as positive in the gold standard, i.e., only true positives (TP) and false positives (FP) are considered. In contrast, accuracy calculates the correctness of all decisions independent of whether instances are marked as positive or negative in the gold standard. Thus, accuracy is calculated as the ratio of correct decisions to all decisions.

accuracy = correct decisions all decisions =

TP + TN

2 Context of the Work & Basic Concepts

Note that for some applications, accuracy is the typically used evaluation measure, e.g., for tokenization and sentence splitting applications (Tomanek et al., 2007). However, for information extraction systems, precision, recall, and f-score are typically used because in many situations the class of true negatives “is huge and dwarfs all the other numbers” (Manning and Schütze, 2003: p.269) resulting in high accuracy values independent of how well the class of positives (e.g., entities) is handled by the system. Nevertheless, accuracy is sometimes also reported. For instance, in the context of named entity extraction and normalization, the subtask of normalization is sometimes evaluated based on the accuracy measure. Note, however, that the set of all decisions then does not contain decisions about all entities but only about those extracted by the system (Equation 2.6). Thus, the class of false negatives only contains extracted entities not correctly normalized by the system.

accuracy = correctly normalized entities

correctly extracted entities (2.6)

The normalization quality of two systems with different recall values in the extraction task should not be directly compared based on the accuracy value without considering the recall of the extraction. A system can achieve a higher accuracy score than another system although the latter normalizes more entities correctly. Assuming system A correctly extracts only one entity and also normalizes this entity correctly. Furthermore, assuming system B correctly extracts all entities in a data set and normalizes all entities except of one correctly, then, 1 = accuracy(system A) > accuracy(system B).

Due to this behavior, we prefer to evaluate both subtasks of named entity extraction and normalization using the measures precision, recall, and f-score. However, sometimes, accuracy is calculated for the normalization subtask as in the temporal tagging task of the TempEval-2 competition (Verhagen et al., 2010) for evaluating the normalization performance of so-called temporal taggers (cf. Section 3.6.2).

In document Domain-sensitive Temporal Tagging for Event-centric Information Retrieval (Page 43-46)