2.3 Evaluation Measures and Approaches
2.3.2 Evaluating Named Entity Recognition
The output of NER systems is usually compared to the output of human linguists. The evaluation goal is to determine a score for the system based on this comparison. There are many different methods to calculate this score. To evaluate systems automatically, human experts have to create annotated texts with the correct solutions. For example, let us assume that a human expert created the following markup (Nadeau, 2007):
Unlike <PERSON>Robert</PERSON>, <PERSON>John Briggs Jr</PERSON> contacted <ORGANIZATION>Wonderful Stockbrockers Inc</ORGANIZATION> in
<LOCATION>New York</LOCATION> and instructed them to sell all his shares in <ORGANIZATION>Acme</ORGANIZATION>.
Let us also assume that an NER system created the following markup (Nadeau, 2007) for the same text:
<LOCATION>Unlike</LOCATION> Robert,
<ORGANIZATION>John Briggs Jr</ORGANIZATION> contacted Wonderful
<ORGANIZATION>Stockbrockers</ORGANIZATION> Inc <DATE>in New York</DATE> and instructed them to sell all his shares in <ORGANIZATION>Acme</ORGANIZATION>.
The only correct match between the correct solution and the named entity recognition system output is <ORGANIZATION>Acme</ORGANIZATION>; all other markups are errors.
Error Types
In classification tasks, it is often possible to determine the true positives, false positives, et cetera (see 2.4), but in NER it can help to be more precise about these classes. For example, two false positives are not necessarily equally wrong. Consider a system that had to tag person names in text, and it tagged “A good start” and “Jim Carrey was” as persons. Obviously, the first occurrence is entirely wrong. The second occurrence, however, must also considered wrong, although the system only failed to find the correct right hand boundary
and mistakenly tagged the word “was” too. In the previous example (see Section 2.3.2), we can see five different errors an NER system can make (Manning, 2006). The errors are shown and explained in Table 2.5 (Nadeau, 2007).
Correct Solution System Output Error
Unlike <LOCATION> Unlike
</LOCATION>
The system tagged an entity where none exists.
<PERSON>Robert</PERSON> Robert The system failed to tag an
entity. <PERSON> John Briggs Jr
</PERSON>
<ORGANIZATION> John
Briggs Jr </ORGANIZATION>
The system tagged the en- tity, but classified it incor- rectly.
<ORGANIZATION> Won-
derful Stockbrockers Inc
</ORGANIZATION>
<ORGANIZATION> Stock-
brockers </ORGANIZATION>
The system tagged the en- tity, but the boundaries are incorrect.
<LOCATION>New York</LOCATION>
<DATE>in New York</DATE> The system found an entity,
but classified it incorrectly and chose incorrect bound- aries.
Table 2.5: Named Entity Recognition Error Types (Nadeau, 2007)
Due to the variety of combinations to weigh the error types for evaluation purposes, three main evaluation methods have evolved over the years:
Exact-match Evaluation
The exact-match evaluation is the simplest method. It does not take the different error types into account. A correct assignment must have the boundaries and the classification correct. The precision and recall are calculated as explained in Section 2.3.2. The final score for the NER system is a micro-averaged F value (MAF). The NER system from the example Section 2.3.2 would get the following scores according to Equation 2.1 and 2.2:
Precision = Correct Assigned = 1 5 = 20 % Recall = Correct Possible = 1 5 = 20 % MAF = 20 % MUC Evaluation
The MUC evaluation method takes all five errors from Table 2.5 into account and scores a system along two axes: the TYPE and the TEXT axis. If an entity was classified correctly (regardless of the boundaries), the TYPE is assigned correct. If an entity was found with
Evaluation Measures and Approaches 27
the correct boundaries (regardless of its type), the TEXT is assigned correct. For both
axes, three measures are used: the number of possible entities, called “POS”, the number of actual assigned entities by the system, also referred to as “ACT”, and the number of correct answers by the system called “COR”. MUC also uses the MAF as the final score for the NER system. Like the usual F value, the micro-averaged F value is also the harmonic mean between precision and recall. Using the example from Section 2.3.2, we can calculate the MUC score for the system as follows according to Equation 2.1 and 2.2:
Correct = COR = 4 (2 times TYPE correct, 2 times TEXT correct) Assigned = ACT = 10 (5 times TYPE assigned, 5 times TEXT assigned) Possible = POS = 10 (5 times TYPE, 5 times TEXT)
Precision = Correct Assigned = 4/10 = 40 % Recall = Correct Possible = 4/10 = 40 % MAF = 40 % ACE Evaluation
The ACE evaluation assigns weights to each entity type. For example, if an NER system correctly classifies an organization it gets one point, whereas it only gets 0.5 points for correctly tagging and classifying a person. Additionally, a cost value is set for the errors “false alarm”, “missed entity”, and “wrong type”. The weights and costs are set for all types and their subtypes, making ACE the most customizable evaluation procedure. The final evaluation score is called Entity Detection and Recognition Value (EDR) and is calculated as 100 % minus the accumulated penalties (costs). The actual EDR for our example from Section 2.3.2 depends on the values for weights and costs. This is also a major drawback for the ACE evaluation since evaluation results might be difficult to compare. Moreover, the complex formula for the EDR complicates the analysis of errors (Marrero et al., 2009).