• No results found

Evaluating Named Entity Recognition

2.3 Evaluation Measures and Approaches

2.3.2 Evaluating Named Entity Recognition

The output of NER systems is usually compared to the output of human linguists. The evaluation goal is to determine a score for the system based on this comparison. There are many different methods to calculate this score. To evaluate systems automatically, human experts have to create annotated texts with the correct solutions. For example, let us assume that a human expert created the following markup (Nadeau, 2007):

Unlike <PERSON>Robert</PERSON>, <PERSON>John Briggs Jr</PERSON> contacted <ORGANIZATION>Wonderful Stockbrockers Inc</ORGANIZATION> in

<LOCATION>New York</LOCATION> and instructed them to sell all his shares in <ORGANIZATION>Acme</ORGANIZATION>.

Let us also assume that an NER system created the following markup (Nadeau, 2007) for the same text:

<LOCATION>Unlike</LOCATION> Robert,

<ORGANIZATION>John Briggs Jr</ORGANIZATION> contacted Wonderful

<ORGANIZATION>Stockbrockers</ORGANIZATION> Inc <DATE>in New York</DATE> and instructed them to sell all his shares in <ORGANIZATION>Acme</ORGANIZATION>.

The only correct match between the correct solution and the named entity recognition system output is <ORGANIZATION>Acme</ORGANIZATION>; all other markups are errors.

Error Types

In classification tasks, it is often possible to determine the true positives, false positives, et cetera (see 2.4), but in NER it can help to be more precise about these classes. For example, two false positives are not necessarily equally wrong. Consider a system that had to tag person names in text, and it tagged “A good start” and “Jim Carrey was” as persons. Obviously, the first occurrence is entirely wrong. The second occurrence, however, must also considered wrong, although the system only failed to find the correct right hand boundary

and mistakenly tagged the word “was” too. In the previous example (see Section 2.3.2), we can see five different errors an NER system can make (Manning, 2006). The errors are shown and explained in Table 2.5 (Nadeau, 2007).

Correct Solution System Output Error

Unlike <LOCATION> Unlike

</LOCATION>

The system tagged an entity where none exists.

<PERSON>Robert</PERSON> Robert The system failed to tag an

entity. <PERSON> John Briggs Jr

</PERSON>

<ORGANIZATION> John

Briggs Jr </ORGANIZATION>

The system tagged the en- tity, but classified it incor- rectly.

<ORGANIZATION> Won-

derful Stockbrockers Inc

</ORGANIZATION>

<ORGANIZATION> Stock-

brockers </ORGANIZATION>

The system tagged the en- tity, but the boundaries are incorrect.

<LOCATION>New York</LOCATION>

<DATE>in New York</DATE> The system found an entity,

but classified it incorrectly and chose incorrect bound- aries.

Table 2.5: Named Entity Recognition Error Types (Nadeau, 2007)

Due to the variety of combinations to weigh the error types for evaluation purposes, three main evaluation methods have evolved over the years:

Exact-match Evaluation

The exact-match evaluation is the simplest method. It does not take the different error types into account. A correct assignment must have the boundaries and the classification correct. The precision and recall are calculated as explained in Section 2.3.2. The final score for the NER system is a micro-averaged F value (MAF). The NER system from the example Section 2.3.2 would get the following scores according to Equation 2.1 and 2.2:

ˆ Precision = Correct Assigned = 1 5 = 20 % ˆ Recall = Correct Possible = 1 5 = 20 % ˆ MAF = 20 % MUC Evaluation

The MUC evaluation method takes all five errors from Table 2.5 into account and scores a system along two axes: the TYPE and the TEXT axis. If an entity was classified correctly (regardless of the boundaries), the TYPE is assigned correct. If an entity was found with

Evaluation Measures and Approaches 27

the correct boundaries (regardless of its type), the TEXT is assigned correct. For both

axes, three measures are used: the number of possible entities, called “POS”, the number of actual assigned entities by the system, also referred to as “ACT”, and the number of correct answers by the system called “COR”. MUC also uses the MAF as the final score for the NER system. Like the usual F value, the micro-averaged F value is also the harmonic mean between precision and recall. Using the example from Section 2.3.2, we can calculate the MUC score for the system as follows according to Equation 2.1 and 2.2:

ˆ Correct = COR = 4 (2 times TYPE correct, 2 times TEXT correct) ˆ Assigned = ACT = 10 (5 times TYPE assigned, 5 times TEXT assigned) ˆ Possible = POS = 10 (5 times TYPE, 5 times TEXT)

ˆ Precision = Correct Assigned = 4/10 = 40 % ˆ Recall = Correct Possible = 4/10 = 40 % ˆ MAF = 40 % ACE Evaluation

The ACE evaluation assigns weights to each entity type. For example, if an NER system correctly classifies an organization it gets one point, whereas it only gets 0.5 points for correctly tagging and classifying a person. Additionally, a cost value is set for the errors “false alarm”, “missed entity”, and “wrong type”. The weights and costs are set for all types and their subtypes, making ACE the most customizable evaluation procedure. The final evaluation score is called Entity Detection and Recognition Value (EDR) and is calculated as 100 % minus the accumulated penalties (costs). The actual EDR for our example from Section 2.3.2 depends on the values for weights and costs. This is also a major drawback for the ACE evaluation since evaluation results might be difficult to compare. Moreover, the complex formula for the EDR complicates the analysis of errors (Marrero et al., 2009).