• No results found

Difficulties of corpus-based evaluations

2.9 Evaluation Methodologies

2.9.2 Difficulties of corpus-based evaluations

To evaluate a literature-mining method, its output is either compared to a gold standard or is manually inspected by an expert. In Leser and Hakenberg [159] the authors name facts which make the evaluation based on manually annotated corpora difficult.

Few available Gold Standards. The number of accessible test corpora is limited. Sec- tion 2.6.3 describes the most important accessible corpora available today. Table 2.21 shows that the available corpora mainly focus on protein and gene entities in text. Only the Fetch- Prot corpus provides marked up annotations of Gene Ontology terms. Unfortunately the number of different terms and the amount of manual annotations do not allow for evaluat- ing a Term Recognition algorithm. The GENIA corpus provides enough annotations for an evaluation. However the 33 ontology terms do not cover for an evaluation of an ontology with thousands of terms. The terms of the GENIA ontology can only be mapped to the Gene Ontology or MeSH on a very high level.

In Doms [67] an evaluation set of 100 documents marked up with GO terms was used but it was not created by domain experts. The freely accessible GOA database provides a large amount of manually curated associations between PubMed identifiers and GO terms which characterize gene and proteins. However the textual evidences are from the full texts which are often not freely accessible. In Doms [67] the evaluation was executed on the freely available abstract texts. The recall performance could therefore not be measured realisti- cally. As many mentions of GO terms do not appear in the abstract. Also the precision performance seemed reduced on this dataset because not only GO term characterizing pro- teins were annotated by the system. The non-protein describing GO terms were counted as false positives despite the fact they were correctly identified.

The BioCreAtIvE workshop provided the participants with associations of GO terms with text passages from PubMed articles. It is the best available Gold Standard for GO term markups in full texts of PubMed articles. A drawback is that the markups are not complete and reflect only annotations for selected proteins. Relevant mentions of other GO terms were not evaluated by the human curators.

Low annotator agreements. Leser and Hakenberg [159] point out that the markup of entities in text depends greatly on the curators. Curators might miss correct annotations be- cause no biologist knows thousands of proteins including their synonyms and abbreviations. Ideally the curation process is executed by several experts on the same texts. In this case an inter-annotator agreement can be measured which marks a natural upper bound on the quality achievable by automatic systems. Inter-annotator agreement is reported between 75 to 90% for gene and protein names [92, 3]. Clear curation guidelines improve the curation consistency [168].

Identifying GO terms in full text articles seems to produce a high disagreement between curators. Results reported in Camon et al. [43] indicate that there is 39% chance of curators exactly interpreting the text and selecting the same GO term, a 43% chance that they will extract a term from a different lineage, and a 19% chance that they will annotate a term from the same GO lineage. Ignoring differing lineages this amount to a curator agreement of 58% only. This is due to tha fact that curators are taught to annotate according to their individual level of confidence. Nevertheless 72% of the time curators recalled all possible valid GO terms from the articles.

Taking the fact into account that human expert curators disagree considerably depending on the task there is a high risk that any system reporting a higher performance represents an overfitting to the particular gold standard.

Problem severity depends on entity type. A varying severity is observed between different entity classes, e.g. the best f-measure for protein and gene names recognition in the second BioCreAtIvE workshop was 85%. Recognition performance on ontology concepts is frequently lower, 25 to 80% [57, 50, 272].

The severity of Ontology Term Recognition is high because each concept in the ontology has its own entity class. Another problem is that most concepts defined in ontologies, e.g. the OBO Foundry, are not covered by any training corpus. Hersh et al. [117] report much lower performance rates for the GO term subtask (50-60%)than for the other subtasks.

Unsharp markup bounderies. Another difficulty is the ambiguous beginning and end- ing characters of entities. Alex et al. [3] report the most disagreements about the markup ranges. Depending on the guidelines organism names may or may not be part of a protein

name, e.g. human growth hormone or human IGF-II protein. It also depends on the guide- lines how conjunctions are tagged in the corpus, e.g. CSN subunit 4,5,6 or CSN subunit 4,5,6. See also section 2.6.3 on inline vs. standoff annotations. Standoff annotations can express conjunctions better than inline annotations. To make systems comparable the an- notations can be normalized to the sentence level. The problem of markup bounderies is less severe for ontology-based literature search as the automatic annotations are normalized to sentence or abstract level.

Bias toward specific tasks. The systems evaluated in the BioCreAtIvE workshop as- sumed that database curators favor a high precision the support their task. On the other hand Alex et al. [3] state that some curators prefer a high recall. Also when analyzing large amounts of text to identify protein interaction networks a high recall might be favorable when human evaluate the networks. Variations of names are tolerated by humans naturally. When comparing systems the specific task the system was designed for has to be taken into account.

In the case of ontology-based literature search a high recall is favorable because filtering articles is done using general terms. An intolerant recognition strategy might fail to associate a document at all in the ontology. The document becomes invisible in this case.