2.6 Biomedical text corpora
2.6.3 Annotation corpora
Linguists have created a number of annotated corpora to solve some problems of natural language understanding such as: text segmentation, coreferences, named entity recognition and word sense disambiguation. The discipline of statistical natural-language processing uses quantitative methods to solve some difficult problems in language understanding. Especially texts with long sentences with highly ambiguous terminology and complex structures are
difficult to be parsed and fully understood by NLP systems. Stochastical, probabilistical and statistical methods are used to overcome those problems which currently are believed not solvable by parsing with formal grammars.
Manually annotated corpora differ in various aspects. Besides corpus size, language coverage, free availability and domain subject some more technical aspects can be compared. A group of corpora was build for general research on natural language [169, 235, 170] and cover e.g. news language, poetry as well as technical speech. Other corpora focus on specific research domains and are biased toward a scientific language [193, 167, 252]. General corpora can be used to reveal domain specific terminology and linguistic structures in such domain specific corpora. Table 2.19 compares publication year, corpus size, subject domain and corpus composition of 15 freely available corpora. Some corpora are build using other corpora. The last column in the table lists which corpus has inclusions of other corpora.
Other differences between corpora are of more technical nature. A corpus can be encoded in a simple text file using tabulators or new line characters to separate words and annota- tions. In contrast to this approach XML-based corpora define an encoding scheme with a document-type definition (DTD). As an alternative to proprietary annotation schemes international standards for linguistic annotations are being developed [131].
Two widely used standards for document annotations are DocBook and the Text En- coding Initiative (TEI). TEI was originally conceived of as a means of representing previ- ously existent documents, rather than creating them from scratch. Docbook was originally thought of as a publishing DTD, for converting between digital and print versions of newly authored documents. Table 2.8 shows and example for DocBook and table 2.9 for an TEI document. Both approaches do not offer specialized vocabulary and thus offer extension of their grammars. Scalable Vector Graphics (SVG), and MathML, for mathematical formu- las, are examples for specialized document content. Thompson and McKelvie [266] compare inline [6, 193] and standoff [167, 252] annotations used in corpora. While inline annota- tions are inserted into the text standoff annotations are separated from the original text and refer to character positions the original text. With standoff annotations overlapping annotation hierarchies can exist in parallel. The original text may be distributed separately from the annotation e.g. with copyright conditions. A disadvantage for standoff annotations is the introduction of errors in case the original document is accidentally changed without maintaining the external annotations.
Annotated corpora differ in the level of structural annotations made by humans. Ta- ble 2.21 compares manual annotated corpora which structural annotations were made by curators.
Sentence segmentation is the task of finding the boundaries of a sentence in the text. It is an important initial step in structuring text as there are more intrasentinal linguistic relations than intersentinal relations [178]. The task is not entirely trivial as sentences may embed other sentences. Delimiting characters may be ambiguous, e.g. colon and semi-colon may or may not separate sentences. Punctuations in abbreviations may be misinterpreted as a sentence boundary.
Tokenization identifies the basic units of a sentence, e.g. word, punctuation symbols and brackets. In the biomedical domain tokenization is difficult as many labels of chemical contain brackets and punctuations which can be misinterpreted with sentence punctuations. Examples are:
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.1//EN"> <book>
<bookinfo>
<title>An Example Book</title> <author>
<firstname>Your first name</firstname> <surname>Your surname</surname> <affiliation> <address><email>[email protected]</email></address> </affiliation> </author> <abstract>
<para>If your book has an abstract then it should go here.</para> </abstract>
</bookinfo> <preface>
<title>Preface</title>
<para>Your book may have a preface...</para> </preface>
<chapter>
<title>My first chapter</title>
<para>This is the first chapter in my book.</para> <sect1>
<title>My first section</title>
<para>This is the first section in my book.</para> </sect1>
</chapter> </book>
Figure 2.8: An example of a DocBook document.
<div type="abstract">
<head>Retinoic acid downmodulates erythroid differentiation and <term ana="SEM-94.000">GATA1 expression</term> in
<term ana="SEM-94.001">purified adult-progenitor culture</term>. </head>
<p> <s>
In
<cl ana="SEM-94.011 SEM-94.012"
function="(OR SEM-94.011 SEM-94.012)">
<term ana="SEM-94.013">clonogenetic fetal calf serum</term> <term ana="SEM-94.014">-supplemented (FCS+)</term>
or
<term ana="SEM-94.015">-nonsupplemented (FCS-)</term> <term ana="SEM-94.016">culture</term>
</cl>
treated with saturating levels of
<term ana="SEM-94.018">interleukin-3</term> (<term ana="SEM-94.019">IL-3</term>)
<term ana="SEM-94.020">granulocyte- macrophage colony-stimulating factor</term> ... </s>
...
• (Na+ + K+)ATPase
• 2,3,7, 8-tetrachlorodibenzo-p-dioxin
• 2-(4-acetoxyphenyl)-2-chloro N-methyl-ethylammonium • 2-amino-6-methyldipyrido(1,2-a:3’,2’-d)imidazole • 2,2’,4,5,5’-Cl5
• 1, 2-bis (o-aminophenoxy) ethane N, N, N’, N’-tetraaceticacid tetra(acetomethoxyl) ester
Part-of-speech tagging adds grammatical information to each token. In English lan- guage for example the word “lead” may be a noun but also a verb. Text corpora are manually annotated with gramatical forms to train machine learning algorithms to identify the correct gramatical form. Most words are assigned to be nouns, verbs, adjectives. Table 2.22 shows a complete list of tags defined for the Penn Treebank corpus. Rule-based taggers (Brill) use such corpora to iteratively learn linguistic rules to miminize errors on the manual annotated tags. Statistical taggers use such corpora to learn the probabilities for a POS tag depending on the previous words.
Anaphora resolution is the detection of instances of an expression referring to another expression. An example is “Ras and its GTPase activating proteins”.
Acronyms are abbreviations that are formed using the initial components in a phrase or name. An example is “a patient with the acquired immune deficiency syndrome (AIDS)”.
Synonyms are different words with identical or similar meaning. Examples in the biomed- ical domain are programmed cell death (synonym: apoptosis) and Ganglion Cysts (synonym: Myxoid Cyst).
Chunks are non-overlapping sequences of words forming a phrase. Noun phrases might consitst of a noun and its adjective and can be linked by prepositions. Other types of phrases marked up in corpora are verbal phrases, predicate adjectival phrases and subordinate clause markers.
Treebanking annotates the syntactic structure of a sentence. An example from Penn Treebank WSJ is shown in figure 2.10. Noun, verb and prepositional phrases are marked up using a scheme defined in Marcus et al. [170].
Table 2.21 compares manually annotated corpora relevant for general and biomedical natural language processing. In column 3 the type of entities annotated for each corpus is listed. The most prominent in the biomedical domain are gene and protein names. Other concepts in the biomedical domain annotated frequently in corpora are species, protein structures, residues, substances and measurements. There exist biomedical ontologies con- taining each of that concepts. However, very few corpora take the approach of annotating ontologies at all.
((S
(NP Battle-tested industrial managers here) always (VP buck up (NP nervous newcomers) (PP with (NP the tale (PP of (NP (NP the (ADJP first (PP of (NP their countrymen (S (NP *) to (VP visit (NP Mexico)))) , (NP (NP a boatload (PP of (NP (NP warriors) (VP-I blown ashore (ADVP (NP 375 years) ago))))) (VP-1 *pseudo-attach*)))))))) .)
Figure 2.10: A sentence with treebanking information. Sentences (S), noun phrases (NP), verb phrases (VP)
GENIA Ohta et al. [193] define its own small ontology and has set up guildines for curators how to annotate them. GENIA Corpus Version 3.0x consists of 2000 abstracts. The base abstracts are selected from the search results with keywords (MeSH terms) Human, Blood Cells, and Transcription Factors. Figure 2.11 shows the GENIA ontology. Only the leaf concepts of the ontology are being annotated by the curators. The corpus is encoded in XML format encoding sentence bounderies, term bounderies, term classifications, semi- structured coordinated clauses and recovered ellipsis in terms. All the abstracts and their titles have been marked-up with biologically meaningful terms, and these terms have been semantically annotated with descriptor from the ontology.
FetchProt Franz´en and Oppenheimer [89] focus on a very small subset of the Gene Ontol- ogy and annotates GO concepts related to tyrosine kinase, namely tyrosine kinase activ- ity, transmembrane receptor protein tyrosine kinase activity, non-membrane spanning protein tyrosine kinase activity and Janus kinase activity. The corpus constists of free access documents, mostly concered with the description of lab experiements on tyrosine kinase activity. Some documents not concered with such experiments were in- cluded to test the FetchProt system for the ratio of false positives. The corpus contains also mentions of mutations of proteins related to tyrosine kinase activity. The text files have been produced by saving the PDF-files as text. The content of an article in plain text is interspersed with annotations (xml-style tags) surrounding specific semantic elements. Figure 2.12 shows a FetchProt annotation made for the PubMed document with the PMID 9582366. An entire paragraph was marked describing a non-membrane spanning protein tyrosine kinase activity.
None of the freely available manually annotated corpora provide a large enough test set of mentions for the Gene Ontology. GENIA and FetchProt annotate concepts related
Figure 2.11: The GENIA ontology (Version 3.01) is a concept tree. The GENIA corpus is only annotated with the leaf concepts printed bold in this figure.
<function prid="p1" fun_id="f1" go_id="GO:0004715"> <evidence prid="p1" fun_id="f1" ev_id="e1" type="006">
<experiment prid="p1" fun_id="f1" ev_id="e1">
We next examined whether Src or Fak catalyze the tyrosine phosphorylation of SHPS-1 in vitro. Incubation of a GST fusion protein containing the cytoplasmic domain of SHPS-1 (GST-SHPS-1-cyto) with Src kinase immunoprecipitated from Rat-1 cells resulted in the phosphorylation of both GST-SHPS-1-cyto (Fig. 5A) and Src (Fig. 5C).
</experiment>
<result prid="p1" fun_id="f1" ev_id="e1">
Incubation of a GST fusion protein containing the cytoplasmic domain of SHPS-1 (GST-SHPS-1-cyto) with Src kinase immunoprecipitated from Rat-1 cells resulted in the phosphorylation of both GSTSHPS-1-cyto (Fig. 5A) and Src (Fig. 5C). </result>
</evidence> </function>
Figure 2.12: An XML fragment of a structural template for PMID: 9582366 annotated man- ually for FetchProt. The experiment describes a non-membrane spanning protein tyrosine kinase activity (GO:0004715).
to concepts in GO. A possible usage of the two corpora prerequisits the mapping of the annotated concepts two actual concepts of the current version in GO. FetchProt used at least one outdated term which was changed in GO in one of the more recent versions. The GENIA concepts are much more general concpets as most concepts known from GO. For a subset of the GENIA concepts a rough mapping is possible. Another issue is that GENIA does not apply the True Path Rule known from GO. The none leaf nodes in the GENIA ontology can be seen as hierarchical modifiers to the leaf concepts. The ancestors are not concepts in the same sense as the leaf nodes. In the GENIA corpus only the leafs are annotated by curators. The part of GO used in FetchProt is with 4 nodes rather small and can therefore not serve as a test corpus for ontology-based term identification. In contrast to the currently available text corpora there exist database curation project which make use of larger portions of ontologies.