Part I Text Mining
3.7 Chapter Summary
Most biological objects, such as genes and proteins, are assigned with several names or iden- tifiers. For applications such as manual literature search, automated text mining, named entity identification, gene/protein annotation, and linking of information from different data sources, it is required to know the names and symbols for a given gene or protein. Various organism-specific or general public databases organize knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. Gene and protein name dictionaries represent a means for compiling identifiers, symbols, and names for gene and proteins.
In this chapter, a method to automatically derive gene and protein name dictionaries from public databases has been presented. An automatic curation procedure leads to high qual- ity gene and protein name dictionaries. The resulting dictionaries form the basis for the
gene and protein name identification approaches presented in the next chapter (Chapter4).
The detailed analysis of gene and protein name dictionaries revealed important differences between the databases the dictionaries were derived from as well as between the organism
nomenclatures (Fundel and Zimmer, 2006). The number of genes/proteins and synonyms
covered in individual databases varies significantly for a given organism. All dictionar- ies show an important yet varying degree of within-dictionary ambiguity. The between- dictionary ambiguity reflects the degree of relationship of the organisms. The degree of ambiguity of gene names with common English words and domain-related non-gene terms varies for the respective organisms and reflects the nomenclature guidelines. Despite con- siderable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate. The combination of data from several databases returns gene and protein name dictionaries that contain considerably more used names than dictionaries obtained from individual data sources. Curation increases size and decreases ambiguity of the dictionaries. Hierarchical synonym dictionaries provide a means to recover gene or protein groups, or genes/proteins that are not fully specified in a text. A method has been presented that generates hierarchical synonym dictionaries by application of heuristic rules to the stan-
dard gene and protein name dictionaries (Friedel, 2003;Donner,2003).
Dictionaries for non-gene and non-protein objects (here: organisms, body parts, tissues, cell types/cell lines, and diseases) and abbreviations have been compiled, and an inter- action term list has been compiled. These are useful for context filtering and analysis, inter-dictionary disambiguation, and interaction extraction, respectively.
The gene and protein name dictionaries obtained from the combination of several data sources and subjected to the curation procedure are publicly available via several tools.
LiMB (Güttler, 2006) makes use of the synonym dictionaries for text mining and focuses
on user-friendly visualization of results. The ProThesaurus and BeThesaurus web services support automatic querying. The ProTag client applications enable users to query the syn- onym dictionaries via the web services from within Microsoft Office applications and by
a standalone tool (Szugat et al., 2005). The ProThesaurus Wiki enables users to query
synonym dictionaries, search MEDLINE and the Internet with synonyms, and to update the Wiki content.
In the next chapter (Chapter 4), methods for high quality gene and protein identification
Gene and Protein Name Identification
Scientific articles are one of the main sources for biomedical information; an important part of biological knowledge is only available in free text. Due to the enormous amount of literature, it becomes necessary to exploit texts and extract information automatically. The identification of gene and protein names is one of the most important tasks in biomedical text mining. Often, this is required as a preprocessing step; for example, when aiming at involved information extraction, relation detection, or integration of text data with data from other sources.
In this chapter three modular systems for gene and protein name identification are pre-
sented (Figure 4.1); all of them rely on synonym dictionaries, which can be generated by
the methods described in the previous chapter (Section3.2).
The exact matching approach (Figure 4.1a, Section 4.2, Fundel et al. (2005a)) directly
evaluates the applied synonym dictionary. Various postfilters can be applied for increasing precision. This approach has been developed together with Joannis Apostolakis and Daniel Güttler.
The ProMiner system (Figure4.1b, Section4.3,Hanischet al.(2005)) expands on the tool
ProMiner (Hanisch et al., 2003) which identifies gene and protein names by approximate
matching of synonyms. Here, it has been expanded by specific preprocessing of the syn- onym dictionary and postfiltering; thus, it has been adapted to the naming conventions of yeast, mouse, and fly. The expansion is joint work with Daniel Hanisch, Theo Mevissen, and Juliane Fluck.
The combined system (Figure4.1c, Section 4.4,Fundel and Zimmer (2007)) integrates the
matching results of exact matching and ProMiner and implements inter-dictionary and intra-dictionary disambiguation.
All three approaches have been evaluated in the BioCreAtIvE challenge evaluations (Sec-
4.1
Introduction and Literature Review
Gene and protein name identification is concerned with finding occurrences of a gene or protein name in a text and returning the corresponding unique identifier for the gene or pro- tein together with the detected text fragment. Most genes/proteins have multiple names; gene names show high variability, are often ambiguous and can overlap with English words
(see Section 3.3). Therefore, gene name identification is a difficult task. Gene and protein
names of different organism vary significantly. Accordingly, names of different organisms exhibit varying degrees of difficulty for text mining.
Several approaches have been proposed to tackle gene and protein name identification. Machine learning, such as support vector machines and hidden Markov Models (HMMs)
(Bunescu et al., 2003), or HMMs for gene mention tagging followed by normalization ac-
cording to various filters (Morganet al.,2004) has successfully been applied. Other methods
focus on linguistic techniques (e. g.Tanabe and Wilbur (2002)).
Numerous methods make use of gene and protein name dictionaries which can be extracted
from databases, ontologies, and other data sources (Onoet al., 2001; Hanischet al., 2003;
Koike and Takagi, 2004; Tsuruoka and Tsujii, 2004).
Some methods rely on the combination of dictionaries and linguistic methods, such as ProtScan, that combines a dictionary based approach and a specialized tokenization ap-
proach (Egorov et al., 2004). BLAST (Altschul et al., 1997), a tool for DNA and protein
sequence comparison, has also been used for matching gene and protein names against
texts (Krauthammer et al., 2000). An overview of biological named entity extraction and
a lexical matching exercise that depicts specific problems of fly synonyms genes has been
presented by Hirschman et al. (2002a).
Disambiguation, that is identification of the correct meaning of an expression out of a set of possible alternatives, plays an important role in gene name identification. Several stud- ies concerning disambiguation of abbreviations as well as words or even longer expressions have been presented. Most of them are based on machine learning, which requires anno-
tated training data (e. g. Hatzivassiloglou et al. (2001);Liu et al. (2002b)).
The BioCreAtIvE1 (Critical Assessment of Information Extraction systems in Biology)
challenge evaluation (Hirschman et al., 2005b) is a community-wide effort for evaluating
text mining and information extraction systems applied to the domain of biomedical lit- erature. The first challenge was conducted in 2004, it consisted of three tasks: Task 1A evaluated the recognition of gene and protein names in texts. Task 1B has been set up to assess the ability of automated systems to identify names of genes and gene products and
normalize them by association of a unique identifier for each gene/gene product (Hirschman
et al., 2005a). The focus was on yeast, mouse, and fly. Task 2 contained several subtasks
which concerned the assignment of GO-terms based on text analysis.
The second BioCreAtIvE challenge was organized in 2006. Amongst other tasks, this chal- lenge evaluated human gene name normalization.
The selected organisms are of high general interest: They are among the experimentally most intensively studied organisms. Yeast, mouse, and fly are frequently used as model or- ganisms to elucidate pathways and molecular interactions that might play a role in human diseases. As many scientific publications deal with these organisms, a reliable gene/protein name identification method would be a significant advance for information retrieval and extraction.
Clearly, the BioCreAtIvE challenge represents a substantial progress for the domain as it enables researchers to evaluate their systems on a blind prediction basis and for an inde- pendent test set. The challenge and the provided data sets make it possible to compare approaches. Yet, the first evaluations also suffered from limitations; for example, the test sets of 250–262 abstracts are still small compared to the over 16 million citations in MED- LINE, and the annotations are questionable in a number of cases.
Combined Approach ProMiner Approach Exact Matching Approach Text Matches Curated Synonym Dictionary Exact Matching Text ProMiner Match combination Text Matches Text Matches Filtered Matches Rule-based Postfilter Intra-Dictionary Disambiguation Inter-dictionary Disambiguation Filtered Matches Filtered Matches Unspecific Synonyms Curated Synonym Dictionary Synonym Classification Text ProMiner Match combination Text Matches Filtered Matches Case-sensitive Synonyms ProMiner Text Matches Other Synonyms ProMiner Text Matches Text Matches Curated Synonym Dictionary Exact Matching Rule-based Postfilter Filtered Matches SVM-based Postfilter Text Organism Filter Filtered Matches (a) (b) (c)
Figure 4.1: Workflow of the gene name identification systems applied for the BioCre- AtIvE gene normalization tasks. All systems make use of extensively curated synonym dictionaries. The systems search synonyms against texts and return, for each found gene or protein, the unique identifier together with the detected text fragment.
The goal of this work was to develop methods for gene name identification that are ap- plicable to large collections of text, achieve good performance with balanced recall and precision, and can be customized to meet the requirements imposed by specific organism nomenclatures.
To this end, the systems described in the following have been set up. The BioCreAtIvE challenges were an ideal scenario to evaluate the approaches. With the exact matching ap- proach applied in the first challenge, the quality of the applied synonym dictionaries could be demonstrated. For ProMiner, the parameters were investigated, and thus the ProMiner system has been adapted for the application with yeast, mouse, and fly synonyms. By using comparable synonym dictionaries, the exact matching approach could be compared with the ProMiner approach in terms of recall and precision, as well as runtime and ease of use. In the second challenge, the combined system has been applied. Here, the focus was on the evaluation of the proposed approach for disambiguation.