Part I Text Mining
2.3 Text Mining in Bioinformatics
The technical means for conducting biological large-scale experiments involving thousands of genes and proteins are established and such experiments are routinely performed. Their interpretation remains an important problem. For example, gene expression can be mea- sured for large gene sets, and subsets of genes with correlated expression patterns can be extracted by statistic analysis and clustering of measured data. Yet, similar expression pat- terns do not necessarily imply involvement in the same biological process, and functional relationships cannot be determined from cluster data alone.
Published literature contains information that can be used for a more detailed analysis. Due to the amount of data to be analyzed it becomes tedious or even impossible to read and analyze all published literature dealing with all genes returned from crude data anal- ysis of such experiments. Here, Text Mining provides help by automatically preselecting documents and analyzing the contained information. For reviews on text mining in the
biomedical domain see Shatkay and Feldman(2003);Jensen et al. (2006)
Major bioinformatics conferences such asIntelligent Systems for Molecular Biology (ISMB)
and Pacific Symposium on Biocomputing (PSB) responded to the growing interest in text mining in bioinformatics by publishing papers on the topic since the early 1990s and by devoting entire sessions to the field since the late 1990s. The NLP and bioinformatics do- mains both dispose of sound scientific achievements and the effort of growing together is certainly useful for both. As this integration is an ongoing process, and the underlying domains also continue to evolve, significant scientific progress is expected in the upcoming years.
Examples for specific literature mining tasks in bioinformatics are:
• Extraction of keywords and functional annotation of proteins (Andrade and Valencia,
1998)
• Generating gene summaries (e. g. sequence information, phenotypes, interactions)
(Ling et al., 2006)
• Predicting the subcellular localization of proteins (Stapley et al., 2002), in conjunc-
tion with protein sequence based features (Hoglund et al.,2006)
• Annotation of enzyme classes with disease-related information (Hofmann and Schom-
burg, 2005)
• Assisting BLAST searches (Chang and Lin,2001)
• Detection of remote homologs by combination of PSI-BLAST and analysis of Swiss-
Prot annotations (MacCallum et al.,2000)
• Discovering protein similarity (Sarkar and Rindflesch, 2002)
• Assisting microarray data interpretation (Masys, 2001; Masyset al., 2001)
• Pathway discovery (Blaschkeet al., 1999; Krauthammer et al., 2002)
• Nucleic acid and peptide sequence identification in texts (Wren et al., 2005b)
• Discovery of themes or gene groups with similar functionality within gene lists (Pehko-
nen et al., 2005)
• Ranking of documents by their relevance with respect to gene queries (Sehgal and
Srinivasan,2006) or Swiss-Prot medical annotation (Dobrokhotovet al., 2003)
• Generation of hypotheses for the explanation of experimental or clinical data (Swan-
son,1986;Smalheiser and Swanson,1998;Weeberet al.,2001;Srinivasan and Libbus,
2004)
Most of these specific tasks require a mapping between entities and articles containing infor- mation about these entities. Manually curated public databases generally contain references to the articles where the curated information has been obtained from. These references can be used for generating a mapping between entities and articles. Alternatively, named entity identification can be applied and thus, more comprehensive mappings can be generated.
Biomedical Language
Biomedical language significantly differs from common English language. This is largely due to the descriptive nature of biomedical sciences and the important number of techni- cal terms. A standardization effort has only emerged during the last decades. Biomedical
language is characterized by synonymy; that is, most biological objects and concepts are
represented by more than one term. For example, genes and proteins have five to ten syn-
onyms on average. Ambiguity also occurs frequently; that is, a term frequently refers to
several entities/concepts. Abbreviations and acronyms (i. e. abbreviations that are formed by combining the first, and sometimes other, letters of the principal words) are very com- mon in biomedical texts. Abbreviations and acronyms are frequently defined as required by the author, and especially prone to ambiguity. Multi-word units play an important role in describing biomedical concepts, and spelling variants occur frequently. Generally, it is difficult to map biomedical text to standard ontologies or thesauri in an automated way due to the numerous spelling variants and due to nested terms.
Biomedical language is constantly changing. As new entities are discovered, new terms are introduced for them. Sometimes, names are changed when more knowledge on individual entities is accumulated. For example, when it becomes evident that a protein forms part of a family, the protein is usually assigned with the family name and a specific subtype
identifier such as a number, letter, or Greek letter.
Interestingly, the amount of information which is available in the literature for individual genes follows an extreme power law distribution and the impact of a gene in the scientific
literature is not correlated to its centrality in protein-interaction networks (Hoffmann and
Valencia, 2003a,b).
Scientific articles in the biomedical domain are generally structured according to a well-
defined schema: An article contains atitleand anabstract. The full text article additionally
contains the following sections: Introduction, Materials and Methods, Results, Discussion,
and Conclusions. The information content and occurrences of gene symbols and names
varies between the different text sections (Shahet al.,2003): Abstracts contain the highest
ratio of keywords per total of words. Besides the abstract, the introduction and discussion appear as appropriate places when searching for gene and protein names and interactions. The Methods section is generally most different from all other sections, best suited for looking for technical data, measurements and chemicals, and least suited for searching for genes, proteins, and interactions between them. Information density is highest in abstracts, but the information coverage is much greater in full texts than in abstracts, with highest information coverage in the results section, and 30–40% of the information mentioned in
each section is unique to the section (Schuemie et al., 2004).
The difference between common English language and biomedical language is also reflected in the performance of information extraction approaches: Recall and precision for identi- fying person, organization, and location names in news stories have been reported in the range of 93–95%, while the values for identifying biological names are in the 75–80% range
(Hirschmanet al.,2002a). Possible explanations for this divergence, besides the ones men-
tioned above, are given by (1) the small number of shared training and test sets for setting up systems and measuring progress in the biomedical domain, (2) experience, which is significantly smaller for text mining in the biomedical domain than in the news domain, and (3) the task definition. In contrast to news articles, annotation of biomedical text needs profound background knowledge and thus needs to be done by expert scientists who often perceive the linguistic task as somewhat artificial. Biomedical text annotations are often debatable and annotation guidelines are sometimes unclear which results in lower inter-annotator agreement.