Steps and Tools of Text Mining in Biomedical Field

(1)

RESEARCH ARTICLE

Steps and Tools of Text Mining in Biomedical Field

Stamatios Ramkumar, Rohit Marriwala

Research of Clinical and Metabolic Genetics, Medical Studies University of Victoria

Abstract: The steps of text mining in biomedical field and the methods used in its each step were described with stress laid on the tools used in each step of text mining in order to promote text mining in biomedical field.

Keywords: Biomedicine; Text mining; Tool; Named entity recognition; Information extraction; Knowledge discovery Introduction

Biomedical text mining has entered the practical application stage from concept introduction and Prospect prospect. More and more biomedical professionals begin to use text mining tools to solve practical biological and even clinical problems.

In short, Text Mining (TM) is automatic analysis and processing of text. A more popular definition is to use com-puters to automatically extract and correlate information from different text resources to discover new, previously un-known information and display its implied meaning. The facts and hypotheses formed in text mining should also be verified by experiments[1]. This paper introduces the main steps and corresponding tools of text mining, including in-formation retrieval, named entity recognition, inin-formation extraction, knowledge discovery and visual expression.

1. Information retrieval (InformationRetrieval, IR)

The first step of text mining is to retrieve text resources related to a particular topic. Generally, the method of que-rying keywords or subject words in bibliographic literature database is adopted. PubMed is the most commonly used literature database for text mining in biomedical field, because MED-LINE database is free, provides rich programming interface and is indexed with MeSH subject words. At present, the tools and software of text mining based on MED-LINE database emerge in endlessly, which has become an important resource in the field of information retrieval. The tool software of text mining also enhances and optimizes these database retrieval functions, such as mapping synonyms to unified concepts using controlled thesaurus, further analyzing and classifying the literature records in the retrieval results, and extracting sentences containing retrieval concepts. In addition to research papers, it should also be noted that patents, medical records, reports from drug administrations, epidemiological surveillance reports, health-related blogs, and text from websites can be used as samples for text mining[2,3].

2. Named entity recognition (Named entity recognition, NER)

The so-called named entity is the entity with specific meaning in the text, mainly including person name, place name, organization name, proper noun and so on. Named entities can also be defined as one or a set of keywords that explicitly state a thing or concept. Named entity recognition is to match the keywords found in the text with the specific concepts in the text. Named entity recognition not only completes the task of identifying these entities, but also unifies the entity names that represent the same biological concept. If the gene can be identified not only by the gene symbol in a text, but also by its synonyms or previous names, the same drug can also be identified by different common names,

This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License

(http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

(2)

2 | Stamatios Ramkumar et al. Biomed Communication

trade names and synonyms in the text. Text mining tools or databases typically highlight identified terms and provide links to primitive concepts, such as the IHOP server that recognizes protein names and MESH words and highlights them in sentences extracted from text. Nowadays, the main methods of named entity recognition include three algo-rithms based on vocabulary, ontology and machine learning.

2.1 Recognition method based on word list

Most text mining systems are based on controlled thesaurus, which organizes keywords according to topic or clas-sification. For example, Whatizit prefabricates dictionaries using terms from genes, proteins, pathways, drugs, and dis-eases collected in advance from various databases to recognize named entities in any text. CoPub has developed a glos-sary of liver pathology terms to annotate the drugs involved in gene expression experiments in toxicological studies.

2.2 Ontology based recognition method

Ontology (ontology) can be understood as a more advanced controlled vocabulary. The biggest difference between ontology and thesaurus is that the definitions of concepts and key words describing these concepts are more formal, and the relationships and rules between concepts are specified in particular. Ontology can be used for text mining because it uses formal structures and classifies domain-specific information, such as biological pathways or diseases. Younesi and others have developed a biomarker ontology specifically for retrieving biomarker knowledge from the literature. The concepts in the ontology involve various categories of biomarker research, such as clinical management, diagnosis and prognosis, and statistics. They used this physical examination to identify biomarkers for non-small cell lung cancer and neurodegenerative diseases. Ontology can help expand and reconstruct keyword retrieval strategies, and significantly improve the image retrieval system by using the tree structure information of MeSH subject words. Many specialized ontologies are freely available, and BioPortal for Biomedical Ontologies[11] collects a large number of such ontologies for download and browse. Many text mining software install ontology in their own systems to build retrieval strategies, visual representation and classification results.

2.3 Machine learning based recognition method

Specific algorithms include hidden Markov models (HMM), maximum entropy Markov models (MEMM), condi-tional random fields (CRF) and support vector machines (SVM). Before starting the real named entity recognition, it is necessary to train rigorously in the training data set which can represent the structured annotations of the actual data set. The text mining system based on machine learning can recognize the chemical entity in the text[12], or combine the rule and dictionary algorithm to recognize the biological name in the text[13], and extract the biological name. Information on cancer staging in health records is used to assist clinical decision making[14].

3. Information Extraction (IE)

After information retrieval and named entity recognition, special algorithms are used to discover the ship between concepts in text. By linking concepts together, additional information is added to the concept, resulting in valuable knowledge that can be analyzed later. At present, the main methods of extracting knowledge from text include co-occurrence analysis and natural language processing (NLP).

3.1 co occurrence analysis method

The theoretical basis of co-occurrence analysis is that if two concepts often appear together in the same text, then there is a semantic correlation between the two concepts. For example, retinol-binding protein 4 (RBP4) and insulin resistance often co-occur in MEDLINE abstracts, indicating a functional link between genes and diseases. In fact, the key words in many texts are not functional. In order to correct this error, co-occurrence analysis uses the frequency of each of the two keywords to score to show the importance of the relationship between the keywords, so as to screen out the important functional links. Co-occurrence analysis is easy to implement in named entity recognition, but it can not point out the relationship between concepts. Therefore, compared with natural and language processing methods, the recall rate is very high, but the precision rate is very low.

(3)

Biomed Communication Volume 2 Issue 2 | 2018 | 3

3.2 Natural Language Processing method

Natural Language Processing uses computers to process natural language, that is, human language. Natural lan-guage processing (NLP) is based on prior knowledge of linguistic structure and the expertise of biological information expression in literature. Natural language method can discover the subject-predicate-object ternary structure in the text, such as gene A suppressor gene B, or gene C related to disease G, and so on. Therefore, compared with co-occurrence analysis, phrase-based natural language processing (NLP) often has higher accuracy, but it often extracts predefined conceptual relationships. The general Natural Language Processing method consumes computer resources more than co occurrence analysis. In addition, the specific relationship obtained through the training set is often limited by the size and quality of the training set data, which is beyond the capability of discovering more relationships. On the other hand, the open information extraction method does not depend on specific verbs and nouns, but focuses on how the relation-ship is defined in the text, so it can extract an unlimited number of relations.

In practical research, the mixed method is often used, that is, the co-occurrence analysis method is used to discover the relationship between concepts, and then the natural language processing method is used to discover the nature of the specific relationship. MEDLINE uses natural language methods to discover relationships between concepts. Users can input source or target concepts in the interface, and define interactions they want to understand, such as blockade, bind-ing, regulation, and so on. The system returns a description of the concepts in the MEDLINE abstract to the user, which can help the user obtain whether there is an interaction between the drug (source) and the drug target (target), such as "what binds to IP-10?", but can also use many more general terms, such as "what causes rheumatoid arthritis?[16].

4. Knowledge Discovery (KD)

If we follow the definition of text mining, strictly speaking, most text mining systems can only be regarded as in-formation extraction systems, that is, extracting the relationships between published concepts and sorting them. Never-theless, these systems systematically gather known facts together and, if used iteratively, derive new relationships be-tween keywords based on known facts from the literature. Literature-Based Knowledge Discovery (LBD) was pro-posed by D. Swanson in 1986. The main principle of LBD is called ABC pattern: if A and C are two keywords that nev-er appear in the same text at the same time, but they both refnev-er to the same set of keywords B, then Thnev-ere is an indirect connection between A and C. After Swanson first suggested in 1986 that fish oil might be beneficial to patients with Raynaud's disease, he carried out a series of studies to prove the validity of ABC rules, such as migraine and magnesi-um deficiency, arginine intake and sermagnesi-um levels of somatostatin[17–19]. The ABC model is relatively simple and easy to understand in concept, but the implementation of the requirements are more stringent, through strict statistics can be found to be effective.

Predict and estimate the false positive rate.

There are two kinds of tools for document knowledge discovery based on ABC model: open and closed. The earli-est tools followed Swanson's ideas, using closed mode. Users first identified keywords A and C, and explored whether they could find concepts B that linked A and C. The earliest development of Ar- rowsmith such as Swanson is in this way[20]. ChemoText linked chemicals to disease according to MEDLINE abstracts[21], and researchers used this large database to discover new migraine-related compounds.

In an open-mode tool, the user first determines a target concept A and a possible mediation concept B, and the sys-tem automatically returns all possible related concepts C. Bitola uses open and closed coexistence, while CoPub uses open discovery.

With the further development of ABC discovery patterns, a method of keyword concept profiling has emerged. For example, for a gene, all the keywords found in the literature that have some relationship with the gene (such as co-occurrence relationship) are listed as if the concept spectrum of the gene is drawn. According to the concept spec-trum of genes, the related concepts can be clustered into classes by using common clustering techniques. For example, if a drug has a similar conceptual spectrum with a disease, there may be a biological correlation between them; if the related concepts clustered together never appear in a text, they may represent new knowledge and should be given

(4)

spe-4 | Stamatios Ramkumar et al. Biomed Communication

cial attention. The Anni software successfully predicted cell types and signal transduction pathways using microarray data using the concept of conceptual spectrum[24–27].

5. Visualization (Visualization)

The final step of text mining is to visualize the extracted knowledge, to interpret the mining results quickly and correctly, and to guide researchers to construct new hypothesis to start the next experimental study.

Most retrieval systems can highlight concepts identified by named entities in text, and some also provide links be-tween concepts and source documents to provide more relevant information. Relationships bebe-tween concepts can be expressed in tables that filter and rank concepts by score and can be used to carry out in-depth experiments with genes, drugs, or diseases[28–30].

5.1 Document network (Literature Network)

Most text mining tools aim at displaying links between concepts found in text, so people first think of using docu-ment networks to represent mining results. String, CoPub, PubViz, and PubNet all generate networks, in which nodes are concepts and edges are built through literature. In general, the edges of a network can be linked to documents link-ing the two concepts, and the nodes can be linked to databases that provide the conceptual information. In particular, we will mention a semantic network technology that integrates other scientific data into the network to enrich the literature network[31]. OpenPhacts integrates clinical, biological and chemical data into pharmaceutical entities to help drug dis-covery[32]. Leach and others developed the Hanalyzer system, which presupposes a biological network of more than 8, 000 mouse genes, matches the findings in the literature with ontology terminology definitions and connects them to multiple gene expression databases[33]. Using this network, they analyzed transcriptome data from mouse maxillofacial development.

Document networks can draw links between two concepts in one space and visualize multiple relationships tween concepts, thus revealing indirect relationships between concepts, helping to understand new relationships be-tween genes and the relationship bebe-tween unknown genes and diseases, and querying clustering based on networks. Tools found these biologically relevant gene clusters in these literature networks.

5.2 Word cloud (Word Cloud)

Word clouds are mainly used to display words that appear frequently in text, and word sizes that appear frequently in word clouds are larger. This technology is especially useful for understanding what concepts will appear in text. For example, in MEDLINE, we retrieved all the abstracts of the three microorganisms, Streptococcus thermophilus, Chla-mydia trachomatis and Mycobacterium tuberculosis, removed the common words, retained the meaningful words and counted them, and used the word cloud to express the most common words[34]. The word clouds of 3 species of micro-biological literature show the micro-biological properties of these microorganisms and their applications. For example, the term "yoghurt", "Italian cheese", "exopolysaccharide production" in the word cloud of Streptococcus thermophilus ac-curately indicates the important role of Streptococcus thermophilus in dairy production, while the word cloud of Chla-mydia trachomatis indicates that ChlaChla-mydia trachomatis is associated with ChlaChla-mydia infection, which can lead to "in-fertility" and "salpingitis". It also suggests Chlamydia transmission routes, such as sex, contact and prevention of con-doms.

Word cloud technology does not offset because it does not use a preset set of concepts or ontologies, and because it comes from rough raw text that shows new ideas and understands specific concepts, it is possible to use the word cloud to get a preliminary overview of the domain before starting a more specialized study.

In summary, the main steps of text mining include information retrieval, named entity recognition, information ex-traction, knowledge discovery and visual expression. Researchers have developed corresponding database and software tools for each step, and provided free use for peers to carry out research. The problems that need to be noticed in the use include understanding the nature of the data and the Applicable Analysis methods, mastering the basic principles of specific algorithms, keeping in mind the boundaries of data mining and analysis, fully communicating with

(5)

profession-Biomed Communication Volume 2 Issue 2 | 2018 | 5

als, and carefully putting forward conclusions and hypotheses.

References

1. Hearst MA. Untangling Text Data Mining. Proceedings of the 37th Annual Meeting of the Association for Compu-tational Linguistics (ACL-99) pages3-10 Park, MD [C].

2. Masic Io Milinovic K. On-line Biomedical Databases-the Bast Source for Quick Search of the Scientific Infor-mation in the Bio- medicine [J]. Acta Informatica Medica 2012; 20(2): 72-84.

3. Falagas ME, Giannopoulou KP, Issaris EA, et al. World Databases of Summaries of Articles in the Biomedical fields [J]. Arch Intern Med 2007; 167(11): 1204-1206.

4. Frijters R, Verhoeven S, Alkema W, et al. Literature-based Com-pound Profiling: Application to toxicogenomics [J]. Pharmacog-enomics 2007; 8(11): 1521-1534.

5. Leuren WW, Toonen EJ, Verhoeven set al. Identification of New Biomarker Candidates for Glucocorticoid Induced Insulin Resist- ance Using Literature Mining [J] . BioData Min 2013; 6(1): 2.

6. Hoffmann R. Valencia A. A Gene Network for Navigating the Lit- erature [J] . Nat Genet 2004; 36(7): 664. Bioin-format- ics 2008; 24(2): 296-298.

7. Zhang Yanxia. Google: A new model of library network services [J]. Library Journal 2008; (8): 47-49.

8. Fleuren WW, Verhoeven S, Frijters R, et al. CoPub Update: Co-Pub 5. 0 a Text Mining System to Answer Biologi-cal Questions [J]. Nucleic Acids Res 2011; (39) 450-454.

9. Younesi ller M ü ller al. Mining Biomarker Informa- tion in Biomedical Literature J. BMC Med Inform Decis Mak 2012; 12(1): 148.

10. Crespo Azc á rate Mata V á zquez Jang Mana L ó pez M. Improving Image Retrieval Effectiveness Via Query Expansion Using MeSH hierarchical Structure [J]. J Am Med Inform Assoc 2013; 20(6): 1014-1020.

11. Whetzel PL. NCBO Technology: Powering Semantically Aware Applications [J]. J Biomed Semantics 2013; 4 Suppl 1: S8.

12. Eltyeb Cheminf, 2014, 6 N. Chemical Named Entities Recognition: A Re- view on Approaches and Applications [J] . J Cheminf 2014; 6(1): 17.

13. Naderi N K appler CJ, et al. OrganismTagger: Detec- tion, Normalization and Grounding of Organism Entities in Bio- medical Documents [J] . Bioinformatics 2011; 27(19): 2721-2729.

14. Martinez Pitson Gn MacKinlay al. Cross-hospital Porta- bility of Information Extraction of Cancer Staging In-formation [J]. Artif Intell Med 2014; 62(1): 11-21.

15. Alako BT, Veldhoven Baal van Baal set al. CoPub Mapper: Min- ing MEDLINE Based on Search Term Co-publication [J] . BMC Bioinformatics 2005; 6(1): 1-15.

16. Ohta Tu Matsuzaki Tu Okazaki al. Medie and Info- pubmed: 2010 Update [J] . BMC Bioinformatics 2010; 11(Suppl 5): 7.

17. Swanson DR. Fish oil, Raynaud's Syndrome, and Undiscovered Public Knowledge [J] . Perspecting Biol Med 1986; 30(1): 7-18.

18. Swanson DR. Migraine and Magnesium: Eleven Neglected Con- nections [J]. Perspect Biol Med 1988; 31(4): 526-557.

19. Swanson DR. Somatomedin C and Arginine: Implicit Connections between mutually isolated literatures [J]. Per-spect Biol Med 1990; 33(2): 157-1866.

20. Smalheiser NR, Swanson DR. Using ARROWSMITH: A Com-puter-assisted Approach to Formulating and As-sessing Scientific Hypotheses [J]. Comput Methods Programs Biomed 1998; 57(3): 149-153.

21. Frijters Vugt van Vugt Mimmer R et al. Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Dis- eases [J]. PLoS Comput Biol 2010; 6(9): 655-664.

22. Baker NC, Hemminger BM. Mining Connections between Chemi- cals, Proteins, and Diseases Extracted from Medline Annotations [J]. J Biomed Inform 2010; 43(4): 510-519.

23. Hristovski Dendare Jn. Peterlin Bn Dzeroski S. Supporting Discov- ery in Medicine by Association Rule Mining in MEDLINE and UMLS [J] . Medinfon 2001; 10(2): 1344-1348.

24. Patra Rao Saha SK. A Kernel-based Approach for Biomedical Named Entityrecognition [J]. ScientificWorldJour-nal 2013; 2013: 950796.

25. Hettne KM, Boorsma A, van Dartel DA, et al. Next- generation text-mining Mediated Generation of Chemical Response-specific gene Sets for Interpretation of Gene Expression Data [J]. BMC Med Genomics 2013; 6(1): 2. 26. Jelier RJelier Schuemie MJ, Roes PJ, et al. Literature-basedconcept Profiles for Gene Annotation: The Issue of

Weighting [J]. Int J Med Inform 2008; 77(5): 354-362.

27. Jelier MJ, Veldhoven al. Anni 2. 0: a Multi- purpose Text-mining Tool for the Life Sciences [J] . Genome Bi- ol 2008; 9(6): R96.

28. Fleuren WW, Toonen EJ, Verhoeven set al. Identification of New Biomarker Candidates for glucocorticoid In-duced Insulin Resistance Using Literature Mining [J] . BioData Min 2013; 6(1): 2.

29. Toonen EJ, Fleuren WW, N ssander al. Prednisolone-induced Changes in Gene-expression Profiles in Healthyvolunteers [J]. Pharmacogenomics 2011; 12(7): 985-998.

(6)

Adi-6 | Stamatios Ramkumar et al. Biomed Communication

pocytes [J]. Arch Physiol Biochem 2013; 119(2): 52-64.

31. Harrow / Filsell WU Wooll al. Towards Virtual Knowl- edge Broker Services for semantic Integration of Life Science Lit- erature and Data Sources [J] . Drug DiscovToday 2013; 18(9-10): 42 8-434.

32. Williams AJ, Harland Ln Groth Pao et al. Open PHACTS: Se- manticinteroperability for Drug Discovery [J]. Drug Discov To- day 2012; 17(21-22): 1188-1198.

33. Leach SM, Tipney al. Biomedical Discovery Ac- celeration, with Applications to Craniofacial Development [J]. PLoS Comput Biol 2009; 5(3): el000215

34. Fleuren W S M: Text Mining and Information Extraction for the Life Sciences: An Enhanced Science Approach [J] . Bioin format- icsn cmls cbmicd and physical Biology 2013; 221.

35. Riker AI, Enkemann SA, Fodstad al. The Gene Expression Profiles of Primary and Metastatic Melanoma Yields a Transition Point of Tumor Progression and Metastasis [J]. BMC Med Ge- nomics 2008; 1( 1): 13. 1