Text mining data curation - Context-specific subcellular localization prediction: Leveraging pr

Text mining is also known as knowledge discovery in text data mining. It is the process of extracting previously unknown, understandable, potential and practical patterns or knowledge from a collection of massive and unstructured text data. It is a combining technique from data mining, machine learning, natural language processing, information retrieval, and knowledge management [66, 67]. Numerous text mining techniques and tools were applied in life science, e.g. mapping of genes diseases and drug discovery [68, 69], in social media, e.g. recommendation system on Facebook and Twitter [70], business intelligence, e.g. analysis of the customer satisfaction [71].

2.8 Text mining data curation 21

A generic overview of text mining process is illustrated in Figure 2.7 which was sum- marized by Rebholz-Schuhmann et al. [68]. The larger text-analytical approaches typically include:

1. Information retrieval. The tasks include data selection, document retrieval, classifica- tion, and feature extraction generally convert the documents into intermediate forms, which should be suitable for different mining purpose.

2. Information extraction from the text and are the central part of a text mining system. Information extraction comprises the identification of entities, such as genes or diseases, as well as the identification of complex relationships between those entities, including protein-protein interactions and gene-disease associations by using the algorithms including clustering, association rule discovery, trend analysis, pattern discovery and other knowledge discovery algorithms.

3. Post-processing. These tasks manipulate data or knowledge coming from information extraction step. These scientific facts can then either be used to populate databases directly or to assist the work of curation teams including the evaluation and selection of knowledge, interpretation, and visualization of knowledge. The text mining results are used to suggest hypotheses that can then be used to shape or to plan experiments to validate or to disprove the proposed hypotheses.

Fig. 2.6Co-expression network inference pipeline. Figure reprinted from Serin et al. [55]

2.8 Text mining data curation 23

Fig. 2.7Pipeline of text mining solution. Figure reprinted from Rebholz-Schuhmann et al. [68].

Chapter 3 Overview of protein subcellular

localization prediction

Understanding of the SCLs of proteins has always been an essential aspect to discover the novel function of the protein, the primary mechanisms of the cell. We are interested in where is a protein localized in the cell? How is a MLP distributed in the cell simultaneously, or exclusively? How does the protein move from one SCC to another? Under which biological context the translocalization of protein is induced? Furthermore, how is the protein interacting with the other molecules in the cell? How to collect and access this SCL data? How to analyze the SCL data? How to interpret them along with protein functions? In this chapter, we discuss the current situation of protein SCL analysis and prediction.

3.1 Access to the protein SCL data

3.1.1 Experimental data

Conventional wet-lab experiments are used to access the SCL of proteins and also as the gold standard for validating SCL. Several wet-lab approaches for systematic analysis of protein SCLs have been developed.

• Initial research was done with specific staining and light microscopy. Closer scrutiny of micrometer- and nanometer-sized subcellular structures was later enabled by the rise of electron microscopy, which illuminated the complexity of organelles and their various positions within the cell [72].

• Quantitative mass-spectrometric readouts allow identification of proteins with simi- lar distribution profiles across fractionation gradients [73–75] or enzyme-mediated

proximity-labeled proteins in cells [75–77]. These techniques make it possible to understand what is each component doing at the molecular level.

• Imaging-based approaches enable the exploration of the subcellular distribution of proteins in situ in a single cell and have the advantage of also effectively identifying single-cell variability and multi-organelle localization. Imaging-based approaches can be performed using tagged proteins [78] or affinity reagents [28]. Such as the immunofluorescence (IF) based approach can be combined with confocal microscopy was utilized to perform the high-resolution investigation of the spatial distribution of each protein.

Finally, genetics, in all its forms, has allowed us to dissect the structure and function of these SCLs by selective disruption of individual cell components. These experimental data can be retrieved from databases such as Human Protein Atlas (HPA) [28], LocDB [79] and ENCODE [80].

3.1.2 Knowledge-bases of protein SCLs

However, not all the experimental data are collected and accessible in databases. A vast amount of data are spread over the scientific research for various purposes. Such data re- quire to be integrated, interpreted, standardized and enriched from literature and numerous resources to a knowledge base. UniProt Knowledgebase (UniProtKB) leads the world in providing full and comprehensive curation of the experimental data in the literature and does this in a mutually beneficial collaboration with other specialized resources. Literature- based expert curation of UniProtKB provides high-quality information for experimentally characterized proteins in a standardized and structured way using widely accepted con- trolled vocabularies and ontologies. Other knowledgebases of protein SCL include the Reactome Knowledgebase [40], Human Protein Reference Database [81] and Gene Ontology annotations [82].

3.1.3 Limitations

Experimentally determining the SCLs of a protein can be a laborious, expensive and time- consuming task, and manually annotating a protein, particularly identifying the massive SCL data from heterogeneous sources, is always a challenging and low-throughput task.

In document Context-specific subcellular localization prediction: Leveraging protein interaction networks and scientific texts (Page 34-41)