A Mutual Conditional Probability baseddocument ranking model usingMap-Reduce Framework

(1)

A Mutual Conditional Probability based document ranking model using Map-Reduce Framework

1

K.S.S. Joseph Sastry,

²

Dr.M.Sree Devi

Dept of Computer Science and Engineering,

Koneru Lakshmaiah Education Foundation, Guntur, A.P, India.

Abstract

As the volume of biomedical text increases exponentially, automatic indexing becomes increasingly important. However, existing approaches do not distinguish central(or core) concepts from concepts that were mentioned in passing.Since, the size of biomedical literature creates a new challenge for researchers to find and extract gene-based disease ranking documents in historical databases. Also, the multilevel gene documents suffer from noisy and duplicate features, it is difficult to rank and summarize the relevant phrases within the multiple document sets. Most of the text ranking and prediction algorithms are sensitive to noise, outliers, high dimensionality and uncertainty. In this proposed approach, a novel biomedical document ranking and prediction model was implemented to find interesting patterns on high dimensional biomedical dataset. Experimental results proved that the proposed probabilistic ranking model has high accuracy compared to traditional text ranking and prediction models.

1. Introduction

Machine learning based text classification allows a classifier to learn a set of rules, or the decision criterion, from a set of labelled data that has been annotated by an expert. This approach allows for better scaling and a reduced cost in classifying documents when compared to a system that relies on manual input only. Most of the research in the field of machine learning based document classification has been done on binary classifiers. That is, building a classifier from a set of positive and negative examples to reduce a document's membership in a class. When dealing with a corpus that has documents that belong to multiple classes, the approach thus far has been to build a separate binary classifier for each class in the corpus and then aggregate the results of each binary classifier. Classification can take many forms, from fully automated systems with limited human intervention [1] to semi-automatic systems that employ a hybrid human machine approach.

Different evolutionary approaches such as genetic algorithms, Rough-set, SVM, etc are used to classify the document sets from large corpus. Genetic algorithms are implemented to find the complex patterns and classification rules on huge datasets[2][3]. In some of the hybrid approaches, genetic algorithms are integrated with decision tree schemes to generate an optimized decision tree.

Classification models such as Naïve bayes with ensemble decision tree models namely CART,C4.5,Bayesian tree and random forest are used to classify document and feature extraction[4][5]. They concluded that no single traditional model existed to handle uncertainty for document prediction with large number of attributes set.

Big Data handles large volumes of data and all these data are required to be processed very quickly and continuously. Big Data is the collection of partially structured and unstructured data, where 92% of its contents [6] consist of medical, spatial, web content, biomedical documents, traditional structured data, etc. In conventional Data Warehouses, it is always assumed that data are precise, clean, consistent and complete. But in the case of big data, input data contain uncertain, erroneous, imprecise and incomplete elements and so data should be processed through virtual machines. It doesn’t depend on the hardware and software specifications of clients’ system[7].

These virtual machines are simple to configure and easier to integrate with different cloud storage systems and can also be merged with other tools (e.g.: Hadoop) for additional computation. Some commonly used cloud computation systems are Google Compute Engine, Windows Azure, Amazon EC2 and Rackspace.

(2)

An advanced three-layer biomedical framework has been implemented to cluster the set of documents [7]. This framework is based on a multi-layer neural structure of neighbourhood peers.

Many overlay peers which act as the representative object of its lower neighbourhoods are clustered to form higher level clusters. The basic limitation of this model is selecting an optimal threshold for a dynamic size overlay network. Also, it is very hard to balance the structure size and peer documents. A model using a parallel approach is implemented to cluster the multiple document collections [8]. The key issue is to find automatic document clusters in large text corpus and it is very high cost to compare documents in a high dimensional vector space. This algorithm tries to minimize the distance computations and cluster size in the training dataset documents, called pivots.

They used parallel algorithm in an efficient way to optimize a complex data structure which affords efficient indexing, searching and sorting.Traditional probability estimation techniques such as Naïve bayes, markov model, Bayesian model[9] are used to find the highest probability estimation variance among the gene and its related disease sets in biomedical document sets. Classification is the process of finding and extracting the main contextual meaning of the gene or disease patterns from the distributed document sources and it has become an integral part of day to day activities in all domains like cloud, forums, social networking and medical repositories.

Haiyan et.al,[10] proposed a pattern mining and classification model for disease prediction.

They used a microarray dataset to rank the pathways and to find the disease related patterns in limited data size. They used random forest classification model to filter and classify the co-related disease patterns in microarray datasets. This model requires high computational storage and memory for large datasets.

Sezin et.al,[11] proposed a novel gene based disease patterns using metagraph construction model. In this model, protein to protein interactions and biological gene keywords are extracted to find the disease patterns using metagraphical model. This model is limited to small datasets with maximum of 10k instances.

Hong-Dong et.al,[12] proposed a new gene selection and disease classification model using the Phase diagram approach. They used different microarray gene datasets for disease prediction and classification. PHADIA method is efficient for micro array datasets with limited instance space.

Andrea mesa et.al,[13] developed a model to detect the sequence of the gene patterns in biomedical repositories. They used genebank dataset as training data to find the correlation between the two sequences. Also, proposed hidden markov model generates two or three states to represent the gene sequences using the Genebank database. This model requires only two gene sequences, as the size of the each sequence increases or the number of sequences are increases then this model is not efficient for ranking.

2.Proposed Model 1) Data Preparation Phase

In the data preparation phase, user specific PubMed documents are extracted along with disease types.

In this phase, 1 million gene terms are extracted to find the relevant documents from the PubMed repository. NLP and text mining approaches have been applied to gene-disease based clinical documents to find the relevant contextual features. The pseudo code for gene based PubMed data preparation in hadoop Mapper phase is given below:

(3)

Figure 1: Proposed Model Input : User query, Disease list, Gene DB

Output: Gene-Disease based Document sets.

Procedure:Map(TextListDiseaseList, Text Gene-DB,Text Query) Connection con=PubMed(Url);

If(con!=null)

then Load GeneDB=getGeneNames();

Load DiseaseList=getDiseaseList();

End if

Else Check connection;

For each user-defined disease categories DC[] do

Biomedical

database

Database Gene Gene Field Extraction

synonym Gene extraction Biokey

extraction Medline

Abstracts Gene based

document

extraction Gene

biomedical document sets

Document pre-processor

and ranking Gene based

document

ranking

(4)

if (DC[i] ) then

getDocSet[] PubMed(DC[i]);

for(i 0;i getDocSet.length;i ) do

Apply Stemming(getDocSet).

Apply Stopword_removal(getDocSet) Tokens[]=Tokenization(getDocSet);

for(j

Disease

0; j Tokens.length L t

; is





   

  j )

do

for(k 0;k geneDB.length;k ) do

if (Tokens[j] geneDB[k]) then

GeneDlist[j] Tokens[j];

end if done done

Add GeneDocs(GeneDlist)

 

   





Algorithm 1: Somatic Gene Data Extraction Input :Gene dataset CD, synonym training dataset SD.

Step 1: Read COSMIC database CD.

Step 2: Somatic Genes SG[]=null;

For each record r in CD

Do SG[i]=FindGeneField(CD[j]); j is gene field index in Gene dataset.

Step 3: Filtering somatic genes based on cancer type.Done For each gene giin SG[]

Do If(gi=True) Then G[i]=gi

End if

Step 4: Find the similarity between somatic gene to the synonym geneset SD.Done For each gene gmin G[]

Do Somatic synonym geneset =SSG[]=Sim(gm,SD)

 

m

m j m j

m

m j m j

[]

Pr ob(g / SD )

Sim Max{ };,m 1,2...| G |

| SD |.Pr ob(g ) j 1,2...| SD |

Add SSGD[m]={Sim(g

SSG S

,SD im g ,SD

g ,S

S D

),g , D }

 



Step 5: Assign somatic synonym genes as bio keyterms.done For each biomedical document from medline Do For each sgiin SSGD[]

(5)

Do If(Sim(g ,SD ) T)_m _j 

Then Bio key terms BKT[i]={g ,SD_m _j} End if

End for End for

Algorithm 2: Gene-Disease document extraction in medline repository Input :Biokey terms BKT[] as somatic genes gmand synonynm genes SD.

Procedure:

Step 1: Connect medline webservice using biokey terms as an argument.

Step 2: Connection c=Medline();

If (c!=null)

Then For each biokey term kiin BKT[]

Do Extract medline documents using the biokey BKT[i]

Biomedical documents BD[]=C(ki) End if done

Step 3: Pre-processing each document in the BD[] . For each document diin BD[]

Do Dti=Tokenize(di) Sdti=stemming(Dti)

PBD[]=stopwordremoval(Sdti)

Step 4: Somatic gene term based document feature ranking is computed on the pre-processedDone biomedical documents.

For each pre-processed biomedical document PBD[]

Do FBD[i]=F_rank(PBD[i],BKT[j]);

i j

TP i j

FP i j

TN i

PBD[i]

BKT[j]

( , )defines the number of documents in PBD[i] that contains biokey term BKT[j]

( , )defines the number of documents not in PBD[i] that contains biokey term BKT[j]

( ,



  

  

  j

i j

FN

)defines the number of documents

in PBD[i] that does not contains biokey term BKT[j]

( , )defines the number of documents

not in PBD[i] that does not contains biokey term BKT[j]



 



  

1

fg fg

i j i j

rank TP FN

(fg )

i j i j

TP FN

fg

( ( , ) ( , )

Where ( 1 e ,min( ( , ), ( , ))) 2



  

         

       



(6)

2

fg fg

i j i j

rank 2 FP TN

(fg )

i j i j

2 TN FP

fg

( ( , ) ( , )

Where ( 1 e ,min( ( , ), ( , ))) 2

 

   

         

       



1

1 2 1 2

rank

rank rank rank rank

.Pr ob(BKT[j] / PBD[i]) Pr ob(BKT[j] / PBD[i])

F _ rank(PBD[i],BKT[j]) log( )

max{ , } max{ , }



   

Experimental results

All experimental results are carried out on the PubMed[96] repository and Gene Databases. List of human related diseases and its associated gene documents were extracted from PubMed using a novel feature selection and ranking model.Top ranked biomedical human diseases documents are extracted using user selected disease and gene ranking model. To find the specific disease associated with gene synonym documents, we implemented a web based ranking framework for human gene to disease ranking model. List of biomedical genes associated with specific disease and ranking score is shown in Table 2. Total 1 million diseasesare processed from the PubMed repository to the given disease list and its associated biomedical genes. Experimental results were performed on an hadoop framework with cloud environment.

Table 3.2: Runtime(secs) comparison of Medline preprocessing models on Medline Datasets

Medline NER SVDD NMF BioNER Proposed

size Model

#25 198.3 189.5 218.9 239.9 159.1

#50 267.7 243.2 324.8 318.7 178.3

#75 406.7 389.2 419.1 448.9 278.2

#100 578.2 589.7 572.1 593.6 304.7

#125 723.3 683.5 729.5 679.3 387.1

Table 3.2 describes the runtime comparison of proposed model to the existing models in terms of millisecs. From the table , it is clear that proposed model has less time complexity compared to traditional models.

Table 3.3: Average Ranking Score of the Proposed Ranking measure to the traditional ranking measures in Hadoop Framework.

Ranking

Measures Diabetes

(secs) Leukemia

(secs) Cancer

(Secs) Brain

Disease Chi-square

Rank 0.76 0.69 0.83 0.719

Mutual

Information 0.83 0.879 0.9 0.858

Probabilistic

Rank 0.792 0.87 0.848 0.869

Proposed

Model 0.92 0.939 0.919 0.957

Table 3.3, describes the average ranking score of the present ranking measure to the traditional ranking approaches using Map Reduce programming. From the table it was observed that the present rank algorithm has high average ranking score compared to the existing approaches.

Table 3.4: Proposed gene-disease based ranking scores on different disease types

(7)

Figure 3.5: Average Ranking Score to Different Disease Type

0 0.20.4 0.60.81 1.2

Diabetes(secs) Leukemia(secs) Cancer(Secs) Brain Neoplasms(Secs)

Avg Ranking Score to Differnt Disease Types

Chi-square Rank[10] Mutual Information[11]

Probabilistic Rank[12] Proposed Model

Figure 3.5, describes the average ranking score of the proposed ranking measure to the existing ranking models in hadoop framework. From the figure, it was observed that the proposed rank model has high average ranking score compared to the existing models.

4.Conclusion

As the size of biomedical literature creates a new challenge for researchers to find and extract gene-based disease ranking documents in historical databases. Since the multilevel gene documents suffer from noisy and duplicate features, it is difficult to rank and summarize the relevant phrases within the multiple document sets. Most of the text ranking and prediction algorithms are sensitive to noise, outliers, high dimensionality and uncertainty. In this proposed approach, a novel biomedical

(8)

document ranking and prediction model was implemented to find interesting patterns on high dimensional biomedical dataset. Experimental results proved that the proposed probabilistic ranking model has high accuracy compared to traditional text ranking and prediction models.

References

[1] Wang, Shiping and Han Wang. "Unsupervised feature selection via low-rank approximation and structure learning." Knowledge-Based Systems 124 (2017): 70-79.

[2] Sun, Huaining andXuegang Hu. "Attribute selection for decision tree learning with class constraint." Chemometrics and Intelligent Laboratory Systems 163 (2017): 16-23.

[3] Deniz, Ayça, et al. "Robust multiobjective evolutionary feature subset selection algorithm for binary classification using machine learning techniques." Neurocomputing 241 (2017): 128- [4] 146.A. Weinberg and M. Last, "Selecting a representative decision tree from an ensemble of

decision-tree models for fast big data classification", Journal of Big Data, vol. 6, no. 1, 2019.

[5] Tsai, Chih-Fong, Wei-Chao Lin and Shih-Wen Ke. "Big data mining with parallel computing:

A comparison of distributed and MapReduce methodologies." Journal of Systems and Software 122 (2016): 83-92.

[6] S. Jayanthi and S. Prema, "Web Document Clustering and Visualization Results of Semantic Web Search Engine Using V-Ranking", International Journal of Computer Theory and Engineering, pp. 463-467, 2011.

[7] Yen, Show-Jane and Yue-Shi Lee. "Cluster-based under-sampling approaches for imbalanced data distributions." Expert Systems with Applications 36.3 (2009): 5718-5727.

[8] Motiur Rehman, Hidayat Ullah Khan and Shaik Sajeed, "Bunching and Parallel Empowering Techniques for Hadoop Document Framework", International Journal of Engineering Research and, vol. 6, no. 02, 2017.

[9] M. Satish, "Document Clustering with Map Reduce using Hadoop Framework", International Journal on Recent and Innovation Trends in Computing and Communication, vol. 3, no. 1, pp.

409-413, 2015.

[10] Khan, Aurangzeb, et al. "A review of machine learning algorithms for text-documents classification." Journal of advances in information technology 1.1 (2010): 4-20.

[11] C. Li and S. Park, "An efficient document classification model using an improved back propagation neural network and singular value decomposition", Expert Systems with Applications, vol. 36, no. 2, pp. 3208-3215, 2009.\

[12] J. Zheng, Y. Guo, C. Feng and H. Chen, "A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification", Mathematical Problems in Engineering, vol.

2018, pp. 1-10, 2018.

[13] J. Zhong and X. Yi, "Chinese electronic health record analysis for recommending medical treatment solutions withNLP and unsupervised learning", MATEC Web of Conferences, vol.

189, p. 10007, 2018.

[14] A. Sarker, D. Molla and C. Paris, "EXTRACTIVE SUMMARISATION OF MEDICAL DOCUMENTS", Australasian Medical Journal, vol. 05, no. 09, 2012.