Identifying Disease -Treatment Relations Using Machine Learning Approach

(1)

Procedia Computer Science 87 ( 2016 ) 306 – 315

Peer-review under responsibility of the Organizing Committee of ICRTCSE 2016 doi: 10.1016/j.procs.2016.05.166

ScienceDirect

2016 International Conference on Computational Science

Identifying Disease -Treatment Relations using Machine

Learning Approach

#1Keerrthega.M.C*#2Ms.D.Thenmozhi

#1Student,SSN College of Engineering,Chennai-603110

#2Assistant Professsor,SSN College of Engineering,Chennai-603110

Abstract

Identifying the disease treatment relation enables to find what disease a person suffers from and what appropriate treatment can be given to that person. The semantic relation tags namely Cure, Prevent and Sideeffects helps to find out the relationship between disease and treatment. Many methodologies like co-occurrence analysis, rule based methodologies and statistical methods are used in disease treatment relation. However, machine learning is widely used in many applications like protein-protein interaction , extraction of medical knowledge and in health care field. we propose a machine learning approach termed as SMO classification, which uses several features namely medical papers, medical abstracts. Our approach identifies the features namely disease-treat, cure, prevent and sideeffects. The performance can be measured by Accuracy, Precision, F-measure and Recall.

Keywords:Natural Language Processing;Machine Learning;SMO Classifcation

1.Introduction

Identification of the disease-treatment relation enables to identify what type of disease - treatment relation exists in sentences. The identified disease - treatment relation contains the three semantic relations namely cure, prevent and sideeffects. Disease-treatment relation identification can be used by the public to know about the above types of relations between disease and treatment. Identification of disease - treatment relations is useful to health care providers, private clinics, hospitals, doctors and common people.

2.Literature Survey

A. Semantic Relations in Bioscience text

(2)

forinformation extraction. The extraction of medical abstract is obtained through text classification. Semantic lexicons of words labelled with semantic classes where associations can be drawn between words which helps in extracting the necessary sentences related to the query. Naive Bayes (NB) algorithm to extract semantic relation like Gene-Protein from Medline abstracts.

B. Learning to extract relations from Medline

In this paper [5]individual sentences are considered as features processed by the Naive Bayes classifier. Here each feature is considered as positive training set. Extraction of words from Medline abstract has been done by using Naive Bayes, CNB algorithm (Compliment Naive Bayes Classification).It used bag of words during classification but not used natural language processing due to this performance of output degrades.

C. Extraction of Disease-Treatment relations from Biomedical Sentences

In this paper,the dataset is annotated with 8 semantic relations between diseases and treatments ,using Hidden Markov models and maximum entropy models to perform both the entity recognition and relation identification. The representation techniques are done by Parts Of Speech(POS),Phrases and terms from MeSH (Medical Subject Headings.)

D. Biomedical Language Processing:What’s beyond Pubmed

In this paper[12]It involves natural language processing for processing of biomedical words and in this work it takes the name of disease and give the solution which has been stored in database of that disease by parsing user statement using natural language processing but it does not do diagnosisof disease.

E. Hybrid Machine Learning Implementation for classifying Disease-Treatment relations in Short texts

In this paper,selection techniques are used in order to identify the most suitable words as features by sentence selection and relation identification. The task of relation extraction is tackled in the medical literature focus on biomedical tasks such as sub cellularlocation, gene-disorder association, diseases and drugs.The data sets used in biomedical specific tasks use short texts.

3.Proposed System

Identifies Disease-Treatment Relation and classifies what type of disease-treatment relation namely cure, prevent and sideeffects exists in a sentence from Biotextc abstracts using Supervised Approach.The architecture of the proposed system:

Informative-information about the disease or treatment

Non-informative- no information about the disease or treatment. SemanticRelations:Cure,Prevent,SideEffects.

The modules of the proposed system are described below: (i)Sentence Selection (ii)Feature extraction ¾ BOW Extraction ¾ Syntactic Representation ¾ Metamap Representation (iii) Classification

(3)

3.1 Sentence Selection

Sentence selection is done by identifying the informative sentences from Medline Abstracts. The informative sentences are chosen by the tag representation namely DISTREAT,CURE,PREVENT and SIDEEFFECTS. If the sentences are represented within these tags then that sentence is considered to be informative. Example:

Cure:

< DIS > Obesity <\ DIS >is a clinical problem and use of< TREAT > dexfenuramine hydrochloride < \TREAT > for weight reduction.

Prevent:

Immmunogenicity of < DISPREV > hepatitis B < \DISPREV>

< TREATPREV > vaccine < \TREATPREV>in term and preterm infants. Sideeffects:

All eyes that had < TREAT SIDE EFF > opticcapture without vitrectomy< \TREAT SIDE EFF > remained clear, but after 6months,four of fivedeveloped< DIS SIDE EFF >opacification < \DIS SIDE EFF>

Disease-Treatment Relations:

Input : Sentences from Medline Abstracts

Output: Set of Informative Sentence and Non-Informative Sentence if Sentence satisfies three semantic relations tags then

| Add the Sentences to the Informative Set else

| Add the Sentences to the Non-informative Set end

Algorithm : Three Semantic Relations Tags for identifying Informative Sentences and Non-Informative Sentences

Presenting

(4)

The Feature extraction module is composed of three types they are: a) BOW Extraction b) Syntactic Representation c) Metamap Representation BOW Extraction Steps:

1. The terms are extracted from the above sentences and the stop words are removed such as is,are,by,and, the.

2. Lemmatization is done by taking the meaningful terms namely diabetes, Lungcancer,disease,Bluredvision, Chemotherapy

3. Equal to and more than 3 frequent items are listed as Diabetes andLungcancer.

Example:

Sentence 1: " Diseases namely lung cancer and Diabetes "

Sentence 2: "The diseases Lungcancer and Diabetes are curable at the initial stage." Sentence 3:" Lungcancer is treated by radiotherapy and

chemotherapy."

Sentence 4:" Blurred vision and frequent urination are the symptoms of Diabetes."

From the above sentences ,only 2 features have been extracted namely Lungcancer and Diabetes equal to 3 times.

(5)

Sentence Lungcancer Diabetes 1 1 1 2 1 1 3 1 0 4 0 1 Syntactic Representation

The Genia Tagger checks the english sentences,the

baseforms , POS tags, chunk tags, named entity tags for biomedical texts such as Medline abstracts and Biotextc abstracts.

Example: Sentence: Diseases namely lungcancer and Diabetes are curable .Treatment of lung cancer are radiotherapy and chemotherapy.

Metamap Representation

Metamap is a tool created by NLM that maps free text to medical concepts. 6The text is processed through a series of modules that will give a ranked list of all possible concept candidates for a particular noun phrase

(6)

3.3 Classification

Classification identifies the characterstics of several features frommedical papers, medical abstracts and what type of disease treatment relation exists in it.

x Training and testing x LibSVM

x SMO

4.Workflow of Proposed System

The Feature extraction is done by Bow Extraction in means of taking the biotextc dataset textfilele and convert it into csv file format. The csv file format text has labels and its disease type. In the original dataset

textfile,there were 233 sentences from which 44 feature vectors have been extracted.Training and testing classification is done by weka tool.Weka is a collection of machine learning algorithms for data mining tasks.The csv file is loaded in the preprocess step where the features are generated.In classification , the feature vectors which is in the csv file format,choose the function option where Libsvm and SMO supervised

approaches are listed. 4.1 Biotextc Dataset

(7)

(8)

(9)

The proposed work provides us only informative sentences and removes uninformative sentences from the medical related articles in a pipelined manner. This system helps users especially doctors in saving their time and they can know easily about a disease ,its treatment ,symptoms and can analyze more about a various treatments associated with a

0 0.2 0.4 0.6 0.8 1 Preciision F-measure LibSVM Multilayer Perceptron RBF Network SMO

disease. BOW Extraction is performed with many classifiers namely LibSVM ,RBF and Multilayer Perceptron where RBF has same level of performance measures ,but SMO classifier has more accuracy and

F-measure.This system will be more useful to common users who want to know more about a disease in simpler manner. In various healths caredomains, we can make use of this method.

References

[1] B. Rosario and M.A. Hearst, “Semantic Relations in Bioscience

Text,” Proc. 42nd Ann. Meeting on Assoc. for Computational

Linguistics, vol. 430

[2] R. Bunescu and R. Mooney, “A Shortest Path Dependency Kernel

for Relation Extraction,” Proc. Conf. Human Language Technology

and Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 724-731, 2005.

[3] R. Bunescu, R. Mooney, Y. Weiss, B. Scho¨ lkopf, and J. Platt,“Subsequence Kernels for Relation

Extraction,” Advances in Neural Information Processing Systems, vol. 18, pp. 171-178, 2006.

[4] A.M. Cohen and W.R. Hersh, and R.T. Bhupatiraju, “FeatureGeneration, Feature Selection, Classifiers, and Conceptual Drift for Biomedical Document Triage,” Proc. 13th Text Retrieval Conf.(TREC), 2004.

[5] M. Craven, “Learning to Extract Relations from Medline,” Proc.Assoc. for the Advancement of Artificial Intelligence, 1999.

[6] I. Donaldson et al., “PreBIND and Textomy: Mining the Biomedical Literature for Protein-Protein Interactions Using a

(10)

[7] O. Frunza and D. Inkpen, “Textual Information in Predicting Functional Properties of the Genes,” Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP) in conjunction with Assoc. for Computational Linguistics (ACL ’08), 2008.

[8] R. Gaizauskas, G. Demetriou, P.J. Artymiuk, and P. Willett,“Protein Structures and Information Extraction from BiologicalTexts: The PASTA System,” Bioinformatics, vol. 19, no. 1, pp. 135-143, 2003.

[9] C. Giuliano, L. Alberto, and R. Lorenza, “Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical

Literature,” Proc. 11th Conf. European Chapter of the Assoc. for Computational Linguistics, 2006. [10] J. Ginsberg, H. Mohebbi Matthew, S.P. Rajan, B. Lynnette, S.S.Mark, and L. Brilliant, “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature, vol. 457, pp. 1012-1014, Feb.2009. [11] M. Goadrich, L. Oliphant, and J. Shavlik, “Learning Ensembles of First-Order Clauses for

Recall-Precision Curves: A Case Study in Biomedical Information Extraction,” Proc. 14th Int’l Conf. Inductive Logic Programming, 2004.

[12] L. Hunter and K.B.Cohen, “Biomedical Language Processing:What’s beyond PubMed?” Molecular Cell, vol. 21-5, pp. 589-594,2006.

[13] L. Hunter, Z. Lu, J. Firby, W.A. Baumgartner Jr., H.L. Johnson,P.V. Ogren, and K.B. Cohen,

“OpenDMAP: An Open Source,Ontology-Driven Concept Analysis Engine, with Applications to Capturing Knowledge Regarding Protein Transport, Protein Interactions and Cell-Type-Specific Gene Expression,” BMC Bioinformatics, vol. 9, article no. 78, Jan. 2008.

[14] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “A Literature Network of Human Genes for High-ThroughputAnalysis of Gene Expression,” Nature Genetics, vol 2005