Procedia Computer Science 87 ( 2016 ) 306 – 315
1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Organizing Committee of ICRTCSE 2016 doi: 10.1016/j.procs.2016.05.166
ScienceDirect
2016 International Conference on Computational Science
Identifying Disease -Treatment Relations using Machine
Learning Approach
#1Keerrthega.M.C*#2Ms.D.Thenmozhi
#1Student,SSN College of Engineering,Chennai-603110#2Assistant Professsor,SSN College of Engineering,Chennai-603110
Abstract
Identifying the disease treatment relation enables to find what disease a person suffers from and what appropriate treatment can be given to that person. The semantic relation tags namely Cure, Prevent and Sideeffects helps to find out the relationship between disease and treatment. Many methodologies like co-occurrence analysis, rule based methodologies and statistical methods are used in disease treatment relation. However, machine learning is widely used in many applications like protein-protein interaction , extraction of medical knowledge and in health care field. we propose a machine learning approach termed as SMO classification, which uses several features namely medical papers, medical abstracts. Our approach identifies the features namely disease-treat, cure, prevent and sideeffects. The performance can be measured by Accuracy, Precision, F-measure and Recall.
Keywords:Natural Language Processing;Machine Learning;SMO Classifcation
1.Introduction
Identification of the disease-treatment relation enables to identify what type of disease - treatment relation exists in sentences. The identified disease - treatment relation contains the three semantic relations namely cure, prevent and sideeffects. Disease-treatment relation identification can be used by the public to know about the above types of relations between disease and treatment. Identification of disease - treatment relations is useful to health care providers, private clinics, hospitals, doctors and common people.
2.Literature Survey
A. Semantic Relations in Bioscience text
forinformation extraction. The extraction of medical abstract is obtained through text classification. Semantic lexicons of words labelled with semantic classes where associations can be drawn between words which helps in extracting the necessary sentences related to the query. Naive Bayes (NB) algorithm to extract semantic relation like Gene-Protein from Medline abstracts.
B. Learning to extract relations from Medline
In this paper [5]individual sentences are considered as features processed by the Naive Bayes classifier. Here each feature is considered as positive training set. Extraction of words from Medline abstract has been done by using Naive Bayes, CNB algorithm (Compliment Naive Bayes Classification).It used bag of words during classification but not used natural language processing due to this performance of output degrades.
C. Extraction of Disease-Treatment relations from Biomedical Sentences
In this paper,the dataset is annotated with 8 semantic relations between diseases and treatments ,using Hidden Markov models and maximum entropy models to perform both the entity recognition and relation identification. The representation techniques are done by Parts Of Speech(POS),Phrases and terms from MeSH (Medical Subject Headings.)
D. Biomedical Language Processing:What’s beyond Pubmed
In this paper[12]It involves natural language processing for processing of biomedical words and in this work it takes the name of disease and give the solution which has been stored in database of that disease by parsing user statement using natural language processing but it does not do diagnosisof disease.
E. Hybrid Machine Learning Implementation for classifying Disease-Treatment relations in Short texts
In this paper,selection techniques are used in order to identify the most suitable words as features by sentence selection and relation identification. The task of relation extraction is tackled in the medical literature focus on biomedical tasks such as sub cellularlocation, gene-disorder association, diseases and drugs.The data sets used in biomedical specific tasks use short texts.
3.Proposed System
Identifies Disease-Treatment Relation and classifies what type of disease-treatment relation namely cure, prevent and sideeffects exists in a sentence from Biotextc abstracts using Supervised Approach.The architecture of the proposed system:
Informative-information about the disease or treatment
Non-informative- no information about the disease or treatment. SemanticRelations:Cure,Prevent,SideEffects.
The modules of the proposed system are described below: (i)Sentence Selection (ii)Feature extraction ¾ BOW Extraction ¾ Syntactic Representation ¾ Metamap Representation (iii) Classification
3.1 Sentence Selection
Sentence selection is done by identifying the informative sentences from Medline Abstracts. The informative sentences are chosen by the tag representation namely DISTREAT,CURE,PREVENT and SIDEEFFECTS. If the sentences are represented within these tags then that sentence is considered to be informative. Example:
Cure:
< DIS > Obesity <\ DIS >is a clinical problem and use of< TREAT > dexfenuramine hydrochloride < \TREAT > for weight reduction.
Prevent:
Immmunogenicity of < DISPREV > hepatitis B < \DISPREV>
< TREATPREV > vaccine < \TREATPREV>in term and preterm infants. Sideeffects:
All eyes that had < TREAT SIDE EFF > opticcapture without vitrectomy< \TREAT SIDE EFF > remained clear, but after 6months,four of fivedeveloped< DIS SIDE EFF >opacification < \DIS SIDE EFF>
Disease-Treatment Relations:
Input : Sentences from Medline Abstracts
Output: Set of Informative Sentence and Non-Informative Sentence if Sentence satisfies three semantic relations tags then
| Add the Sentences to the Informative Set else
| Add the Sentences to the Non-informative Set end
Algorithm : Three Semantic Relations Tags for identifying Informative Sentences and Non-Informative Sentences
Presenting
The Feature extraction module is composed of three types they are: a) BOW Extraction b) Syntactic Representation c) Metamap Representation BOW Extraction Steps:
1. The terms are extracted from the above sentences and the stop words are removed such as is,are,by,and, the.
2. Lemmatization is done by taking the meaningful terms namely diabetes, Lungcancer,disease,Bluredvision, Chemotherapy
3. Equal to and more than 3 frequent items are listed as Diabetes andLungcancer.
Example:
Sentence 1: " Diseases namely lung cancer and Diabetes "
Sentence 2: "The diseases Lungcancer and Diabetes are curable at the initial stage." Sentence 3:" Lungcancer is treated by radiotherapy and
chemotherapy."
Sentence 4:" Blurred vision and frequent urination are the symptoms of Diabetes."
From the above sentences ,only 2 features have been extracted namely Lungcancer and Diabetes equal to 3 times.
Sentence Lungcancer Diabetes 1 1 1 2 1 1 3 1 0 4 0 1 Syntactic Representation
The Genia Tagger checks the english sentences,the
baseforms , POS tags, chunk tags, named entity tags for biomedical texts such as Medline abstracts and Biotextc abstracts.
Example: Sentence: Diseases namely lungcancer and Diabetes are curable .Treatment of lung cancer are radiotherapy and chemotherapy.
Metamap Representation
Metamap is a tool created by NLM that maps free text to medical concepts. 6The text is processed through a series of modules that will give a ranked list of all possible concept candidates for a particular noun phrase
3.3 Classification
Classification identifies the characterstics of several features frommedical papers, medical abstracts and what type of disease treatment relation exists in it.
x Training and testing x LibSVM
x SMO
4.Workflow of Proposed System
The Feature extraction is done by Bow Extraction in means of taking the biotextc dataset textfilele and convert it into csv file format. The csv file format text has labels and its disease type. In the original dataset
textfile,there were 233 sentences from which 44 feature vectors have been extracted.Training and testing classification is done by weka tool.Weka is a collection of machine learning algorithms for data mining tasks.The csv file is loaded in the preprocess step where the features are generated.In classification , the feature vectors which is in the csv file format,choose the function option where Libsvm and SMO supervised
approaches are listed. 4.1 Biotextc Dataset
The proposed work provides us only informative sentences and removes uninformative sentences from the medical related articles in a pipelined manner. This system helps users especially doctors in saving their time and they can know easily about a disease ,its treatment ,symptoms and can analyze more about a various treatments associated with a
0 0.2 0.4 0.6 0.8 1 Preciision F-measure LibSVM Multilayer Perceptron RBF Network SMO
disease. BOW Extraction is performed with many classifiers namely LibSVM ,RBF and Multilayer Perceptron where RBF has same level of performance measures ,but SMO classifier has more accuracy and
F-measure.This system will be more useful to common users who want to know more about a disease in simpler manner. In various healths caredomains, we can make use of this method.
References
[1] B. Rosario and M.A. Hearst, “Semantic Relations in Bioscience
Text,” Proc. 42nd Ann. Meeting on Assoc. for Computational
Linguistics, vol. 430
[2] R. Bunescu and R. Mooney, “A Shortest Path Dependency Kernel
for Relation Extraction,” Proc. Conf. Human Language Technology
and Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 724-731, 2005.
[3] R. Bunescu, R. Mooney, Y. Weiss, B. Scho¨ lkopf, and J. Platt,“Subsequence Kernels for Relation
Extraction,” Advances in Neural Information Processing Systems, vol. 18, pp. 171-178, 2006.
[4] A.M. Cohen and W.R. Hersh, and R.T. Bhupatiraju, “FeatureGeneration, Feature Selection, Classifiers, and Conceptual Drift for Biomedical Document Triage,” Proc. 13th Text Retrieval Conf.(TREC), 2004.
[5] M. Craven, “Learning to Extract Relations from Medline,” Proc.Assoc. for the Advancement of Artificial Intelligence, 1999.
[6] I. Donaldson et al., “PreBIND and Textomy: Mining the Biomedical Literature for Protein-Protein Interactions Using a
[7] O. Frunza and D. Inkpen, “Textual Information in Predicting Functional Properties of the Genes,” Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP) in conjunction with Assoc. for Computational Linguistics (ACL ’08), 2008.
[8] R. Gaizauskas, G. Demetriou, P.J. Artymiuk, and P. Willett,“Protein Structures and Information Extraction from BiologicalTexts: The PASTA System,” Bioinformatics, vol. 19, no. 1, pp. 135-143, 2003.
[9] C. Giuliano, L. Alberto, and R. Lorenza, “Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical
Literature,” Proc. 11th Conf. European Chapter of the Assoc. for Computational Linguistics, 2006. [10] J. Ginsberg, H. Mohebbi Matthew, S.P. Rajan, B. Lynnette, S.S.Mark, and L. Brilliant, “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature, vol. 457, pp. 1012-1014, Feb.2009. [11] M. Goadrich, L. Oliphant, and J. Shavlik, “Learning Ensembles of First-Order Clauses for
Recall-Precision Curves: A Case Study in Biomedical Information Extraction,” Proc. 14th Int’l Conf. Inductive Logic Programming, 2004.
[12] L. Hunter and K.B.Cohen, “Biomedical Language Processing:What’s beyond PubMed?” Molecular Cell, vol. 21-5, pp. 589-594,2006.
[13] L. Hunter, Z. Lu, J. Firby, W.A. Baumgartner Jr., H.L. Johnson,P.V. Ogren, and K.B. Cohen,
“OpenDMAP: An Open Source,Ontology-Driven Concept Analysis Engine, with Applications to Capturing Knowledge Regarding Protein Transport, Protein Interactions and Cell-Type-Specific Gene Expression,” BMC Bioinformatics, vol. 9, article no. 78, Jan. 2008.
[14] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “A Literature Network of Human Genes for High-ThroughputAnalysis of Gene Expression,” Nature Genetics, vol 2005