Sparse Deep Learning Framework for Disease Analysis

(1)

51

Sparse Deep Learning Framework for Disease Analysis

Ms. K. Remya 1, Dr. T. Senthil Prakash2, Mr. S. Ramesh3, Ms.P.C. Saritha 4 II Year M.E(CSE)1 , Professor & HOD2,Assistant Professor3, II Year M.E(CSE)4

Shree Venkateshwara Hi-Tech Engg. College, Gobi, Tamilnadu, India 1,2,3,4 [email protected] 1, [email protected], [email protected] 4 Abstract

Health services are composed with the support of the infrastructure provided by the Internet community. Reputable portals and community based health services are the main types of web based health services. Reputable portals provide the disease and symptom information for the patients. Interaction based health services are provided under the community based health service schemes. Question and Answer (QA) based patient and Doctor Communication is supported under the community based methods. Decision support systems are constructed to assist decision making based on the communication. Vocabulary and information availability are the key issues in the Question and Answer (QA) based remote health services. Remote health services uses the disease and symptom details for the decision making process. Sparse deep learning method is build with the local learning and global learning schemes. Feature identification and signature identification operations are carried out under the learning process. The inference identification tasks are performed with the support of the learning process.

1. Introduction

Information technologies are transforming the ways healthcare services are delivered, from patients’ passively embracing their doctors’ orders to patients’ actively seeking online information that concerns their health. This trend is further confirmed by a national survey conducted by the Pew Research Center1 in Jan 2013, where they reported that one in three American adults have gone online to figure out their medical conditions in the past 12 months from the report time.

To better cater to health seekers, a growing number of community based healthcare services have turned up, including HealthTap,2 HaoDF3 and WebMD.4 They are disseminating personalized health knowledge and connecting patients with doctors worldwide via question answering [1], [2]. These forums are very attractive to both professionals and health seekers. For professionals, they are able to increase their reputations among their colleagues and patients, strengthen their practical knowledge from interactions with other renowned doctors possibly attract more new patients. For patients, these systems provide nearly

instant and trusted answers especially for complex and sophisticated problems. A tremendous number of medical records have been accumulated in their repositories and in most circumstances, users may directly locate good answers by searching from these record archives, rather than waiting for the experts’ responses or browsing through a list of potentially relevant documents from the Web.

(2)

52

For example, “heart attack” and “myocardial disorder” are employed by different experts to refer to the same medical diagnosis. It was shown that the inconsistency of community generated health data greatly hindered data exchange, management and integrity. Even worse, it was reported that users had encountered big challenges in reusing the archived content due to the incompatibility between their search terms and those accumulated medical records [5]. Automatically coding the medical records with standardized terminologies is highly desired. It leads to a consistent interoperable way of indexing, storing and aggregating across specialties and sites. In addition, it facilitates the medical record retrieval via bridging the vocabulary gap between queries and archives.

It is worth mentioning that there already exist several efforts dedicated to research on

automatically mapping medical records to

terminologies [6]. Most of these efforts, focused on hospital generated health data or health provider released sources by utilizing either isolated or loosely coupled rule-based and machine learning approaches [3]. Compared to these kinds of data, the emerging community generated health data is more colloquial, in terms of inconsistency, complexity and ambiguity, which pose challenges for data access and analytics. Further, most of the previous work simply utilizes the external medical dictionary to code the medical records rather than considering the corpus-aware terminologies. Their reliance on the independent external knowledge may bring in inappropriate terminologies. Constructing a corpus- aware terminology vocabulary to prune the irrelevant terminologies of specific dataset and narrow down the candidate is the tough issue we are facing. The varieties of heterogeneous cues were often not adequately exploited simultaneously. A robust integrated framework to draw the strengths from various resources and models is still expected

2. Related Work

The authority-oriented expert finding

methods are based on link analysis for the

ask-answer relation between users in the rating matrix. The user authority is ranked based on conventional webpage ranking algorithms and their variations [7]. For example, Bouguessa et al. choose the experts to answer the questions based on the number of best answers provided by users, which is an In-degree-based method. Jurczyk and Agichtein construct the user-to-user graph from the past question answering activities and employ a HITS based method to rank the user authority. Zhu et al. [13] measure the category relevance of questions and rank user authority in extended category link graph. Sung et al. [9] infer the expertise of new users by propagating the expertise of old users through common used words.

(3)

53

et al. model the expertise of the users based on the rating of the comments in community question answering system.

Unlike the previous studies, we formulate the problem of expert finding from the viewpoint of missing value estimation, which can be solved via matrix completion. We also employ the social relationship of users to improve the performance of missing value estimation.

3. Disease Inference from Health-Related

Questions

The greying of society, escalating costs of healthcare and burgeoning computer technologies are together driving more consumers to spend longer time online to explore health information. One survey 59 percent of U.S. adults have explored the internet as a diagnostic tool in 2012. Another survey reports that the average U.S. consumer spends close to 52 hours annually online to find wellness knowledge, while only visits the doctors three times per year in 2013. These findings have heightened the

importance of online health resources as

springboards to facilitate patient-doctor

communication.

The current prevailing online health resources can be roughly categorized into two categories. One is the reputable portals run by official sectors, renowned organizations, or other professional health providers They are disseminating up-todate health information by releasing the most accurate, well-structured, and formally presented health knowledge on various topics. WebMD1 and MedlinePlus2 are the typical examples. The other category is the community-based health services, such as HealthTap3 and HaoDF.4 They offer interactive platforms, where health seekers can anonymously ask health-oriented questions while doctors provide the knowledgeable and trustworthy answers. Fig. 3.1. illustrates one question answer (QA) example. The community- based health services have several intrinsic limitations. First of all, it is very time consuming for health seekers to get their posted questions resolved.

Fig. 3.1. An Illustrative Example of a QA Pair from Community-based Health Services

(HealthTap)

The time could vary from hours to days. Second, doctors are having to cope with an ever-expanding workload, which leads to decreased enthusiasm and efficiency. Taking HealthTap as an example, as of January 2014, it had gathered 50 thousand doctors and accumulated more than 1:1 billion answers, i.e., on average each doctor has online replied approximately 23 thousand times since its foundation in 2010. Third, qualitative replies are conditioned on doctors’ expertise, experiences and time, which may result in diagnosis conflicts among multiple doctors and low disease coverage of individual doctor. It is thus highly desirable to develop automatic and comprehensive wellness systems that can instantly answer all-round questions of health seekers and alleviate the doctors’ workload.

(4)

54

possible diseases of their manifested signals. The former two genres usually involve the exact disease names and expected sub-topics or sub-problems of the given diseases, such as the side effects of specific medications, and treatments. They can be automatically and precisely answered by either directly matching the questions in the archived repositories or syntactic information extraction from the structured health portals. The existing automatic question answering techniques are applicable here. The third genre conveys parts of the health seekers’ demographic information, physical and mental symptoms, as well as medical histories, in which they do not know what conditions they might have and expect the doctors to offer them some forts of online diagnosis. If the diseases are correctly inferred, these questions are naturally transferred to the first genre. A robust disease inference approach is the key to break the barrier of automatic wellness systems.

Little research has been dedicated to disease inference in the community-based health services. Disease inference is different from topics or tags assignment to short questions, where topics or tags are direct summarizations of given data instances and they may explicitly appear in the questions. While disease inference is a reasoning consequence based on the given question, this task is nontrivial due to following reasons. First, vocabulary gap between diverse health seekers makes the data more inconsistent, as compared to other formats of health data. For example, “shortness of breath” and “breathless” were used by different health seekers refer the same semantic “dyspnea”. Second, health seekers describe their problems in short questions, containing 14:5 terms per question on average. The incompleteness hinders the effective similarity estimation based on shared contexts. Third, medical attributes such as age, gender and symptoms, are highly correlated and do not unusually appear as compact patterns to signal the health problems.

This paper aims to build a disease inference scheme that is able to automatically infer the

possible diseases of the given questions in community based health services. We first analyze and categorize the information needs of health seekers. As a byproduct, we differentiate questions of this kind that require disease inference from other kinds. It is worth emphasizing that large-scale data often leads to explosion of feature space in the lights of n-gram representations, especially for the community generated inconsistent data. To avoid this problem, we utilize the medical terminologies to represent our data. Our scheme builds a novel deep learning model, comprising two components, as demonstrated in Fig. 3.2. The first globally mines the latent medical signatures.

Fig. 3.2. The Illustrative Process Of Our Sparsely Connected Deep Learning Construction

They are compact patterns of

(5)

Fine-55

tuning with a small set of labeled disease samples fits our model to specific disease inference. Different from conventional deep learning algorithms, the number of hidden nodes in each layer of our model is automatically determined and the connections between two adjacent layers are sparse, which make it faster. Extensive experiments on real world dataset labeled by online doctors were conducted to validate our scheme.

1) To the best of our knowledge, this is the first work on automatic disease inference in the community-based health services. Distinguished from the conventional sporadic efforts that generally focus on only a single or a few diseases based on the hospital generated records with structured fields, our scheme benefits from the volume of unstructured community generated data and it is capable of handling various kinds of diseases effectively. 2) It investigates and categorizes the information needs of health seekers in the community-based health services and mines the signatures of their generated data. 3) It proposes a sparsely connected deep learning scheme to infer various kinds of diseases. This scheme is pre-trained with pseudo-labeled data and further strengthened by fine-tuning with online doctor labeled data.

4. Problem Statement

Community based health services supports automatic disease inference identification for online health seekers. Question and Answer(QA) sessions are suffered with the vocabulary gap and incomplete information. Correlated medical concepts and limited high quality training samples makes an impact on inferring results. Diseases and symptoms are collected and used in the QA based health analysis tasks. Deep learning scheme is applied to infer the possible diseases using QA data values. Global leaning component is used to mine the discriminant medical signatures from raw features. In local learning raw features and their signatures are updated into the input layer and hidden layer. Sparsely connected deep learning scheme is applied

to infer various kinds of diseases. The following drawbacks are identified from the existing system.

• Discriminant feature identification is not supported

• Medical terminology relationship analysis is not provided

• Feature priority factors are not considered

• Limited inference accuracy levels

5. Disease Analysis using Sparse Deep Learning Framework

The sparse deep learning scheme is enhanced to fetch discriminate features from health data values. Medical terminology based Ontology is used for inference estimation process. Feature analysis is carried out with the conceptual relationship based weight values. Question and Answer (QA) data values are evaluated with symptom priority levels.

The disease inference estimation scheme is constructed to analyze the question and answers in online health services. Medical domain based Ontology is adapted to identify the disease inferences. Feature selection and categorization operations are integrated with the system. The system is partitioned into three major modules. They are Question and Answer Sessions, Tag Analysis and Deep Learning Process, The Question and Answers (QA) session module is designed to perform the data preprocess. Tags are identified and categorized under the tag analysis. Features and signatures are identified under deep learning process.

5.1. Question and Answer Sessions

The Question and Answer (QA) data sets are collected from online health services. The QA data values are transferred into the database. Questions, answers and tags are extracted from the data sets. The data sets are labeled with category information.

5.2. Tag Analysis

(6)

56

tag analysis. Features and associated tags labels are updated into the database.

5.3. Deep Learning Process

Pseudo labeled data and doctor labeled data are analyzed in the learning process. Signatures are identified from the raw features under the global learning process. Input layers and hidden layers are updated with features and signatures. The layers are used in the inference identification process.

6. Conclusion

Online health services are deployed to provide remote medical assistance. Automatic disease inference estimation is carried out using Question and Answer (QA) based diagnosis details. Sparse deep learning scheme is improved with Ontology

support. Discriminant feature identification

mechanism is used to upgrade the inference estimation process. Online health services are constructed to perform inference identification using Community based Question and Answer (CQA) details. Concept and term relationships are used to provide vocabulary support for the Question and Answer sessions. Inference identification accuracy is improved with discriminatory features and feature priority levels. Disease diagnosis is improved with overlapped feature analysis.

REFERENCES

[1] L. Nie, M. Akbari, T. Li, and T.-S. Chua, “A joint local-global approach for medical terminology assignment,” in Proc. Int. ACM SIGIR Conf., 2014. [2] L. Nie, T. Li, M. Akbari, and T.-S. Chua, “Wenzher: Comprehensive vertical search for healthcare domain,” in Proc. Int. ACM SIGIR Conf., 2014, pp. 1245–1246.

[3] Liqiang Nie, Yi-Liang Zhao, Mohammad Akbari, Jialie Shen and Tat-Seng Chua, “Bridging the Vocabulary Gap between Health Seekers and Healthcare Knowledge”, IEEE Transactions On Knowledge And Data Engineering, Vol. 27, No. 2, February 2015.

[4] S. H. Hashemi, M. Neshati, and H. Beigy, “Expertise retrieval in bibliographic network: A

topic dominance learning approach,” in Proc. 22nd ACM Int. Conf. Inform. Knowl. Manage., 2013. [5] G. Zuccon, B. Koopman, A. Nguyen, D. Vickers, and L. Butt, “Exploiting medical hierarchies for concept-based information retrieval,” in Proc. Australasian Document Comput. Symp., 2012, pp. 111–114.

[6] E. J. M. Laur_ıa and A. D. March, “Combining Bayesian text classification and shrinkage to automate healthcare coding: A data quality analysis,” J. Data Inf. Quart., vol. 2, no. 3, p. 13, 2011.

[7] Zhou Zhao, Lijun Zhang, Xiaofei He and Wilfred Ng, “Expert Finding for Question Answering via Graph Regularized Matrix Completion”, IEEE Transactions On Knowledge And Data Engineering, Vol. 27, No. 4, April 2015.

[8] F. Riahi, Z. Zolaktaf, M. Shafiei, and E. Milios, “Finding expert users in community question answering,” in Proc. 21st Int Conf. Companion World Wide Web, 2012.

[9] J. Sung, J.-G. Lee and U. Lee, “Booming up the long tails: Discovering potentially contributive users in community-based question answering services,” in Proc. 7th Int. AAAI Conf. 2013.

[10] T. Zhao, N. Bian, C. Li, and M. Li, “Topic-level expert modeling in community question answering,” in Proc. SDM, 2013.

[11] F. Xu, Z. Ji, and B. Wang, “Dual role model for question recommendation in community question answering,” in Proc. SIGIR, 2012.