Using a Combination of Methodologies for Improving Medical
Information Retrieval Performance
Hoda F orghani Raissi
A THESIS SUBMITTED TO
THE FACULTY OF GRADUATE STUDIES
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF ARTS
GRADUATE PROGRAM IN
INFORMATION SYSTEMS AND TECHNOLOGY
YORK UNIVERSITY
TORONTO, ONTARIO
September 2013
ABSTRACT
This thesis presents three approaches to improve the current state of Medical Information Retrieval. At the time of this writing, the health industry is experiencing a massive change in terms of introducing technology into all aspects of health delivery. The work in this thesis involves adapting existing established concepts in the field of Information Retrieval to the field of Medical Information Retrieval. In particular, we apply subtype filtering, ICD-9 codes, query expansion, and re-ranking methods in order to improve retrieval on medical texts. The first method applies association rule mining and cosine similarity measures. The second method applies subtype filtering and the Apriori algorithm. And the third method uses ICD-9 codes in order to improve retrieval accuracy. Overall, we show that the current state of medical information retrieval has substantial room for improvement. Our first two methods do not show signficant improvements, while our
Acknowledgements
This work could not be done without the help and support of my very kind friends and mentors who were always there to help and to support me. Words can not describe the gratitude I have for them. Nevertheless, this section serves to thank the special people who helped me complete my thesis.
First of all, I want to show my great respect and thanks to professor Jimmy Huang, for his encouragement and passionate academic support. Professor Huang's kindness and support constantly motivates me to work hard, and achieve my academic goals.
Furthermore, I would like to thank professor Yang, also in my supervisory committee, for her help and support. Finally, I would like to thank my external examiner, professor Song, for her time and consideration.
I also want to thank Dr. Mariam Daoud for her help and support. None of this would be possible without her guidance. I would also like to greatly thank my friend Karl Aeen for his continued support and guidance.
Apart from the academic support, I would like to thank my friends and family who were there for me through thick and thin. My best friend, and a younger sister at heart, Atanaz, was one of the reasons for my successes over the past few years. Her family was also very helpful in providing me with an easy transition for my new life in Canada.
Finally, I want to thank the most important people in my life; these people gave me all they had, emotionaly and financially, through love and kindness to help me get to where I
am today: my Lovely parents, Zari Pirjedi and Reza Forghani Raissi; and my beautiful sister, Mona. I am and where I am today because of my family's love.
Table of Contents
Abstract ii
Acknowledgments iii
Table of Contents vi
List of Tables xi
List of Figures xii
ABSTRA.CT ....••••...•••...•...•...•...•...•••... ii Acknowledgements ...•...•... iii 1 Introduction ... 1 1.1 Background ... 1 1.2 Motivation ... 3 1.2.1 E-Health Initiatives ... 4 1.3 Problem Definition ... 5 1.3 .1 Query Model ... 6
1.3.2 Medical Dataset Model ... 6
1.4 Contribution ... 8
1.4.1 Query Expansion ... 9
1.4.2 Methods Proposed ... 10
1.5 Findings ...•..•.•... 11
1.5.1 Method 1: APFP-Cosine Algorithm ... 11
1.5.2 Method 2: Sub-AP Algorithm ... 12
1.5.3 Method 3: ICD9-Top Algorithm ... 12
1.6 Thesis Structure ...•....•.•...•..•...•...•...• 12
2 Literature Review ... 14
2.1 Re-ranking based on Similarity Calculation ... 14
2.2 Query Expansion .•...•...•.•...•...•...•... 15
2.2. l Automatic Query Expansion ... 16
2.2.2 Manual Query Expansion ... 20
2.2.3 Association Rule Mining Algorithms ... 21
2.3 Weighting Models ...•.•...••...•... 23
3 APFP-Cosine Method ... 24
3.1 APFP-Cosine Methodology Steps ...•... 26
3.1.1 Indexing Based on Using UMLS and BioLabeler ... 26
3 .1.2 Using Association Rule Mining for Query Expansion ... 27
3.1.3 Vectors of Queries ... 29
3.1.4 TF.IDF Scoring ... :··· 30
3.1.5 Vectors of Reports ... 31
3 .1.6 Cosine Similarity ... 31
3.1.7 Re-ranking based on cosine similarity score: ... 33
3.2 Algorithm ...•... 34
4.1 Sub-AP Methodology Steps ...•...•... 37
4.1.1 Indexed Collection ... 37
4.1.2 Subtypes Filtering ... 38
4.1.3 Using Top Ranked Results ... 39
4.1.4 Finding Most Relevant Concepts ... 39
4.1.5 Apriori Algorithm Scores ... 40
4.1.6 Re-ranking and Combining Scores ... 40
4.2 Algorithm ... 40
5 ICD9-Top ... 42
5.1 Methodology Steps in Details .•...•...••...•...•...•..• 43
5.1. l Baseline ... 44
5.1.2 Gathering ICD-9 Codes ... 46
5.1.3 Gathering Top Documents ... 47
5.1.4 Finding Associated ICD-9 codes ... 47
5.1.5 Ranking ICD-9 Codes ... 47
5.1.6 Finding The Top ICD-9 Codes ... 48
5.1. 7 ICD-9 Codes Descriptions ... 51
5.1.8 Weighting Terms ... 52 5.1.9 Query Expansion ... 54 5.1.10 Retrieval. ... 54 5.2 Algorithm ...•...•...•... 55 6 Experimental Settings ... 57 6.1 Information Retrieval ... 57
6.1.1 Standard Information Retrieval Methods ... 57
6.1.2 Terrier Information Retrieval Method ... 58
6.2 TF-IDF Retrieval Method ... 59
6.3 Perl ...••...•...•...•...•... 60
6.5 Evaluation Parameters ...•...•...•...•... 62 6.5. I Recall ... 63 6.5.2 Precision ... 63 6.5.3 Precision at n ... 64 6.5.4 bpref ... 65 6.5.5 MAP ... 65 6.5.6 R-precision ... 66 7 Results ... 68 7.1 APFP-Cosine ... 68 7. I. I AP-Cosine Results: ... 68 7. I .2 FP-Cosine Results: ... 72 7.2 Sub-AP: ...•.•...•..•...•...•... 76 7.3 ICD9-Top ... 89 7.3.I ICD9-Top20 ... 89 7.3.2 ICD9-Top50 ... 94
8 Analysis and Discussion ... 100
8.1 APFP-Cosine Discussion ...•...•...•..•... 100
8.2 Sub-AP Discussion •...•...•... 101
8.3 ICD9-Top Discussion ... 102
8.4 Improved Methods Compared to the Baseline ...•...•...•... 102
8.5 The Significance Test ...•...•...•...•.... 103
9 Conclusion and Future Work ... 112
9.1 APFP-Cosine Method ... 112
9.2 AP-Sub Method ... 113
9.4 Future Work ...•...•...•...•... 114
A. Scripts for Methodoligies ... 120
A.1 Generating Report's Vector ... 120
A.2 Generating Query's Vector ... 122
A.3 Cosine Similarity ...•...•••...•...•...•... 124
A.4 Subtype Filtering ... 126
A.5 Perl Program for Re-reanking ...••...•...•...•... 129
A.6 Reading Reports in order to find the ICD-9 Codes ... 132
A. 7 Finding ICD-9 Codes Description ...•... 134
A.8 Weight for the ICD-9 Codes Descrptions for Each Query ... 136
A.9 Phyton Program for Stemming and Remove Stop Words ... 137
B. Topics ... 139
B.1 Topics of Medical TREC 2011 ...•...•... 139
List of Tables
Table 1: Sample ofICD-9 codes ... 9
Table 2: Partial List of Stop Words ... 18
Table 3: Different Tags of Subtypes ... 38
Table 4: APCosine Method in compare with Baseline ... 72
Table 5: FP-Cosine Method in compare with Baseline ... 76
Table 6: Comparison Disorders Subtype Results with baseline ... 83
Table 7: Comparison Procedure Subtype Results with baseline ... 89
Table 8: Comparison ICD9-Top20 Method with Baseline results ... 93
Table 9: Comparison ICD9-Top50 Method with Baseline results ... 98
Table 10: The results of significant test for ICD9-Top20 (Map Values) ... 104
Table 11: The results of significant test for ICD9-Top20 (R-prec Values) ... 105
Table 12: The results of significant test for I CD9-Top20 (bpref Values) ... 105
Table 13: The results of significant test for ICD9-Top20 (P5 Values) ... 106
Table 14: The results of significant test for ICD9-Top20 (PIO Values) ... 106
Table 16: The results of significant test for ICD9-TopSO (R-prec Values) ... 107
Table 17: The results of significant test for ICD9-TopSO (bprefValues) ... 108
Table 18: The results of significant test for ICD9-TopSO (PS Values) ... 108
Table 19: The results of significant test for ICD9-TopSO (PlO Values) ... 109
Table 20: The results of significant test for Sub-AP method based on the Top 10 added concepts in a = 0.1 for Procedure subtype (MAP Values) ... 109
Table 21: The results of significant test for Sub-AP method based on the Top 10 added concepts in a
=
0.1 for Procedure subtype in (R-prec Values) ... 110Table 22: The results of significant test for Sub-AP method based on the Top 10 added concepts in a= 0.1 for Procedure subtype in (bprefValues) ... 110
Table 23: The results of significant test for Sub-AP method based on the Top 10 added concepts in a= 0.1 for Procedure subtype in (PS Values) ... 111
Table 24: The results of significant test for Sub-AP method based on the Top 10 added concepts in a= 0.1 for Procedure subtype in (PlO Values) ... 111
List of Figures
Figure 1: Sample Report ... 7
Figure 2: Re-ranking Method Diagram ... 25
Figure 3: Re-ranking based on the Cosine Similarity Algorithm ... 34
Figure 4: Subtypes Method's Diagram ... 37
Figure 5: Sub-AP Algorithm ... 41
Figure 6: ICD-9 Methodology Graph ... .42
Figure 7: Sample Report along with the tags ... .43
Figure 8: Terrier Components ... 45
Figure 9: ICD-9 Codes Frequency for Top 20 Relevant reports ... 50
Figure 10: ICD-9 Codes Frequency for Top 50 Relevant reports ... 51
Figure 11: Sample Lines ofindex ... 52
Figure 12: ICD9-Top Algorithm ... 55
Figure 13: Standard IR Algorithm ... 57
Figure 14: Inverted Index Algorithm ... 58
Figure 15: Terrier Source Code to Calculate TF.IDF ... 60
Figure 16: MAP Values for AP-Cosine Method ... 69
Figure I8: bprefValues for AP-Cosine Method ... 70
Figure I9: P5 Values for AP-Cosine Method ... 7I Figure 20: PlO Values for AP-Cosine Method ... 7I Figure 2 I: MAP Values for FP-Cosine Method ... 73
Figure 22: R-precision Values for PP-Cosine Method ... 74
Figure 23: bprefValues for FP-Cosine Method ... 74
Figure 24: P5 Values for FP-Cosine Method ... 75
Figure 25: PIO Values for FP-Cosine Method ... 75
Figure 26: Map Values for Top 10, Top 20 and Top 40 added Concepts for Sub-AP based on
a
for Disorders ... 78Figure 27: R-precision Values for Top 10, Top 20 and Top 40 added Concepts for Sub-AP based on
a
for Disorders ... 79Figure 28: bpref Values for Top 10, Top 20 and Top 40 added Concepts for Sub-AP based on
a
for Disorders ... 80Figure 29: P5 Values for Top 10, Top 20 and Top 40 added Concepts for Sub-AP based on
a
for Disorders ... 8 I Figure 30: PIO Values for Top IO, Top 20 and Top 40 added Concepts for Sub-AP based on a for Disorders ... 82Figure 3 I: Map Values for Top 10, Top 20 and Top 40 added Concepts for Sub-AP based on a For Procedure ... 84
Figure 32: R-precision Values for Top 10, Top 20 and Top 40 added Concepts for Sub-AP based on a For Procedure ... 85
Figure 33: bprefValues for Top 10, Top 20 and Top 40 added Concepts for Sub-AP
based on
a
For Procedure ... 86Figure 34: P5 Values for Top 10, Top 20 and Top 40 added Concepts for Sub-AP based on
a
For Procedure ... 87Figure 35: PIO Values for Top 10, Top 20 and Top 40 added Concepts concepts for Sub-AP based on a For Procedure ... 88
Figure 36: Comparison MAP values for ICD9-TOP20 method and Baseline ... 90
Figure 37: Comparison R-precision values for ICD9-TOP20 method and Baseline ... 91
Figure 38: Comparison Bprefvalues for ICD9-TOP20 method and Baseline ... 91
Figure 39: Comparison P5 values for ICD9-TOP20 method and Baseline ... 92
Figure 40: Comparison Pl 0 values for ICD9-TOP20 method and Baseline ... 93
Figure 41: Comparison of Evaluation Measures Between ICD9-Top20 and Baseline for All the Queries ... 94
Figure 42: Comparison MAP values for ICD9-TOP50 method and Baseline ... 95
Figure 43: Comparison R-precision values for ICD9-TOP50 method and Baseline ... 96
Figure 44: Comparison bprefvalues for ICD9-TOP50 method and Baseline ... 96
Figure 45: Comparison P5 values for ICD9-TOP50 method and Baseline ... 97
Figure 46: Comparison PIO values for ICD9-TOP50 method and Baseline ... 98
Figure 47: Comparison of Evaluation Measures Between ICD9-Top20 and Baseline for All the Queries ... 99
Figure 48: Comparison of all Evaluation Measures Between Sub-AP, ICD9-Top20 and Baseline ... 103
1 Introduction
1.1 Background
Information Retrieval is the act of retrieving specific information that is required from a set of resources, or a collection of documents. In other words, it is the act of separating relevant information from irrelevant information in a collection of data. A great example of this in action is Google's search engine. Through this service, people are able to use search terms, or queries 1, in order to retrieve the relevant information they are looking
for. Google's data collection is massive, and the whole point of the search engine is to separate relevant data from irrelevant data. This is information retrieval at its heart.
In the field of Information Retrieval, there are many great challenges to be solved. In order to solve these challenges, many organizations exist that aid in this process. TREC, which stands for Text REtrieval Conference (TREC, http://trec.nist.gov), is a major
Conference 2 that provides standardization for researchers in the field Information Retrieval. It provides complete datasets and benchmark information in order to help Information Retrieval researchers have a standard baseline for their work. Once a year,
TREC releases queries and documents that researchers use in order to perform research. Furthermore TREC provides a Golden Standard that is created by extracting the relevant reports for each topic (query). The judges are chosen from the medical community, or from general participants of the TREC Medical Track who have significant medical knowledge; normally, the chosen participants posses a PhD in the life sciences [13].
One of the main motivations for the existence of TREC is the collection of documents that it provides to researchers. A "collection of documents" is the raw data that
researchers use to conduct novel methods of Information Retrieval on. TREC creates the standard for the collection of documents and the performance baselines. In the early days of Information Retrieval research, document collections were typically very small. During the 1990s, collections were only as big as a few megabytes, or as much information that is present in a 20-page essay [36]. Thus, it did not represent real world databases of document collections. The second problem was that the sharing of such collection of documents was very difficult; researchers were not able to efficiently share their data amongst themselves. Furthermore, most of the companies had their own databases, but they were not really interested to share them with other researchers. Hence, TREC was designed to help all industrial, academic and government researchers to work and
2
compete together efficiently in a standardized system. This way, researchers can correctly compare the accuracy and speed of their Information Retrieval algorithms. TREC defines itself not only as an annual competition, but also as a way to share ideas and techniques for performing successful Information Retrieval [22]. The ultimate goal of TREC is to advance the field of Information Retrieval.
1.2 Motivation
Information Retrieval systems generally function by taking as input a query, or a set of queries, and then produce as output the part of the data that is most relevant to the input query given earlier. Medical Information Retrieval is subset of Information Retrieval that is optimized and designed to be used specifically with medical data. Medical literature can be one of the hardest texts to perform information retrieval on, and this is because of issues like synonyms and very-long keywords that are common in the medical field. Considering these problems, medical text retrieval is tricky. Yet, it is one of the most essential and critical parts of Information Retrieval. Medical text retrieval is extremely important because it has the potential to reduce the spread of diseases, and ultimately save lives. Furthermore, it is important in order to recognize health-related information that doctors and medical experts need, sometimes very urgently. Therefore, Medical Information Retrieval requires not only impeccable accuracy, but also a very high-speed performance. Medical text information retrieval can be broken down into three major steps: a) identifying information sources, b) using those sources to retrieve relevant information and, in the end, c) analyzing, understanding and using the information in
order to deliver the correct diagnoses [13]. In short, the purpose of medical text
information retrieval is to help make the decision of which treatment or which medicine to prescribe; this decision can be made by the patient, the patient's family, or the health
care professional.
There are various different kinds of factors that can effect health-related decisions. Most important is that medical decisions be based on evidences and facts. The field of medical information retrieval hopes to discover those evidences and facts.
1.2.1 E-Health Initiatives
In Canada, e-health has become a huge movement that is tasked with converting almost all medical records and data to an electronic format. The 2012 Canada Health Info way
(https://www.infoway-inforoute.ca/) commissioned study found that the use of Electronic Medical Records helped save taxpayers $177 million in the year 2012 alone [31]. There
are many more initiatives to store more health records in an electronic format. Every province in Canada has at least one or more initiatives to bring ambulatory, pharmaceutical, and general health records to an electronic format. All this wealth of information will require state of the art IT infrastructure in order to handle the data. Amongst the IT requirements, Medical Information Retrieval Systems will have a very high priority. As more and more information is collected, medical professionals will need
daunting task, and the field of Medical Information Retrieval has never been so important
in the history of Canada.
There are also several initiatives around the world that aim to advance the medical field by transitioning it to the electronic age. The GSMA mHealth Program (www.gsma.com) reports that the global eHealth market value is around $160 billion and a growth rate of up to 16% [21]. This shows the need for better and improved Healthcare IT systems in the
near Future.
1.3 Problem Definition
My work relies on the subset of the TREC dataset that is related to the Medical field3,
also known as the TREC Medical Data Track4• The main goal of my research is to
improve the relevancy of medical data retrieval from a set of documents based on a medical query given by the user. The medical query can consist of one or more words5•
The problem that is faced today in the field of Medical Information Retrieval is the low accuracy of search results. This is mainly due to the very specific and unusual requirements of medical terms; in general, medical search terms are very complicated to
3 The TREK datasets are divided into different ''tracks". The particular track I will be working with is the Medical Records Track. The goal of the Medical Data Track is "to foster research on providing content-based access to the free-text fields of electronic medical records."
4
http://trec.nist.gov/tracks.html
5
In this context, the queries, or ''words" need not be actual English words appearing in the dictionary. Any number of letters grouped together can form a single ''word".
be processed by regular information retrieval methods. As such, the field of Medical Information Retrieval aims to design information retrieval solutions specific to the
medical field.
1.3.1 Query Model
The query model in Medical Information Retrieval is very unique. Unlike regular search queries, we do not need to match exact query terms (i.e. Google). For instance, searching "York University" in Google will get you results that match ''york", "university", or both ''york" and "university". This usually is the ideal situation for non-medical data sets. However, in medical datasets, the queries might not need to directly match the data results. For instance, if the medical query is "heart disease", then the results must include not only results that match "heart" and "disease", but also results that include actual types of heart diseases (i.e. Cardiovascular disease, Cardiac dysrhythmias ). Furthermore, perhaps there might be a need to include related diseases as well. Finally, there might also be a need to include other diseases or illnesses that might contribute to heart diseases (i.e. atherosclerosis, hypertension). These use cases show that regular searching algorithms are
not sufficient for the field of Medical Information Retrieval.
1.3 .2 Medical Dataset Model
Another unique characteristic of Medical Information Retrieval is the design of the medical dataset. Generally speaking, and in the case of the TREC medical dataset, the
medical dataset is divided into hundreds of thousands of patient visit records. These records are not grouped together for a single patient; a patient can have multiple records for each of the multiple visits to the medical professional. These files are usually in the XML 6 format, which makes the job of the searching algorithm a little easier. Figure 1 shows a sample report from the TREC Medical Track dataset. The XML tags in this sample report include PROCEDURE, COMPLICATIONS, and DESCRIPTION OF OPERATION.
PROCEDURES:
TITLE OF OPERATION: IRR:IGATION AND DEBR:IDEHENT OF LEFT l<NEE. N>.!ESTHESIA: General..
COMPLICATIONS: None.
PREOPERATIVE D:XAGNOSIS(ES): SEPTXC ARTHRI:TZS, LEFT l<NEE. POSTOPERATIVE DIAGNOSIS(ES): SEPTIC ARTHRITrs. LEFT l<NEE.
HISTORY /VIII.> INDICATIONS: The patient is a ""*AGE[in 69s]-year-old female with a history of end-stage renal. disease and hel'AOdial.ysis with vascul.opathy who by history,
examination, and l.aboratory studies had a septic arthritis of the l.eft knee. Preoperat:ively, I spoke to the patient at great: l.ength. I spoke to her and her daughter about: t:he risks and benefits of surgical. intervention. We talked about: complications of anesthesia. septic arthritis, continued pain. neurovasc:ul.ar surgery. need for future surgeries. so1't: tissue compl.icat:ions etc. I: expl.ained to that: irrigation and debridement: of' septic arthritis is indicated and we t:al.ked about this at great: l.ength. Aft:er thorough a
discussion about the risks and benefits of surgery, t:'he patient gave informed consent:.
DESCR:IPT:ION OF OPE.RATION: The patient. was identified as the patient:. She was taken to t:h.e operating room where she was pl.aced supine on a t:able. Anesthesia had attempted to pl.ace a b'lock; however, this did not work and therefore she needed to be ,i.nt:u.bat:ed. After success·ful. intubation, a nonsteril.e t:ourniquet was carefully pl.aced high :in the l.e1't thigh. The l.eft: leg was then prepped and draped in the usual. st:erite fashion white making sure to isol.ate the l.eft 'foot on which she had surgery a few days prior. The leg was e\.evated 'for 120 .seconds and the,n t:he tourniquet was inf\.ated. A smial.l.. ap·proximate\.y 5 cm parapat:el.lar arthrotonay was perfornaed sharply with a kni1'e. This was taken down i.nto the joint: sharpl.y. .lmmediate\.y significant: amount of· c:l.oudy~looking 1'1.uid came out: o'f the knee. This was sent for cul.t:ure. After evacuating the fluid. 1:he knee was pul..se irrigated wi.1:h 3 L of sol.ut:ion. A'ft:er this,. we reexamined the knee. There was no further sign of purulence. The skin bl.eeders were coagulated. Again. 3 more l..:it:ers 01' pul.se irrigation were used to clean out: the knee. A1'ter successful.\.y accomplishing this, the arthrotomy was closed wit:h 0 V.i.cryl. in a watert:ight fashion. The skin was t:hen cl.osed care'fu\.ly with interrupted 3-e nylon sutures. A s't:erile consisting 01' Xeroform • . 4x4's. Webr:il.. and Ace wrap were appl.i.ed.
The patient w.as awakened f'rom anesthesia. Ear\.ier the tourniquet, had been defl.ated prior to closure.
There were no complications during this procedure. She was brought to the PACU in a stabl.e condition.
Figure 1: Sample Report
6 XML stands for Extensible Markup Language and is a computer language that transforms a human readable document format into a computer readable one. This is done by the use of tags to divide the different sections of a document.
Medical TREC challenge also provides Topics or queries. These topics were created by
physicians who were also students in the Oregon Health & Science University (OHSU) Biomedical Informatics Graduate Program. Their goal was to develop number of topics
that each matched a reasonable number of visits (more than a few but less than several hundred) in the record set [13]. In Medical TREC 2011, they first chose 35 queries and then they decreased to 34. These queries were the best ones which fit to the collection.
1.4 Contribution
The contribution of this thesis is a novel method for medical information retrieval that improves the relevancy of search results by 3% to 20%. My approach adapts existing Information Retrieval concepts and methods to the field of Medical Information Retrieval.
My approach uses a combination of ICD-9 codes, query expansion methods, sub-type
filtering, and re-ranking methods in order to find the most relevant results in a large dataset. I CD-97 codes are standardized codes for all the different types of diseases and medical terms used by medical professionals (http://icd9cm.chrisendres.com/). Some examples of ICD-9 codes and their descriptions are shown in Table 1.
7
Table I: Sample ofICD-9 codes
1.4.1 Query Expansion
Query Expansion is defined as a method of "reformulating" an existing query, in order to improve the relevancy of the search results. For instance, if a given query is "heart sickness", the query expansion algorithm might reformulate the query into "heart cardiac disease" in the hopes of improving the relevancy of the information retrieval.
In my work, I use query expansion methods by extracting key terms from a set of already retrieved set of documents to reformulate the initial query, and, ultimately, to improve the accuracy of the search results. The set of already retrieved documents is performed by the current baseline information retrieval system (i.e. Terrier8).
Sub-type filtering is a method whereby is using concepts with specific subtype of medical concepts. The ones that we considered are the most important ones which are known as disease and procedure. These subtypes are the ones that are came as a question or
8
1-statement in the queries. This kind of filtering can be useful to discard the unnecessary concepts that can cause lower precision in retrieved information.
Another way of using retrieved information in baseline to improve the results is using re-ranking methods which involves applying similarity measures between the results and the query in order to re-rank them based on the new weighting model.
1.4.2 Methods Proposed
In total, I proposed three different methods for improving Medical Information Retrieval. The first method (Chapter 3) is a re-ranking method based on the cosine similarity9 between the queries and the data in the retrieved medical reports. This method involves applying the Apriori 10 and the FP-Growth 11 algorithms in order to expand the initial given query. Using cosine similarity, and the concepts produced by the A priori and the FP-growth algorithms, the retrieval process is run again, this time with an expanded query. The process completes with a set of document results that have been re-ranked.
9
Cosine Similarity is defined at the similarity of two vectors by measuring the cosine of the angle between them. In other words, it measures the similarity of two sets of information based on a single subject matter.
10 The Apriori Algirthm applies association rule mining in a dataset This algorithm is further explained in section 3.1.2.1.
11
The PP-Growth algorithm applies association rule mining in a dataset. FP stands for frequent pattern. This algorithm is further explained in section 3.1.2.2.
The second method (Chapter 4) involves using query reformulation with the usage of subtype filtering12 and the Apriori Algorithm in order to improve the medical information retrieval. This method uses the retrieved documents from the baseline information retrieval system in order to find the subtypes within them. In tum, using the Apriori algorithm, the subtypes are ranked based on the index; the highest ranked subtypes are then used to reformulate the original query.
The third method (Chapter 5) applies ICD-9 codes and query expansion techniques in order to improve relevancy in medical information retrieval. This method applies top-ranked baseline ICD-9 code descriptions in order to expand the queries. The expanded queries are then re-run in order to improve the accuracy of the retrieved documents.
1.5 Findings
For each of the three methods I have performed, the findings are explained below:
1.5.1 Method 1: APFP-Cosine Algorithm
In this method, we found the results to be negative. In the best case, this method resulted in lowering accuracy by 26% to 71 %. We believe that the reason for this extreme regression in accuracy is that applying cosine similarity is not a suitable solution for the
12
case of medical information retrieval. These results show the lack of relationship between
related medical concepts.
1.5.2 Method 2: Sub-AP Algorithm
In this method, the results showed improvements in all the evaluation measures. The improvements show an increase in accuracy from 3% to 17%. These results are strong and show the power of applying subtype filtering in the domain of Medical Information Retrieval.
1.5.3 Method 3: ICD9-Top Algorithm
In this method, applying ICD-9 codes description had a high positive impact on our evaluation measures. The result of this method had the highest improvement among the other proposed methods. In this Algorithm, we showed an increase in accuracy from 7% to 20%. These results are very promising and highlight the power of using ICD-9 codes in the domain of Medical Information Retrieval systems.
1.6 Thesis Structure
In the next section, we will the literature review of materials that are related to this research work. We will illustrate the differences of the work in this research project, with those of previous research work. Chapter 3 describes our first proposed method, which is
based on using Cosine Similarity measures; we will call this method the APFP-Cosine method. In chapter 4, we will describe our second proposed method, which is based on using Subtypes of the concepts from the retrieved results; we will call this method the Sub-AP method. Our third and final proposed method is explained in Chapter 5. This
method is based on the combination of query expansion techniques and ICD-9 codes from the top retrieved results; we will call this method ICD9-Top. Chapter 6 presents our evaluation measures, which is largely based the TREC evaluation measures. The rest of the thesis belongs to the results and the discussion points (Chapter 7 and Chapter 8). Chapter 9 illustrates the conclusion and future work related to this proposed methodologies.
2 Literature Review
2.1 Re-ranking based on Similarity Calculation
In [29], the authors tried to find a new scoring method in order to re-rank the resulting documents from an information retrieval system (i.e. search engine). This new scoring method involved mapping each retrieved report to a single patient visit. Based on the number of retrieved reports that point to a single visit, that visit is scored accordingly. The higher the number of reports for a single visit means a higher score for that visit in this re-ranking.
Another interesting step that the researchers took was to increase the size of the index. This is a pre-computation step that is done in hopes of bettering the results of the retrieved documents. The researchers expanded the index database by using ICD-9 codes in order to better understand the medical records, and increase the knowledge base of the index.
2.2
Query Expansion
Over the recent decades, the volume of medical literature has increased dramatically. With the general goal of improving medical information retrieval methods, there are
several different types of search engines that are designed for medical text retrieval; each search engine uses a different method in order to improve the retrieval accuracy. Apart from the general limitations and challenges of information retrieval, in Medical Information Retrieval, the problem is further compounded by the fact that many users of medical information retrieval systems might not have enough domain knowledge. Therefore, it becomes increasingly difficult to understand the user's query, and provide the results he/she is actually looking for.
This particular problem of "lack of domain knowledge" has been partially solved by a technique called Query Expansion. This technique was proposed by the authors in [42], [20], [40], [46], [7], [27] and [23]. Query Expansion is the process of re-creating the seed query in order to improve the information retrieval results. The process of re-creating the seed query usually involves adding terms to the query, based on the initial retrieved results from the initial seed query. Using Query Expansion, the authors in [38], have shown an improvement of 20-30%. This is an immense improvement that is seldom seen in Information Retrieval research nowadays.
Query expansion becomes especially important when there is a word mismatch. This
concept. When a query contains a commonly used word, then the query will match to more documents, and possibly, to incorrect one due to the word mismatch problem. As Xu has stated, this word mismatch problem becomes very serious when the queries are shorter and spontaneous; this is true since shorter queries have a higher chance of
containing terms that co-occurring in a larger set of documents [16]. Nowadays, because of the Internet, shorter queries are more common than longer queries (i.e. Casual users using Google). Mismatch has been a big problem in the Information Retrieval field for a
long time.
2.2.1 Automatic Query Expansion
There are two major types of Query Expansion techniques: automatic and manual [27].
Automatic techniques have major advantages compared to manual ones such as relevance feedback and manual thesauri because there is no need for user efforts. Automatic Query Expansion techniques can be divided in two categories, global and local [16].
Global Automatic Query Expansion techniques need corpus-wide statistics, which can take a large amount of computer resources (i.e. time, memory, energy). For instance, we would need the co-occurrence data values for all possible pairs of terms in a collection. If the collection contains ten thousand documents (a very small collection by today's standards), then there are about 3,124,998,800,000 total pairs of words13• If we were to
13
store just one byte of information for each pair of words, then we would need at least 12 terabytes of storage14•
On the other hand, Local Automatic Query Expansion techniques only process a small amount of top-ranked documents which are retrieved for that specific query. A local technique may use some global statistics such as the document frequency of a term, which should be cheap to store and calculate. In this technique, the source of expansion terms is the set of top-ranked documents. In other words, for Local Automatic Query Expansion Techniques, we first run the regular Information Retrieval system, then use the top ranked results as a starting point for the query expansion techniques.
The simplest local technique is local feedback, which assumes the top-ranked documents are relevant and uses the standard relevance feedback for query expansion procedures.
The relevance feedback process, which was first introduced in the middle of the 1960s, is a controlled automatic process for query reformulation (i.e. query expansion) [14]. There is another similar technique, which was proposed earlier in [ 41]. Here, the information from the top-ranked documents is used for the estimation of the probability of query terms in the relevant set for that specific query. These terms usually are the ones with the most frequency, excluding stop words from the top ranked documents. Stop words are the words that are filtered out before or after processing of natural language data (text). Stop
14
words can cause problems in searching the phrases including them. Table 2 shows a
partial list of Stop Words.
Lately, research shows that local feedback has a generally good impact on retrieval improvements. However, some have shown that there is a chance to degrade the retrieval performance ifthe top-ranked documents are not relevant to the query [20]. In this case, a small problem in the retrieval process is compounded due to the assumption that the
initial set of retrieved documents is truly the most relevant documents.
The idea of using the initial top-ranked documents for query retrieval improvement was first proposed in 1977. In [35], term clustering had been used on the top-ranked retrieved documents, and the results were used for query expansion. Term Clustering is a simple process of grouping together the most commonly appearing terms in a set of documents.
a it these
about its they
again itself this all . just those almost
re-w
throughalso from thus
although made to
I
~!ways mainly: upon among' make I
use
an may used
and mg using
,[·another might various
any m ·very
·are my was
.·as., most we at mostly: were Table 2: Partial List of Stop Words
There are many more techniques of applying Automatic Query Expansion Techniques. In [32], three different strategies are proposed for Automatic Query Expansion:
Synonym-based, Topic model-based and Predication based. In the first method (Synonym-based), the authors attempted to use a few selected UMLS source vocabularies and included lexical variants in the expanded queries. The UMLS is Unified Medical Language System and includes a set of files and software that brings together many health and medical vocabularies and standards into one database, in order to enable interoperability among computer systems. In the second method, Topic model-based, the authors added related terms based on the topic-model trained on 100,000 clinical documents. The third and final method is the Predication-based expansion technique that uses a large predication database, extracted from global medical literature by a natural language processing (NLP) system called SemRep. The authors in [32] showed that all three methods resulted in improvements of up to 23% in comparison with baseline.
In [ 6], another method for applying query expansion techniques on data from the TREC Medical Track dataset had been proposed. The authors suggested using two external sources: the UMLS database and the Cengage Leaming' s collection of medical reference encyclopedias. The research showed an improvement of 6% in some of the evaluation
Furthermore, the authors in [ 18] proposed two new approaches for query expansion. The first approach is the Default Query Expansion, which uses terms from the top ranked documents. In this method, the authors collected the terms with the most frequency from the top 10 documents; in this case, 10 terms are added to each query. The second approach is called Concept Extraction; in this method, the authors annotated the queries with concepts from the UMLS database using the MetaMap system. MetaMap is a widely available program that can provide access to the concepts in the UMLS Metathesaurus15 medical text. MetaMap was founded in order to improve medical text retrieval, especially for the retrieval of MEDLINE16/PubMed17 citations [l].
The authors in [ 18] used the MetaMap concept list and their short descriptions to perform query expansion. Their results showed that the second approach had slight improvements in precision. However, the performance decreased in the general case, when compared to the default query expansion technique.
2.2.2 Manual Query Expansion
In [3 7] researches tried both manual and automatic query expansion techniques in conjunction with the addition of ICD-9 codes, based on the original queries using the 15
Metathesaurus is a National Cancer Institute browser containing different biomedical vocabularies, including the International Classification of Diseases for Oncology.
16
MEDLINE (Medical Literature Analysis and Retrieval System Online) is a bibliographic database oflife sciences and biomedical information.
17
PubMed is a free database accessing primarily the MED LINE database of references and abstracts on life sciences and biomedical topics.
MeSH18 and MetaMap databases. In the results, the manual run had good results but the automatic run did not perform particularly very well.
In (38] the authors attempted to use ICD-9 codes in order to expand the queries. The
authors collected the ICD-9 code definitions and their relationships from the UMLS database. Next, the authors added the newly learned definitions into the original queries. For the experiments, the authors ran three separate runs based on three different parts of the documents 19 • These three different runs are based on the different parts of the documents that had been used to execute the expanded queries on them. These methods lead them to get improvements in their runs in compare with baseline.
2.2.3 Association Rule Mining Algorithms
Association rule mining algorithms have been used within query expansion techniques in
many different ways. Association rule mining consists of two phases: Rule Generation and Frequent ltemset Discovery. The Rule Generation involved using the mined frequent itemsets in order to generate rules. This step is trivial and takes very little time, only 1-2% of the time of the entire information retrieval process. Thus, association rule mining algorithms focus on finding frequent itemsets. A frequent itemset is defined as the itemset which has a support that is higher than the user-specified support.
18
Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching.
19
In the field of Information Retrieval, the two most popular frequent itemset mining techniques are Eclat and Apriori. Eclat is a depth-first search using set intersection at its
heart. In Eclat, for each item, we store a list of transaction IDs, or TIDs, in a vertical layout. The advantage of the Eclat algorithm is that it has very fast support counting. However, the disadvantage is that intermediary TIDs can become too massive to hold in memory [30]. The Apriori Algorithm [34], on the other hand, uses a breadth first approach and uses data structures such as a Hash Tree in order to efficiently count itemsets. Apriori's "bottom-up" approach and pruning techniques make it the most viable and popular choice for frequent itemset mining. Researchers also propose the DHP [19],
the PARTITION [1] and the CD algorithms [33]. There are Many more algorithms that researchers have proposed or used in their work [3]. The best characteristics of Apriori Algorithm are its simplicity and superiority of performance and also the fact that it is scalable with large data sets.
Another powerful and popular algorithm that is well-known in the data mining field is FP-growth. The FP-Growth algorithm is another algorithm that is used to find itemsets, but without using Candidate Generations. This Candidate Generations technique is a "bottom-up" approach used by the Apriori Algorithm whereby each term is considered as a candidate to one of the available item sets. It has been shown that applying the Candidate Generations technique is very time consuming and memory-consuming, and hence, the FP-growth algorithm has a much higher performance.
2.3 Weighting Models
In query expansion methods, there are different types of weighting models that can be used. In [ 45] the authors discovered an alternative technique to utilize query expansion by
attaching weights to query terms based on the term's distribution among the various categories. The authors presented the normalized entropy (NE) method to determine the special category for each term. They derived two supervised weighting schemes. The authors used the TREC dataset and it was shown that the schemes which are included in the traditional IDF have significant outperform for queries which containing more than a few specific terms and also has the competitive results on short and well-defined queries.
3 APFP-Cosine Method
The TREC Medical track dataset contains a corpus about 101 thousand anonymous medical records from the University of Pittsburgh NLP20 Repository. The
collection consists of one month of structured reports from multiple hospitals and includes nine types of reports from different departments in those hospitals. The medical report types include Radiology reports, History and Physicals reports, Consultation reports, Emergency Department reports, Progress Note reports, Discharge Summary
reports, Operative reports, Surgical Pathology reports, and Cardiology reports. Each of these medical reports can be mapped to one of 17 ,265 patient visits. A patient visit is an individual stay at hospital by a patient. Each report belongs to a single patient, however,
many reports can represent a single patient [29].
In this research, we developed a new methodology based on Association Rule Mining and cosine similarity measures. We wanted to find similarity measures between assigned
20
queries and medical reports in order improve on the retrieval accuracy, compared to the baseline results.
Using Apriori and PP-Growth algorithms in combination with cosine similarity are the foundations of APFP-Cosine methodology.
The entire APFP-Cosine Re-ranking process is shown graphically and in detail in Figure 2.
3.1 APFP-Cosine Methodology Steps
3 .1.1 Indexing Based on Using UMLS and BioLabeler
As we mentioned earlier, the initial queries contain the sentences and words that describe the medical conditions or treatments or even disease and procedures that a user is searching for. In order to proceed to next step, we first need to perform conceptual indexing21 for the queries and the reports. This resulting index is based on using UMLS and an online biomedical text mining tool named BioLabeler22, which associates UMLS concepts to the data in any given text [25].
The produced index file contains rows and columns of data that will be used for performing our weighting algorithms. The rows represent each of the many concepts found in each report. The columns represent extra information for each of the concepts in the rows. The columns show information in this order: report ID, concept, number of
concepts in that specific report, number of reports that includes that specific concept, number of that concept in the whole collection, and also type of the concept which could be a disorders or a procedure or anatomy. This index file is required to provide the different parameters that are needed for the weight calculations for each of the concepts in the next step.
21
Conceptual Indexing is the act of indexing on a collection of documents, based on a set of concepts. 22
3 .1.2 Using Association Rule Mining for Query Expansion
Association rule mining is a well-known method for finding the relationship among different variables in very big data collection. The goal is to find strong relationships
between terms in a data collection. Association rule mining is used in various applications, including web mining, intrusion detection, continuous production and bioinformatics.
3.1.2.1 Apriori Algorithm
The Apriori algorithm is one of the most effective algorithms for dividing a large data collection into smaller sets of related terms. These sets of closely related terms are called frequent item sets. The Apriori algorithm is used to look, or to mine, for the frequent item sets of boolean association rules 23 . This algorithm can be divided into two
sub-algorithms: First, retrieve all item sets that have a support24 which is greater than the minimum; this will be called the frequent item. And second, using the previously found frequent item set, we will generate all the association rules. At this point, for each item set X, all non-empty subsets of A are found; in other words, if
support(A)/support(a) ?;:: minimum confidence, then the association rule is A-a. In other words, we would like to exploit the association rules that have a high confidence, specifically a confidence level that is not below the user specified level [24].
23
Boolean Association rules says that each item in the dataset is considered to be either part of an itemset, or not part of at all.
24
3.1.2.2 FP-Growth
The FP-Growth algorithm is another algorithm that is used to find itemsets, but without using the Candidate Generations. Candidate Generations is a "bottom-up" technique used by the Apriori algorithm whereby each term is considered as a candidate to one of the available item sets. In turns out that applying the Candidate Generations technique is very time consuming, and hence, the PP-growth algorithm has a much better performance. The PP-Growth algorithm is a tree-based approach that uses a divide-and-conquer strategy. Behind the scenes, the FP-Growth algorithm uses a special data structure named the Frequent Pattern Tree, or the FP-Tree, which preserves the itemset association data. The FP-Tree is a data structure that provides quick Specifically, the PP-Growth algorithm works as such:
1. Reduce the collection database to represent frequent itemsets, with the usage of an FP-tree data structure.
2. Next, we divide the reduced database into several conditional databases; each database represents a single frequent pattern.
3. Finally, we mine each database individually.
By following this algorithm, the FP-Growth algorithm reduces the costs of searching for patterns. This is done by recursively searching for shorter patterns, then concatenating them into the longer patterns. The tree-approach of the FP-Growth algorithm allows it to use memory efficiently, while also providing a quick response time when looking for patterns. This provides good selectivity.
3.1.2.3 Query Expansion
The query expansion step is possible after the execution of the Apriori and the FP-Growth algorithms. The top 100 reports (along with its associated concepts) from the baseline retrieval method are given to the algorithms. The concepts are divided into semantic groups. The semantic groups we used for these runs are DISO (Disorders), PROC (Procedures), PHYS (Physiology), CHEM (Chemical and Drugs) and ANAT (Anatomy). In the results, we disregarded general concepts such as CONC (Concepts and Ideas), GEOG (Geographic Areas) and LIVB (Living Beings).
The Apriori and FP-Growth algorithms produce the weights for each of the concepts. We only considered concepts which had a support;;:::: 0.6. Our experiments confirmed that the optimal value for the support is 0.6. These selected concepts are then added to the original concepts of the initial query, thereby forming a new, expanded query.
3.1.3 Vectors of Queries
The 34 given queries25, which are the only clues to finding the relevant information in the
large corpus, are analyzed in order to find the UMLS concepts. These queries are then defined into groups of concepts, based on their weights, which is derived from their support values. The definition of Support is provided in Equation 1.
Supc = Support value for concept C
Supc=N/N
, ... 1~1::, '. '.r ~ '·,', --,.;· )~ ..-. >:<"'!\ /-:.~ .. ,.. :i-.= ·:_ ' .. ;:.j
Nt~,Totat;,number of reports that include C
~ -.
N =TotaCnumberof reports in coll~ction
Equation 1: Support Formula
Presenting concepts with their respective support values, allows us to create a vector of concepts that is associated with each query term. This vector can be defined as shown in
Equation 2:
C=Concept S =Support
3.1.4 TF.IDF Scoring
Equation 2: Query Vector
TF .IDF, or Term Frequency - Inverse Document Frequency is a scoring method which is indicates how relevant single term is in relation to the entire document, or corpus. For our method, the TF.IDF scores can be obtained through the index file, which was produced in the first few steps by Terrier. For this step, we use the TF.IDF method to produce a score for each concept in each of the medical reports; using this information, a vector of concepts, along with their TF.IDF values is generated for each medical report in the entire corpus. Calculation of the TF.IDF is shown on Equation 3.
N
(TF. IDF)concept
=
TF
x
IDF
=
TF
x
log
Nt
TF= Concept Frequency in the specific report N= Total Numbers of reports= 17011
Nt= Total number of reports that contained this specific concept
Equation 3: TF .IDF for each concept of each report
3 .1.5 Vectors of Reports
All the values that have been mentioned in 3 .1.4 can be generated by using data from the indexed file. The result of this calculation leads us to a vector, which contains all the concepts and their weights. Each Report's vector can be defined as shown Equation 4.
C=Concept
TF.IDF =from Equation 3
Equation 4: Report Vector
3.1.6 Cosine Similarity
The last step of this re-ranking method is to find and to calculate the re-ranking score for each report 26•
We calculate the Similarity values for each pair of vectors (VQUERY, VREPORT).
26
There are several different types of similarity measurements in vector space model27•
In the vector space model, we will need to use similarity measurements that are based on the inner product of the vectors. There are several types of vector similarity measurements, such as Jaccard, Dice and Cosine [10].
The Jaccard Similarity Coefficient, also know as Tanimoto Similarity is a similarity ratio given over bitmaps of a fixed size vector. This is ratio is basically the number of common bits, divided by the number of bits set in the either sample. In our research, as we mentioned in 3.1.3 and 3.1.5, the vectors of our reports and our queries are not of a fixed size nature. Further, since our vectors contain weight measurements of the terms, it is not possible to use this similarity measurement; the Jaccard model accepts either 0 or 1 inside the vectors. Therefore, this measurement model was discarded as a possible choice for our similarity measurements.
The Dice similarity measurement is also known as Dice's coefficient or Sorenson index. For two vectors of keywords, the dice coefficient is defined as twice the shared information over the sum of cardinalities of the vectors. Therefore this measure is not suitable for our purposes since the cardinality measurements of vectors are something we are not considering.
The Cosine Similarity measurement is the most popular choice for vector similarity measurements in Information Retrieval. In the vector space model, we can simply use the angles as the measure of divergence between the vectors. Then, in order to have a
27
numerical similarity, the angle value is converted by applying the cosine calculation. This
way, we end up with a numerical value for the similarity between two vectors. This method is especially useful in Information Retrieval since identical vectors receive a similarity measurement of 1, and orthogonal vectors receive similarity measurement of 0. As an alternative, we can also use the dot-product (inner-product). However, the problem with the dot-product calculation is that it takes into account the length of the vector, something that we want to avoid in Information Retrieval [39].
Finding Cosine Similarity values is a result of calculations of the two-mentioned vectors in section 3.1.3 and 3.1.5; this is shown in Equation 5.
S · . (R
r t . ) _C .·
·cv·
V
) .;__
l:Sup(Ci)x TF.JDF(ci , Reportn)core
epo
-
osme
Q, Report -:---II II I
I
.· VQ x IDocnl
Equation 5: Cosine Similarity
3.1.7 Re-ranking based on Cosine Similarity score:
The final step in this APFP-Cosine method is to re-rank the reports based on the cosine similarity values generated in the previous step. The Cosine Similarity values score are calculated for all the reports in the entire corpus. After completing the above-mentioned
steps, we then feed in the re-ranked reports back into Terrier, in order to perform retrieval evaluation measurements. This step is further described in section 6.
3 .2 Algorithm
1 List collection= Tree.data(); 2 Array queries= initialQueries(); 3 collection.perforrnUMLS.Biolabeler(); 4 queries.perforrnUMLS.BioLabeler();
5 List collectionindexed = collection.perforrnindex(); 6 Array expandedQueryAP
7 Array expandedQueryFP
queries.perforrnApriori(); queries.perforrnFPGrowth(); 8 Array reportVector = RV.vector(collectionindexed);
9
II
Below block for fPGrowth method10 Array queryVectorFP = QV.vector(expandedQueryFP);
11
II
Below called CosineSirnilarity Class12 List' rankedReportsFP · = CS. rank ( queryVectorFP, reportVector);
13
II
results conatins the output of running Terrier14 List; resul tsFP =· Terrier. evaluate ( rankedResportsFP) ;
15
II
Below block for Apriori Method16 Array queryVectorAP = QV.vector(expandedQueryAP);
17
II
Below called. CosineSirnilarity Class18 List rankedReportsAP = CS.rank(queryVectorAP,reportVector);
19
II
results conatins the output of running Terrier20 List resultsAP = Terrier.evaluate(rankedResponseAP);
Figure 3: Re-ranking based on the Cosine Similarity Algorithm
Figure 3 shows the detailed algorithm for the APFP-Cosine re-ranking method. In the first 5 lines, the collection and the query are indexed based on UMLS concepts. The next two lines perform the Apriori and the PP-Growth algorithms on the queries, in order to perform query expansion for this method. Line 8 extracts the report vector from the database index. Since we are performing the re-ranking and the cosine similarity
measures separately, lines 9-14 represent the steps for the PP-Growth Algorithm, while lines 15-20 represent the steps for the Apriori Algorithm; in each of these blocks, we first extract the expanded query vector, then rank the reports, based on this newly formulated query. The results are then stored in a variable of type List (line 14 and line 20).
4 Sub-AP Method
The TREC Medical Track data records contain various different types of medical terms and concepts. Finding the most valuable ones is a challenge. Solving this challenge will lead us to correctly respond the users' queries.
In this method, two of the most important subtypes (Disorders and Procedures) are selected and filtered from the indexed collection. These two subtypes were chosen because these were the most popular subtypes amongst the queries. This matter is further discussed in 4.1.2. After gathering the desirable concepts' subtypes, the Apriori algorithm is applied in order to get the weight of these concepts in order to expand the queries and retrieve the results.
Figure 4: Subtypes Method's Diagram
4.1 Sub-AP Methodology Steps
4 .1.1 Indexed Collection
As we mentioned in section 3, the collection consist of just over 17000 visits. A Visit is defined as all the reports for all the visits for a single patient. In this collection, more than one report can belong to the same patient.
As mentioned in chapter 3, the initial stage involves creating two files: 1) an index of the entire corpus and 2) the baseline results from an information retrieval system, such as
Terrier. These 2 files can be based either on the reports or on the Visits, which contain these reports. In the final stages of this method, the evaluation (Golden Standard) is done via the reports, and not the Visits. Therefore, we will need to map our results (Visits) to that which is suitable for our evaluation (reports). The mapping file for determining which reports belongs to which Visits is available from TREC.
4.1.2 Subtypes Filtering
This step involves filtering the subtypes (PROC, DISO) from the initial index. The index file contains all the visits' concepts, along with the specification of their types and their visit IDs. The subtypes are shown in Table 3.
ACTI Activities & Behaviors
ANAT, Anatomy
CHEM Chemicals & Drugs
CONC Concepts & Ideas
DEVI Devices
DISO Disorders
GENE Genes & Molecular Sequences
GEOG Geographic Areas
LIVB Living Beings
OBJC Objects
occu
Occupations ORGA Organizations PHEN Phenomena PHYS Physiology PROC ProceduresBased on the medical definitions of the queries, we are able to locate the subtypes that best match the queries. The most important subtypes for us are the concepts that belong to the Procedures and to the Disorders (Diseases) subtypes.
We use a java program (A.4) to filter the desired subtypes from the entire indexed file. This new smaller indexed file contains all the concepts from the visits that are under the DISO and PROC subtypes.
4.1.3 Using Top Ranked Results
We used the top-ranked baseline retrieval results in order to find the desirable concepts. For the purposes of this thesis project, we simply used the top 50 results when looking for the desirable concepts. It is also possible to use a different number of top-ranked results. The reason why the top 50 was chosen is because this number of visits is not so large to give us irrelevant concepts. Furthermore, it is also not that small, which might cause it not to give us enough number of concepts.
4.1.4 Finding Most Relevant Concepts
This next step is finding the associated concepts with these top relevant visits. A java program was used to find the associated concepts with the top ranked visits that was gathered from the previous steps.
4.1.5 Apriori Algorithm Scores
The Apriori algorithm can calculate the weight of the subtype-related concepts. This
output consists of each concept and the produced weight caiculated by the Apriori algorithm. This combination of (concept, weight) will be used for the next step.
4.1.6 Re-ranking and Combining Scores
Perl is very suitable for the re-ranking step, based on the weighted queries. The prepared query with the concepts and the associated weights from .4.1.5 is run through a Perl program. In order to generate new score for each document and apply the re-ranking step, we used a score combination technique whereby combine the new score of the new query and the old baseline score. This score is calculated as show in Equation 6.
NewScore(Doc }~ocxScore(Doc) +(1-oc) xBaselineScore(Doc)
~ "J x '., ! " ' ;) .·:·,~~-'' ,,~
Equation 6: Score Combination
4.2 Algorithm
As previously mentioned, this method is based on using the subtype-related concepts as a filter in choosing the specific relative concepts; these concepts are later used in creating an expanded query. The step-by-step algorithm for this method is shown in Figure 5.
In the first 5 lines, the collection and the query are indexed based on UMLS concepts.