3.5 Stability of the ranking
3.5.1 Dependency on the part-of-speech tagger
The change of the implementation and model of the Part-Of-Speech tagger does not significantly change the extracted noun phrases or the the obtained rankings.
Experiment
For the 28 PubMed queries listed in Table 3.9 term rankings have been generated following the pipeline presented in Figure 3.1. The experiment has been repeated with two configurations using different Part-Of-Speech-taggers, namely the
• TNT part-of-speech tagger (Brants, 2000), trained on the Penn Treebank Corpus (English, Newspaper (Wall Street Journal), 1,200,000 tokens, 96.7% average accu- racy, 0.13 standard deviation), and the
• LINGPIPE part-of-speech tagger (Carpenter, 2009), trained on the MedPost cor- pus (Smith et al., 2004).
Beside the pure ranking special interest in this experiment has been devoted to the number of terms extracted by with one but not the other configuration, as different Part-Of-Speech tags lead to different noun phrases and hence different candidate terms.
Results
Single examples:Figure 3.8 shows per example for the three domains Blood Pressure, Obesity, and Insulin Resistance how the ranking is influenced by a transition from one Part-Of-Speech tagging method to another one. The experiment was repeated with 50, 100, 500, and 2000 PubMed abstracts containing the words “Blood Pressure”, “Obesity”, or “Insulin Resistance”. The results for those examples indicate on one side that the ranking is relatively stable for those example. The distance of one terms rank in experiments with different Part-Of-Speech taggers is mostly below 2. The numeric results for the examples are listed in Table 3.14. While the change in rank did not seem to be of relevance in the top ranked terms, the listing shows that a significant number of terms either exist in one or another ranking, i.e. the number of missing terms in total is high (not overlap (A) or not overlap (B)).
The change of the Part-Of-Speech tagger does influence whether a term is pre- dicted as candidate term at all.
Summary over 28 experiments The figures 3.9 and 3.10 show a summary plot over all 28 experiments for documents sets of the size 50, 100, 500, 1000, and 2000 PubMed abstracts. For each ranked term created using the TNT Part-Of-Speech tagger (x-axis) the plot illustrates the change in rank when exchanging the TNT Part-Of-Speech tagger with the LINGPIPE Part-Of-Speech tagger (y-axis). The differences in rank have been accumulated and are visualized using gray shadings in a hexagon plot. The darker a hexagon is displayed the more aggreement was observed between the 28 experiments. The results are shown for the top 25 and top 100 ranked terms.
Missing terms Interpreting the summarised visualisation for the 28 experiments, it can be seen that in the top 25 ranked terms segment the number of missing terms decreased from 5.6% to 3.6% on average, when using more documents as basis for the term generation. For the top 100 ranked terms the proportion of missing terms decreases from 6.6% to 4.8%.
Blood Pressure Obesity Insuline Resistance 50 do cuments 100 do cuments 500 do cuments 2000 do cuments
Fig. 3.8. Change in rank (selected samples): TNT vs. LingPipe Part-of-speech tagger.
Examples for rankings based on 50, 100, 500, and 2000 PubMed abstracts The plot illustrates for each ranked term (x-axis) the change in rank when exchanging the TNT part-of-speech tagger with the LINGPIPE part-of-speech tagger (y-axis). Results are shown for the top 25 ranked terms. Terms missing in the ranking are plotted with negative distance in red. Terms which show a difference in ranks below 2± (rank∗5%)are plotted in black color, others in blue color. The threshhold is illustrated by the gently inclined blue line.
Top 25 Top 100 50 d o cuments 100 do cuments 500 do cuments
Fig. 3.9. Change in rank (summary): TNT vs. LINGPIPE Part-of-speech tagger (part 1)
Summary plot of term generation results on the basis of 50, 100, and 500 documents. The plot illustrates for each ranked term (x-axis) the change in rank (“below or at rank”) when exchanging the TNT part-of-speech tagger with the LINGPIPE part-of-speech tagger (y-axis). The experiment covers term generation results from all 28 PubMed queries given in dataSetLipoProteinRelated. The darker a hexagon is displayed the more aggreement was observed throughout all experiments. Results are shown for the top 25 and top 100 ranked terms.
Top 25 Top 100 1000 do cuments 2000 do cuments
Fig. 3.10. Change in rank (summary): TNT vs. LINGPIPE Part-of-speech tagger (part 2)
Summary plot of term generation results on the basis of 1000 and 2000 documents. The plot illustrates for each ranked term (x-axis) the change in rank (“below or at rank”) when exchanging the TNT part-of-speech tagger with the LINGPIPE part-of-speech tagger (y-axis). The experiment covers term generation results from all 28 PubMed queries given in dataSetLipoProteinRelated. The darker a hexagon is displayed the more aggreement was observed throughout all experiments. Results are shown for the top 25 and top 100 ranked terms.
Difference in rank Independently from the number of documents 90% of terms show a shift in rank of below 2 when changing the Part-Of-Speech tagger (top 25). The more text is used for term generation (here test in the range from 50 to 2000 abstracts) the smaller is the average difference in rank. For the top 100 ranked terms the 90% borderline showed to be 13, 20, 8, 9, 13 for 50, 100, 500, 1000, and 2000 documents as basis for term generation.
The experiment suggests, that the agreement between the rankings is especially high in the top most segment. This can be explained as follows. Single word terms show higher frequency of occurrence in texts and lower dependency on the correct-
query string result size
overlap not over- lap (A)
not over- lap (B)
total (A) total (B) spear. corre- lation (A,B) “Blood Pressure” AND 2007[pdat] 50 1992 489 466 2481 2458 0.97 “Insulin Resistance” AND 2007[pdat] 50 2034 478 477 2512 2511 0.95 obesity AND 2007[pdat] 50 1740 420 423 2160 2163 0.98 “Blood Pressure” AND 2007[pdat] 100 3618 895 841 4513 4459 0.97 “Insulin Resistance” AND 2007[pdat] 100 3646 876 838 4522 4484 0.96 obesity AND 2007[pdat] 100 3437 938 910 4375 4347 0.95 “Blood Pressure” AND 2007[pdat] 500 15310 4138 3776 19448 19084 0.96 “Insulin Resistance” AND 2007[pdat] 500 14705 4194 3814 18899 18519 0.95 obesity AND 2007[pdat] 500 14890 4307 4056 19197 18946 0.96 “Blood Pressure” AND 2007[pdat] 1000 28340 8294 7318 36634 35657 0.96 “Insulin Resistance” AND 2007[pdat] 1000 25686 7512 6695 33198 32383 0.95 obesity AND 2007[pdat] 1000 27189 8108 7444 35297 34632 0.96 “Blood Pressure” AND 2007[pdat] 2000 50529 15576 13428 66105 63957 0.95 “Insulin Resistance” AND 2007[pdat] 2000 45284 13906 12242 59190 57529 0.95 obesity AND 2007[pdat] 2000 46905 14229 12821 61134 59729 0.96
Table 3.14. Dependency of the term generation on the choice of the Part-of-speech tagger.Evaluation
results comparing the rankings obtained for term generations experiment when changing the Part-of- Speech tagger for the example domains Blood Pressure, Obesity, Insulin Resistance. Over all experiments an average overlap of 79%(±0.015)was observed.
ness of the assigned Part-Of-Speech tag. Single word terms will in general appear higher in the ranking than compound words. But the extraction of single word terms is also less vulnerable to variations in the assigned Part-Of-Speech tags as only one tag must follow the pattern for candidate terms (noun phrase pattern). The extrac- tion of compound words on the other side requires the whole sequence of words to follow the noun phrase pattern.