• No results found

Quality of terms in dependence of the scoring method

3.4 Evaluation of the quality of generated terms

3.4.4 Quality of terms in dependence of the scoring method

With Alexopoulou et al. (2008) we evaluated several automatic term recognition methods and manually decided on the relevance of terms in the given domain of lipoprotein metabolism. In total 1, 197 terms have been identified to be relevant in the LMO Benchmark (Section 3.3). In this experiment these terms are used to measure

the quality of the rankings obtained for several configurations of the term genera- tion pipeline. Later, it will be experimentally investigated how much the background knowledge in form of a global reference corpus (Section 3.5.2) and the used model for scoring terms (Section 3.5.3) influence the results. Additionally it has been eval- uated, whether the performance can be increased in meta analysis. For the meta analysis the mean probability of occurrence of a term in the local document set is used for ranking, where the probabilities are on one side calculated using the occur- rence counts obtained from PubMed and on the other side those obtained from the Google n-gram source as explained in Section 6.3.2 (Global Frequency Revisions).

In the graphs in Figure 3.5 the performance of the different configurations of the term generation pipeline is shown by plotting rank-wise mean precision for the extraction of terminology on lipoprotein metabolism. Six methods have been defined:

• Frequency – Term ranking with frequency of occurrence in the analysed docu- ment set.

• PubMedTFIDF – Term ranking with tf-idf, where the document frequency was derived from PubMed abstracts.

• GoogleTFIDF – Term ranking with tf-idf, where the document frequency was derived from the Google Web 1T 5-gram Version 1 (Brants and Franz, 2006). . • PubMedPVALUE – Term ranking with the probability of occurrence where the

conditional probability is estimated with the probability of a terms occurrence in PubMed.

• GooglePVALUE – Term ranking with the probability of occurrence where the conditional probability is estimated with the probability of a terms occurrence according to the Google Web 1T 5-gram Version 1 (Brants and Franz, 2006). • MetaPVALUE – Term ranking with the joint probability of GooglePVALUE and

PubMedPVALUE.

For the graph in Figure 3.5(a) terms have been generated for 143 experiments. For 28 PubMed queries 50, 100, 500, 1000, and 2000 abstracts have been retrieved and terms have been generated. Additionally three manually created document sets related to the topic have been integrated in the analysis. To avoid a biased analysis the experiment has been repeated large scale by issuing queries for 811 concept labels in the LMO Ontology. The results have been shown in Figure 3.5(b).

Results: Mean precision for generated terms from PubMed abstracts retrieved for 28 manually defined domain related PubMed queries

Details on the scoring and frequency assignment can be found Section 6.3.3 (Scoring Revisions) and Section 6.3.2 (Global Frequency Revisions).

For each rank the graph Figure 3.5(a) shows the overall percentage of predicted candidate terms which are contained in the 1, 197 manually curated terms at this rank or further up in the hierarchy. For example a precision of 0.5 at rank 4 means that half (286) of the terms ranked 1-4 in the 143 experiments (in total 586) are known relevant terms in the domain. The following observations can be made:

(a) Terms generated based on text retrieved for 28 selected PubMed queries

Mean Precision

MetaPVal GooglePVal GoogleTfidf PubMedPVal PubMedTfidf Frequency

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 rank 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5 50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0 72.5 75.0 77.5 p re ci so n

(b) Terms generated based on text retrieved for PubMed queries for 811 existing LMO terms

Fig. 3.5. Quality of generated terms in the lipoprotein metabolism domainThe mean precision is

shown for the retrieval of terminology from lipoprotein metabolism related PubMed abstracts retrieved based on PubMed queries for (a) 28 selected representative terms (Section 3.3) and (b) for all 811 concept labels of lipoprotein metabolism ontology (LMO) terms. Only 13 terms standing for general concepts in the LMO (e.g human, long-lived person, or adolescent) have been excluded from the analysis. The retrieved terminology was compared against the manual created LMO.

• Top 1: For the top term can be seen that the precision is lower than for the top 2 or top 3. This is because for terms like “blood pressure” the term “pressure” or for “fatty acid” the term “acid” are the most frequently used terms but not contained in the ontology.

• Top 3: The graph shows that especially GooglePVALUE shows the most correct

predictions with>77−81% precision for the top ranks 1,2, and 3. GoogleTFIDF

(75−78% precision) and MetaPVALUE (76−78% precision) show similar results.

PubMedPVALUE follows just below with 73−76% making more mistakes at

rank 1 and 3. PubMedTFIDF achieves 68−72% precision. All methods with

background knowledge clearly outperform the Frequency method, which shows

correct predictions for only 51−56% of the terms.

• Top 10: MetaPVALUE performs best, followed by GooglePVALUE, and

PubMedPVALUE all in the range of 72−74% mean precision. GoogleTFIDF

reaches 68% and PubMedTFIDF 65% mean precision. Also in the top 10 all methods using corpus statistics clearly outperform the Frequency method, which shows correct predictions for only 45% of the terms at rank 10.

• Top 50: MetaPVALUE and PubMedPVALUE perform best with on average 25 domain relevant terms within the top 50 terms (50%). GooglePVALUE follows with 48%, PubMedTFIDF and GoogleTFIDF with 43%, and Frequency with 36%.

Results: Mean precision for generated terms from PubMed abstracts retrieved for all LMO terms

In the previous experiment the 14 queries have been selected as queries likely to re- trieve relevant terms. To ensure that the selection of the best performing configura- tion is not biased towards those queries, the experiment has been repeated for all but 13 labels of terms defined within LMO. The 13 terms with general meanings have been excluded because they are not specifically relevant for the domain “lipopro- tein metabolism”. The terms middle-aged adult, middle-aged, hl, long-lived experimentee, long-lived person, long-lived population, young, young adult, enzyme, newborn, human, experimentee, and person have been excluded.

The obtained precision is lower, but the relative performance of the term genera- tion configurations remains approximately the same.

Results: Pairwise comparison of chosen global corpus statistics and chosen statistical measure

Figure 3.6 compares the joint probability approach using PubMed and Google corpus statistics. Figure 3.7 shows the comparison between the tf-idf-based and probability-based methods. Figure 3.6(a) and Figure 3.6(b) show, that for this ex- ample the first 27 extracted terms were relevant terms. This indicates that once the document set contains terms of relevance the method ranks those high. It also shows that while PubMedPVALUE did fail to predict relevant terms at rank 18, 26 and 27 and GooglePVALUE at rank 21, 26, 27 the top 20 terms the combination MetaPVALUE lead in this case to a better ranking not showing this prediction of non-relevant terms.

(a) Example where the MetaPVALUE outperforms GooglePVALUE

(b) Example where the MetaPVALUE outperforms PubMedPVALUE

(c) Comparison of PubMedPVALUE and GooglePVALUE

Fig. 3.6. Pairwise comparison of the quality of term generation in dependence of the reference

corpus statistics.The F-measure (shaded: Average (rank-based) F-measure using average precision) is

shown for ranges up to rank 50. Corpus statistics obtained from Google indexed web sites, PubMed ab- stracts or a Meta-approach combining both are compared. For this test set Google and PubMed corpus statistics perform equally well. The meta-approach combining both probabilities performs slightly bet- ter the single measures. Terms candidates have been generated from abstract retrieved via the PubMed query “lipoprotein metabolism” AND 2006[pdat]. The resulting terminology was compared against the manual created Lipoprotein Metabolism Ontology.

(a) Example where the PubMedPVALUE outperforms PubMedTFIDF

(b) Example where the GooglePVALUE method outperforms GoogleTFIDF

(c) Comparison of PubMedTFIDF and GoogleTFIDF

Fig. 3.7. Pairwise comparison of performance for term generation in dependence of the used sta-

tistical measure.The F-measure (shaded: Average (rank-based) F-measure using average precision)

is shown for ranges up to rank 50. Relevance ranking using tf-idf is compared against the one us- ing the true conditional probability. The less computational expensive method TFIDF is compared for the different corpus statistics. The approximation TFIDF performs lower than true probability (PVALUE). TFIDF performs better with PubMed than with Google corpus statistics. Terms candi- dates have been generated from abstracts retrieved via the PubMed query “lipoprotein metabolism” AND 2006[pdat]. The resulting terminology was compared against the manual created Lipoprotein Metabolism Ontology.

Summary of results

Scoring with the underlying probabilistic model performs better than the simplifi- cation tf-idf and both clearly outperform simple frequency measures on achieving a ranking of domain relevant vocabulary. The graphs in Figure 3.5 show that the differences between the probability-based methods are not significant. While Google corpus statistics lead to better terms in the top 5, PubMed-based corpus statistics were beneficial for the ranks from approximately 5 to rank 10. GoogleT f id f very good for the top 3 ranked terms, performs weaker in for terms ranked 4 to 50.

rank background knowledge statistical measure method

5 Google n-grams p-value GooglePVALUE

10 Meta p-value MetaPVALUE

25 Meta p-value MetaPVALUE

50 Meta/PubMed p-value MetaPVALUE/PubMedPVALUE

Table 3.13. Summary of best global corpus statistics at different ranks. The best performing back-

ground knowledge and scoring method is shown for the best ranking for the top 5, top 10, top 25, and top 50 terms.

The method GooglePVALUE shows to be the best choice in terms of quality. In terms of runtime PubMedTFIDF is the best choice as the calculation of tf-idf is more efficient than the calculation of exact conditional probabilities (see Section 3.5.3 (Hy- pergeometric distribution)), and the corpus covers the required terminology but is significantly smaller. The PubMed corpus statistics are easier to handle and faster accessible. In the domain of lipoprotein metabolism the combination of the proba- bilities in the MetaPVALUE method do not lead to significantly better results.

The hypothesis “Biomedical terminology including single word terms can be ranked better when weighting terms in contrast to large domain specific reference corpus.” asso- ciated with research question 1 can be attested after the performed analysis. It has been surprisingly discovered that the ranking in contrast to a big enough general reference corpus leads to similar and better results than against domain specific ref- erence corpus. This justifies in the biomedical domain the use of one huge source for stable word and phrase frequencies like the Google Web 1T 5-gram Version 1 (Brants and Franz, 2006) corpus and makes the methodology domain independent.