Here we report on the naive baseline approach which exploits the Wikipedia fre- quent sense. As the initial setting, we keep k = 1, and run the baseline method against the gold standard4. The baseline approach applies the most frequent map- ping to the OIE terms and generates a set of mapping outputs in the form of the set M where each OIE term is mapped to at most one DBpedia instance. In Figure9, we show the precision and recall values obtained. The figure has been generated for both Nell and Reverb relations which were sampled for evalua- tion purpose (Section 7.2). Precision and recall vary across the predicates with lakeinstate having the highest precision for Nell. is a city in, originated in hav- ing the highest precision values for Reverb. Using average, for Nell the baseline method achieved a precision of 82.78% and an average recall of 81.31% across all the predicates, giving us a F1score of 81.8%. For Reverb the precision was 77.72%
and recall was 90.60% which gives a F1 score of 84.24%.
7.4 results: baseline method 71 0.6 0.7 0.8 0.9 1.0 actorstarredinmo vie agentcollabor ateswithagent animalistypeof animal athleteledspor tsteam bankbankincountr y bookwr iter citylocatedinstate compan yaslokno wnas lakeinstate personleadsorganizationteampla ysagainstteam weaponmadeincountr y Recall Legend Recall1 Recall2 Recall5 Recall10 (I) NELL 0.6 0.7 0.8 0.9 1.0 is a city in is a registered tr ademar k of is a sub urb of is in is located in is par t of
is the capital ofis the home of located inoriginated in stands f
or was bor n in Recall Legend Recall1 Recall2 Recall5 Recall10 (II) ReVerb
Figure 10: rec@k variation for k = 1, 2, 5, 10 for (I)Nell(II) Reverb
As the next step, we wanted to investigate the effect of k on the top-k candidates for the OIE terms. We ran the baseline with different values of k. Particularly, we chose values of k as 2, 5 and 10. In Figure 10, we show the values for rec@2, rec@5 and rec@10 compared to rec@1 (the recall values reported in Figure9). For all the sample relations, we observe a similar trend. A rise in the recall scores and then eventual saturation after a certain value. By considering more possible candidates with increasing k, every term gets a better chance of being matched correctly, which explains the rise in recall for lower values k. Particularly, for some relations like that of bookwriter, we observe increasing recall even beyond 10. How- ever, it must be noted, that for most of the predicates the values tend to saturate after rec@5. This reflects that after a certain k any further increase in k does not alter the correct mappings, since our algorithm already provided a match within top-1 or top-2 candidates.
As the final analysis, we wanted to capture a generalized trend in the behav- ior across all the relations. In Figure11, we analyze our baseline algorithm with increase in values of k. For each setting of k we compute the average precision, recall and f-measures and plot those. The most interesting aspect of this exper-
72 experiments 0.25 0.50 0.75 2.5 5.0 7.5 10.0 top−K Micro − a ver age scores (I) NELL 0.4 0.6 0.8 2.5 5.0 7.5 10.0 top−K Micro − a ver age scores variable Precision Recall F1 (II) ReVerb
Figure 11: prec@k, rec@k and F1 for (I) Nell (II) Reverb.
iment was to observe the recall variation. We expected it to increase over the increasing values of k, we observed our expected trend for both the data sets. Furthermore, we are also interested in the value of k for which the number of ad- ditionally generated correct mappings in M is negligibly small compared to the mappings generated in M for k + 1. We plot the average values of the precision, recall and F1 scores over varying k. We captured the values for both the data sets
and presented them. For Nell we attained the best F1 score of 82% for k = 1 and
the recall values tend to saturate after k = 5. Similarly, for Reverb we attained the maximum F1 of 84.24% at k=1. Furthermore, an important aspect revealed in
this result is the gradual saturation of the recall scores after k> 5. This marks a recall saturation point for our experimental setup. Intuitively, we were not able to improve the number of correct mappings by any further beyond k = 5. Or in other words, beyond k = 5, the chances of finding across the correct mapping falls drastically. If its there, it is mostly within top-5.
However, there are ways to further improve the recall of our method like, for instance, by means of string similarity techniques – e.g., Levenshtein edit dis- tance. A similarity threshold (say, as high as 95%) could then be tuned to con- sider entities which only partially match a given term. Another alternative would be to look for sub-string matches for the terms with middle and last names of persons. For instance, hussein_obama can have a possible match if terms like