7.7 Results: Methods comparison
9.1.1 Likelihood Estimate
The idea proposed in the previous section is promising but has a minor complica- tion. The SPARQL query may return multiple values for dpor just a single value.
Hence, to decide on the mapping of p ! dp we require to decide on the correct-
ness of a particular mapping. A simple scheme to quantify the likelihood of a KB relation to be a matching OIE relation candidate could be a frequency count. In particular, we want to estimate the likelihood of every possible mapping of p to one or more dp. As an initial step, a naive frequency count of the mappings
can give us a likelihood estimate. For instance, if the Nell property bookwriter is mapped to dbo:author in k out of n cases and to dbo:writer in (n-k) out of n cases, then the likelihood scores for these two cases can been given as follows:
likelihood(bookwriter! dbo : author) = k n likelihood(bookwriter! dbo : writer) = n- k
n
These likelihood scores can have a threshold which denotes the acceptance limit of the scores. Finally, selecting candidates above the threshold, could give us a simple solution. However, this approach suffers from two major drawbacks: first, any conceptually similar property (as in this case) might be eliminated out due to lack of sufficient evidence (low likelihood score). For instance, (n-k) can be very small in the above expression just due to lack of evidences, but semantically the
2 http://www.w3.org/TR/owl-ref/#ObjectProperty-def
9.1 methodology 95
mapping is not wrong in . Second, finding a correct threshold.
We propose an alternative to finding the mapping. This is an improved ap- proach and incorporates the type information of the mapped DBpedia instances. For each OIE triple in the input, let the subject s be mapped to ds and object
o mapped to do. Using the publicly available DBpedia endpoint, we collect the
type of these mapped DBpedia instances. For easy reference, we denoted them as dom(p) and ran(p) respectively. This notation should not be read as "domain of p", and likewise "range of p". The concept types of the mapped subject (and object) are being referred to as the likely domain (and range) of the OIE relation p. It must be observed that querying for an analogous KB triple (of the form dp
(ds, do)) might not often result in some triples, and also can have multiple pos-
sibilities as shown in the Example 3. Hence, we identify and differentiate these three cases as:
I a single possible value is returned for dp (Case I in Example3)
II multiple values for dp are returned (Case II in Example3)
III an empty set is returned, indicating absence of any dp. This can happen if
there is no such triple in DBpedia or the mapped instances are wrong at the first place.
We observe that, depending on the returned value of the result set, we can have multiple interpretations. All these variations can be represented under a single unified form using tuples. Case (2) and Case (3) are given an unified representa- tion by framing discrete association tuples as,
p, dp, dom(p), ran(p) (12)
In Example3we present the different possible cases and also their representations as association tuples. Observe that now, we do not consider the occurrence of dp
alone but also consider the domain and range associations instead. We intend to achieve two major goals with this variation: (1) finer granular vision of the relation matches (2) exploit the KB ontology to our advantage to ease the relation mapping task.
Example 3.
96 rule based approach
dbo:city(db:Helsinki_Airport,db:Helsinki)
After transformation into an association : airportincity,city,Airport,Place
Case II: Multiple values for airportincity(vnukovo, moscow)
dbo:city(db:Vnukovo_International_Airport,db:Moscow) and
dbo:location(db:Vnukovo_International_Airport,db:Moscow)
After transformation into associations :
airportincity,dbo:city,dbo:Airport,dbo:Place
airportincity,dbo:location,dbo:Airport,dbo:Place
Case III: No values. There is nothing to transform in this case.
The positive impact of such a translation is that, both the cases (I) and (II) men- tioned above can have one representation. Intuitively, for Case (II) in particular, if multiple properties are possible then each one of them is equally likely to hold true. All the associations for p thus formed is denoted as Ap. As a subsequent
step, we apply an association rule [LC01] of the form
p) dom(p) ^ ran(p) (13)
on Ap. This means, if the OIE relation is p then the type of the mapped DBpedia
subject instance is dom(p) and type of the object instance is ran(p). Intuitively, the Expression13 makes it evident that a likelihood of a particular OIE relation is manipulated by the types of the mapped subject and object instances. Hence, this reinstates our claim once more that, better the quality of the instance matches, better would be the relation matching.
Once expressed as an association rule, we can compute the confidence for each of these rules. We denote the confidence as conf for each such rule. It denotes the frequency of co-occurrence of dom(p) and ran(p), whenever p occurred. For some rule i and relation p, the confidence is denoted as confi
p, and defined as,
confip= conf(p) (dom(p) ^ ran(p))) =) confip= count(p ^ dom(p) ^ ran(p))
9.1 methodology 97
=) confip=
count(p ^ dom(p) ^ ran(p)) |Ap|
(14) The definition of confidence is used from [LC01]. Referring to Table11,
conf3agentcreated = count(agentcreated, Person, Book)
|Aagentcreated|
The table does not explicitly show the conf column, but the associations presented are used for the confidence score. The suffix 3 denotes the ith association for the
particular relation i.e. agentcreated in this case. We must note that the computation is performed on the association set Ap, hence, the count(p) will be the size of
Ap. Intuitively, the task of finding the confidence of a particular rule reduces to
finding the count of joint occurrence of the relation and its associated domain and range over the whole set of associations. Note the count function is not just the frequency count of the joint occurrence of a particular p and its associated DBpedia domain and range values, but also the sub-classes of each of the domain and range. The rational is, if we observe an association like agentcreated ) (Person, Book) then any other association like agentcreated ) (Scientist,Book) should also be considered as an evidence for the former association.Scientist being a sub- class ofPersonin the DBpedia ontology, is not a different association but a more particular case. Needless to say, count of the later is not affected by the count of the former. This technique is an improvement over the previously mentioned naive count based which did not exploit this inherent hierarchical structure of the structured KB. Finally, each association, is awarded with a confidence of confi
p.
Our initial analysis with the data sets revealed that often OIE relation instances had varying degrees of KB counterparts. Some had quiet a lot of KB assertions while some extremely few. This motivated us to quantify the notion of mappability of a particular OIE relation p. It determines the degree to which a particular OIE property can be mapped. It is denoted by Kpand defined as:
Kp= |TPp| j=1 C(j) |Tp| (15) where C(j) = 8 < :
1; atleast one property mapping for p in Tj p
98 rule based approach
We introduced Tp, which is simply the set of all OIE relation instances for the
given relation p. Note that, C(j) in the formula above is counted as one even if phas single possible mapping (Case (I)) or multiple possibilities (Case (II)). Ex- ample4presents a simple scenario to show the computation of Kp. However, the
actual value is 0.55 as shown in Table 11. Also note the use of Tp in the expres-
sion of Kpabove. We do not require Ap here, since this factor is for determining
the mapping degree. We are not looking into domain/range associations here. It can be easily and accurately determined with the set Tp. Ap would also work,
but it would provide false signals. This can be illustrated with Example 5. The intuition is, if an OIE relation is mapped in multiple ways, it necessarily does not enhance its mapping degree. Hence, we chose Tp for this step, while use Ap for
the evidence collection and confidence calculation.
Example 4. Let’s assume we have only ten relation instances for the Nell property airportincity. Hence each is of the form airportincity(*,*). This makes |Tp|= 10. Assuming
that, 8 out of them have been mapped (mixture of Case (I) and (II) above), meaning there was a counterpart assertion in DBpedia after the Nell subject and objects were mapped to DBpedia instances. And 2 have no such counterpart assertion, i.e could not be mapped (Case (III) above). Hence, in this toy scenario, Kairportincity = 108 .
Example 5.
Let us consider just 2 instances in the set Tislocatedinas
{is located in(Kendall, Miami-Dade County), is located in(Kendall College, Chicago)}. Using, the instance mapping techniques and subsequently finding the analogous DBpedia assertion, we have the set Aislocatedinas
{(is located in, n.a, n.a, n.a), (is located in,dbo:campus,dbo:University,dbo:City),
(is located in,dbo:city,dbo:University,dbo:City)}.
Using Tislocatedinfor computing Kislocatedinwould give12, since only the second in-
stance actually can be mapped. While using Aislocatedin, we would have Kislocatedin= 2
3, which is higher than the former value.
In an attempt to define a likelihood score, we identified two closely related factors. First, the confpvalue which considers the observed evidences in the input
data set. Second, Kpwhich defines the degree to which a relation can be actually
9.1 methodology 99
not want to map relations with low Kp, even if there is high confidence for them.
We combine Kpand confip to define the factor called ⌧ (tau) for each association
rule i and is defined as,
⌧ip= (1 - Kp) confi
p ; 8i 2 A
p (16)
This equation combines the values from the confidence scores (Equation14) and mapping degree (Equation15). This is an unified way to quantify the goodness of a particular rule for a particular p having mapping factor Kp. In Table11 we
present 4 example relations, 2 from Reverb (are in, grew up in) and 2 from Nell (airportincity, agentcreated), which presents the actual values for each one of the ma- jor notations introduced so far. These values are not the complete set of tuples, but only a snippet of them. A low confident association with low Kpwill give a high
⌧i
p as seen with ⌧3airportincity in the table. While, a more confident association
with high Kp minimizes the ratio, hence making the ⌧ value less (⌧5airportincity
in Table11). The last column in the table is a learnt threshold value for the max- imum allowance limit of the ⌧. We discuss the learning scheme in Section9.1.2. The whole idea of combining confidence and mapping factor was to incorporate the inherent differences across OIE data sets and also across relations within a data set. This is a generalized scoring scheme, which considers the two important aspects of the input data set.