Similarity Metrics - Graph Based Clustering

10.2 Graph Based Clustering

10.2.1 Similarity Metrics

We introduce a similarity measure sim(ri, rj) which captures the degree of sim-

ilarity between any two property phrases ri, rj(i 6= j). In defining the similarity,

we looked into two major aspects:

• exploit the input data set to extract evidences which allow to quantify the similarity between ri, rj.

• exploit the rich semantics via external sources to define a similarity function. For the former requirement, we have the set of relation instances for the two relations in question. These can be exploited to define a similarity co-efficient, in particular, we use the overlap similarity ov(ri, rj) for our task. Let fri and frj be

the set of relation instances for the relational phrases ri and rj, respectively. If n

denotes the number of instance pairs where both the subject and object terms are in common across both sets, then the overlap similarity is, in our context, defined by,

ov(ri, rj) = n/min(|fri|, |frj|) (21)

One can also consider using other measures like Jaccard for this similarity. We opted for the overlap co-efficient for attaining higher similarity scores in general. This can be illustrated better with the Example6.

Example 6. Let us consider two relations r1 and r2. For each one of them, we define few

relation instances as follows: r1(a1, b1) r2(a1, b1)

r1(a2, b2) r2(a2, b2)

r1(a3, b3) r2(a, b)

r1(a4, b4)

Clearly, n = 2, since there are just two pairs ((a1, b1) and (a2, b2)) across the two relation

instance sets. . And |fr1|= 4; |fr2|= 3

Using, Jaccard co-efficient, the score would have been n

|f_r1[f_r2| = 27 = 0.285

But, with overlap similarity, the score will be n

min(|f_r1|,|f_r2|) = 23 = 0.66

The example shows that overlap similarity tends to give higher scores. This is important for us since our goal is to find a similarity score between two relational

10.2 graph based clustering 111

phrases. Now, for typical open information extraction, the size of the relation instance sets are not of prime importance. For instance, if the extraction produced just one instance of is the writer of and 1 million instances for is the author of, then jaccard similarity will make these two phrases very less similar, but that is not the case as they are semantically similar and should be weighed higher. This defeats our purpose. But, overlap similarity exactly avoids this scenario and better suits our use case.

Now, measuring the likelihood with overlap coefficients can capture the relations having similar instances as arguments, but often it might not be the case. We need something more sophisticated to determine if relations like are essential in and is vital in are similar even if they might not have any common instances between them. We use Wordnet [Mil95] as our lexical reference for computing similarities in these complicated cases. In particular, we used a similarity API5 which internally uses a hybrid approach involving statistics from a large cor- pus [KHY+_14,_HKF+_{13]. The RESTful [RR07] API allows to retrieve the score for} a given pair of relation phrases riand rj. We denote this score as wn(ri, rj).

Eventually, we make a linearly weighted combination of these two weights to define our intra-node affinity score sim(ri, rj), which is given as,

sim(ri, rj) = ⇤ ov(ri, rj) + (1 - )⇤ wn(ri, rj) (22)

where, , is a weighing factor (0 6 6 1). In Section 11.2.2, we present an em- pirical analysis for the choice made for and discuss its effect on the overall clustering task. Applying the measure to the phrases ri=is the capital of and rj =

is the capital city of, for example, we obtain the score 0.719 with wn(ri, rj) = 0.829

and ov(ri, rj) = 0.585 if we set = 0.45.

Additionally, one can also use an n-gram6 _{overlap similarity as an additional} measure. The given phrases can be splitted into 2/3-grams and a cosine similarity score can be computed. This works well for phrases which are similar token wise, for instance is located in7 _{and is city located in. But, it gives a low score for a pair} of relations with similar semantic sense but low token overlap, like spouse of and 5 http://swoogle.umbc.edu/SimService/

6 a sequence of n items where items can be letters, words, syllables etc. 7 Using characters, a 2-gram would look like [is, _l], [_l, oc], [oc, at], . . .

112 cluster based approach p1 p2 p3 p4 p5 (a) 0.12 0.2 0.016 0.06 0.44 p1 p2 p3 p4 p5 (b) 0.38 0.62 1.0 0.61 0.08 0.31 0.63 0.29 0.08 1.0

Figure 14: (a): A weighted undirected graph representing similarities between nodes. (b): same graph showing the transition probabilities. The directed edges represent the probability of moving to the connecting node. Note that the bold values add up to 1. p1: is a village in; p2: is a town in; p3: is a suburb of ; p4: currently

resides in; p5: currently lives in; nodes of same color are eventually in the same

cluster.

married to. Moreover, there is a wide range of other similarity measures one can use for the purpose, but we chose the wordnet API and the overlap. The former is a one-step solution which considers distributional semantics, statistical coverage and token similarities but not the instance overlap. This is nicely complemented with the overlap score, which on the other hand misses the semantics achieved by wordnet. Hence, in our opinion, the above similarity measure is a simple yet com- prehensive choice for the task. We consider two broad aspects which determine pattern similarities and linearly combine them as one unified score.

In document Automated Knowledge Base Extension Using Open Information (Page 130-132)