Results: Cluster Based Method - Automated Knowledge Base Extension Using Open Information

In Chapter10we introduced the different clustering schemes. Here, we evaluate those and discuss the effects of some of the parameters we had introduced. First, we define the notion of a "good" cluster by presenting an intrinsic clustering quality measurement (Section11.2.1). Second, this measure is used for our choice of optimal parameters (Section11.2.2), especially the graph clustering parameter (inflation factor , introduced in Section10.2.2) and linear weight aggregation factor ( , introduced in Section10.2.1).

11.2.1 Metric

To define a cluster quality score, which considers two factors, intra-cluster and inter-cluster sparseness. For a set of clusters, C = {c1, . . . , c|C|}, we measure the

cluster outputs in terms of a quality measure [RL99, DJ79], denoted by S and defined as, S= ✓_P ci2C comp(ci) iso(C) ◆-1

where, comp(ci)denotes compactness and is defined as

comp(ci) = min(sim(ri, rj)); 8ri, rj 2 ci

The authors [RL99] referred to this measure as "Separated Clusters" since this measure tries to evaluate by maximizing the separation of clusters relative to their compactness. Intuitively, it measures how tightly any two arbitrary phrases ri and rj are connected in cluster ci by looking at the minimum pairwise score

between all elements in ci. Note that comp(ci) is defined only if a cluster has

at least two elements. Otherwise, we set it to zero. The metric iso(C) measures isolation. It is defined as

11.2 results: cluster based method 133

It denotes how sparsely the elements are distributed across clusters in C. Ideally, for a good cluster scheme, every cluster ci should contain very closely similarly

elements i.e high compactness and there should very low similarity between elements across different clusters, i.e. low isolation. This tends to make S low for good clustering schemes.

11.2.2 Parameter Search

In our cluster based workflows, we used two parameters: which is the weigh- ing factor for the pairwise relation similarity scores; which is the inflation factor for performing the markov cluster. In this section we present a principled way of choosing the optimal values for these parameters. We alter in steps of 0.1 start- ing from 0 to 1.0. For each of these settings, we obtain different pairwise scores for our set of relational phrases. For every we create different similarity files, which serve as inputs to the markov clustering routine. Here, we let the inflation vary in steps of 1 ranging from 2 to 40. We had tried with =1, but it did not converge after a finite amount of time.

Furthermore, we chose 40, after observing that cluster size did not vary much beyond > 30. However, choosing a maximum of =40, was enough to capture the saturation trend. Essentially, we executed the markov cluster routine 11 * 39 times (11 values of times 39 values of ). Each resulted in a configuration which allocated the elements accordingly. Needless to say, these configurations were different from one another, and for some particular combination of and , the configuration would be the best. In order to make that qualitative judgement we employed the metric S as discussed in Section11.2.1, on each individual cluster configuration. This has been depicted in Figure18(a). For a given , we present all the values along the y-axis. We must remember that lower the value of S for a configuration, better is that cluster formation. The figure shows a valley around 13 6 6 16. We fitted a smoothed curve over these data points and it repre- sents the general variation. Detailed analysis in this particular range of values revealed that the lowest score of S was obtained for =14, and this is expanded over the values in Figure18(b). Once, was chosen, it was simple to pinpoint the giving the best cluster. We attained it at =0.4.

134 experiments 0.0050 0.0075 0.0100 0.0125 0.0150 10 20 30 40 Inflation, φ

Cluster Quality Score

, S 0.0055 0.0060 0.0065 0.0070 0.00 0.25 0.50 0.75 1.00 beta, β Quality Scores , S 1.5 1.8 2.1 2.4 10 20 30 40 Inflation, φ #Clusters −2.0 −1.5 −1.0 −0.5 0.0 5 10 15 20 Inflation, φ

Cluster Quality Score

, S (log base 10)

naive markov k−mediod

Figure 18: (a) Variation of cluster quality, S with Inflation, . For a given inflation value, all the corresponding values for are plotted and a trend line is fitted to capture the overall behavior. Comparison of the Markov clustering based scheme with a naive mediod based scheme. (b) Variation of cluster scores for the minimum beta values for = 14. (c) Number of clusters depending on . (d) Com- parison of the Markov, k-mediod and a naive clustering scheme with respect to the cluster qualiy scores. [DMS15]

We were also interested to find the change in cluster sizes for the same set of variations. This has been captured in Figure18(c). Instead of score S, we plot the cluster size (marked as #Clusters) here with the same set of parameter values. It clearly reveals a trend as represented by the smoothed curve and maintains parity with Figure 18(a). A steady improvement phase (26 6 13), optimality, then deterioration (166 6 22) and eventually saturation ( > 22).

11.2.3 Clustering Techniques

We compared the performance of the Markov cluster approach with two different variants of clustering. The first one is a naive clustering. It selects k random relational phrases from an input set (in our case, k < 500) and tries to assign the rest (i.e. 500-k) of the phrases closest to one of these k feed relational phrases.

In document Automated Knowledge Base Extension Using Open Information (Page 152-155)