Linear Regression - Automated Knowledge Base Extension Using Open Information

Model Fit: In this part we work out the error calculation for the linear model

we employ for our relation matching step. Given a set of data points, D in a 2- dimensional feature space we can estimate the linear dependency between the two variable using a linear regression model. In our context, D contains the point { . . . , (Kp, ⌧minp ), . . . }, for every particular OIE relations p. Hence, the data set can

be represented as, 8p 2 Tp,

D= (Kp, ⌧minp ) | Kp2 R>0; 06 Kp6 100 (17)

We select a subset, D0 _{(⇢ D), for which we try to model with a linear regression}

line. The only difference the subset has is that every data point has Kp value

greater than a particular threshold. The choice is made based on the linear trend as observed in Figure17and16. For this subset of data points, fitting a regression line necessarily means to find a straight line passing through the set of points in a way such that it is the best fitting line. Best usually is defined as minimizing the sum of the squared errors. Hence, the dependent variable ⌧min

p can be predicted

by the linear formulation with Kpas,

ˆ⌧p= ✓0⇤ Kp+ ✓1 (18)

where ˆ⌧pis an estimate of ⌧minp and ✓0, ✓1 are the parameters. The sum of errors

is given by, error= |D0| X p (ˆ⌧p- ⌧minp )2 (19)

which is simply the sum of the squares of the difference in the actual value of ⌧min

p and its estimated values for all the points in the data set. Intuitively, if the

actual and estimated values are identical, that signifies zero error at that point and if that is true for all the data points, then it is the perfect fit. Or, geometri-

9.2 linear regression 103

cally, all the points lie on the regression line. It is evident from Equation18 that the parameters play an important role, ✓0 determines the slope of the regression

line and ✓1 gives the intercept. The choice of these two parameters can be easily

solved by gradient descent algorithm, which involves altering the values of the two parameters iteratively, such that the cost function, the error (in Equation19) is minimized in every step. However, to avoid falling into local minima, stochastic gradient decent may help. We used the standard java math libraries from Apache foundation to estimate these parameters4_.

Mean Squared Error (MSE): Once the linear model is generated, we can use it to

compute the error for any arbitrary data points. The sum of errors would give us a sense of the goodness of fit for the linear model. While calculating the error, we now also consider the data points which were not considered previously during model fitting. Thus the MSE is given by,

MSE= 1 |D|⇤ |D| X i [(✓0⇤ Kp+ ✓1)i- ⌧i]2 (20)

For any arbitrary point i, the estimated value is given by (✓0⇤ Kp+ ✓1)iwhile the

actual value is ⌧iwhich readily allows us to compute the error. Thus a lower MSE

indicates better model fitting. We chose MSE as an error measure, however, one can also choose Mean absolute Error (MAE) as an alternate performance indicator. Is is to be noted that the index i runs over D0 _{while model fitting but for error}

calculation, it uses the whole data set. In Figure17, we vary the threshold value gradually starting from as low as 1% and moving till 50%. For each threshold, a new linear model is learnt, and fitted on the data set. This gives varying values MSE for varying threshold. We observed that, extreme low and high threshold led to worse MSE. Finally, we select the threshold values which gives us the lowest MSE.

10

C L U S T E R B A S E D A P P R O A C H

In this chapter we outline an alternative relation mapping approach. The idea in this chapter is to have a natural grouping of the relations which are semantically synonymous. Once a natural grouping is achieved, it can be safely assumed that they all bear similar semantics. Such a group is referred to as a cluster of relational phrases. In this regard, we perform two clustering methodologies. First, cluster only the OIE relations without any additional seed inputs from KB relations. The clusters thus created can be mapped to one or more KB relations. This is very analogous to the rule based method proposed in the previous chapter, the only difference being, now we treat clusters of relations as a unit instead of single relations. This is a rational choice to group synonymous phrases into a cluster which can then be treated as one collective unit. The importance of this particular scheme is more applicable to Reverb than to Nell. The former data set often contains relation phrases which are natural language textual phrases and it is a natural choice to bundle them together. While the relations in Nell are necessarily not textual phrases but normalized into a form unique to Nell’s own internal hierarchical structure1_{. Second, cluster together both the OIE and KB relations} with seeds from the later. This forms a group with a collection of OIE relations with one or more semantically similar KB relations. We explore these two methodologies and design our system workflows around these.

For computing these clusters, we explore and analyze two different schemes: graph-based and vector-based. For the graph-based method, we use Markov clustering (Section10.2) in which nodes represent relational phrases and edges weigh the degree of similarity of the connected phrases. As a vector-based scheme we opt for k-mediod (Section10.1) where every relational phrase is represented as a vector in feature space. Although the clustering technique is more applicable to the Reverb data set, our algorithm makes no basic assumption of the underlying data sets and is oblivious to any specific structure within the relations.

1 http://rtw.ml.cmu.edu/rtw/kbbrowser/

106 cluster based approach

We briefly outline the major differences between the graph-based and vector- based schemes, before getting into the methodology specifications. Consider a set of items which needs to be grouped in sets of closely related items. Vector- based clustering necessarily tries to represent each data points (or items) in a high dimensional feature space, typically dimension greater than 2. The item (data) representation is usually made of from a vector of features. Hence the name vector-based clustering. We will discuss these details in Section 10.1. In contrast, a graph-based scheme purely trusts on the pairwise affinity scores between two data points and the assumption that the elements are distinct nodes in a graph. Hence, the former has the onus of representing data points as a vector while the later has the onus of defining a similarity function between items.

10.1 vector based clustering

Given a set of n data points, the k-medoid clustering [JH10] algorithm initializes kmedoids which act as central points. Every other data point is assigned closest to one of these medoids. This scheme iteratively toggles between two steps, ini- tialization of a medoid and assignment of the data points nearest to the chosen medoid. After the first iteration of assignment phase, the new mediod is calcu- lated by minimizing the cost2 _{for the configuration. The new mediod giving the} best configuration (least cost) is chosen and the re-assignment is repeated. The algorithm runs till convergence, i.e. no more newer medoids can be found leading to a lower cost. This is a stable configuration with a set of k disjoint clusters. This method is similar to the k-means algorithm [HW79] but with the only difference that for each iteration, the actual data points form the new medoids, unlike the means of data points as done in k-means clustering. We choose k-medoid scheme over its sibling, due to its better robustness to noise and outliers [Vel12].

In our setting, every OIE property is represented as a vector of n-elements. We refer to the elements as features and likewise the representative vector as a feature vector. In particular, we use two feature spaces, one for domains and the other for ranges for the OIE relations. They are exactly identical just that the features are different. The need for two spaces will be clarified shortly. For simplicity, consider

2 Cost is usually computed as the sum of all the pairwise Manhattan distances between the points. One can use other measures as well.

10.1 vector based clustering 107

three Reverb relations: is situated in, is a region in and is a part of. Let each has exactly two instances. We present the instances for one of the relations here:

is a region in (Asturias, Spain) is a region in (Amsterdam, Nederland)

We describe the details with one relation and its instances, the description is the same for the other relations as well. With a refined instance mapping already computed, the mapping for the subject terms in the two relation instances above are: {. . . , (Asturias,db:Asturias), (Amsterdam,db:Amsterdam), . . . }. We present the mappings in the form of mapping pairs as defined by M (Expression1). The dots denote the existence of the object mappings as well as the other mappings from rest of the relation instances. Once correctly mapped, it is not hard to find the domain of the relation, is a region in. In this case it isdbo:Settlementanddbo:City (respective instance types of db:Asturias and db:Amsterdam). Analogously, the other two of the remaining OIE relations would have similar vector representation which are presented below:

is situated in ⌘ <dbo:Settlement,dbo:City, 0 > is a region in ⌘ <dbo:Settlement,dbo:City, 0 >

is a part of ⌘ < 0,dbo:City,dbo:Album>

In this example, the domain space spans over three axes:dbo:Settlement,dbo:City anddbo:Album. Hence, the relation is a region in spans over 2 out of the 3 axes in the domain feature space. We see that the feature vector is of length 3 and each feature is either present or absent. For instance, "is a part of " hasdbo:Settlement feature absent, hence marked 0. This is an over simplified example with 3 dimensions only to make the intuition transparent. In reality we have extremely high dimensions for a given property. In our implementation, we use integer valued ids to identify the DBpedia classes and also maintain an inverse lookup table to map the ids back to its respective textual representations. Ideally, we had extremely long vectors, whose length was typically in the order of the input data size. Generally, a feature vector for some relation p would look like:

p ⌘ < (1 = 0), (2 = 0), (3 = 0), (4 = 1), · · · >

The feature element (1=0) means the DBpedia class indexed "1" (e.g.dbo:City) is not a feature for this relation, similarly (4=1) means the class indexed with "4" is a

108 cluster based approach p1 p2 p3 p4 p5 p6 d1 DomainSpace < 1, 0 > r1 RangeSpace < 1, 0, 0 > d1 < 1, 0 > r1 < 1, 0, 0 > d1 < 1, 0 > r1 < 1, 0, 0 > d1 < 1, 0 > r2 < 0, 1, 0 > d2 < 0, 1 > r3 < 0, 0, 1 > d2 < 0, 1 > r3 < 0, 0, 1 >

Figure 13: A schematic representation of the k-medoid based clustering technique using the OIE relation vector representation in two different feature spaces. Adjacent to every domain and range values, the feature vector is written, for instance, p₄_{⌘ < 1, 0 > in the domain space.}

feature. It is obvious that such feature vectors can be extremely sparse with lots of zeros. We used the sparse vector data structure3_{for this purpose which maintains} a extremely compact form of the above vector with just the existing features.

We maintain a similar feature space for the range as well. The broad idea is to have a projection of a given relation in two feature spaces; cluster based on one projection (domain features) and then refine further with the other projection (range features). This is not a strict order, one can also start from the range space with identical outcomes. This is schematically represented in Figure13, where we present six properties with their respective associated domain and ranges. We ob- serve that, domain space is two-dimensional with d1 and d2 as axes, while range

space is three-dimensional with axes r1, r2 and r3. Thus, in the range space, p2

has a representation of <1, 0, 0>. We first cluster relations on the domain space (represented by thick rectangles) giving us two clusters {p1, p2, p3, p4} and {p5,

p6}. Now, considering the projections in range space as well we find further sub-

clusters (represented by dotted rectangles), finally leading to three clusters {p1,

p2, p3}, {p4} and {p5, p6}.

3 http://lenskit.org/. This project specializes in developing tools for recommender systems, how-

In document Automated Knowledge Base Extension Using Open Information (Page 122-129)