• No results found

Statistical Relational Learning for KG Completion

3.3 Contribution

4.1.1 Statistical Relational Learning for KG Completion

A knowledge graph can be defined in a probabilistic way, in a sense that facts do not have to be restricted to the binary domain of true or unknown. One can define facts to be modeled as random variables, such that they can take any value in the range from [0,1]. More precisely, each fact or triple hh, r, ti, with head entity h, relation r, and tail entity

t, is defined to follow a Bernoulli distribution:

Kh,r,t ∼Ber(ph,r,t)

wheref is a scoring function that is proportional to the likelihood of the triple being true with respect to a set of parameters θ that either weight observed features or represent latent features of the respective triple. Given these preliminaries, one can define the task of statistical KG completion as an extension to link prediction that is concerned with inferring missing links in the overall knowledge graph.

Problem of KG Completion By representing the KG as an adjacency tensor T, see Figure 4.1, formally:

Th,r,t =

(

1, if hh, r, ti ∈ K

0, else

Assuming triples are independent given their parameters θ, the KG completion can then be defined as computing the maximum likelihood estimate:

ˆ T = argmax T Y h∈E Y r∈R Y t∈E Ber(Th,r,t|f(θh,r,t))

This can be seen as calculating a likelihood over all all possible worlds, as combinations ofE ×R×E. For realistic KGs this is a huge space, however, only a small fraction of triples are likely to be true. Hence, a relational learning model should exploit this sparseness and find a lower-dimensional representation.

One of the first KG completion approaches was the tensor factorization-based RESCAL model [NTK11]. It allows a concise definition of the KG completion problem using the notion of a KG as probabilistic tensor. RESCAL computes a bi-linear model as an ap- proximation ˆT of the original tensor:

ˆ

Th,r,t =f(θh,r,t) = h>Wrt

where the parameters of the entities h, t and relations r occurring in K are to be embedded in a low-dimensional, say d-dimensional, vector space as vectors h,t Rd as

the entity parameter matrixWE. The relation parameters define the core tensor consisting of matrices Wr ∈Rd×d. Hence, relating to the above definitions, θ ={WE}S{Wr}r∈R. These parameters are typically referred to as latent representations or simplyembeddings. The approximated tensor now holds a probability-like score of all possible facts in the KG.

Another major family of models are the so-called vector translation-based models, such as TransE, which have been one of the forerunners in the representation learning domain. In the TransE model [BUWY13], givenK suchf relies on distance or similarity between vectors of entities and relations. Intuitively, TransE follows the intuition that there is a linear relation for triplesh+rt, hence the scoring function is defined as a dissimilarity measure (e.g. `2-norm) f(h,r,t) = kh+r−tk2

2. This means that translating entity h with relation r should end up close to its tail entity t in the latent d-dimensional space. In order to prevent overfitting, the magnitudes of parameters in TransE are normalized after each mini-batch to unit-norm vectors, i.e. e∈ E :kek= 1.

Interestingly, many of these representation learning model families have been shown to be effectively trained by using a ranking loss with the objective that true triples should be ranked before false/unknown ones according to the scoring function. This learning objective is formulated as minimizing a margin-based ranking loss:

LK= X (h,r,t)∈K X (h0,r,t0)∈N max(0, γ+f(h, r, t)f(h0, r, t0)), (4.1)

where h, r, tare observed inK andh0, r, t0 are sampled fromN, which is a set of negative examples, i.e. presumably false triples not contained in K. This loss is minimized when the true triples outscore the false ones by a constant marginγ. In practice the training is done using mini-batches ofK, instead of iterating over all triples, together with stochastic- gradient descent (SGD), since this introduces more variance in the embedding parameter updates and can prevent early convergence in local optima. RESCAL as well as its simpler variant DistMult can also be formulated within this negative sampling framework of weight updating, instead of closed-form Alternating-Least-Squares updates.

Note that the closed-form solution of RESCAL corresponds to taking the closed-world assumption (CWA), since every unobserved fact is considered to be false. In terms of negative sampling, different assumptions can be made, such as the local-closed world assumption (LCWA). In LCWA the, for a particular predicate subject pair h and r it is assumed that any hh, r, )ithat is not observed in the KG is indeed false and can be used as negative sample [NMTG15].

4.2

Existing Background-enhanced KG Completion

Related documents