Statistical Relational Learning for KG Completion

3.3 Contribution

4.1.1 Statistical Relational Learning for KG Completion

A knowledge graph can be defined in a probabilistic way, in a sense that facts do not have to be restricted to the binary domain of true or unknown. One can define facts to be modeled as random variables, such that they can take any value in the range from [0,1]. More precisely, each fact or triple _hh, r, t_i, with head entity h, relation r, and tail entity

t, is defined to follow a Bernoulli distribution:

Kh,r,t ∼Ber(ph,r,t)

wheref is a scoring function that is proportional to the likelihood of the triple being true with respect to a set of parameters θ that either weight observed features or represent latent features of the respective triple. Given these preliminaries, one can define the task of statistical KG completion as an extension to link prediction that is concerned with inferring missing links in the overall knowledge graph.

Problem of KG Completion By representing the KG as an adjacency tensor _T, see Figure 4.1, formally:

Th,r,t =

(

1, if _hh, r, t_{i ∈ K}

0, else

Assuming triples are independent given their parameters θ, the KG completion can then be defined as computing the maximum likelihood estimate:

ˆ T = argmax T Y h∈E Y r∈R Y t∈E Ber(_Th,r,t|f(θh,r,t))

This can be seen as calculating a likelihood over all all possible worlds, as combinations of_{E ×R×E}. For realistic KGs this is a huge space, however, only a small fraction of triples are likely to be true. Hence, a relational learning model should exploit this sparseness and find a lower-dimensional representation.

One of the first KG completion approaches was the tensor factorization-based RESCAL model [NTK11]. It allows a concise definition of the KG completion problem using the notion of a KG as probabilistic tensor. RESCAL computes a bi-linear model as an ap- proximation ˆ_T of the original tensor:

Th,r,t =f(θh,r,t) = h>Wrt

where the parameters of the entities h, t and relations r occurring in _K are to be embedded in a low-dimensional, say d-dimensional, vector space as vectors h,t _∈ Rd _as

the entity parameter matrixW_E. The relation parameters define the core tensor consisting of matrices Wr ∈Rd×d. Hence, relating to the above definitions, θ ={WE}S{Wr}r∈R. These parameters are typically referred to as latent representations or simplyembeddings. The approximated tensor now holds a probability-like score of all possible facts in the KG.

Another major family of models are the so-called vector translation-based models, such as TransE, which have been one of the forerunners in the representation learning domain. In the TransE model [BUWY13], given_K suchf relies on distance or similarity between vectors of entities and relations. Intuitively, TransE follows the intuition that there is a linear relation for triplesh+r_≈t, hence the scoring function is defined as a dissimilarity measure (e.g. `2-norm) f(h,r,t) = kh+r−tk2

2. This means that translating entity h with relation r should end up close to its tail entity t in the latent d-dimensional space. In order to prevent overfitting, the magnitudes of parameters in TransE are normalized after each mini-batch to unit-norm vectors, i.e. _∀e_{∈ E} :_kek= 1.

Interestingly, many of these representation learning model families have been shown to be effectively trained by using a ranking loss with the objective that true triples should be ranked before false/unknown ones according to the scoring function. This learning objective is formulated as minimizing a margin-based ranking loss:

LK= X (h,r,t)∈K X (h0,r,t0)∈N max(0, γ+f(h, r, t)₋f(h0, r, t0)), (4.1)

where h, r, tare observed in_K andh0, r, t0 are sampled fromN, which is a set of negative examples, i.e. presumably false triples not contained in _K. This loss is minimized when the true triples outscore the false ones by a constant marginγ. In practice the training is done using mini-batches of_K, instead of iterating over all triples, together with stochastic- gradient descent (SGD), since this introduces more variance in the embedding parameter updates and can prevent early convergence in local optima. RESCAL as well as its simpler variant DistMult can also be formulated within this negative sampling framework of weight updating, instead of closed-form Alternating-Least-Squares updates.

Note that the closed-form solution of RESCAL corresponds to taking the closed-world assumption (CWA), since every unobserved fact is considered to be false. In terms of negative sampling, different assumptions can be made, such as the local-closed world assumption (LCWA). In LCWA the, for a particular predicate subject pair h and r it is assumed that any _hh, r, )_ithat is not observed in the KG is indeed false and can be used as negative sample [NMTG15].

4.2 Existing Background-enhanced KG Completion

In document Semantic-guided predictive modeling and relational learning within industrial knowledge graphs (Page 51-54)