Lecture 1: Relational Data & Embedding Models

(1)

İsmail İlkan Ceylan Advanced Topics in Machine Learning, University of Oxford 18.01.2021

Relational Learning

1

(2)

Course Organisation

(3)

Course Organisation

2

Advanced Topics in Machine Learning: Research-oriented and somewhat less conventional

(4)

Course Organisation

2

course.

(5)

Course Organisation

2

course.

Themes of the lectures are:

(6)

Course Organisation

2

course.

• Relational Learning: 8 lectures by İsmail İlkan Ceylan and 1 guest lecture.

(7)

Course Organisation

2

course.

• Bayesian Machine Learning: 8 lectures by Atılım Güneş Baydın and 1 guest lecture.

(8)

Course Organisation

2

course.

For both themes, there will be a live-streamed guest lecture towards the end of the term. Please follow the announcements on the course website.

Assessment: Through a reproducibility challenge, as detailed in the assessment form available from

(9)

Course Organisation

2

course.

For both themes, there will be a live-streamed guest lecture towards the end of the term. Please follow the announcements on the course website.

Assessment: Through a reproducibility challenge, as detailed in the assessment form available from

the course webpage. More details will follow in due course.

Practicals: There are 6 practicals planned. These practicals will provide you the necessary technical

(10)

Course Structure: Relational Learning

(11)

Course Structure: Relational Learning

3

(12)

Course Structure: Relational Learning

3

The lectures cover some selected (and mostly recent) topics in relational learning. There are various subﬁelds of relational learning which are not covered in this course.

(13)

Course Structure: Relational Learning

3

(14)

Course Structure: Relational Learning

3

The (pre-recorded) lectures for relational learning are organised as follows: • Knowledge graphs and embedding models (2 lectures)

(15)

Course Structure: Relational Learning

3

• Graph neural networks (6 lectures)

(16)

Course Structure: Relational Learning

3

• Graph neural networks (6 lectures)

There will be dedicated oﬃce hours (from Week 4, onwards) for your questions. Please follow the announcements regarding this.

(17)

Overview of the Lecture

(18)

Overview of the Lecture

• _{Relational data}

(19)

Overview of the Lecture

• _{Knowledge graphs}

(20)

Overview of the Lecture

• _{Knowledge graph embedding models}

(21)

Overview of the Lecture

• _{Knowledge graph embedding models} • _{Model expressiveness}

(22)

Overview of the Lecture

• _{Model inductive capacity & inference patterns}

(23)

Overview of the Lecture

• _{Model inductive capacity & inference patterns} • _{Empirical evaluation: Datasets and metrics}

(24)

Overview of the Lecture

• _{Model inductive capacity & inference patterns} • _{Empirical evaluation: Datasets and metrics}

• _Summary

(25)

Relational Data

(26)

Relational Data

6

(27)

Relational Data

7

(28)

Relational Data

8

(29)

Relational Data

9

(30)

Relational Data

10

Knowledge Graphs, as graph-structured data models, storing relations (e.g., isFriendOf) between

(31)

Knowledge Graphs

(32)

Knowledge Graphs

(33)

Knowledge Graphs

12

(34)

Knowledge Graphs

12

Nairobi Kenya

(35)

(36)

(37)

(38)

(39)

Knowledge Graphs

12 Nairobi Kenya capitalOf Africa locatedIn neighbourOf Ethiopia locatedIn Somalia locatedIn cityIn • We consider a relational vocabulary that consists of a ﬁnite set

of entities, and a ﬁnite set of relations.

(40)

Knowledge Graphs

E

R

• A fact is an expression of the form , where and .

(41)

Knowledge Graphs

E

R

r(h, t)

r ∈ R,

h, t ∈ E

• Following a common convention, we refer to as the head and as the tail entity in a fact . In the literature, such facts are sometimes denoted as triples of the form , i.e., as “subject, predicate, object” triples.

h

t

r(h, t)

(42)

Knowledge Graphs

E

R

r(h, t)

r ∈ R,

h, t ∈ E

h

t

r(h, t)

(h, r, t)

• A knowledge graph (KG) is a set of facts over and .

Alternatively, we can view a KG as a directed, labelled multigraph over nodes and edges .

G

E

R

(43)

Knowledge Graphs

E

R

r(h, t)

r ∈ R,

h, t ∈ E

h

t

r(h, t)

(h, r, t)

• A knowledge graph (KG) is a set of facts over and .

Alternatively, we can view a KG as a directed, labelled multigraph over nodes and edges .

G

E

R

G = (E, R)

E

R

• We sometimes write to denote the set of all possible facts over and .

U

(44)

Why Knowledge Graphs?

(45)

Why Knowledge Graphs?

13

(46)

Why Knowledge Graphs?

13

• _{KGs provide means for storing, processing, and managing}_{structured data}_{, and are part of} modern information technologies.

(47)

Why Knowledge Graphs?

13

• _{KGs can be used for}_reasoning_{(in conjunction with ontologies), and for query answering, i.e.,} “Who has co-authored a paper with Marie Curie and Pierre Curie?”

(48)

Why Knowledge Graphs?

13

• _{KGs pose (or, relate to) various challenges in AI & machine learning:}

(49)

Why Knowledge Graphs?

13

(50)

Why Knowledge Graphs?

13

• _{How to automatically}_construct_{KGs (e.g., relation extraction, open information extraction)?} • _{How to}_populate_{an existing KG with new facts (e.g., KG completion)?}

(51)

Why Knowledge Graphs?

13

(52)

Why Knowledge Graphs?

13

• _{How to}_{improve/personalise}_{information systems using KGs (e.g., recommender systems)?} • _{How to}_learn_{on top of KGs, while complying with the existing knowledge?}

(53)

Why Knowledge Graphs?

13

• _{How to}_{improve/personalise}_{information systems using KGs (e.g., recommender systems)?} • _{How to}_learn_{on top of KGs, while complying with the existing knowledge?}

(54)

Knowledge Graph Completion

(55)

Knowledge Graph Completion

14

(56)

Knowledge Graph Completion

14

Problem: KGs are typically highly incomplete, which makes their downstream use more challenging. For example, 71% of individuals in Freebase lack a connection to a place of birth.

Question: Can we automatically ﬁnd new facts for our KG,

(57)

Knowledge Graph Completion

14

solely based on the existing information in the KG?

Task: Given a KG , the task of knowledge graph

completion is to predict facts that are missing from .

G

(58)

Knowledge Graph Completion

14 Nairobi Kenya capitalOf Africa neighbourOf Ethiopia locatedIn Somalia locatedIn cityIn

solely based on the existing information in the KG?

Task: Given a KG , the task of knowledge graph

completion is to predict facts that are missing from .

G

(59)

Inspiration from Word Vector Representations

15

“The word representations computed using neural networks are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns.

Somewhat surprisingly, many of these patterns can be represented as linear translations.

For example, the result of a vector

calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector.”

(60)

Inspiration from Word Vector Representations

15

“The word representations computed using neural networks are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns.

Somewhat surprisingly, many of these patterns can be represented as linear translations.

For example, the result of a vector

calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector.”

(Mikolov et. al, 2013)

Figure 2 (Mikolov et. al, 2013): 2-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their capital cities. The ﬁgure illustrates ability of the model to automatically

(61)

Knowledge Graph Completion

Problem: KGs are typically highly incomplete, which makes their downstream use more challenging. For example, 71% of

individuals in Freebase lack a connection to a place of birth.

Question: Can we automatically ﬁnd new facts for our KG, solely

based on the existing information in the KG?

Task: Given a KG , the task of knowledge graph completion is to predict facts that are missing from .

G

(62)

Knowledge Graph Completion

G

(63)

Knowledge Graph Completion

G

Intuition: Real-world data lies in low dimensional manifolds, so if existing facts in a KG exhibit common patterns then one can embed them into low-dimensional vector-spaces and use them to predict new facts.

(64)

Knowledge Graph Embedding

Models

(65)

KG Embedding Models

(66)

KG Embedding Models

18

Most of the existing approaches can be described in term of the following criteria: (i) Model representation: How are the entities and relations represented?

(ii) Scoring function: How is the likelihood of a fact to be true deﬁned?

(67)

KG Embedding Models

18

Most of the existing approaches can be described in term of the following criteria: (i) Model representation: How are the entities and relations represented?

(ii) Scoring function: How is the likelihood of a fact to be true deﬁned?

(iii) Loss function: What is the objective function to be minimised?

Well-known families of models classiﬁed in terms of model representation:

• Translational models: Embed entities as points in vector space, and model relations as translations operating on the embeddings of the entities.

• Bilinear models: Embed entities and relations into vector space, and model relations as a bilinear product between entity and relation embeddings.

(68)

KG Embedding Models: Basic Idea

(69)

KG Embedding Models: Basic Idea

19

Train a KG Embedding Model M

G

Score all facts

M

_{𝗌𝖼𝗈𝗋𝖾}

:: U ↦ ℝ

(70)

KG Embedding Models: Basic Idea

19

G

Score all facts

M

:: U ↦ ℝ

True fac_ts

(71)

KG Embedding Models: Basic Idea

19

G

Score all facts

M

:: U ↦ ℝ

True fac_ts Negative Facts? N = Negat ive fact s

(72)

KG Embedding Models: Basic Idea

19

G

Score all facts

M

:: U ↦ ℝ

True fac_ts

Problem: KGs typically store only positive information, and so encode only the facts that are true. There are

no real negative examples to train with!

Negative Facts?

N = Negat

ive fact s

(73)

Negative Sampling

(74)

Negative Sampling

20

(75)

Negative Sampling

20

(76)

Negative Sampling

20

The idea is to corrupt true facts, and then use some of these corrupted facts as negative examples. Given a KG over entities and relations , every fact in is called a

G

E

R

G

true fact.

A corrupted fact is obtained by replacing only the head (resp., only the tail) entity in a true fact in with an

entity in . Formally, for a true fact , we deﬁne the set of all corrupted facts as:

G

(77)

Negative Sampling

20

G

E

R

G

true fact.

G

E

r(h, t) ∈ G

(78)

Negative Sampling

20

G

E

R

G

true fact.

G

E

r(h, t) ∈ G

C

r(h,t)

= {r(e, t) ∣ e ≠ h ∈ E, r(e, t) ∉ G} ∪ {r(h, e) ∣ e ≠ t ∈ E, r(h, e) ∉ G} .

A negative fact for a given true fact , is a fact randomly sampled from . The set of negative facts

sampled for a given true fact is denoted as .

r(h, t)

C

r(h,t)

(79)

Negative Sampling

20

G

E

R

G

true fact.

G

E

r(h, t) ∈ G

C

r(h,t)

= {r(e, t) ∣ e ≠ h ∈ E, r(e, t) ∉ G} ∪ {r(h, e) ∣ e ≠ t ∈ E, r(h, e) ∉ G} .

r(h, t)

C

r(h,t)

r(h, t)

N

r(h,t)

(80)

Negative Sampling

20

G

E

R

G

true fact.

G

E

r(h, t) ∈ G

C

r(h,t)

= {r(e, t) ∣ e ≠ h ∈ E, r(e, t) ∉ G} ∪ {r(h, e) ∣ e ≠ t ∈ E, r(h, e) ∉ G} .

r(h, t)

C

r(h,t)

r(h, t)

N

r(h,t)

Sampling from corrupted facts and using these as negative facts is standard in the literature, and various sampling techniques are used, e.g., uniform sampling, adversarial sampling, etc.

(81)

Model Expressiveness

(82)

Model Expressiveness

(83)

Model Expressiveness

22

A KGC model is fully expressive if, for any given disjoint sets of true and false facts over a vocabulary (i.e., the ground truth of a set of facts), there exists a parameter conﬁguration for such that

accurately classiﬁes all the given facts.

M

(84)

Model Expressiveness

22

M

(85)

Model Expressiveness

22

M

Intuitively, a fully expressive model can capture any ground truth of a given set of facts. Conversely, a model that is not fully expressive can fail to ﬁt its training set properly, and thus can underﬁt.

(86)

Model Expressiveness

22

M

It is therefore desirable to have full expressivity, as we would our model to ﬁt the dataset reasonably well, regardless of the complexity of the dataset.

(87)

Model Expressiveness

22

M

It is therefore desirable to have full expressivity, as we would our model to ﬁt the dataset reasonably well, regardless of the complexity of the dataset.

Would theoretical inexpressivity surface in practice?

(88)

Model Inductive Capacity

(89)

Model Inductive Capacity

(90)

Model Inductive Capacity

24

(91)

Model Inductive Capacity

24

Model inductive capacity is the generalisation capacity of a model, i.e., the quality of the predictions of the model over incomplete datasets.

Full expressiveness does not necessarily correlate with inductive capacity: Fully expressive models can merely memorise training data and generalise poorly. It is important to develop models that are jointly fully

(92)

Model Inductive Capacity

24

expressive and have a strong inductive capacity.

(93)

Model Inductive Capacity

24

How can model inductive capacity be studied?

Inference patterns are speciﬁcations of logical properties that may exist in a KG, which, if learned, enable

(94)

Model Inductive Capacity

24

further principled inferences from existing KG facts. Inference patterns are a common means to formally analyse the generalisation ability of KGC systems.

(95)

Model Inductive Capacity

24

further principled inferences from existing KG facts. Inference patterns are a common means to formally analyse the generalisation ability of KGC systems.

One well-known example inference pattern is symmetry: A relation is symmetric if, for any choice of entities

e

₁

, e

₂

∈ E

, whenever a fact

r(e

₁

, e

₂

)

holds, then so does

r(e

₂

r ∈ R

, e

₁

)

.

As a result, if a model learns a symmetry pattern for a relation , then it can infer facts in the symmetric closure of , thus providing a strong inductive bias.

(96)

Inference Patterns

(97)

Inference Patterns

25

An inference pattern speciﬁes a logical property over a KG, which means that such patterns can be formalised using logical rules. To formalise this, let us extend our relational vocabulary over and with a set of

variables. A ﬁrst-order atom is an expression of the form , where , and .

E

R

V

(98)

Inference Patterns

25

E

R

V

r(x

_i

, x

_j

)

r ∈ R

x

_i

, x

_j

∈ V

A Boolean combination of ﬁrst-order atoms is deﬁned inductively using logical constructors , e.g.,

and are Boolean combinations of ﬁrst-order

atoms.

¬, ∧ , ∨

(99)

Inference Patterns

25

E

R

V

r(x

_i

, x

_j

)

r ∈ R

x

_i

, x

_j

∈ V

For the purposes of this lecture, we are interested in universally quantiﬁed ﬁrst-order rules of the form: ,

with . The semantics of such universally quantified first-order rules is that of first-order logic, restricted to a finite domain (as the set of entities is finite).

∀x

₁

…x

_k

ϕ(x

₁

…x

_k

) ⇒ ψ(x

₁

…x

_l

)

k ≥ l

E

A Boolean combination of ﬁrst-order atoms is deﬁned inductively using logical constructors , e.g.,

and are Boolean combinations of ﬁrst-order

atoms.

¬, ∧ , ∨

(100)

Inference Patterns

(101)

Inference Patterns

26

We can express the symmetry inference pattern for a relation , in the form of such a logical rule as follows:

,

which holds if and only if the relation is symmetric, i.e., the rule is invalidated if there exists two entities , where is true, but is not.

r ∈ R

∀x, y r(x, y) ⇒ r(y, x)

r

(102)

Inference Patterns

26

We can express the symmetry inference pattern for a relation , in the form of such a logical rule as follows:

,

which holds if and only if the relation is symmetric, i.e., the rule is invalidated if there exists two entities , where is true, but is not.

r ∈ R

∀x, y r(x, y) ⇒ r(y, x)

r

e

₁

, e

₂

∈ E

r(e

₁

, e

₂

)

r(e

₂

, e

₁

)

Similarly, we can express that the relations are the inverse of each other in terms of two rules: ,

.

In this case, we will use the standard abbreviation and write .

r

₁

, r

₂

∈ R

∀x, y r

₁

(x, y) ⇒ r

₂

(y, x)

∀x, y r

₂

(x, y) ⇒ r

₁

(y, x)

(103)

Inference Patterns

27

Inference pattern Inference rule

Symmetry Anti-symmetry Inversion Composition Hierarchy Intersection Mutual exclusion

∀x, y r(x, y) ⇒ r(y, x)

∀x, y r(x, y) ⇒ ¬r(y, x)

∀x, y r

₁

(x, y) ⇔ r

₂

(y, x)

∀x, y, z r

₁

(x, y) ∧ r

₂

(y, z) ⇒ r

₃

(x, z)

∀x, y r

₁

(x, y) ⇒ r

₂

(x, y)

∀x, y r

₁

(x, y) ∧ r

₂

(x, y) ⇒ r

₃

(x, y)

∀x, y r

₁

(x, y) ⇒ ¬r

₂

(x, y)

(104)

Inference Patterns

27

Inference pattern Inference rule

Symmetry Anti-symmetry Inversion Composition Hierarchy Intersection Mutual exclusion

∀x, y r(x, y) ⇒ r(y, x)

∀x, y r(x, y) ⇒ ¬r(y, x)

∀x, y r

₁

(x, y) ⇔ r

₂

(y, x)

∀x, y, z r

₁

(x, y) ∧ r

₂

(y, z) ⇒ r

₃

(x, z)

∀x, y r

₁

(x, y) ⇒ r

₂

(x, y)

∀x, y r

₁

(x, y) ∧ r

₂

(x, y) ⇒ r

₃

(x, y)

∀x, y r

₁

(x, y) ⇒ ¬r

₂

(x, y)

List of inference patterns commonly used in the literature and the corresponding logical rules. It is assumed that r₁ ≠ r₂ ≠ r₃.

(105)

Empirical Evaluation

(106)

Empirical Evaluation: Ranking

(107)

Empirical Evaluation: Ranking

29

(108)

Empirical Evaluation: Ranking

29

The most common empirical evaluation task for KGE methods is based on entity ranking. The knowledge graph is partitioned into a set of

G

training ( ),

G

_tr validation ( ), and

G

_v test facts (

G

_test).

(109)

Empirical Evaluation: Ranking

29

G

training ( ),

G

_v test facts (

G

_test).

Given a test fact

r(h, t) ∈ G

_test, we consider the following sets:

_{r(h, _) = {r(h, e) ∣ e ∈ E, r(h, e) ∉ G}

r(_, t) = {r(e, t) ∣ e ∈ E, r(e, t) ∉ G

tr

∪ G

v

∪ G

test

} ∪ {r(h, t)},

(110)

Empirical Evaluation: Ranking

29

G

training ( ),

G

_v test facts (

G

_test).

Given a test fact

r(h, t) ∈ G

_{r(h, _) = {r(h, e) ∣ e ∈ E, r(h, e) ∉ G}

r(_, t) = {r(e, t) ∣ e ∈ E, r(e, t) ∉ G

tr

∪ G

v

∪ G

test

} ∪ {r(h, t)},

tr

∪ G

v

∪ G

test

} ∪ {r(h, t)} .

Importantly, all facts that occur in the training, validation, or test data are ﬁltered out from these sets

(except the test fact itself). This is to ensure that other facts known to be true do not aﬀect the ranking. This is the so-called ﬁltered evaluation which has become standard practice in experimental evaluation

(111)

Empirical Evaluation: Ranking

29

G

training ( ),

G

_v test facts (

G

_test).

Given a test fact

r(h, t) ∈ G

_{r(h, _) = {r(h, e) ∣ e ∈ E, r(h, e) ∉ G}

r(_, t) = {r(e, t) ∣ e ∈ E, r(e, t) ∉ G

tr

∪ G

v

∪ G

test

} ∪ {r(h, t)},

tr

∪ G

v

∪ G

test

} ∪ {r(h, t)} .

(Bordes et al., 2013).

(112)

Empirical Evaluation: Ranking

29

G

training ( ),

G

_v test facts (

G

_test).

Given a test fact

r(h, t) ∈ G

_{r(h, _) = {r(h, e) ∣ e ∈ E, r(h, e) ∉ G}

r(_, t) = {r(e, t) ∣ e ∈ E, r(e, t) ∉ G

tr

∪ G

v

∪ G

test

} ∪ {r(h, t)},

tr

∪ G

v

∪ G

test

} ∪ {r(h, t)} .

(Bordes et al., 2013).

Every fact in these sets is ranked in accordance to the scoring function of the model in descending order. The rank of the entity relative to the facts , denoted , is the rank of the fact

in ; similarly, the rank of the entity relative to the facts , denoted , is the rank of the fact in .

e

r(_, t)

rank(e ∣ r(_, t))

r(e, t)

r(_, t)

e

r(h, _)

rank(e ∣ r(h, _))

(113)

Empirical Evaluation: Metrics

(114)

Empirical Evaluation: Metrics

30

Mean rank (MR) is the average rank of true facts against their corrupted counterparts:

1

2 ∣ G_test ∣ _r(h,t)∈G∑

test

(115)

Empirical Evaluation: Metrics

30

1

2 ∣ G_test ∣ _r(h,t)∈G∑

test

(rank(h ∣ r(_, t)) + rank(t ∣ r(h, _)))

(116)

Empirical Evaluation: Metrics

30

Hits@ is the proportion of true facts with rank at most :

, where is the indicator function that returns , if is true, and , otherwise.

k

1

2 ∣ G_test ∣ _r(h,t)∈G∑

test (1(rank(h ∣ r(_, t)) ≤ k) + 1(rank(t ∣ r(h, _)) ≤ k))

1(c)

1 c

0

1

2 ∣ G_test ∣ _r(h,t)∈G∑

test

(rank(h ∣ r(_, t)) + rank(t ∣ r(h, _)))

(117)

Empirical Evaluation: Datasets

(118)

Empirical Evaluation: Datasets

31

FB15k (Bordes et al., 2013): A subset of Freebase (Bollacker et al., 2008), where a large part of the test facts can be directly inferred via an inverse relation , which makes the inversion pattern very prominent (Toutanova & Chen, 2015). Other patterns on FB15k are symmetry/antisymmetry and

composition patterns.

(119)

Empirical Evaluation: Datasets

31

r(x, y)

r′

(y, x)

(120)

Empirical Evaluation: Datasets

31

r(x, y)

r′

(y, x)

FB15K-237 (Toutanova & Chen, 2015): A subset of FB15k , where inverse relations are deleted. The prominent patterns are composition and symmetry/antisymmetry patterns.

(121)

Empirical Evaluation: Datasets

31

r(x, y)

r′

(y, x)

WN18 (Bordes et al., 2013): A subset of WordNet (Miller, 1995), featuring lexical relations between words. This dataset has also many inverse relations, and the main inference patterns are symmetry/ antisymmetry and inversion.

(122)

Empirical Evaluation: Datasets

31

r(x, y)

r′

(y, x)

WN18 (Bordes et al., 2013): A subset of WordNet (Miller, 1995), featuring lexical relations between words. This dataset has also many inverse relations, and the main inference patterns are symmetry/ antisymmetry and inversion.

WN18RR (Dettmers et al., 2017): A subset of WN18, where inverse relations are deleted. The prominent inference patterns are symmetry/antisymmetry and composition.

(123)

Empirical Evaluation: Datasets

32

Dataset |E| |R| Training facts Validation facts Test facts

FB15K-237 14,541 237 272,115 17,535 20,466

WN18RR 40,943 11 86,835 3,034 3,034

(124)

Empirical Evaluation: Datasets

32

Dataset |E| |R| Training facts Validation facts Test facts

FB15K-237 14,541 237 272,115 17,535 20,466

WN18RR 40,943 11 86,835 3,034 3,034

YAGO3-10 123,182 37 1,079,040 5,000 5,000

(125)

Summary

(126)

Summary

• _{Relational data}_{is prominent in real-world applications!}

(127)

Summary

• Discussed KG embedding models through the lens of the KG completion task.

(128)

Summary

• _{The families of}_{translational}_,_bilinear_{, and}_{neural models}_{are brieﬂy discussed.}

(129)

Summary

• _Established_{evaluation criteria}_{for diﬀerent models:}

(130)

Summary

• _Model_{expressiveness}

(131)

Summary

• Model inductive capacity and inference patterns

(132)

Summary

• _{Empirical evaluation:}_{Datasets and metrics}

(133)

Summary

• _{Empirical evaluation:}_{Datasets and metrics}

• _{We have}_not_{introduced/evaluated any speciﬁc model: Next lecture!}

(134)

34

References

• _{A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling} multi-relational data. NIPS, 2013.

• _{K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database} for structuring human knowledge. MOD, 2008.

• _{T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and} their compositionality. NIPS, 2013.

• _{G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995.}

• T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel. Convolutional 2D knowledge graph embeddings. AAAI, 2018. • _{K. Toutanova and D. Chen. Observed versus latent features for knowledge base and text inference. Proc. of the}

3rd Workshop on Continuous Vector Space Models and their Compositionality, 2015.