Neural Sequence Models - Multi-relational Representation Learning and Knowledge Acquisition

This subsection introduces a variety of neural sequence encoding techniques, which are the basis of our work on knowledge acquisition from sequence data.

2.3.1 The Convolution Layer with Pooling

We use X = [v1,v2, ...,vl] to denote an input vector sequence that corresponds to the input se-

quence or the outputs of a previous neural layer. A convolution layer applies a weight-sharing kernel Mc ∈ Rh×k to generate a k-dimension latent vector h

(1)

t from a window vt:t+h−1 of the input vector sequenceX:

h(1)_t = Conv(vt:t+h−1) =Mcvt:t+h−1+bc

for whichhis the kernel size, andbcis a bias vector. The convolution layer applies the kernel as a

sliding window to produce a sequence of latent vectorsH(1) _{= [}_h(1) 1 ,h

(1) 2 , ...,h

(1)

l−h+1], where each latent vector combines the local features from each h-gram of the input sequence. The n-max- pooling mechanism is applied to every consecutive n-length subsequence (i.e., non-overlapped n-strides) of the convolution outputs, which takes the maximum value along each dimensionj by h(2)_i,j = max(h(1)_i_:_n₊_i₋₁_,j). The purpose of this mechanism is to discretize the convolution results, and preserve the most significant features within each n-stride (Chen et al., 2018b; Hashemifar et al., 2018; Kim, 2014). By definition, this mechanism divides the size of processed features by n.

2.3.2 GRU and Residual GRU

The Gated Recurrent Unit model (GRU) represents an alternative to the Long-short-term Memory network (LSTM) (Cho et al., 2014), which consecutively characterizes the sequential information without using separated memory cells (Dhingra et al., 2016). Each unit consists of two types of gates to track the state of the sequence, i.e. the reset gate rt and the update gate zt. Given the

embeddingvtof an incoming item, GRU updates the hidden stateh

(3)

t of the sequence as a linear

combination of the previous state h(3)_t₋₁ and the candidate state h˜(3)_t of a new item vt, which is

calculated as below. h(3)_t = GRU(vt) = zth˜ (3) t + (1−zt)h (3) t−1 zt=σ Mzvt+Nzh (3) t−1+bz ˜ h(3)_t = tanhMsvt+rt(Nsh (3) t−1) +bs rt =σ Mrvt+Nrh (3) t−1 +br .

The symboldenotes the element-wise multiplication. The update gateztbalances the informa-

tion of the previous sequence and the new item, where capitalized M∗ and N∗ denote different weight matrices,b∗ denote bias vectors, and σ is the sigmoid function. The candidate state h˜(3)_t is calculated similarly to those in a traditional recurrent unit, and the reset gate rt controls how

much information of the past sequence contributes to h˜(3)_t . Note that GRU generally performs comparably to LSTM in sequence encoding tasks, but is less complex and requires much fewer computational resources for training.

A bidirectional GRU layer characterizes the sequential information in two directions. It con- tains the forward encoding process−−−→GRUthat reads the input vector sequenceX = [v1,v2, ...,vl]

fromv1tovl, and a backward encoding process←−−−GRUthat reads in the opposite direction. The en-

coding results of both processes are concatenated for each input itemvt, i.e.h

(4)

t = BiGRU(vt) =

[−−−→GRU(vt),←−−−GRU(vt)].

The residual mechanism passes on an identity mapping of the GRU inputs to its output side through a residual shortcut (He et al., 2016). By adding the forwarded input values to the outputs, the corresponding neural layer is only required to capture the difference between the input and output values. This mechanism aims at improving the learning process of non-linear neural layers by increasing the sensitivity of the optimization gradients (He et al., 2016; Kim et al., 2016), as

well as preventing the model from the vanishing gradient problem. It has been widely deployed in deep learning architectures for various tasks of image recognition (He et al., 2016), document classification (Conneau et al., 2017b) and speech recognition (Zhang et al., 2017). In our deep RCNN in Chapter 8, the bidirectional GRU is incorporated with the residual mechanism, and will pass on the following outputs to its subsequent neural network layer:

h(5)_t = ResGRU(vt) = [−−−→GRU(vt) +vt,←−−−GRU(vt) +vt]

In our development, we have found that the residual mechanism is able to drastically simplify the training process, and largely decreases the epochs of parameter updates for the model to converge. 2.3.3 Self-attentive Encoder

The self-attention mechanism (Conneau et al., 2017a) seeks to capture the overall meaning of the input sequence unevenly from each encoded item. One layer of self-attention is calculated as below, ut= tanh Mah (3) t +ba at= exp u>_t uX P wm∈Xexp (u > muX) h(6)_t =_|X_|atut

where ut thereof is the intermediary latent representation of the GRU output h(3)t , and uX =

tanh(Mah

(3)

X +ba)is the intermediary latent representation of the last GRU outputh

(3)

X that can be

seen as a high-level representation of the entire input sequence. By measuring the similarity of each ut with uX, the normalized attention weight at forh

(3)

t is produced through a softmax function,

which highlights an input that contributes more significantly to the overall meaning. Note that a scalar_|X_|(the length of the input sequence) is multiplied along withattoutto obtain the weighted

of the sequence is calculated as the average of the last attention layerE(2)(X) = _|_X1_|P|X|

t=1ath

(6)

t .

2.3.4 Linear Bag-of-words

The much simpler Linear Bag-of-words (BOW) encoder (Hill et al., 2016; Xie et al., 2016) is real- ized by the sum of projected word embeddings of the input sentence, i.e.E(3)₍_S_{) =}P|S|

CHAPTER 3

Transfer Embeddings of Multilingual Knowledge Graphs

In this chapter, we propose the first method that learns transferred embeddings across multiple knowledge graphs. This work is presented in the stage of learning to represent multilingual knowledge graphs (Chen et al., 2017b,d), although it is easily extended to other domains.

In document Multi-relational Representation Learning and Knowledge Acquisition (Page 40-44)