Recursive Neural Tensor Networks - MoL 2017 28: Understanding Generalization: Learning Quant

While the tRNN model of the last sectiongloballyaccounts for linguistic structure, by performing a hierarchical forward computation analogously to a parse tree of the input sentence,localcomposition at the tree nodes remains additive. As discussed previously, this type of function seems inadequate for the representation of composition including function words, such as adjectives, transitive verbs or negation.

The syntactic structure of composition in the tRNN suggests that the composition process is defined by a function of the constituent vectors, in analogy to the functions of formal semantics. In reality, tRNN composition input is aconcatenationof constituents, which is equivalent to the addition of twoindependentmappings over the arguments. This representation seems to contradict our understanding of functional application in compositional models of semantics.

A possible alternative is the recursive matrix-vector model of Socher et al. (2012), where each word is represented by a matrix and a vector, and composition is defined by symmetric functional application of the constituents. The requirement that words can act as functions on other words is satisfied here, but due to the added word represen- tations, model complexity quickly grows with vocabulary size. A preferable solution seems to be employing higher-order tensors, allowing for an adequate representation of argument structure, while still using a single, global composition function.

Tensor Layer by Slices With the above considerations in mind, therecursive neural tensor network(tRNTN) is proposed in Socher et al. (2013a,b). The previous tRNN composition function of Eq. (4.3) is augmented by a third-order tensor term, and composition is given by the sum of a linear map and a bilinear map of layer input. Defined by tensor slices, the tRNTN composition function is then:4

z

₍_x_,_y₎

=

f x

_V

(1:d)

_y

₊_W_[_x

_y_]

₊

_b

(4.4) where

x,y

∈_R

d is the composition input, and

V

∈_R

d×d×d is a full-rank third-order tensor defining a bilinear map from

R

d input spaces to an

R

d output space. The remaining terms are identical to those of Eq. (4.3).

V

(i)_{, the}

_i

_{-th of}

_d

_{slices of}

_V

_{, determines the}

_i

_{-th component of}

_z

∈

R

d, the output vector of the tensor term, as follows:

z

_V_,_i

=x

_V

(i)

_y

We can think of the tRNTN layer output

z

as resulting from three

d

-dimensional terms: tensor output

z

_V defined above,

z

_W, the linear combination of concatenated input arguments parametrized by

W

, and the constant term

z

=b

. Summing, and applying the nonlinearity, yields the layer output,

z

=

f(z

+z

)

4 _{We define the tensor function of Socher et al. (2013a) and Bowman et al. (2015), which differs from}

the tensor of Socher et al. (2013b), where input consists of concatenated child vectors, i.e.[x;y]in

Matricized Tensor Layer Rather than using the definition by slices of Eq. (4.4), we

will usually describe the tRNTN inmatricizedform, which allows for a more intuitive exposition of the multiplicative interaction of arguments. We can then equivalently define the tRNTN composition as follows:

z

₍_x_,_y₎

=

f

V

(x⊗y)

+W[x

;

y]

+

b

(4.5) where

V

_m _{is the matricization of third-order tensor}

V

∈

_R

d×d×d, and

(x⊗y)

_v the vectorization of the tensor (or Kronecker) product of input vectors

x

and

y

. All other terms are identical to Eq. (4.4).

A more detailed view of the matricized tensor

V

_{, and the vectorized Kronecker} product of input vectors

x,y

_{is given by:}

V

(x⊗y)

=







v(₁1_,₁)

. . .

v(₁1_,)_d

. . .

v(_d1_,)₁

. . .

v(_d1_,)_d

...

. . .

...

v(₁d_,₁)

. . .

v(₁d_,_d)

. . .

v(_dd_,)₁

. . .

v(_dd_,)_d













x

₁

y

₁

...

x

₁

y

...

x

y

₁

...

x

y







Here, tensor

V

_{is represented by its}mode-3 matricization, defined in Appendix A.0.3, resulting in the

d

×d

2 matrix

V

_m_{. The Kronecker product of input}

x,y

_{is given in} vectorized form, arranged as a

d

_×

_{1 column vector. We denote here by superscript} indicesv(1)

. . .

v(q)the slicing index of the mode-3 matricization. These indices also

allow us to connect theslicedefinition with thematricizeddefinition: In the matricized view, the

d

_{slices of Eq. (4.4) are vectorized, and vertically stacked row-wise to form}

matrix

V

_m_{. Composition of}

V

_m_with

d

×

1 column vector

(x⊗y)

_vis given by matrix multiplication, and yields a

d

-dimensional vector, the tensor term of Eq. (4.5), and equivalent to the tensor term of the slice definition.

In the previous definitions we described composition of constituents belonging to

R

d spaces, for a composition output in

R

d. We can in principle define composition over arbitrary dimensions, e.g. by some tensor

V

∈_R

r×(p×q).5 In order to allow for composition of arbitrary constituents, we will generally demand that their dimensions match, thus constraining the mapping to

R

r×(p×p). Forrecursivelydefined composition, such as in the tree-structured networks described here, we demand additionally that input and output dimensions match, thus constraining the mapping further to the case of our definitions,

V

∈

_R

p×(p×p).6

5 _{Parentheses and index order here mark, purely for convenience, a mapping from}_Rp_×

RqtoRr.

6 _{Note that the models of Chapter 5 also employ the functions of Eqs. (4.3) and (4.5) in a non-recursive}

fff →

V

_{: 3}

×

9 fff →

V

_{: 3}

×

9 fff →

V

_{: 3}

×

9 not all pets fff →

V

_{: 3}

×

9 not walk

Figure 4.4:Global structure and tensor composition of tRNTN forward pass.The syntactic structure of the tRNTN forward pass is identical to the tRNN one, but local composition is multiplicative rather than additive. Node input in orange is given by the Kronecker product of child vectors. The (matricized) tensorVis applied to this vector by matrix multiplication. The graph only depicts calculations of the tensor term; the full function of Eq. (4.5) corresponds to summing green vectors at each node of Figs. 4.3 and 4.4, and adding the (omitted) bias term.

In document MoL 2017 28: Understanding Generalization: Learning Quantifiers and Negation with Neural Tensor Networks (Page 49-51)

Recursive Neural Tensor Networks

z

=

f x

V

y

+W[x

y]

+

b

x,y

∈R

V

∈R

R

R

V

i

d

V

i

z

∈

R

z

=x

V

y

z

d

z

z

W

z

=b

z

=

f(z

+z

+z

)

z

=

f

V

(x⊗y)

+W[x

y]

+

b

V

V

∈

R

(x⊗y)

x

y

V

x,y

V

(x⊗y)

=



















. . .

. . .

. . .

...

. . .

...

. . .

. . .

. . .

_V

_y

₊_W_[_x

_y_]

₊

_b

∈_R

∈_R

_i

_d

_V

_i

_z

_V

_y

_R

_×