Network Architecture - Learning Morphology-Aware Embeddings

3.2 Learning Morphology-Aware Embeddings

3.2.1 Network Architecture

Our network is proposed in order to learn embeddings for MCWs. To this end:

i) first we take a cubic data structure as the input which includes embeddings for

lemmas and affixes of n preceding and m following words around the target word;

ii) we apply the 3D convolution function over the cube to combine its elements; iii) then apply other mathematical functions and transformations (such as non-

linearity) to the convolution result and reshape it into a vector; and iv) at the end, the final vector is passed to a Softmax layer in order to predict the target word. To implement such a pipeline we propose the following architecture.

The first layer of our architecture is a lookup table which includes embeddings for lemmas, suffixes, and prefixes. The lookup table is treated as a matrix of network parameters whose values are updated during training. From each sentence in the training set, one word is randomly selected as the target word. During training, we process sentences multiple times and different words are selected as the target word. All words from the training sentence are decomposed into subunits (each word is decomposed into 1 lemma, 1 prefix and 1 suffix). Based on the selected target word, corresponding lemma and affix embeddings for context words are retrieved from the lookup table and placed in the cube. In our implementation, we use a window of

10 words around the target word, namely n=m=5. This means that if the target word is the fifth word in a sentence, the first plane of the input cube includes prefix embeddings for wordi where0 ď i ď 10 & i ‰ 5. Similarly, the second and the third

planes include suffix and lemma embeddings for the same set of context words. n and m are hyperparameters of the model, both of which we set both m and n to 5 to make our work comparable to others (see Section 3.3 for more details). The look-up table is a matrix with |V| rows and d columns, where V is the vocabulary set and d is the embedding size (see Section 3.3 for more details).

In the second layer the multi-plane convolution function is applied. The input data is a 3-plane cube which the convolution module changes to a more dense struc- ture with 6 planes, i.e. the input instance at each step with the shape 3ˆ(n+m)ˆd is transformed to a data structure with 6 ˆ wout ˆ dout dimensions, where wout =

t(m+n)´wF

2 u and dout = t (d)´hF

2 u. In our setting wF = hF = 5. We empirically recognized these numbers to be the best trade-off between the training time and the network accuracy. We also apply max-pooling to the 6-plane convolution result, where each plane is segmented into 2 ˆ 2 windows whose maximum values are selected (one maximum value from each of those 4-cell windows).

The next layer applies non-linearity, where we transform each cell of the 6- plane data structure with rectifier units. For the purpose of generalization and preventing over-fitting, we also placed a dropout layer with p = 0.3 after the non- linear layer. Srivastava et al. (2014) extensively discussed the advantages of using

rectifier+dropout layers. Up to this layer we have a data structure with several

planes. We unfold the planes and reshape them all into a single vector. The vector is passed through another rectifier+dropout layer and is mapped to a 200-dimensional vector.

All cells of the final vector are processed by a hierarchical Softmax (HSMX) function to produce the probability distribution over classes (words). Softmax is a very expensive function in terms of time and space complexities. To deal with this problem, we used HSMX (Morin and Bengio, 2005) which first finds the correct

word cluster and then looks for the correct word within the cluster. Similar to Kim et al. (2016), we pick the number of clusters K = r?Vs and randomly split V into

mutually exclusive and collectively exhaustive subsets _V1, ...,VK of (approximately)

equal size. HSMX in our setting is formulated as in (3.3):

P (wt= j|C) = exp(ht.wk+ bk) řK k1₌₁exp(ht.wk 1 + bk1 ) ˆ exp(ht.w_kj + aj_k) ř j1_P_V kexp(ht.w j1 k + a j1 k) (3.3)

where similar to the regular Softmax function (Equation (3.1)), wt is the target

word, C is the context (in our case the cube) and ht is the output of the last layer

just before HSMX. The first term is the probability of picking the cluster k and the second is the probability of selecting the word j given the cluster k. With the regular Softmax layer the network processes 2, 500, 000 tokens in „9 hours whereas with HSMX it is reduced to „1.5 hours (both on GPUs).

The network is trained using SGD and back-propagation (Rumelhart et al., 1988). All parameters of the model are randomly initialized over a uniform dis- tribution in the range [´0.1, 0.1]. Filters, weights, bias values and embeddings are all network parameters which are tuned during training. The lemma/affix embedding size for the English experiment is 50 (see Section 3.3) and 200 for the other experiments. We use the negative log likelihood criterion to compute the cost. More formally, we wish to maximize the average log-probability in our network, as in (3.4):

1 J J ÿ j=1 log p(wj|Cji) (3.4)

where J shows the number of all words in a training corpus, and wj is the target

word whose context information is represented by the cube Ci

j. One wj can have

different context cubes which is shown with the i superscripts. Figure 3.2 shows the network architecture.

Li Lookup table Pi Si 3D Convolution + Max-pooling Non-linearity + Dropout + Reshaping Non-linearity + Dropout + Resizing V1 … V2 Vk Hierarchical Softmax V o c a b ula ry Target word

Figure 3.2: Network Architecture.

In document Machine translation of morphologically rich languages using deep neural networks (Page 73-76)