Materials - Study 5: Animate / Inanimate - Understanding Semantic Implicit Learning through dis

5.4 Study 5: Animate / Inanimate

5.4.2 Materials

We simulate the performance of the English speakers using the dataset from Williams (2005, Experiment 1), which uses 24 nouns split evenly between animate and inanimate as seen in Fig. 5.2a. Using the embeddings matrix from §3.4, we generate training data by pairing the indices of the columns of the word vectors used in the experiments with one-hot vectors (localist representations) corresponding to the novel determiners. As explained above, given the index of the word in the input layer, the activation of the representation layer (i.e., the dot product between the input and the embeddings matrix) will be equal to that of the neural embedding with the same index formed in Chapter 3. We train and evaluate the performance of the model using the train-test split by Williams (2005) (see §C.1).

For the Chinese speakers, we follow the same procedure as above, substituting the embedding matrix with the Mandarin Chinese one formed in Chapter 4. The experimental stimuli and procedure matched that of Williams (2005) so no additional changes were made.

5.4.3 Results and discussion

We evaluate the performance of our model in this and any subsequent experiments on three grounds. Firstly, whether the generalisation gradients of the network retain the observed patterns in the behavioural data, secondly, whether the predictions of the network for the ‘correct’ epoch match those reported in the studies, and, finally, how the hidden layer responds to the training input. This last point is of particular importance as it can help us understand the solution that the network has found to classify the training data. For example, the network might achieve high levels of generalisation, without using the intended semantic distinction. Examining the activation of the hidden layer then helps us understand which regions of the input the network considers to be relevant.

Figure 5.2 shows a two-dimensional projection of the stimuli used in the two animacy experiments. Fig. 5.2a plots the distributional vectors of the words used by Williams (2005), whereas Fig. 5.2b plots the same distributional vectors when trained on a Chinese corpus. For illustration purposes, we substitute the Chinese characters with their equivalent English translations. We obtain both distributional matrices from the simulations in Chapters 3 and 4, colour-coded for animacy (i.e., ‘green’ for animate and ‘orange’ for inanimate words). Despite some minor inconsistencies in the Chinese embeddings, in both cases, the problem should be relatively easy for the model as the stimuli arelinearly separable(i.e., one can draw a line that demarcates between the two groups). However, we note at this point, that theprimary

goal of the network is not to distinguish between animate and inanimate concepts but to learn to judge the alternative groupings of already learnt associations. For example, given the configuration of the words in the semantic space, if a determiner were seen withbear,snake, andmonkey, would the learner by more inclined to generalise tobeeor tobook? While the problem is simplified if the network has knowledge of concrete semantic features, the critical point is to associate those features with the relevant determiners. In other words, even if the network ‘understands’ the difference between animate and inanimate concepts in general, it does not mean that it will associate those features with the correct determiners.

Turning now to the network’s performance on the test sets, Fig. 5.3 plots the performance of the network on the two datasets (the English and the Chinese) both overall and broken down by semantic category (animate vs inanimate). Figure 5.3a plots the overall generalisation performance of the network on the English test set. We observe that the network plateaus after a few dozen epochs at 55% (which is our point estimate) accuracy and then does not improve after that. This accuracy rate is somewhat lower than the one reported in the behavioural study (59%) but still on the same scale. Given thet-sne solutions in Fig. 5.2, this result points to the direction that either the problem of re-associating the already paired determiners is ‘unlearnable’ or that the stimuli used in these experiments cannot yield perfect generalisation.

picture pig lion table cat mouse monkey cup stool book bird box plate bear vase snake cushion fly television cow sofa clock dog bee Animate Inanimate (a) picture pig lion table cat mouse monkey cup stool book bird plate bear vase snake box television cow sofa clock dog bee Animate Inanimate (b)

Figure 5.2Two-dimensional projections of the stimuli used in the animacy experiments. (a)

Projection of the words used in the English experiment (Williams, 2005), (b) Projection of the words used in the experiments with speakers of Chinese (Chen et al., 2011, Experiment 1). We translate the words into their English equivalents only for illustration purposes; for more details on the Chinese datasets see Appendix C.

In the first case, the problem is that the network identifies distinct neuronal ensembles in the input for each determiner. This behaviour causes problems during testing as there is no overlap between the words that belong to the same category but were paired with distinct determiners. In the second case, the network discovers the relevant regions but associates them with weak connections to each determiner causing the residual activation during testing to be small. In other words, the network is expressing uncertainty by assigning small weights to the relevant regions of the input to the determiners. When probed with an already paired word, the activation of the other grammatical alternative is higher than the rest but still quite low overall. We attempt to answer these problems in the next paragraph when we evaluate the activation of the hidden layer. For the Chinese speakers, the network exhibits performance similar to English, although the point estimate is now equal to the behavioural data (.56 for the network, .56±10 in the behavioural data).¹0

We finally turn to the activation of the hidden layers. We noted above that we could attribute the limited generalisation performance of the network to either the network not being able to discover the regions of interest in the input vector or that these regions are

¹0We obtain this estimate from the responses based on unconscious structural knowledge (Chen et al., 2011, p. 1754).

0.00 0.25 0.50 0.75 1.00 0 500 1000 1500 2000 Epoch Ac tiva tio n Grammatical Ungrammatical

(a)Generalisation performance (English)

0.00 0.25 0.50 0.75 1.00 0 500 1000 1500 2000 Epoch Ac tiva tio n Grammatical Ungrammatical

(b)Generalisation performance (Chinese)

Epoch 1 Epoch 100 Epoch 1000 Epoch 2000

Animate Inanimate

(c)Activation of the hidden layer (English)

Epoch 1 Epoch 100 Epoch 1000 Epoch 2000

(d)Activation of the hidden layer (Chinese)

Figure 5.3Generalisation gradients for the two animacy experiments. Fig. 5.3a plots the by-

epoch performance of the model on the English dataset, whereas Fig. 5.3b provides a similar view for the Chinese. Figs. 5.3c and 5.3d plot the activation of the hidden layer when probed with the stimuli of the training sets. Concretely, We feed the network all the stimuli without performing any weight updates and extract the output of the hidden layer. Subsequently, we lower the dimensionality of this output using thet-sne algorithm described above.

weakly associated with the relevant output units (i.e., the determiners). We explore these two alternatives by recording the activity of the hidden layer when presented with the words that the network encounters during training. Subsequently, we reduce the dimensionality of these representations for visualisation using thet-sne algorithm. If the network can distinguish between the two groups in the hidden layer, it means that it has abstracted the relevant regions from the input and the low performance can be attributed to the weak connections. If on the other hand, the network is unable to distinguish between semantic categories in the hidden layer it points to the direction that the neural embeddings bias the network towards alternative, but partially consistent, solutions.

Figure 5.3c shows the activity of the hidden layer when probed with the training patterns from the English dataset. We see that the network discovers the neuronal ensembles that signal animacy from the input as it distributes the concepts according to the given semantic distinction in its latent space. Although we do not test for that directly, the low performance in the test set, in this case, should be due to the weak connections between the hidden layer and the output. In other words, if an animate word takes one of two determiners the network avoids accentuating the relevant regions as these might activate the wrong determiner during training. In the Chinese experiment (Fig. 5.3d), on the other hand, while we still cannot preclude the above possibility, the network appears to have problems in identifying the relevant regions from the input as animate and inanimate concepts are not linearly separable.

These two contrasting results suggest that semantic knowledge, although helpful, is not necessarily needed to achieve above chance generalisation in these tasks. The two networks achieve the same level of performance, without following the same semantic rule. In the case of English, the simulated ‘learners’ –unconsciously– separate their input by its semantics (or, at least, a semantic-like distinction), while in the case of Chinese they do not. Presumably, in the case of Chinese, the distributional input provided a better alternative to the model not based solely on semantic grounds. It has to be noted that this result is independent of whether thet-sne algorithm clusters concepts by animacy. That is, even though thet-sne algorithm can identify the difference between the two semantic categories, during training the model does not consider this dimension to be the most predictive one.

In document Understanding Semantic Implicit Learning through distributional linguistic patterns: A computational perspective (Page 159-163)