Chapter 4: Decoding the internal representation of a predictive machine and testing its relation with
4.3. Decoding the pattern of activation in the LSTM internal and output layers
4.3.1. Sanity checks and methods
Before using this LSTM network model to characterize the spatiotemporal dynamics of neural activity, I explored the nature of information processing in the two hidden layers using a number of linguistic models capturing different aspects of computations involved in human speech comprehension, as illustrated in Chapter 3. These models are the full-context and verb-alone models of constraints as well as a model of the lexical semantic information of an input word which are tested against the brain data and reported in Chapter 3. In this way, I hoped to gain a better understanding of how the network processes an incrementally unfolding sequence of words in a sentence and construct more specific neurocognitive hypotheses for different layers of the network. But, before going into details, one of the key aspects of the LSTM network, recurrence of a theme, can easily be seen from a simple sanity check below.
In order to illustrate that the network is capable of retrieving and applying a recurrent theme when making predictions, it was used to generate a sentence from a given fragment (a simple continuation study). Each word after the fragment was sampled from its output prediction and the sampled word was combined with the fragment to sample a next word until the end of a sentence in the following way:
“The local politician emphasised that…….. “The local politician emphasised that the……. “The local politician emphasised that the issue …..
For different fragments, the following sentences were generated (a given fragment is marked in bold and a recurring theme is underlined):
“The local politician emphasised that the issue was the result of political manipulation of the press and the public interest.”
139
“The bank manager acknowledged the mistake and notified the FDIC as soon as possible.”
“The duty solicitor concluded that the claim was not only invalid but also in breach of Article 14 of the European Convention on Human Rights.”
“The graduate student applied to a university to find out which university he was interested in and then went to a job fair.”
From these LSTM generated sentences, we can see that the theme of the subject in a given fragment (highlighted in bold) is recurring throughout the sentence, as indicated by the underlined text. This shows that the network is capable of holding the necessary thematic information in its memory so that it can associate the recurring theme in the later part of the sentence to the subject. Again, this is the main advantage of using LSTM architecture, designed to address the vanishing gradient problem through recurrent layers (see Chapter 2). In order to delve into more details about various linguistic properties being activated by the internal representation of the model at each point in a sentence, I compared the similarity pattern of the internal state at every incremental sequence of words with that of 7 different models of interest, described below, capturing a variety of linguistic properties of incremental computations at five adjacent points in a sentence starting from the subject noun up to a point including the complement noun. For example, in a sentence “The young man fled the army when the fighting began”, the five points included the consecutive sequence of words including “man”, “fled”, “the”, “army” and “when”. The models of interest included the full- context and verb-alone subcategorization frame (syntax) constraint models (see 2.5.1), the verb-alone WordNet-MDL model capturing the VALEX lexical constraint in the WordNet conceptual space (see 2.5.2(a)), the full-context and verb-alone LDA topic models capturing the co-occurrence relation between a verb and a following noun specifically in a direct object frame (see 2.5.2(b)) and a subject noun and a verb DM models published by Baroni & Lenci (2010) that capture the general co-occurrence properties of the word (see 2.5.2(b)).
Comparing the similarity patterns involved creating a set of RDMs of the LSTM internal activation (see section 3.2) at the five points mentioned above. In the section 3.2.1, I described a number of distance metrics and the properties of each of them. Here, I used the Euclidean distance as a default distance metric to compare the representational geometry of the activation vectors across 1,024 hidden processing units (or neurons) between different trials. Again, this metric is highly sensitive to exact amplitude of each processing unit which
140
is the key information to generate an output prediction via the weighted combination across the processing units in the softmax layer (see 3.2). In contrast, cosine distance was used to model the similarity pattern of the softmax layer consisting of nearly 800,000 units each of which reflects the prediction strength for a particular word in the LSTM vocabulary. Again, the reason for using cosine distance here was to neglect the absolute probability difference for each of the ~800,000 types (i.e. many of the types were not in the human vocabulary) while taking the overall covariance into account. These LSTM RDMs were compared with each of the model RDMs using Spearman’s correlation as described above in 3.2.1. The results are shown in the figures below (Figure 4-1, 4-2).
4.3.2. Results
Figure 4-1: A correlation plot of the first Hidden Layer (HL0) with 7 different models of interest at the five adjacent points in a sentence described in the main text. Each line in the plot reflects the correlation time-course associated with a particular model indicated in the legend. The error bars show 95% confidence interval calculated as tanh (tanh−1(𝜌) ± 1.96
√𝑁−3)
where 𝜌 is a ranked correlation coefficient and N is the total number of elements in the vectorized RDMs. The inverse hyperbolic tangent (Fisher) transformation on 𝜌 renders the sampling distribution to be approximately normal with the standard error of 1
√𝑁−3 and the
141
Figure 4-2: A correlation plot of the second hidden layer (HL1). Other annotation details are same as in Figure 4-1.
From Figure 4-1, we can see that the models reflecting the semantic properties of an input word at each point is showing the greatest fit. For example, at the point when a subject noun is revealed “The young man”, the semantics of “man” is activated strongly showing the greatest fit, which immediately declined to the least good fit as soon as the following verb “fled” is revealed (light green). A similar pattern was observed for the semantics of “fled” which declined immediately after the function word “the” is revealed (dark green). From these results, we can infer that the role of the first internal layer HL0 is to activate the
semantic information of the input word which will project this information to HL1 for further predictive processing described below.
Next, Figure 4-2 shows a largely different pattern of results. Although the peak effects for the semantics of a subject noun and a verb occurred as they were being heard, the peak effect of the verb semantics did not decline even when the function word in the verb’s complement was heard (dark green). Further, the strength of correlations between constraint models and the HL1 state was generally increased where syntactic constraint was consistently activated at the point of a verb (light and dark blue) whereas semantic constraint was activated later, at the point of the complement function word (orange, light pink and purple). As expected, these constraint effects on the complement phrase declined once the actual complement is
142
revealed. From these patterns of results, the information processing in HL1 involves computing and activating constraints on the various linguistic properties of the upcoming continuation including both syntax and semantics.
In order to investigate the information encoded in the output layer of LSTM, the exactly same approach was taken of constructing an RDM from the output vector and of comparing it with 7 different models of interest (see Figure 4-3) as used in Chapter 3. Interestingly, the results showed a different pattern from those related to the internal states. First, the two syntactic constraint models showed strong correlations when a verb is heard whereas the semantic models did not, reflecting that the LSTM lexical prediction after the verb mainly determined likely syntactic frames, assigning high probability values to a number of function words. Second, neither a subject noun nor a verb semantics showed strong correlations at the point when they are revealed but only the verb semantics model showed a strong peak at the point of a function word in the complement phrase in conjunction with other semantic constraint models. This means that the similarity pattern of the semantics of verbs was strongly related to that of lexical prediction on the complement content word in LSTM, implying the
importance of a verb in determining the semantics of its complement. This finding is particularly informative because it suggests that the prediction on the complement noun is strongly determined by the verb semantics, showing higher correlation than the full-context semantic constraint model in orange (see below for further discussion).
143
Figure 4-3: A correlation plot of the softmax output layer. Other annotation details are same as in Figure 4-1.