Sequential vs hierarchical processing accounts of human speech comprehension

Chapter 4: Decoding the internal representation of a predictive machine and testing its relation with

4.1. Sequential vs hierarchical processing accounts of human speech comprehension

A recent research article in Nature Neuroscience again sheds light on the importance of syntax in understanding speech (Ding, Melloni, Zhang, Tian & Poeppel, 2016). In a cross- linguistic study between Chinese and English, the authors showed clear peaks in neural responses at the frequencies at which the stimuli are processed at different levels. In

particular, they observed 3 clear peaks in the frequency spectrum of neural responses at 1Hz (a sentence presentation rate), 2Hz (a phrase presentation rate) and 4Hz (a syllable

presentation rate) by presenting a sentence consisting of two phrases each of which contains two syllables with a presentation rate of each syllable for every 250ms (e.g. “new plans gave hope” consisting of NP and VP). This pattern of results, however, was not observed when listeners did not understand the language. For example, the cortical activity of English

speakers when listening to Chinese stimuli only showed entrainment to the syllabic rhythm at 4Hz. Consistent with other neuroimaging studies of artificial syntax showing that statistical cues are not necessary to trigger neural tracking of the structure in a sequence, they

interpreted these results as evidence for cortical tracking of hierarchical structures in a sentence and supported the claim that the brain can form representations at various syntactic levels based solely on rules (Ding, Melloni, Tian & Poeppel, 2017).

133

Nevertheless, an obvious question one has to ask is how applicable these results are in explaining natural speech comprehension in the real-life environment. Nobody speaks at the same rate all the time in real-life communication. Hence, although it can be acknowledged that humans are capable of tracking structure of a sentence based solely on their syntactic knowledge, it doesn’t mean that it is necessary to understand a spoken sentence. In fact, processing a sentence more likely depends on the syntactic complexity of it such that a listener’s syntactic knowledge may become useful as a confirmatory process involving grammatical analysis on syntactically complex sentences.

Moreover, the pattern of results in Ding et al. (2016) was replicated in Frank & Yang (2018) even when they used a word-level statistics model based on the Skipgram architecture (see Mikolov, Chen, Corrado & Dean., 2013) that knows nothing about such grammatical rules. For each simulated participant, they concatenated the N dimensional column (Skipgram) vectors (where N is randomly sampled for each participant with mean = 300 and SD = 25) into a matrix such that each row represents a time-course of simulated MEG samples for a particular dimension. Each MEG sample was simulated in a way that the column vector only contains Gaussian noise (mean = 0 and SD = 0.5) until t milliseconds after the word onset and the actual information (signal) becomes available only after t milliseconds (the time-point t most plausibly reflect the word’s uniqueness point). The signal was added by Gaussian noise to reflect the noise in MEG data. A power spectrum for each row was then computed using discrete Fourier transform (DFT) quantifying the amplitude of a sinusoid in each frequency contained in the row vector and was averaged with the other power spectra across N dimensions. Replicating the original results in Ding et al. (2016) suggested that the cortical tracking of syllabic, phrasal and sentential rhythms can be explained by the lexical

information without applying the grammatical rules. This also reflects the possibility that word-level statistics could sufficiently trigger tracking of local phrases, just like it can trigger the learning of syntactic rules (Seidenberg et al., 2002).

In the light of Occam’s razor, cognitive science pursues a parsimonious model as a

descriptive measure of cognitive processing in humans. The logic is if both simpler and more complex models explain a particular cognitive phenomenon, the simpler model is favoured as a descriptive measure unless the complex model performs significantly better in explaining the phenomenon. Assuming that abstraction requires additional cognitive operations, a non- abstracted model based on the word-level statistics should be favoured (see Frank &

134

the lexical information captured by distributional models is already abstracted, reflecting syntactic and semantic category information of an input (just like the topic models described in Chapter 2) without engaging syntactic knowledge. However, if such abstracted

representation is obtained through years of experience, the lexical information is likely to be represented in processing dimensions optimized through experience without requiring further explicit computational operations for abstraction. This could enable the brain to track the hierarchical structures in a simple sentence commonly used in a daily conversation based on the lexical information.

Following on from this debate, I use a connectionist model designed to process the lexical information in a distributional format in order to generate an accurate prediction of an upcoming word. This connectionist framework provides a transparent predictive machine whose internal state and its relation to the output response can directly be investigated at any particular point in a sentence. Compared to a human brain consisting of billions of neurons (or information processing units), such predictive machine is much simpler in architecture with fewer processing units and has a much more straightforward representation. Comparing how similar the nature of incremental speech processing is between a human brain and a state-of-art predictive machine is an interesting topic that has not been thoroughly

investigated in the literature. By decoding the linguistic properties activated in the internal state of a well-trained machine and relating the pattern of activation to the temporal dynamics of neural activity using RSA, this chapter identifies a number of brain regions showing similar pattern of activity as the machine at a time when the multi-level linguistic constraints are activated (see Chapter 3). Further, by modelling the spatiotemporal characteristics of the activity pattern for each ROI using the output prediction of the machine, I evaluate the prediction in the light of the incremental computations in humans during speech

comprehension. If computation involved in generating the constraint in the brain is based purely (or partly) on combination of the distributional properties of an input word with its internal representation of the preceding context without any explicit engagement of syntactic knowledge, it is expected to observe significant correlation between the representational geometries of the internal state of the model and the brain in the similar time and regions as shown in Chapter 3.

135

4.2. Connectionist models in a parallel distributed processing (PDP) framework

In document Neurobiology of incremental speech comprehension (Page 132-135)