Chapter 2: Computational modelling of the incremental processing of a sentence
2.1. Bayesian Belief Updating (BBU)
Incremental speech processing involves using the available information from the context to constrain an upcoming input (which can be a word, a phrase, a sentence etc.) and integrate it into the prior context once it is heard in order to constrain a subsequent input more accurately. This cycle continues until the speaker ends his message. This conceptual description of
incremental speech processing fits well in the Bayesian framework of language
comprehension. The motivation of this framework originates from Bayesโ theorem which describes the probability of an event based on the prior information and knowledge related to the event. A simple mathematical description of Bayesโ theorem is as follows:
๐(๐ด|๐ต) =๐(๐ต|๐ด)๐(๐ด)๐(๐ต) โฆ (1)
where A is a target variable and B is a context variable on which the target A is conditioned on. As a simple application to language processing, suppose that a listener hears an adjective- noun phrase like โyellow bananaโ. The goal is to model the listenerโs internal beliefs about โbananaโ given the preceding adjective โyellowโ. By simply substituting ๐ด with โbananaโ and ๐ต with โyellowโ, we obtain the following:
๐("๐๐๐๐๐๐"๐ก|"๐ฆ๐๐๐๐๐ค"๐กโ1) =๐("๐ฆ๐๐๐๐๐ค"๐กโ1๐("๐ฆ๐๐๐๐๐ค"|"๐๐๐๐๐๐"๐ก)๐("๐๐๐๐๐๐"๐ก)
๐กโ1) โฆ (2)
where ๐ก and ๐ก โ 1 indicates the relative position of each word in the phrase. The goal is to model the posterior ๐("๐๐๐๐๐๐"|"๐ฆ๐๐๐๐๐ค") describing the probability of โbananaโ given โyellowโ. This expression already proves its usefulness by showing an explicit mapping between the goal (posterior) and the prior. The prior ๐("๐๐๐๐๐๐") describes the listenerโs beliefs about the target โbananaโ (i.e. subjective probability of โbananaโ alone) before knowing the context โyellowโ. Then, the likelihood ๐("๐ฆ๐๐๐๐๐ค"|"๐๐๐๐๐๐") evaluates the context โyellowโ against his prior beliefs about the target โbananaโ. The evidence
๐("๐ฆ๐๐๐๐๐ค") works as a context normaliser whose practical role is explained in Footnote 1 in Chapter 1. The concept of belief updating is reflected by the shift from a prior to a posterior at any given cycle until the posterior converges to the delta distribution (target = 1 or 0 otherwise). In a modelling perspective, this Bayesian approach provides useful insight into how prediction may change and develop as new words are incrementally unfolded in a sentence.
59
Another important aspect of this approach is that it models the cyclical development of prediction in sentence and discourse comprehension. Suppose that we are modelling the listenerโs syntactic prediction of a complement structure in a sentence: โThe intrepid child found the pictureโ. For illustration purposes, I assume that the subject NP โThe intrepid childโ is independent of the following complement structure such that it is constrained entirely by the verb โfoundโ in a preceding context. Then, it is possible to track changes in prediction as follows (Figure 2-1):
Figure 2-1: A simplistic visual illustration of belief updating about the complement syntactic structure across different cycles in time. SCF = subcategorization frame.
In Figure 2-1, Cycle 1 describes the process of incorporating the main verb โfoundโ into prediction. Cycle 2 shows that this verb-incorporated prediction becomes a new prior to constrain the syntactic frames. As a direct object structure is confirmed by the determiner โtheโ, the prediction cycle ends in Cycle 2 in this example and the prior facilitates the integration of the direct object structure into the sentence. Hence, by tailoring the prediction more specifically to the up-to-date context, this Bayesian model promotes more rapid and accurate integration of the target frame (direct object). It is worth noting that any posterior at
60
the end cycle (Cycle 2 in this example) converges to a delta distribution and the process of belief updating becomes conceptually equivalent to integrating the target into the context (the โtargetโ, in practice, refers to a specific property (e.g. semantic meaning or grammatical category etc.) of a particular linguistic unit (e.g. a word, a phrase, a clause etc.) that appears after the context).
As shown in (2) and Figure 2-1, incremental speech comprehension proceeds with updating the beliefs each time an input (i.e. verb) that constrains the target (i.e. SCF) is heard.
However, as already discussed in Chapter 1, prediction in speech processing is not merely limited to words but includes a variety of linguistic aspects from perception (phonological- lexical) to cognition (syntax-semantics). The psycholinguistic accounts based on the Fodorian modular theory (Fodor, 1983) claims that the processing streams are organized into separate, autonomous modules (Frazier, 1987). Other accounts propose jointly interacting streams (Marslen-Wilson, 1975; Altmann & Steedman, 1988). In this section, I briefly review a recent generative framework proposed by Kuperberg (2016) in the Bayesian perspective. Kuperbergโs framework claims that listeners infer the underlying cause of the observed inputs from a set of hierarchically organized representations (or internal generative model). These representations best explain the statistical properties of the observed inputs based on their beliefs about the message that the speaker tries to convey. The beliefs propagate down to lower levels to tailor the representations by generating probabilistic predictions before processing the new input. Predictions at these various domains hierarchically interact with each other: for example, predictions about semantic meanings or syntactic structures of possible continuations could influence the predictions about candidate words which could, in turn, affect the expected sequences of phonemes. These probabilistic predictions are
evaluated against the bottom-up evidence once the new input is heard to update their prior beliefs. This top-down prediction scheme facilitates the processing of an input word in a sentence and the input, in turn, enables flexible updating of the multi-level constraints through bottom-up projections. This process is simplistically illustrated in Figure 2-2 below.
61
Figure 2-2: Incremental speech processing of a simple direct object sentence โThe giant crocodile attacked the wildebeestโ in the light of the BBU generative framework (Kuperberg, 2016). This describes the role played by each input (i.e. a subject noun phrase, a verb and a complement noun phrase) in constructing the event representation (i.e. a message) in a predictive processing framework. Blue arrows indicate โpredictionโ and orange arrows indicate โupdateโ or โintegrationโ.
62
Now, the problem simplifies to characterizing the arrows in Figure 2-2: prediction and update. Under the view of prediction as a graded/probabilistic phenomenon (see Kuperberg & Jaeger, 2016), the conditional probability distribution about the upcoming input directly represents information used to predict the upcoming input (i.e. constraints). Also, it is important to quantify the certainty of beliefs because the strength of top-down prediction depends on the certainty with which the beliefs are held (Kuperberg, 2016). Lastly, the difficulty of updating reflects the proportion of variance in constraints (a.k.a. โpruned
probability massโ in Levy (2008, p. 1131)) which cannot be explained by the bottom-up input, so-called โprediction errorโ. The human language system aims to minimize this prediction error by an iterative process of predicting and updating throughout a sentence and will eventually obtain converged representations at various levels each of which best explains the observed sentence. The ways to characterize prediction and to quantify certainty and error are described in the following sections.
This Kuperbergโs BBU framework is a variant of โpredictive codingโ framework (Friston, 2005, 2008) which has drawn significant attention in the field of cognitive/perceptual neuroscience. As stated in Kuperberg and Jaeger (2016), โHierarchical predictive coding in the brain takes the principles of the hierarchical generative framework to an extreme by proposing that the flow of bottom-up information from primary sensory cortices to higher level association cortices constitutes only the prediction error, that is, only information that has not already been โexplained awayโ by predictions that have propagated down from higher level corticesโฆโ. This specific neurobiological hypothesis from the predictive coding account has been tested and corroborated in a series of behavioural and neuroimaging studies of speech perception (Sohoglu, Peelle, Carlyon & Davis, 2012, 2014; Sohoglu & Davis, 2016). They consistently reported the reduced activity in superior temporal gyrus (STG) when the speech input (target) was more expected, supporting the claim that brain is sensitive to the mismatch (error) between expected and actual input.