• No results found

Quantifying the “degree” in prediction: the information-theoretic framework

Chapter 2: Computational modelling of the incremental processing of a sentence

2.4. Quantifying the “degree” in prediction: the information-theoretic framework

Under the view of prediction as a probabilistic phenomenon, constraint can be expressed in the form of the probability distribution. Such probability distribution captures various possibilities with different degrees of expectation which can be compared with the other probability distributions associated with different linguistic contexts in order to illuminate how the processing state of a system changes as a function of prediction. However, we can ask a more fundamental question: Is the constraint useful? In fact, it is not absurd to think that the human language system is flexible to utilize the constraint only if it is informative enough. If the constraint is not very informative, there is really no point to change the processing state. Information theory (Shannon, 1948) offers a way to quantify the amount of information contained in the constraint in the form of a probability distribution, providing an answer to the above question.

One of the key measures in information theory is known as “entropy” which quantifies how much uncertainty is involved in the value of a random variable or the outcome of a random

77

process. The total number of bits (common currency in information theory) is defined by the expected value of the negative logarithm of the probability mass function (PMF):

𝐻(𝑌) = 𝐸[− log(𝑃(𝑌))] = − ∑ 𝑃(𝑦𝑖) log 𝑃(𝑦𝑖) 𝑁

𝑖=1

… (20)

where 𝑌 is a random variable with 𝑁 possible outcomes. The logarithm of a probability distribution is often very useful as it renders the computation additive for independent sources: for example, if the entropy of a fair coin toss is 1 bit, the entropy of 𝑚 tosses is simply 𝑚 bits. Due to this effect, the logarithm is commonly adopted to maximize a

likelihood or posterior in many statistical optimization algorithms described throughout this thesis. To make the interpretation more straightforward, consider a coin toss. The entropy (uncertainty) is at its maximum if the coin is fair (i.e. the distribution is uniform) because knowing that the coin is fair does not help a system to make a correct prediction at all. However, if the coin is unfair such that one outcome is more probable than the other, knowing the actual probabilities associated with these outcomes clearly improves the prediction (and the entropy becomes lower). Using entropy as a model of human speech comprehension allows researchers to test the hypothesis that the entropy is incrementally tracked throughout the speech such that the prediction only occurs when the constraint is informative (i.e. when the entropy is low). In the context of incremental speech

comprehension, the constraint entropy naturally decreases as more words are heard in a sentence because the constraint often becomes more informative with the richer context. This tendency is known as entropy reduction, an important descriptive property of incremental speech comprehension (Hale, 2006).

Entropy describes the degree of uncertainty within a probability distribution, then, cross- entropy measures the expected number of bits that will be needed to predict an upcoming input linguistic unit using an estimated distribution instead of a true distribution. As a result, the cross entropy will always be higher than entropy because using the estimated constraint will always require extra bits than using the true constraint (in the context of incremental speech comprehension, the estimated and the true constraints refer to the prior and the posterior of the belief updating system as illustrated in Figure 2-1). It consists of two terms: the entropy of the true constraint (minimum number of bits required for prediction) and the KL-divergence between the true and estimated constraints (extra bits additionally required for prediction if you are using an estimated distribution):

78

𝐻(𝑌, 𝑂) = 𝐻(𝑌) + 𝐷𝐾𝐿(𝑌||𝑂) = − ∑ 𝑃(𝑦𝑖) log 𝑃(𝑜𝑖) 𝑁

𝑖=1

… (21)

where 𝑂 is the estimated distribution of 𝑌 (see (10)). As described above, the cross-entropy is a common error function in neural networks with the softmax activation in the output layer where the softmax output is the estimate of a true distribution. If the true distribution is delta (or a label), then, the cross entropy function becomes equivalent to surprisal.

Computing the entropy of the constraint enables us to quantify how informative it is to predict an upcoming input. This metric could be the basis of deciding whether to utilize the constraint or not. Then, can we quantify the effect of prediction on processing the upcoming input? This is another critical question that could advocate prediction as a core speech processing mechanism in humans. Conceptually, it is not very difficult to formulate a model to address the question: how unexpected is the outcome given the prediction? This can be quantified by any distance function between the prediction 𝑂 and the outcome 𝑌. In the information theoretic setting, we use the forward KL divergence between these two

distributions: 𝐷𝐾𝐿(𝑌||𝑂). If the outcome 𝑌 is a label representing the target word being heard, 𝑌 always consists of 1 for the target and 0 for all other words that have been considered in prediction 𝑂. Then, the effect of prediction on processing the target can be formulated as:

𝐷𝐾𝐿(𝑌||𝑂) = ∑ 𝑃(𝑦𝑖) log𝑃(𝑦𝑃(𝑜𝑖) 𝑖) 𝑁 𝑖=1 = ∑ {1 ∗ log (𝑃(𝑜1𝑖)) 𝑖𝑓 𝑖 = 𝑗 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑁 𝑖=1 = − log 𝑃(𝑜𝑗) … (22)

where 𝑗 is an index of the target word in the distribution. This simplification is known as “surprisal”, reflecting how difficult it is to process the target with respect to the given context (i.e. if the target 𝑜𝑗 is strongly predicted such that 𝑃(𝑜𝑗) is high, − log 𝑃(𝑜𝑗) is consequently low and vice versa). Using the same logic, it is possible to model the belief (prediction) updating process as each word incrementally unfolds in a sentence (see multicycle BBU framework in 2.1). It is merely the KL-divergence between the constraints before and after taking a new input into account. If a new input does not affect the state of belief at all, then, the constraint will not change even after taking the new input into account. However, if it does affect, the degree of update will be quantified under this formulation. From here on, I refer any metrics that represent “how different the target linguistic unit is with respect to the prior constraint” to constraint error (hence, this is not a term to describe the quality of

79

constraint) and surprisal is a particular way to represent the constraint error using KL- divergence.

Referring back to the cross-entropy (21) often used as a loss function in training neural networks (10), if the posterior distribution 𝑃(𝑌) is simply a label indicating a target, the KL- divergence simplifies to (22) and the posterior entropy 𝐻(𝑌) becomes 0 because there is no uncertainty. With a 𝑗th response being the target, it is not very difficult to translate (21) to (22). This is why the cross entropy is known as a generalized metric of surprisal and is commonly used as a loss (error) function in many training algorithms.

It has long been claimed that the subjective experience of stimulus intensity is proportional to logarithm of the actual objective intensity (see Appendix 3 for Weber-Fechner’s law

motivating logarithm as a psychophysical mapping function). In line with this claim, a recent psycholinguistic study revealed that the reading time is logarithmically related to the

objective prediction derived from a corpus-based computational model (Smith & Levy, 2013). The surprisal metric has been applied in the field of psycho- and neuro-linguistics and

showed that humans are indeed sensitive to the prediction error during language

comprehension, providing evidence for prediction as a core mechanism of incremental speech comprehension. See Levy (2008) for theoretical descriptions of information theoretic metrics, Smith & Levy, 2013 for logarithmic approximation of human reading time, Frank et al. (2013, 2015) for application of surprisal for modelling electroencephalography (EEG) data during sentence reading and Willems et al. (2015) for application of surprisal for modelling fMRI data during sentence listening. In this thesis, the information theoretic (logarithmic) models are central to the univariate analysis of neural response amplitude consistent with the abundant applications of the surprisal metric in the psycho- and neuro-linguistic literatures (Roark, Bachrach, Cardenas & Pallier, 2009; Frank & Bod, 2011; Fossum & Levy, 2012; Smith & Levy, 2013; Monsalve, Frank & Vigliocco, 2012; Frank et al., 2013, 2015; Willems et al., 2015).