General Discussion for Studies 1–4 - Understanding Semantic Implicit Learning through distribut

The primary goal of the four studies presented above was to find the best approximation of the underlying organisation of concepts in the semantic memory. In other words, we sought to determine how we can represent semantic information in the absence of conscious

activation of semantic knowledge. Considering that there are multiple ways in which we can represent semantic information, we test several different kinds of models, each of which makes different predictions concerning the structure of semantic memory. The three primary classes of models we looked at consisted of (a) Association norms as obtained from humans, (b) a large semantic database (WordNet), and (c) word representations formed by exploiting the statistical information inherent in the linguistic environment. Through four studies which look at the reaction times in different semantic tasks, we find that projecting contextual information about each word in a high-dimensional space accounts for the patterns in the behavioural datasets. Concretely, in these four studies we look at the reaction times in tasks of semantic priming directly (Study 1), reported priming effects from studies of mediated priming (Study 2), the effect of semantic neighbourhood density (as measured in these models) in lexical decision reaction times in English (Study 3) and in Chinese, Dutch, French, and Malay (Study 4).

Apart from the main result, another important outcome of the above is the inability of dense representations containing concept-concept relations to capture semantic priming effects. There are a few reasons why this might be the case. Firstly, semantic priming effects without some degree of association between the two words are quite hard to obtain (Lucas, 2000, for a thorough review of the topic). In this vein, it is not surprising that WordNet fails to provide a significant predictor in the regression models. Secondly, we need to take a look at what information these representations contain. We have already argued that one way or the other, dsm representations carry information about the distribution of each word in linguistic usage. On the other hand, WordNet contains rich information about the target concept’s locus in a domain independent of language. Retrieving, however, that sort of information from memory in a speeded naming task is a more laborious process than exploiting the surface statistical regularities provided by the language. On the other hand, we see that WordNet fits improve significantly in the longer soa condition. This interaction points to the direction that such ‘deep’ semantic information might require some more time to retrieve than surface level regularities (see also, Till et al., 1988, for a thorough exploration of the time course of semantic priming).

However, why do the neural embeddings work better in these tasks? On one level, we argue that this happens because they can model bothsyntagmaticandparadigmaticrelations (§§ 2.3.1 and 3.6). This property places them closer to beagle, which is the only model that combines both sources of information, albeit performing worse than the embeddings. We argue that this mismatch in performance is a result of thecurse of dimensionality, from which count models (as beagle) suffer, and which predictive models trivialise. In short, the problem relates to how the model assigns probabilities to novel contexts. By definition,

foranydsm the ideal representation ˆw will be one that maximises the following quantity ˆ

w=arg max_θ∑j∈CiP(wi∣cj,θ), wherewi is the target word,Ci the contexts in which it can be

encountered, andθthe model parameters. The non-neural models estimate this probability by counting the occurrences ofwiin all contexts ofCi, whetherCi is defined as the documents

in whichwiappears or as a window around every occurrence ofwi. The problem with this

approach is that there exist contexts in which the occurrence of a word would be legal, although not ‘seen’ in a particular corpus. A consequence of this is that the representations formed by these models will be further away from ˆwas they would not be able to account for this.

Neural embeddings, on the other hand, surpass this problem by consideringP(w∣c)a

continuous probability distribution instead of a discrete one. The consequence of this is that even ifw has not been encountered in a particularc, the neural network would be able to impute a non-zero value, instead of directly discarding c. For example, consider that we encounter the wordbrownin the corpus in the following context; ‘the quickbrownfox’. To estimate the representation for the word brown, the model would have to learn the parameters that maximise the following dot products; σ(e_brown⋅cj)∀j ∈ {the, quick, fox}. Since the

vectors at this stage are not normalised, then each dot product exists in the interval(−∞,∞),

hence, passing this value through the sigmoid function would give either 1 (if cj is a true

context-word) or 0 (ifcjis just noise) (see, §2.3.1, for more).¹4 Consider now that the model

encounters the unseen sentence ‘The quick brown wolf’; models based on counting would assign zero probability to this context. However, the neural embeddings would still consider this a high probability context for the word ‘brown’ as the only thing that is different is the dot product betweenebrown⋅c_wolf. The advantage of this approach is that if cosc_wolf,c_fox≈1

(i.e., if the words ‘wolf’ and ‘fox’ are closely related), then the model can go beyond the given contexts ofbrownand generalise to ‘unseen’ contexts.

The performance of the neural embeddings on the task is also encouraging in the context of the discussion in §1.3. We argued there that if connectionism provides an account of human learning processes, then neural embeddings might be relevant to how the semantic system is ‘bootstrapped’. Such neural networks learn semantic-like information only by exploiting statistical patterns in the input using simple (and general) mechanisms. In this way, the ‘semantic’ knowledge we are interested in here is nothing more than knowledge of the statistical regularities in the environment learnt during language processing. That is not, of course, to say that other sorts of semantic knowledge are not present in the human semantic system. This latter point is why we use the word ‘bootstrap’ above instead of saying that this is how the semantic system is structured. In §7.2, we explore this notion further, testing how this tacit knowledge of statistics can be used to learn ‘deeper’ semantic relations.

We argue in §1.3.1 that there are numerous ways of representing meaning, and finding the most appropriate description of that has been called the ‘holy grail’ of a variety of scientific disciplines (Jackendoff, 2002; Kiela, Bulat, Verő & Clark, 2016). The studies presented above suggest that the distributional patterns of words in the language can account for the priming effects obtained in behavioural experiments. However, suchsemanticdistributional patterns do not exist solely in linguistic contexts. Perceptual information from visual (Berzak, Barbu, Harari, Katz & Ullman, 2015), auditory (Kiela & Clark, 2015), and olfactory (Kiela, Bulat & Clark, 2015a) routes contribute significantly to create richer semantic representations. Perhaps, therefore, a complete account of the representations used in these studies would be a fusion of the possible routes that contribute to meaning (see, e.g., the present Shallice, 1988).

The last question that remains from the present chapter regards the value of these representations (i.e., what are they ‘good’ for). After all, as we have remarked numerous times so far, for any task, the optimal representations will have to be ‘tuned’ to that task to capture any nuisances in the data. Following this argument, the above exploration can give us –at best– the optimal representations for tasks of semantic priming, which is not what we set out to explore in the beginning. In §3.1, we note that there is a close connection between tasks of semantic priming and those of semantic implicit learning. In both cases, participants have to make semantic decisions without consciously activating their semantic knowledge (or, at least, consider it relevant for the task). From this, we argued that whatever influence semantic knowledge has on these tasks must be exerted from the distribution of concepts in the semantic space rather than ad-hoc conscious categorisations. Chapter 5 explores this notion further, using these representations as input to computational models that simulate the behaviour of participants in sil tasks.

Distributional Semantics Approach to

Implicit Language Learning

5.1 Introduction

The results obtained in Chapters 3 and 4 present the encouraging view that the distributional patterns of words capture elements of the organisation of concepts in the semantic space. In §3.2, we argue why semantic priming effects give a better view of how human semantic memory is organised. We reach the above conclusion by comparing the predictions of dsms against human performance in tasks of semantic priming and lexical decision. We then generalise these results by looking at languages beyond English, namely, Chinese, Dutch, French, and Malay, by looking at whether the same predictions hold there as well. Furthermore, in §3.1 we argue why the representations used in modelling tasks of semantic priming are relevant in modelling tasks of semantic implicit learning. In this section, we extend the above findings arguing that the distributional patterns of words can also predict what can be learned implicitly in the tasks presented in §1.5. Also, since dsms can capture elements of the semantic space of speakers of different languages, we also examine datasets of sil tasks using English and Chinese as the main language.

In document Understanding Semantic Implicit Learning through distributional linguistic patterns: A computational perspective (Page 139-143)