• No results found

5.8 Discussion for Studies 5–8

5.8.1 A theory of Semantic Implicit Learning

The results obtained in this chapter point to the direction that semantic implicit learning need not be considered semantic at all. In fact, the patterns obtained in the behavioural datasets can be recovered through the distributional characteristics of the stimuli used in the experiments instead of abstract semantic features. Indeed, looking at the 2afc tasks alone (i.e., excluding Leung & Williams, 2014, which uses a speeded reaction time task), we achieve

a strong correlation (ρ = .86,p < .001) between the model estimates and the behavioural

performance (Fig. 5.10). An interesting remark we can make about this correlation is that we achieve a good fit to the human performancewithoutdirectly fitting our models to the human results. In other words, the models are nottrainedto match the human performance in the sil tasks, but to predict semantic priming. We only fit their estimates post experimentally, but even in that case, the generalisation gradients do not show a lot of variance (they usually level after a few dozen epochs).

Linguistic relativity?

Insofar as we consider the semantic representations constructed by the dsms as a proxy of the true semantic representations, it might follow that speakers of different languages conceptualise the world around them slightly differently. However, classic work in cognitive psychology has determined that despite surface linguistic differences, humans conceptualise the world in a very similar way. As we have argued above (§1.5), in the domain of colour perception, Berlin & Kay (1969) using data from 20 different languages identified a ‘universal’ evolutionary pattern in colour naming. For example,alllanguages contain terms for black and white; subsequently, the order by which colour have a specific term in the language isred,

green,yellowand so on. Similarly, Eleanor Rosch (Heider, 1972; Heider & Olivier, 1972) found that even though the Dani people in Papua New Guinea lacked the terms for any colour other thandarkandlight(cf. black and white), they were able to categorise objects by colour for which they had no word. The above results led Rosch to assert that it is not the structure of each language that determines conceptual organisation but a pressure for efficiency, on the one hand, and common world knowledge, on the other (but see some recent results from Cibelli, Xu, Austerweil, Griffiths & Regier, 2016).

Under this light, our proposal that specific distributional patterns might give rise to different conceptual structure between speakers of different languages might seem problematic. Indeed, in the limit, this proposal predicts that speakers of different languages will have distinct semantic spaces and process incoming input differently. While this topic is quite contentious in cognitive psychology (Brody, Gumperz & Levinson, 1998) and any proposal would be met with severe criticism from the other side, we find two middle-ground solutions to this problem. Firstly, as we noted above, despite the fact that overall speakers of different languages share a semantic space, minor effects attributed to distributional knowledge can be found. Secondly, in §7.2, we outline a study in which we transform distributional representations to semantic representations containing information about semantic relations. Under this account, the distributional knowledge might simply provide a different starting point for

speakers of different languages which through exposure to the physical world is refined ending up in a similar semantic space.

A role for phonology

As noted at the beginning of this chapter, our method does not take into account the phono- logical representations of the words in question. This happens because we encode each word as a one-hot vector (i.e., a localist representation), making it orthogonal to every other word. While we argued that this should not be an issue in the present context where we look only for the contribution of distributional semantics to the implicit learning tasks, it remains an open question whether a model which looked only at phonological information would be able to explain the results. One can imagine that upon encountering ‘gi dog’, ‘gi drill’, ‘gi dark’ a participant might use such cues to limit their search space and arrive at an ad-hoc categorisation such as ‘things that start with the letter d’ (Tenenbaum, 1999, for an example in number learning). From a modelling perspective there are various ways we can deal with this issue; firstly, following connectionist models on word-reading (Seidenberg & McClelland, 1989) one can imagine that the hidden layer receives its input from two distinct streams (one semantic and one phonological) and its output activation pattern depends on a (non-)linear combination of the two. Secondly, another possibility would be to use phonologically moti- vated input vectors where each unit represents a particular phonological unit (McClelland & Elman, 1986) or that they are sampled from different distributions (Jones & Mewhort, 2007). In the behavioural domain, there have been recent contributions for the interaction between semantic and phonological knowledge (Ouyang, Boroditsky & Frank, 2016). While these are reasonable objections and deserve a more rigorous examination, in the present chapter, we are only interested in the contribution of semantic knowledge in such tasks, leaving an exploration of phonological effects for §6.2.

Is this the whole picture?

A related question concerns the limits of the learnability of such systems. Could this mean that any semantic distinction can be learnable? Potentially yes, but with considerable constraints; firstly, the introductory discussions for every study assert that the semantic distinction has to be somehow reflected in linguistic usage. This property quickly constrains the space of possibly learnable systems as there is a limited number of ways words can be combined in a language to yield different distributional patterns. However, the results given here are consistent with the idea that novel semantic distinctions can be formed. Since the network constantly learns from its input, if we transform the input in such a way to highlight novel semantic distinction

these should, in theory, be learnable. Recall at this point the discussion in §3.1, where Barsalou (1983) recognises the potential that through usage,ad hoccategories (e.g., things to take in the event of fire) should constitute a more natural category for a particular speaker. Our results support this notion, as repeated sentences in a corpus that contain the objects that one takes in the case of fire will bring the corresponding concepts closer in the semantic space, resulting in faster processing and learning.

Secondly, the results from the English speakers in §5.7 show that the semantic distinction to be learnt has to preclude more specific, and more intuitive hypotheses as these might bias the learner away from the ‘true’ distinction. Tenenbaum & Griffiths (2001) explain this behaviour in terms ofBayesianpriors. In short, the learnerwillprefer the hypothesis that (a) fits the data best, and (b) is more intuitively probable. While how we define intuition (or prior probabilities in the Bayesian context) is a contentious issue in cognitive science (e.g., the discussion between Bowers & Davis, 2012 and Griffiths et al., 2012), for the English concepts being long might be a less probable categorisation than being an insect, even though they might fit a portion of the data the in the same way. Considering the correlational structure of concepts in the world (Rosch, 1978), it might be hard to construct such an artificial category, where the subclusters are not more probable.

Do all the ‘learners’ learn in the same way?

The final question we are going to examine in this section is whether all the simulated par- ticipants follow the same learning path during each experiment. Here we do not examine how individual differences stemming from l1 biases might affect participants’ framing of the computational problem. In the applied linguistics literature, this sort of individual difference concerns whether participants notice the relevant variables (animacy and the determiner) or not.¹² In the present case, we explore differences in performance that stem from different initialisation of network weights, the randomised order of presentation, and thedropout

procedure outlined in §5.3.1. All these factors can be considered as noise during the experi- ment. Human learners might perceive the input noisily; they might forget or misremember a particular example. We, therefore, ask whether the learners would converge to the same output despite these differences or random factors determine their performance.

Figure 5.11 plots the error and generalisation gradients of five simulated learners from the Paciorek & Williams (2015, High similarity) dataset. Interestingly, the results show that not all participants achieve the same level of performance in the experiments. Figure 5.11a plots the summed cross-entropy error for each participant by semantic distinction. Looking

¹²An important side note here would be that even if participants notice the relevant variables the learning can still be considered implicit as they do not notice the relation between them.

at the gradients, we see that some participants find it easier to learn either concrete or abstract concepts, whereas some others might not learn at all. These error patterns seem to also extend to the generalisation rates in the testing phase. Figure 5.11b plots the unnormalised activation of the grammatical alternatives for abstract and concrete concepts. This data shows that the participants can generalise to the extent that they managed to ‘learn’ the training set. Learner #4, for example, did not learn anything during training and performed completely at chance during testing (recall that these are the unnormalised activations). Participant #1, on the other hand, generalises better to concrete concepts which they managed to learn better during training.

These results highlight the fact that even in the case where participants frame the task in an optimal way (i.e., retrodiction), achieving good levels of generalisation is still quite hard. Many issues can appear such as errors during retrieval or the effect of randomisation that might prohibit participants from achieving high performance. In §7.3 we discuss the latter reason to some extent in the light ofcurriculum learning. In short,curriculum learning(Tsvetkov, Faruqui, Ling, MacWhinney & Dyer, 2016; also see Elman, 1993, for an early connectionist overview) assumes that there is an optimal path during training that aids learning. The idea is that if there is a target rule to be learnt, receiving random examples might prohibit the model/learner from abstracting the relevant information. If on the other hand, the training stimuli are administered in such a way that enables abstraction of the relevant information then the learner can achieve higher levels of generalisation. Our results are consistent with this view, and can potentially be explored further in future research to optimise the learners’ input.