General considerations - Understanding Semantic Implicit Learning through distributional lingui

2.4.1 Multiword Expressions

Since the primary unit in distributional semantics models is a space-delimited string, Multi- word expressions (mwes) such as ‘sim card’ raise a unique problem. The problem lies in the fact that although the constituent parts of the expression (here ‘sim’ and ‘card’) have representations of their own, there is no representation for the concept of the ‘sim card’. Intuitively, this is wrong as ‘sim card’ (the cards used in gsm phones) is something qualitatively different to the ‘red card’ given in football matches and both of them are not the same to the concept ‘card’ as in ‘After the meeting she left him her card’. A challenge, therefore, for distributional semantics models would be to widen their scope beyond the space-delimited sequence of characters to identify and represent concepts which need more than one word to be expressed.

At the heart of Formal Semantics (Montague, 1970) is the principle that the meaning of the sentence (or a phrase in this instance) can be derived using a rule-governed combination of its constituents. In other words, we can combine these space-delimited atomic units into phrases and sentences using a productive set of rules. Assuming we know what these rules entail, this would provide a helpful framework for which we would be able to derive semantic representations for phrases and sentences automatically. There have been a few attempts to bridge the well-studied field of formal semantics with distributional models (Beltagy, Chau, Boleda, Garrette, Erk & Mooney, 2013; Garrette, Erk & Mooney, 2014) mostly by enriching logical forms with distributional representations.

An alternative would be to identify a set of algebraic operations which would be applied to the semantic representations (i.e. the vectors as derived above) as a proxy for different semantic phenomena such as compositionality, negation, quantification and so on. Various authors have explored this method in detail (Mitchell & Lapata, 2008, 2009, 2010; Polajnar, Rimell & Clark, 2014, 2015) yielding promising results for Compositional Distributional Semantics. In

this strand of research, vectoradditionseems to yield the best results for combining words into phrases. Mikolov, Yih & Zweig (2013d) have independently reached a similar conclusion. In the present thesis, while we acknowledge the importance of the rule-based approach we use vectoradditionas a way to extract semantic representations for multiword units. We have also considered concatenating the elements of the expressions forming unigrams such as ‘simcard’. Considering that the number of multiword units used in the behavioural experiments was small (only 32 mwe) this method could be feasible. However, the frequency of the bigrams (such as ‘id card’) was quite low in the corpora yielding uninformed vectors (as shown by similarity tests). Moreover, this method masks the fact that, for example, ‘sim card’ and ‘id card’ are both cards in some sense, and participants might be aware of this fact during the experiment.

2.4.2 Corpus choice and parameter spaces

All the above models can be trained using any linguistic corpus. For our English simulations, we chose the British National Corpus (British National Corpus (bnc)) as a qualitatively balanced and diverse alternative to the commonly used Usenet and The Touchstone Applied Science Associates corpus (tasa) corpora (see §4.3.2, for other languages). One advantage of the bnc is that it is large enough (ca. 100 million words) but costly operations such as Singular Value Decomposition could still be completed in a very short time.¹0 The bnc comprises 4049 marked up texts, and it is a mixture of written texts (comprising of 90% of the corpus) from a variety of domains and a smaller spoken corpus (ca. 10 million words). To make this corpus more dsm-friendly but to incur as minimal information loss as possible we followed the standard practice in the field (Manning & Schutze, 1999), as well as suggestions by Rohde et al. (2006) in performing a series of clean-up steps. Specifically, we performed the following steps

1. removal of xml markup from the bnc files 2. removal of all punctuation marks

3. removal of words over 20 characters in length 4. conversion to lower case

5. automatic spelling correction 6. splitting of hyphenated words

¹0Using the publicly available svdlibc library and Intel’s Math Kernel Library training an lsa model on the bnc in 200 dimensions took approximately 1h but on 700 dimensions ca. 2.5 days.

We performed steps Item 1 to Item 3 and Item 6 using a set of custom regular expressions. The details of the spelling correction algorithm are given in Rohde et al. (2006), and the implementation was by Peter Norvig.¹¹ The total number of word types after discarding all words, which appeared five times or less in the corpus was 126097.

We obtain lsa, beagle, coals semantic vectors using the publicly available S-Space package.¹² As a normalisation procedure (where applicable), we useterm frequency-inverse document frequency, as described in Table 2.3. Singular Value Decomposition was carried using the svdlibc library by Doug Rohde.¹³ We also obtained Random Indexing vectors using a custom implementation.¹4 Using a custom version of theword2vectool we obtain

our neural embeddings. Finally, for the hal simulations, we use the HiDEx package (Shaoul & Westbury, 2010), which is a configurable implementation of hal allowing for more control over the original parameters used by Lund & Burgess (1996). The parameter spaces explored for all these models are described in §B.2.

¹¹Available athttp://norvig.com/spell-correct.html.

¹²Available athttps://github.com/fozziethebeat/S-Space/

¹³Available athttp://tedlab.mit.edu/∼dr/SVDLIBC/ ¹4Available athttps://github.com/dimalik/random_indexing

Discovering the unconscious representations

3.1 Introduction

The implicit learning phenomena introduced in §1.5, involve recovering the semantically motivated, underlying grammatical system that generated the set of stimuli and making generalisations from it. For now let us assume that simpler explanations based on surface regularities, such as morpho-phonological patterns, cannot explain the generalisation gra- dients (we test this assumption in §6.2). Developing computational descriptions of implicit learning, therefore, involves using appropriate semantic representations, that capture the effects observed in the behavioural studies. The appropriateness of representations in any context is far from a straightforward issue; take the experiment done by Williams (2005), for example, which we outlined in §1.5. If we construct semantic representations such that they only reflect a single semantic feature [±animacy], thenanymodel would exhibit perfect

generalisation. Hummel & Holyoak (2003) consider this one of the more severe issues in cognitive modelling, stating that for cognitive modelling to be a ‘truly’ scientific enterprise there should be a principled way of deriving representations as, otherwise, the modeller can bias the results in their favour.¹

In §1.3.1 we underlined that not all semantic representations are equivalent as they encode both qualitatively and quantitatively different sorts of information. Word association norms provide sparse information on how words are recalled based on free association experiments. WordNet, on the other hand, provides dense information on how concepts are related based on their hierarchical relations. Not only different models can contain different semantic

¹In their view, Hummel & Holyoak (2003, p. 247) considerany_{hand-coded representations problematic.} While we agree in principle with this assertion, we do not consider representations such as those derived from the McRae norms as hand-coded in the present context. Although they do involve hand-coding, their scale and coverage render them appropriate descriptions of semantic memory.

information, but we also observe this effect within the same model; Landauer & Dumais (1997) notice that in lsa increasing the dimensionality of the vectors can lead to lower fit with the behavioural results (we observe a similar effect in §3.4). However, depending on the needs of the behavioural dataset, such parameters might need tuning to provide better results. As such, they are left free and fitted on a particular dataset.

The objective of the present chapter is to outline and find a principled way of deriving appropriate semantic representations for modelling tasks of semantic implicit learning. On the face of it, this seems like a trivial task; assuming we can construct a computational model for the sil tasks (see, §5.3.1) we can use representations derived from different methods as input to the models and compare the generalisation patterns of the computational model to the behavioural data. However, there are two problems with this approach; firstly, there is a small amount of semantic implicit learning experiments with different manipulations and cover tasks making it harder to perform such meta-analysis. Secondly, the number of stimuli in these experiments is quite small, which increases the chances of the model overfitting the data.

We find the solution to these issues by making the further assumption that participants do not –at least consciously– activate their semantic knowledge during these tasks. An implication of this hypothesis is that whatever effect we observeshouldbe a function of how words are organised bydefaultin the mind. Take thesemantic primingparadigm (Meyer & Schvaneveldt, 1971) as an example. It has been well-established that processing a word can have a facilitative effect on subsequent processing of a semantically related word (e.g., claw→cat) than on an

unrelated one (e.g., calendar→cat). As we will explain below, the automaticity of this effect

has lead researchers to believe that semantic memory is structured in such a way that by defaultevidenceandtraceare placed closer together. We can now contrast this behaviour to arbitrary tasks of categorisation (e.g., Barsalou, 1983) where participants are asked to find concepts relating to a particular scenario (e.g., ‘things to take with you in the case of fire’). While humans can carry out the task giving responses as diverse as ‘children’, ‘dog’ and ‘blanket’, their reaction times as well as the variability in the responses prompt us to think that this is not how concepts are organised in mind.² To this end, in this chapter, we focus on deriving the best semantic representations based on tasks of semantic priming in the hope that they will let us model better semantic implicit learning.

²This is not to say that ad hoc categories cannot become ‘common’ ones as frequency plays a major role in their formation (e.g., someone consistently sets their house on fire, so they form the corresponding category) (Barsalou, 1983, p. 244). However, the frequency argument only supports our thesis as within distributional models of semantics words are distributed in space according to the frequency of co-occurrence.

In document Understanding Semantic Implicit Learning through distributional linguistic patterns: A computational perspective (Page 78-83)