Method - Understanding Semantic Implicit Learning through distributional linguistic patterns: A

The methodological details outlined here concern the studies described in §§ 3.4, 4.4 and 4.5.

3.3.1 Model Selection

A drawback of the above models is that they introduce a substantial number of (potentially interacting) hyperparameters that need to be tuned. Hyperparameter tuning is a laborious task, and if done improperly it has the potential of levelling at some local minimum (i.e., a region of the parameter space where just changing one or two values in the parameters would not improve the fit). To illustrate how laborious this task can be, consider that the neural embeddings model we use (word2vec) has 8 hyperparameters.6 Even if each hyperparameter

could take either one of two values (i.e., a binary variable) the number of potential parameter sets (and consequently models) would be 210₌256.7 Since 2 is the lower bound of the number

of different values a variable can take (most of the parameters are real numbers), the number of possible parameters sets to consider grows prohibitively large forgrid search(i.e., iterating

over the possible parameter sets in a ‘loop’). To alleviate most of this problem, we useBayesian Optimisation(Snoek, Larochelle & Adams, 2012) to find the best parameter set, a technique that has been found very useful in finding the parameters of dsms (e.g., Alikaniotis et al., 2016).

The Bayesian Optimisation algorithm constructs a generative probabilistic model for the parameter spaceZand then exploits this model to find regions in the space which maximise some internal ‘fit’ function (i.e.,p(y∣x,Z), whereyis the model ‘fit’,xis the parameter vector

andZthe parameter space). One added advantage of Bayesian Optimisation is that the choice for the next step does not rely only on the last evaluation (as in Markov Chain Monte Carlo algorithms) but onallthe previous steps. The expense of the added computation, however, is mitigated by the speed with which the Bayesian Optimization (bo) converges to the best

6Simulation details can be found in §B.2.

7To put this number into context, training one dsm in the present setting takes on average one hour, so doing a full pass over all the possible models would take about 42 days.

possible results. Since each dsm takes significant time to run, bo provides a way to explore many parameter sets in the shortest amount of time.

What remains is to define an objective for the Bayesian Optimisation algorithm (that is, what it will maximise). Let us take a step back here and consider the task for the moment. If we are to modelreaction timesorpriming effects(i.e., Reaction Time (rt)unrelated- rtrelated),

we want to find a set of predictor variables, such asword frequency,orthographic length, etc., that maximise the fit of the regression model to the data. Hutchison et al. (2008) performed hierarchical linear regression entering variables relating to eitherprimeortargetcharacteristics or some measure of their relation (e.g.,forward association strengthas obtained by the Nelson norms, see §1.3.1). Using this set of variables they were able to derive a set of predictors that best predict semantic priming effects at the item level (the same set of predictors were included in the spp dataset). A simple proposal could, therefore, be to include those parameters which are predictive of priming effects adding the covariate for the semantic model used. Alternatively, we can addonlythe semantic model covariate in the model and then asses the fit.

Both of the above proposals face a similar problem. Firstly, ignoring the rest of the variables when we construct the objective function we might run into the problem that the best parameters for the dsm might be a function of some other ‘simpler’ covariate such as word frequency. To see how this might be possible, consider the word vectors in Table 2.1, before any normalisation procedure. The magnitude of each vector is an approximate function of the frequency of the corresponding word. If, therefore, we leave the normalisation procedure as a parameter to fit we might run into the problem of learning a ‘hard-to-beat’ model only by learning a way to estimate word frequency. Secondly, if we include the variables used by Hutchison et al. (2008) together with the semantic model estimates, we might use a more complicated model than warranted giving rise to misleading (overfitted)R2values.

We mitigate the above issues by performingvariable selectionon the dataset before training the semantic models. Concretely, using several psycholinguistic variables to be detailed below (cf. §3.3.2) we perform severalfeature selectionprocedures to find the best baseline model for the naming task. Once we derive the best possible model, we extract the design matrix (i.e., the predictors) and attach the estimates of the semantic model. Three questions remain unanswered from the above; 1) how do we eliminate variables? 2) what do we mean by ‘fit’ in the present context? and 3) how do we decide on the best possible model?

Regarding the first question, we avoid usingstepwiseregression, which, although favoured by researchers in the field (Buchanan et al., 2001), can give rise to erroneousbetavalues and are biased to return high R2 values (Tibshirani, 2011). To this end, we chose to use lasso

norm of thebetavalues (the regressor weights) to be lower than some thresholds, solving: arg min∑(y−A⋅X⊺)2 s.t. ∥A∥₁≤s (3.1)

whereyis the dependent variable (the priming effects),Athe design matrix,Xthe indepen- dent variables,∥A∥₁theℓ₁norm of the covariates, andsa threshold value constraining the

magnitude of the weights. The advantage of this approach is that by choosing a low value for

s(using cross-validation we determineds=0.1), irrelevantbetavalues are going to be very

close to zero effectively being cancelled out.

Regarding the rest of the problems, we can say the following. Firstly, by fit, we mean the coefficient of determination (R2) given by the model with the selected variables and the

dsm estimates on the validation set. Becauselassodepends on cross-validation, deriving aR2

value directly from this model is a non-trivial procedure. We find theR2value by re-training

a standard ordinary least squares regression model with the same parameters as the ones returned bylassoand compute theR2value there. While this might seem confusing (we train

the model to re-train another with the same parameters), Belloni & Chernozhukov (2013) have shown that this is an acceptable procedure forlassoand performs “just as well” (i.e., using

lassoonly to zero some coefficients, not as a regression model). We then determine the best possible model by comparing the baseline model and the model with the similarity estimates.

3.3.2 Baselines

Common psycholinguistic measures

A major problem in psycholinguistic research is crafting stimuli sets that vary only in one dimension (Cutler, 1981; Hutchison et al., 2008). Since most psycholinguistic designs involve a factorial design or at least a design where the relevant comparison will be made on separate groups (e.g., high vs. low frequency words, semantically related vs. unrelated) balancing the stimuli such that nuisance variables cannot explain the variance is of utmost importance. For example, the orthographic frequency of the target word influences semantic priming (Becker, 1979). This should not be surprising as one way or the other semantic priming involves lexical access which is affected by frequency. Either not controlling for this effect (by crafting balanced lists) or accounting for this in the analysis might lead to erroneous results on what caused the faster rt in the experiment.

There are many variables related to either the characteristics of theprimeword or thetarget

that might influence the rt. To name a few prime or target length, regularity, consistency, bigram frequency, onset, orthographic neighbourhood, meaningfulness, and concreteness (for a more detailed list, Hutchison et al., 2008) can either facilitate or inhibit lexical access,

exhibiting either a positive or negative correlation to rt. As outlined above, including these factors in the model and testing their significance helps us get a better estimate of the influence of semantic similarity in priming. The factors we test for during the lasso procedure are the ones included in the spp dataset; (a) bigram frequency, (b) word length, (c) word frequency, (d) number of orthographic neighbours, (e) part-of-speech and (f) nature of the relation between the prime and the target (synonyms, antonyms etc.).

As seen in §A.4 the variables we end up with are theword length(Prime_Length) and

the log-transformedword frequency(Prime_LogSubFreq) for the prime words and theword

length(Target_Length), the log-transformedword frequency(Target_LogSubFreq) as well

as thenumber of orthographic neighbours(Target_OrthoN)8 for the target words. A linear

regression model with these coefficients achieves a baselineR2₌0.323, which is very close to

the variance explained by the baseline model in Mandera et al. (2017) (R2₌0.312). Given that

regression models achieve usually achieve low fit with the data (e.g., Buchanan et al., 2001; Hutchison et al., 2008), this model provides a competitive baseline for our experiments.

WordNet

We also offer two semantic baseline measures where we can compute the semantic similarity between the prime and the target; WordNet and Association Norms. Measuring semantic similarity in WordNet is tantamount to measuring the distance between two nodes in the graph. Concretely, we want to find the pathP= (v₁,v₂, . . . ,vn) ∈V ×V× ⋅ ⋅ ⋅ ×V from word

w1to wordw2by minimisingn. While this is a very straightforward way and many efficient

path minimisation algorithms exist, it quickly faces the issue pointed out by Resnik (1995) that there is an underlying assumption that the distances between the nodes in the graph are uniform. Consider the representation ofdogin Fig. 1.4; intuitively, the distance between

dogand another canine such aswolf should be shorter than between the conceptsinsectivore

andpet(both being sister nodes underanimal.n.01). However, because both pairs are sister

nodes (i.e., they are subsumed by the same ancestor) their path distance is the same.

One set of approaches to overcome this problem take into account the depth in the hierarchy of the concepts in question. The reasoning behind this is that concepts ‘deeper’ in the taxonomy will are more closely related than those higher up (Sussna, 1993). Approaches, therefore, taken by Leacock & Chodorow (1998) and Wu & Palmer (1994) introduce a normalisation element in the path distance calculation that takes into account the depth of the concepts in the hierarchy. The second set of approaches (Jiang & Conrath, 1997; Lin, 1998; Resnik, 1995) involves incorporating corpus statistics in the similarity function. In short, these methods include information theoretic criteria to capture the probability of encounteringw1givenw2.

Comparing the possible methods to capture semantic similarity from WordNet, Budanitsky & Hirst (2006) found the algorithms proposed by Jiang & Conrath (1997) and Leacock & Chodorow (1998) to provide the best fit on two behavioural tasks of similarity ratings. In what follows, we examineallof the above metrics in the context of semantic priming.

As we remarked above, WordNet includesconcept-concept(see §1.3.1) relations instead ofword-wordas the association norms and the dsms. The problem in the present context is that there is not necessarily an 1∶1 relationship between the word in the spp and the concept

referred to by the WordNet synset. Take the wordcow, for example; the first two synsets tagged ascoware defined as“female of domestic cattle: ‘moo-cow’ is a child’s term”and“mature female of mammals of which the male is called ‘bull”’. Automatically choosing the intended concept could be done using WordNet’s internal sorting mechanism that arranges synsets by frequency. However, this can quickly prove to be problematic as the first synset to appear for the wordtablehas the definition ‘a set of data arranged in rows and columns’ while the related primes in the spp for the target wordtablearechairandseat. Hand-picking the words so as to intuitively match the WordNet definition to the intended use in the spp is both a laborious and potentially biasing task. We mitigate this problem by implementing the following solution; for any tworelated(in the spp) words we select the two synsets that maximise the similarity metric while for the unrelated words we choose two synsets at random. The reason for the first decision is rooted in models such as thespreading activation theoryin that given the activation of the prime word, the target word that is going to be activated is the one that stands closer to the prime. This method was also used by Budanitsky & Hirst (2006) and in the original studies introducing the above similarity metrics. We consider the second decision to be a safer option in the present context than if we had implemented the same solution for the unrelated words as initial results showed this method to be biased by spurious correlations.

Word Association Norms

Apart from WordNet similarity metrics, we use the University of South Florida Free Associa- tion Norms to obtain similarity ratings for theprime-targetpairs. We have already described the procedure to obtain Free Association Norms in §1.3.1 and the relation of those norms to the spp in §3.4.1. Since theforward associationstrength was used to derive the word pairs, we would expect it to exhibit some negative correlation to the reaction times. That is, the higher the association strength, the lower the reaction time needed to process the target word. In the semantic priming task we are reviewing below, the correlation between theforward association strengthand theprime-targetpair isr= −.09,p=0. While this estimate might seem small, we

their analysis Hutchison (2003a) found that the Forward Association Strength (fas) covariate was significant in their model while the lsa estimate was not.

Together with theforward association strengthwe also obtainsimilaritiesfrom the association norm representations. Given a cue wordw, its associatesAand an associates vocabulary

V, we form a sparse vectorw∈_R∣V∣such that all its elements are zero except for those that

exist in its associates set∀i ∈ V, i ∈ A ⇒ wi = 1. Alternatively, the value of the element

can be a function of the relationship between the two words. We also explore the use of

forward association strengthas an alternative function Because of the sparsity of the vector, we perform dimensionality reduction via svd (§2.2.1) to obtain lower dimensionality representations which, hopefully, contain richer information. For the experiments reported here we derive vectors of size 10, 30, 50, 100, 200, and 300. Due to the small number of possible hyperparameter combinations for both WordNet and the Association Norms, we do not use the Bayesian Optimiser on these model as we can derive the estimates directly.

In document Understanding Semantic Implicit Learning through distributional linguistic patterns: A computational perspective (Page 88-93)