Results - Three experimental studies - The Processing of Lexical Sequences

1.6 Three experimental studies

2.2.3 Results

Our participants understood the task I asked of them. The mean rating for the nonsense n-grams was µ = 1.35, σ = 0.2, a very low rating on a scale of 1 to 7. The mean rating for all the sensible n-grams was µ = 3.83, σ = 1.07. The nonsense n-gram ratings were removed from the rest of the analyses. I measured rater reliability using the intra-class correlation coefficient (ICC, Shrout & Fleiss, 1979). For all of the sub-groups of subjects who rated the same set of 32 items, all the ICCs were greater than 0.37, and all of the 95% confidence intervals around the ICCs did not include 0, showing consistency in item ratings across participants.

pus frequency of the n-grams, I had to see if the internal n-gram frequencies were participating in driving the subjective ratings. This task is complicated by the fact that all of these frequencies are highly inter-correlated, and en- tering all the predictors simultaneously into a regression model could lead me to mis-evaluate the importance of the predictors. Principle Component Anal- ysis (PCA), chosen by Matthews and Bannard (2010) to reduce the multicollinearity of the component frequencies of their 4-grams, was considered as a potential way to reduce multi-collinerarity in this experiment. A disadvantage of PCA is that the orthogonal components that it produces can be extremely difficult to interpret in terms of the original variables. To sidestep the problem of multicollinearity while properly assessing which predictors are most relevant, I made use ofrandom forests. Random forests are a type of recursive partitioning algorithm for performing nonparametric regression with a large numbers of predictors (Breiman, 2001). They are a powerful type of Classification and Regression Tree (CART) method, and since they make no assumptions about the types of relationships between variables they have been found to be supe- rior to multiple regression in predicting performance on various tasks (Finch et al., 2011). To understand which of my predictors was important, I measured the conditional importance of each variable in a random forest model and then only used the most important predictors in my regression models. A method for performing this type of conditional importance analysis has been described by Strobl, Malley, and Tutz (2009) in this way: variable importance is assessed by permuting the data in each predictor variable and then testing the model with the permuted variable and the remaining non-permuted variables until all the variables have been permuted. The prediction accuracy of each inference tree in the forest decreases substantially if the permuted variable was involved in predicting the response. The difference in prediction accuracy before and after permuting a variable, averaged over all trees, is one measure of variable importance, the marginal permutation importance. An improvement on this unconditional permutation importance measure is theconditional permutation importance(Strobl, Boulesteix, Kneib, Augustin, & Zeileis, 2008) in which the permutation importance is conditioned on each of the partitions that arise from

the recursive partitioning in the random forest as a conditioning grid. This conditional variable importance is less susceptible to preferring correlated predictor variables and takes into account both main effects and interactions. In all of my analyses I used the R package called party (Strobl et al., 2009), and I tested my random forests with several different starting values to make sure that the ranking of variable importance did not change depending on the starting value. I report the results below after confirming that there was no change in the ranking of conditional importance being caused by the initial conditions of the pseudo-random number generator. The results of my analysis are shown in Figure 2.2.3, and can be summarized as follows:

• For 2-grams, the wholen-gram and second word frequencies were important.

• For 3-grams, the wholen-gram frequency was important, with a smaller contribution from the third 2-gram’s frequency. Interestingly, the third 2-gram frequency, bf3, is the frequency with which the first and third words appear together, which I call asplit-gram.

• For 4-grams, the whole n-gram, the first 2-gram and the second 3-gram frequencies were important.

• For 5-grams, the first 4-gram and the whole n-gram frequencies were important.

Was n-gram frequency helpful in predicting my outcome variable? Using the variables identified by the random forest analysis, I created linear models for each size of n-gram with and without the whole n-gram frequency in each model and then performed a model comparison. I compared the Akaike In- formation Criterion (AIC, Akaike, 1974) of all the models to determine which one had the best fit. The AIC is a measure of the quality of a model that incorporates both the goodness of fit and the number of free parameters in the model. Nested models with fewer parameters that have a better fit with the

wf1 NgramFreq wf2 ● ● ● 0.00 0.05 0.10 0.15 0.20 0.25 2−grams

Mean decrease in accuracy

wf2 bf1 bf2 wf1 bf3 wf3 NgramFreq ● ● ● ● ● ● ● 0.0 0.1 0.2 0.3 0.4 0.5 3−grams

Mean decrease in accuracy

wf2 bf4 tf3 bf2 bf3 bf5 wf3 tf4 wf1 tf1 bf6 wf4 tf2 bf1 NgramFreq ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.05 0.10 0.15 4−grams

Mean decrease in accuracy

wf1 wf2 wf3 wf4 wf5 bf2 bf3 bf5 bf6 bf7 bf8 bf10 tf4 tf6 qf3 qf4 qf5 bf1 tf5 tf8 tf7 bf9 bf4 qf2 tf1 tf3 tf2 NgramFreq qf1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.04 0.08 0.12 5−grams

Mean decrease in accuracy

Figure 2.1: Importance for predictors in a random forest model of mean item rating in Experiment 1. After creating random forest models, I calculated the relative importance of all of the log transformedn-gram frequency variables in predicting mean subjective frequency ratings adjusted for correlations between predictor variables (both for the main effects and the interactions). The names of the frequencies are abbreviated in the following manner: 2, 3 and 4-grams are assigned the letters b,t, and q. The abbreviation tf2 stand for Second

Trigram Frequency. A full description of all these abbreviations is given in

Table 2.1: Regression Model Comparisons for Experiment 1. Two models for predicting the mean subjective frequency ratings ofn-grams are given for each size of n-gram, with the first model nested within the second. Models in bold type were the best models for each type ofn-gram. ∆df denotes the change in the number of free parameters between the two models being compared.

AIC ∆df χ2 _p

2-grams: n-gram freq only 330

2-grams: n-gram freq and w2f 317 1 15.98 0.00001

3-grams: n-gram freq only 212

3-grams: n-gram freq and bf3 206 1 8.07 0.00566

4-grams: n-gram freq only 154

4-grams: n-gram freq, bf1 & tf2 145 2 6.81 0.00200

5-grams: qf1 only 172

5-grams: qf1 andn-gram freq 174 1 0.29 0.59329

data are given a lower AIC. This means that the absolute value of the AIC is not important, but rather the difference between two AIC values shows which is better, and how much better. The results of these comparisons of nested models are shown in Table 2.1.

The picture for the relationship between objective and subjective frequency forn-grams is more complicated than the one for words described by Balota et al. (2001); it is much more than a linear relationship between the meaningful- ness of words or their simple whole form corpus frequency. There were effects of the internal n-gram frequencies that came into play. In this section, I will report regression effect sizes using Cohen’s f2_{, a measure of effect sizes appro-}

priate for regression models. Cohen (1988) suggested that effect sizes of 0.02, 0.15, and 0.35 should be considered as being small, medium, and large. Each model was re-fit 1000 times with bootstrapped replicants giving a distribution of f2 _{values. I then calculated the 95% confidence interval of the effect size}

from this distribution, reported below. For 2-grams, the subjective frequency ratings were predicted by both the 2-gram’s frequency (f2 = 0.45, 95% CI 0.32-0.56) and the second word’s frequency (f2 _{= 0}_._{07 , 95% CI 0.02-0.13).}

This result could imply a recency effect: the frequency of the last word read had more impact on the rating than the first word. For the 3-grams, the whole

n-grams’s frequency had the largest effect size (f2 _{= 0}_._{45, 95% CI 0.27-0.59)}

and there was a weak effect of the split-gram (f2 _{= 0}_._{05, 95% CI 0.01-0.14).}

For the 4-grams, a more complicated model was the best fitting. The whole n-gram frequency had the largest effect (f2 = 0.34, 95% CI 0.17-0.51), followed by a weak effect of the first bigram (f2 _{= 0}_._{08, 95% CI 0.01-0.19) and}

an unreliable effect of the second trigram (f2 _{= 0}_._{03, 95% CI 0-0.11).}

For the 5-grams, the addition of the whole n-gram frequency did not im- prove the model, so the simpler model prevailed. This simpler model had a strong effect of 4-gram frequency, with the effect size being (f2 _{= 0}_._{27, 95%}

CI 0.14-0.43).

In all the analyses above the amount of mulit-collinearity between the predictors was reasonable (in all models, κ <8).

Finally, I noted that Balota et al. (2001) had found that the group of words with the highest subjective frequency ratings had a strong relationship between objective and subjective frequency, and that the opposite was true for the words with the lowest subjective frequency ratings. I replicated this result: I performed a median split on all of the items bases on their average subjective frequency rating, and calculated a bootstrapped Pearson correlation with corpus frequency for each of the two groups. The magnitude of the correlations with frequency were larger for the set of items with the higher subjective frequency ratings: for the upper half, r(177) = 0.22, 95% CI 0.19- 0.48, and for the lower half, r(176) = 0.11, 95% CI 0.01-0.19.

In document The Processing of Lexical Sequences (Page 50-55)