2.4 Experiment 3
2.4.5 Results: Accuracy
The overall accuracy with which our participants identified the higher fre- quency n-gram was above chance. I used a bootstrapped confidence interval to assess the accuracy, and found that for 2-grams, the mean accuracy for all subjects on all items was 0.6 (95% CI: 0.58-0.62), for the 3-grams it was 0.62 (95% CI: 0.6-0.64), for the 4-grams is was 0.57 (95% CI: 0.55-0.6), and for the 5-grams it was 0.56 (95% CI: 0.54-0.58). Before attempting to model the accuracy data, I investigated the relative conditional importance of all the frequency ratio variables in predicting mean accuracy using the same random forest methodology described in the analysis of Experiment 1. The relative im- portance of the predictor variables in predicting the mean accuracy is shown in Figure 2.5. Since there are the same large number of multi-collinear predictors here in Experiment 3 as there were in Experiment 1, I wanted to see which frequency components contributed the most. Before presenting a more formal statistical analysis, I begin with this informal summary of the results of this analysis:
wf1Ratio wf2Ratio NgramRatio ● ● ● 0.002 0.004 0.006 0.008 0.010 2−grams
Mean decrease in accuracy of model
wf1Ratio wf2Ratio wf3Ratio bf1Ratio bf2Ratio bf3Ratio NgramRatio ● ● ● ● ● ● ● 0.000 0.002 0.004 0.006 0.008 0.010 0.012 3−grams
Mean decrease in accuracy of model
wf2Ratio wf3Ratio wf4Ratio bf1Ratio bf2Ratio bf4Ratio bf5Ratio bf6Ratio tf1Ratio tf3Ratio NgramRatio wf1Ratio bf3Ratio tf4Ratio tf2Ratio ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 4−grams
Mean decrease in accuracy of model
wf1Ratio wf2Ratio wf3Ratio wf4Ratio wf5Ratio bf1Ratio bf2Ratio bf3Ratio bf4Ratio bf5Ratio bf6Ratio bf7Ratio bf8Ratio bf9Ratio bf10Ratio tf2Ratio tf3Ratio tf4Ratio tf7Ratio tf6Ratio tf5Ratio qf2Ratio qf3Ratio qf4Ratio qf5Ratio NgramRatio qf1Ratio tf1Ratio tf8Ratio ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0000 0.0004 0.0008 0.0012 5−grams
Mean decrease in accuracy of model
Figure 2.5: Conditional importance of predictors in a random forest model for accuracy in Experiment 3. After creating random forest models, I calculated the relative conditional importance of all of the n-gram frequency variables in predicting mean accuracy, adjusted for correlations between predictor vari- ables, both for the main effects and interactions.
• For 3-grams, the wholen-gram frequency ratio was important.
• For 4-grams, the second 3-gram’s frequency ratio was the most impor- tant. The first trigram, first word and a split 3-gram, tf4 had weak influence. The whole 4-gram frequency ratio was not a strong predictor.
• For 5-grams, the first 3-gram was a strong predictor along with a split 3-gram, tf8, made up of the first, second and fifth words of the n-gram. The first 4-gram was a slightly weaker predictor. The whole 5-gram frequency ratio was not a strong predictor.
Table 2.6: Accuracy GLME Model Comparisons for Experiment 3. All models contain crossed random effects for subjects and items. Models in bold type were the best models for each type of n-gram. All models include a random intercept for each item and a random slope for the effect of the frequency on each subject.
AIC ∆df χ2 p
2-grams: Position 3330
2-grams: Position + N-gram Ratio 3327 1 5.42 0.020
3-grams: Position 2387
3-grams: Position + N-gram Ratio 2372 1 16.47 0.000
4-grams: No fixed effects 2286
4-grams: tf1 Ratio + tf2 Ratio 2282 2 7.92 0.019
5-grams: No fixed effects 2381
5-grams: tf8 Ratio + tf1 Ratio 2378 2 7.21 0.027
Next I used generalized linear mixed effects models (Baayen et al., 2008) to understand the relationship between the stimuli and the trial-level accuracy of the participants’ judgements using the most important variables found in each of the random forest models. Just as in my analysis of the data from the single word experiment, Experiment 2, all of my models included the random effect of item on the intercept crossed with a random slope for the effect of the frequency ratio of each item on each subject. Stimulus position was only included in the 2- and 3-gram models, as it did not enhance the model fitness in the models for the other n-gram lengths. The comparison of these models
is shown in Table 2.6. From the model comparison it becomes clear that the ability of the models to predict trial-level accuracy improved when the appropriate frequency ratios were added. I also compared these models shown with other models that included predictors such as trial number and all the individual word frequencies, but these models are not shown in my model comparison table because these models were uniformly lower in fitness than the models shown. Since the individual word frequencies were not found to improve the fit of any of the models for accuracy for any of the n-gram types in the experiment this method of matching pairs of n-grams was successful in preventing lexical frequency cues from influencing the participants’ relative frequency judgements. Finally, stimulus position did not improve the fitness in the models for the 4-grams and 5-grams, and was dropped from those models.