size of tree
4.2 Linear regression
Measures of lexical processing and subjective frequency estimates for rts = read.table("DATA/rts.txt",T)
Most variables are summarized in Table 3.1, except for 4512 monomorphemic and mono-syllabic English words. The three dependent variables: RTld (reaction time in visual lexical decision), RTna (reaction time in word naming) andFami (subjective frequency estimate) are available at Balota et al. [1999], Spieler and Balota [1998]. Inrts, reaction times are already log transformed.
rts[1:4,]
# X RTld RTna Fami Word Age Wcat CelS Fdif Vf
#1 1 6.543754 6.145044 2.37 doe young N 3.912023 1.0216510 1.386294
#2 2 6.397596 6.246882 4.43 whore young N 4.521789 0.3504830 1.386294
#3 3 6.304942 6.143756 5.60 stress young N 6.505784 2.0893560 1.609438
#4 4 6.424221 6.131878 3.87 pork young N 5.017280 -0.5263339 1.945910
# Dent Ient NsyS NsyC Len Ncou Bigr InBi spelV spelN
#1 0.14144 0.02114 0.6931472 0.000000 3 8 7.036333 12.02268 10 41
#2 0.42706 0.94198 1.0986123 0.000000 5 5 9.537878 12.59780 20 2619
#3 0.06197 1.44339 2.4849066 1.945910 6 0 9.883931 13.30069 10 806
ac_rec=b|
ac_theme=ac
net_of_theme=a
semantic_class=abcd
length_of_theme>=1.498 n
1861/116
n 119/17
p 50/139
n 165/44
p 134/188
p 85/345
Figure 4.3: Cost-complexity pruned CART tree for the realization of the recipient in En-glish clauses (through NP or PP) in EnEn-glish (data courtesy Joan Bresnan and Anna Cueni).
#4 0.43035 0.00000 1.0986123 2.639057 4 8 8.309180 12.07807 5 793
# phonV phonN friendsV friendsN ffV ffN fbV fbN NounFreq VerbFreq Missing
#1 41 6889 8 26 1 14 32 6862 49 0 present
#2 38 17602 20 2619 0 0 18 14983 142 0 present
#3 13 1141 10 806 0 0 3 335 565 473 present
#4 6 45 4 33 1 760 2 12 150 0 present
We need to transform various variables as there are decidedly non-normal:
rts$spelN = log(rts$spelN+1) rts$phonN = log(rts$phonN+1)
rts$friendsN = log(rts$friendsN+1)
# rts$ffN, rts$ffV terribly distributed, even after log,
# perhaps factorize ?
rts$ffNonzero = as.numeric(rts$ffN > 0) rts$fbV = log(rts$fbV+1)
rts$fbN = log(rts$fbN+1) rts$ffV = log(rts$ffV+1) rts$ffN = log(rts$ffN+1)
rts$NounFreq = log(rts$NounFreq+1) rts$VerbFreq = log(rts$VerbFreq+1)
rts$NVratio = rts$NounFreq-rts$VerbFreq
We need to assess collinearity. We create the data matrix, and look at the condition number as defined in Belsley et al. [1980], using the function collin.fnc, the code of which can be found in FUNCTIONS/cap4.q. There are data for both young and old subjects, we look only at the young age group.
items = rts[rts$Age=="young",] # the group of young subjects items = items[items$Missing=="present",]
source("FUNCTIONS/cap4.q")
collin.fnc(items[,c(19:24,27,28)])$cnumber [1] 207.6198
Horribly high condition number, should be below 15. Not surprising given the strong correlations between all the predictors, compare Figure 4.4. The Design library is ex-cellent for regression, we load it and use its variable clustering algorithm to look at the correlational structure of the predictors:
library(Design)
plot(varclus(as.matrix(items[, c(8:30, 32, 33)])))
So we have these 10 highly correlated measures of orthographic consistency, let’s or-thogonalize these using principal components analysis.
phonN spelN
friendsN
phonV spelV
friendsV
Fdif Ncou Len
Bigr ffV
ffN ffNonzero
fbV fbN
Ient VerbFreq
NVratio
InBi Vf
Dent
NsyS NsyC
CelS NounFreq
1.0 0.8 0.6 0.4 0.2 0.0
Spearman ρ2
Figure4.4:Cost-complexityprunedCARTtreefortherealizationoftherecipientinEglishclauses(throughNPorPP)inEnglish(datacourtesyJoanBresnanandAnnaCuen
140
> items.pca = prcomp(items[,c(19:28)], center=T, scale.=T)
> pvars = items.pca$sdevˆ2/sum(items.pca$sdevˆ2)
> ndims=sum(pvars>0.05)
> sum((items.pca$sdevˆ2/sum(items.pca$sdevˆ2))[1:ndims]) [1] 0.9269
> ndims [1] 4
> items.pca$rotation[,1:4] -> xx
> x=as.data.frame(xx)
> x[sort.list(x$PC4),] # PC4 captures tokens (N) versus types (V)
PC1 PC2 PC3 PC4
friendsN 0.3718694 -0.28289007 0.07148132 -0.44928966 spelN 0.3881210 -0.22627576 -0.16553869 -0.40495399 phonN 0.4069421 0.18048815 0.07507764 -0.34997103 fbN 0.2444671 0.52547079 0.06847653 -0.05996177 ffN 0.1056966 0.06472324 -0.66663278 0.05153449 fbV 0.2491062 0.52677423 0.06610245 0.10484388 ffV 0.0927819 0.04683986 -0.67005837 0.13199384 phonV 0.3892776 0.22303485 0.13663841 0.37879208 friendsV 0.3407555 -0.35233530 0.19816428 0.38419662 spelV 0.3690849 -0.31987200 -0.03841247 0.43117866 And we add these PCs toitems:
items$PC1 = items.pca$x[,1]
items$PC2 = items.pca$x[,2]
items$PC3 = items.pca$x[,3]
items$PC4 = items.pca$x[,4]
This data frame is also available asDATA/rts.items.txt. We make a data distribution object that the Design library uses when plotting effects.
items=read.table("DATA/rts.items.txt",T)
We will make use of theDesignlibrary that comes with Harrell [2001], an excellent book on regression.
library(Design) library(Hmisc)
When you start working with data usingDesign, it is useful to first make an object that summarizes the distribution of your data. The current datadistobject is set using the optionscommand.
items.dd = datadist(items) options(datadist=’items.dd’)
We use theols()function for ’ordinary least squares’ regression, which is much better than the standardLM() function that we introduced earlier. Our first model takes lexical decision latencies as the dependent variable, and seeks to model it as a linear combination of the other variables. In order to keep collinearity somewhat under control, we do not use variables that are variants, such as family size and derivational entropy.
The collinearity of the set of variables that we consider is still very high,
collin.fnc(items, c(8, 9, 11, 12, 14, 15, 16, 17, 33, 34:37))$cnumber 70
so we need to check carefully later whether our model is reasonably rubust.
items.ols = ols(RTld˜PC1+PC2+PC3+PC4+ Len+Bigr+
Wcat+ CelS+Fdif+NVratio+Dent+Ient+NsyC, data=items)
anova(items.ols)
Analysis of Variance Response: RTld
Factor d.f. Partial SS MS F P
PC1 1 0.027506143 0.027506143 4.71 0.0302 PC2 1 0.002089036 0.002089036 0.36 0.5500 PC3 1 0.015762311 0.015762311 2.70 0.1007 PC4 1 0.003885736 0.003885736 0.66 0.4150 Len 1 0.023347310 0.023347310 3.99 0.0458 Bigr 1 0.044314419 0.044314419 7.58 0.0059 Wcat 1 0.056669613 0.056669613 9.69 0.0019 CelS 1 4.959025503 4.959025503 848.30 <.0001 Fdif 1 0.247675623 0.247675623 42.37 <.0001 NVratio 1 0.190007853 0.190007853 32.50 <.0001 Dent 1 0.049328103 0.049328103 8.44 0.0037 Ient 1 0.396822468 0.396822468 67.88 <.0001 NsyC 1 0.112721489 0.112721489 19.28 <.0001 REGRESSION 13 11.726697648 0.902053665 154.31 <.0001 ERROR 2219 12.971941640 0.005845850
Of the PCs for orthographic consistency, only the first is significant, so we chuck PC2–4 out, and we also consider the possibility that some variables might be non-linear, using restricted cubic splines:
items.ols = ols(RTld˜PC1+Len+Bigr+
Wcat+rcs(CelS)+rcs(Fdif)+NVratio+
rcs(Dent,3)+Ient+rcs(NsyC), data=items, x=T, y=T)
anova(items.ols)
Analysis of Variance Response: RTld
Factor d.f. Partial SS MS F P
PC1 1 0.02393750 0.023937499 4.44 0.0353
Len 1 0.02169801 0.021698008 4.02 0.0450
Bigr 1 0.05852876 0.058528758 10.85 0.0010 Wcat 1 0.04164103 0.041641031 7.72 0.0055 CelS 4 5.50533065 1.376332663 255.19 <.0001 Nonlinear 3 0.70641383 0.235471277 43.66 <.0001 Fdif 4 0.40997178 0.102492944 19.00 <.0001 Nonlinear 3 0.06037631 0.020125438 3.73 0.0108 NVratio 1 0.03855496 0.038554957 7.15 0.0076 Dent 2 0.13940175 0.069700877 12.92 <.0001 Nonlinear 1 0.09494349 0.094943495 17.60 <.0001 Ient 1 0.15225267 0.152252673 28.23 <.0001 NsyC 4 0.10497023 0.026242557 4.87 0.0007 Nonlinear 3 0.04331896 0.014439654 2.68 0.0456 TOTAL NONLINEAR 10 1.06305843 0.106305843 19.71 <.0001 REGRESSION 20 12.76845350 0.638422675 118.37 <.0001 ERROR 2212 11.93018579 0.005393393
We now have only significant predictors. Note that the anova summary also tells us whether the non-linear components are significant.
We have done rather rigorous variable selection. So we use bootstrap validation to check whether we have done something sensible:
validate(items.ols,bw=T,B=200) ...
Frequencies of Numbers of Factors Retained 7 8 9 10
11 23 74 92
index.orig training test optimism
R-square 0.516969917 0.520893823 0.511475061 0.0094187618 MSE 0.005342672 0.005287432 0.005403449 -0.0001160167 Intercept 0.000000000 0.000000000 0.043590926 -0.0435909265 Slope 1.000000000 1.000000000 0.993226182 0.0067738177
index.corrected n R-square 0.507551155 200
MSE 0.005458688 200
Intercept 0.043590926 200 Slope 0.993226182 200
The corrected index for the R-square is not much lower than the original index, so there is little overfitting, good. Let’s plot the partial effects:
par(mfrow=c(3,4))
plot(items.ols, ylim=c(6.3,6.6)) par(mfrow=c(1,1))
plot(items.ols, Fdif=NA) # plots only Fdif
Note that by plotting all effects on the same scale, we get a good visual impression of the effect sizes. Frequency has the greatest effect. Interestingly, the more often a word is typical of written language (greater Fdif), the longer it takes to respond to in visual lexical decision. This points to the primacy of familiarity of spoken language.
For a regression model to be valid and trustworthy for prediction (within the inter-vals of the predictors used), we need to look whether the model did a decent job. One diagnostic is the distribution of the residuals, which should be approximately normally distributed. we can check this with
qqnorm(resid(items.ols)) qqline(resid(items.ols))
This looks reasonable enough, as shown in the upper left panel of Figure 4.6, but there is a weird inflection upwards at the right hand side. Probably, the model is having difficulties with the extremely high decision times. A Shapiro-Wilk test, moreover, is very negative about this distribution being normal (but it tends to be rather picky . . . ):
shapiro.test(resid(items.ols)) Shapiro-Wilk normality test data: resid(items.ols)
W = 0.9947, p-value = 3.866e-07
So let’s look at the distribution of reaction times, and make a new data frame without the outliers:
plot(sort(items$RTld)) abline(h=6.23)
abline(h=6.7)
items2 = items[items$RTld > 6.23 & items$RTld < 6.7,]
nrow(items) - nrow(items2)
[1] 42 # data points removed on total of 2233, roughly 2%
Let’s now refit our model:
items2.ols = ols(RTld˜PC1+ Len+Bigr+
rcs(CelS)+rcs(Fdif)+NVratio+
rcs(Dent)+Ient+rcs(NsyC)+
Wcat, data=items2)
PC1
RTld
−6−226
6.30 6.40 6.50 6.60
Len
6.30 6.40 6.50 6.60
CelS
RTld
04812
6.30 6.40 6.50 6.60
Fdif
RTld
−3−113
6.30 6.40 6.50 6.60
NVratio
RTld
−15−55
6.30 6.40 6.50 6.60
Dent
RTld
0.01.02.03.0
6.30 6.40 6.50 6.60
Ient
RTld
0.01.02.0
6.30 6.40 6.50 6.60
NsyC
RTld
0246
6.30 6.40 6.50 6.60
Wcat
anova(items2.ols)
Analysis of Variance Response: RTld
Factor d.f. Partial SS MS F P
PC1 1 0.01213802 0.012138022 2.54 0.1110
Len 1 0.02274202 0.022742018 4.76 0.0292
Bigr 1 0.05235859 0.052358592 10.96 0.0009 CelS 4 4.39116958 1.097792394 229.89 <.0001 Nonlinear 3 0.60878333 0.202927777 42.50 <.0001 Fdif 4 0.34335007 0.085837518 17.98 <.0001 Nonlinear 3 0.06929259 0.023097529 4.84 0.0023 NVratio 1 0.02537945 0.025379450 5.31 0.0212 Dent 4 0.10719001 0.026797503 5.61 0.0002 Nonlinear 3 0.07752807 0.025842692 5.41 0.0010 Ient 1 0.11648041 0.116480412 24.39 <.0001 NsyC 4 0.10414574 0.026036434 5.45 0.0002 Nonlinear 3 0.04173393 0.013911311 2.91 0.0332 Wcat 1 0.03785084 0.037850837 7.93 0.0049 TOTAL NONLINEAR 12 0.91089015 0.075907513 15.90 <.0001 REGRESSION 22 10.63547856 0.483430843 101.24 <.0001 ERROR 2167 10.34814537 0.004775332
PC1 is now worthless - apparently it depended on a few extreme values, for the rest, nothing changed much (you can check this by plotting the new model). So we could throw PC1 out as a predictor. If we now check the residuals with a qqnorm plot (see Figure 4.6), we get a better distribution, and even the Shapiro-Wilk test is reasonably satisfied:
shapiro.test(resid(items.ols))
Shapiro-Wilk normality test data: resid(items.ols)
W = 0.9982, p-value = 0.01731
The lower right panel of Figure 4.6 shows the residuals plotted against the fitted ues. There is a little heteroskedasticity (the residuals fan out slightly for high fitted val-ues), but not so much to be a worry. Nevertheless, there is a hint that there is a small problem with respect to fitting the highest reaction times (compare the little warning from shapiro.test()).