Linear regression - size of tree - Exploratory data analysis. An introduction to R for the lang

size of tree

4.2 Linear regression

Measures of lexical processing and subjective frequency estimates for rts = read.table("DATA/rts.txt",T)

Most variables are summarized in Table 3.1, except for 4512 monomorphemic and mono-syllabic English words. The three dependent variables: RTld (reaction time in visual lexical decision), RTna (reaction time in word naming) andFami (subjective frequency estimate) are available at Balota et al. [1999], Spieler and Balota [1998]. Inrts, reaction times are already log transformed.

rts[1:4,]

# X RTld RTna Fami Word Age Wcat CelS Fdif Vf

#1 1 6.543754 6.145044 2.37 doe young N 3.912023 1.0216510 1.386294

#2 2 6.397596 6.246882 4.43 whore young N 4.521789 0.3504830 1.386294

#3 3 6.304942 6.143756 5.60 stress young N 6.505784 2.0893560 1.609438

#4 4 6.424221 6.131878 3.87 pork young N 5.017280 -0.5263339 1.945910

# Dent Ient NsyS NsyC Len Ncou Bigr InBi spelV spelN

#1 0.14144 0.02114 0.6931472 0.000000 3 8 7.036333 12.02268 10 41

#2 0.42706 0.94198 1.0986123 0.000000 5 5 9.537878 12.59780 20 2619

#3 0.06197 1.44339 2.4849066 1.945910 6 0 9.883931 13.30069 10 806

ac_rec=b|

ac_theme=ac

net_of_theme=a

semantic_class=abcd

length_of_theme>=1.498 n

1861/116

n 119/17

p 50/139

n 165/44

p 134/188

p 85/345

Figure 4.3: Cost-complexity pruned CART tree for the realization of the recipient in En-glish clauses (through NP or PP) in EnEn-glish (data courtesy Joan Bresnan and Anna Cueni).

#4 0.43035 0.00000 1.0986123 2.639057 4 8 8.309180 12.07807 5 793

# phonV phonN friendsV friendsN ffV ffN fbV fbN NounFreq VerbFreq Missing

#1 41 6889 8 26 1 14 32 6862 49 0 present

#2 38 17602 20 2619 0 0 18 14983 142 0 present

#3 13 1141 10 806 0 0 3 335 565 473 present

#4 6 45 4 33 1 760 2 12 150 0 present

We need to transform various variables as there are decidedly non-normal:

rts$spelN = log(rts$spelN+1) rts$phonN = log(rts$phonN+1)

rts$friendsN = log(rts$friendsN+1)

# rts$ffN, rts$ffV terribly distributed, even after log,

# perhaps factorize ?

rts$ffNonzero = as.numeric(rts$ffN > 0) rts$fbV = log(rts$fbV+1)

rts$fbN = log(rts$fbN+1) rts$ffV = log(rts$ffV+1) rts$ffN = log(rts$ffN+1)

rts$NounFreq = log(rts$NounFreq+1) rts$VerbFreq = log(rts$VerbFreq+1)

rts$NVratio = rts$NounFreq-rts$VerbFreq

We need to assess collinearity. We create the data matrix, and look at the condition number as defined in Belsley et al. [1980], using the function collin.fnc, the code of which can be found in FUNCTIONS/cap4.q. There are data for both young and old subjects, we look only at the young age group.

items = rts[rts$Age=="young",] # the group of young subjects items = items[items$Missing=="present",]

source("FUNCTIONS/cap4.q")

collin.fnc(items[,c(19:24,27,28)])$cnumber [1] 207.6198

Horribly high condition number, should be below 15. Not surprising given the strong correlations between all the predictors, compare Figure 4.4. The Design library is ex-cellent for regression, we load it and use its variable clustering algorithm to look at the correlational structure of the predictors:

library(Design)

plot(varclus(as.matrix(items[, c(8:30, 32, 33)])))

So we have these 10 highly correlated measures of orthographic consistency, let’s or-thogonalize these using principal components analysis.

phonN spelN

friendsN

phonV spelV

friendsV

Fdif Ncou Len

Bigr ffV

ffN ffNonzero

fbV fbN

Ient VerbFreq

NVratio

InBi Vf

Dent

NsyS NsyC

CelS NounFreq

1.0 0.8 0.6 0.4 0.2 0.0

Spearman ρ²

Figure4.4:Cost-complexityprunedCARTtreefortherealizationoftherecipientinEglishclauses(throughNPorPP)inEnglish(datacourtesyJoanBresnanandAnnaCuen

140

> items.pca = prcomp(items[,c(19:28)], center=T, scale.=T)

> pvars = items.pca$sdevˆ2/sum(items.pca$sdevˆ2)

> ndims=sum(pvars>0.05)

> sum((items.pca$sdevˆ2/sum(items.pca$sdevˆ2))[1:ndims]) [1] 0.9269

> ndims [1] 4

> items.pca$rotation[,1:4] -> xx

> x=as.data.frame(xx)

> x[sort.list(x$PC4),] # PC4 captures tokens (N) versus types (V)

PC1 PC2 PC3 PC4

friendsN 0.3718694 -0.28289007 0.07148132 -0.44928966 spelN 0.3881210 -0.22627576 -0.16553869 -0.40495399 phonN 0.4069421 0.18048815 0.07507764 -0.34997103 fbN 0.2444671 0.52547079 0.06847653 -0.05996177 ffN 0.1056966 0.06472324 -0.66663278 0.05153449 fbV 0.2491062 0.52677423 0.06610245 0.10484388 ffV 0.0927819 0.04683986 -0.67005837 0.13199384 phonV 0.3892776 0.22303485 0.13663841 0.37879208 friendsV 0.3407555 -0.35233530 0.19816428 0.38419662 spelV 0.3690849 -0.31987200 -0.03841247 0.43117866 And we add these PCs toitems:

items$PC1 = items.pca$x[,1]

items$PC2 = items.pca$x[,2]

items$PC3 = items.pca$x[,3]

items$PC4 = items.pca$x[,4]

This data frame is also available asDATA/rts.items.txt. We make a data distribution object that the Design library uses when plotting effects.

items=read.table("DATA/rts.items.txt",T)

We will make use of theDesignlibrary that comes with Harrell [2001], an excellent book on regression.

library(Design) library(Hmisc)

When you start working with data usingDesign, it is useful to first make an object that summarizes the distribution of your data. The current datadistobject is set using the optionscommand.

items.dd = datadist(items) options(datadist=’items.dd’)

We use theols()function for ’ordinary least squares’ regression, which is much better than the standardLM() function that we introduced earlier. Our first model takes lexical decision latencies as the dependent variable, and seeks to model it as a linear combination of the other variables. In order to keep collinearity somewhat under control, we do not use variables that are variants, such as family size and derivational entropy.

The collinearity of the set of variables that we consider is still very high,

collin.fnc(items, c(8, 9, 11, 12, 14, 15, 16, 17, 33, 34:37))$cnumber 70

so we need to check carefully later whether our model is reasonably rubust.

items.ols = ols(RTld˜PC1+PC2+PC3+PC4+ Len+Bigr+

Wcat+ CelS+Fdif+NVratio+Dent+Ient+NsyC, data=items)

anova(items.ols)

Analysis of Variance Response: RTld

Factor d.f. Partial SS MS F P

PC1 1 0.027506143 0.027506143 4.71 0.0302 PC2 1 0.002089036 0.002089036 0.36 0.5500 PC3 1 0.015762311 0.015762311 2.70 0.1007 PC4 1 0.003885736 0.003885736 0.66 0.4150 Len 1 0.023347310 0.023347310 3.99 0.0458 Bigr 1 0.044314419 0.044314419 7.58 0.0059 Wcat 1 0.056669613 0.056669613 9.69 0.0019 CelS 1 4.959025503 4.959025503 848.30 <.0001 Fdif 1 0.247675623 0.247675623 42.37 <.0001 NVratio 1 0.190007853 0.190007853 32.50 <.0001 Dent 1 0.049328103 0.049328103 8.44 0.0037 Ient 1 0.396822468 0.396822468 67.88 <.0001 NsyC 1 0.112721489 0.112721489 19.28 <.0001 REGRESSION 13 11.726697648 0.902053665 154.31 <.0001 ERROR 2219 12.971941640 0.005845850

Of the PCs for orthographic consistency, only the first is significant, so we chuck PC2–4 out, and we also consider the possibility that some variables might be non-linear, using restricted cubic splines:

items.ols = ols(RTld˜PC1+Len+Bigr+

Wcat+rcs(CelS)+rcs(Fdif)+NVratio+

rcs(Dent,3)+Ient+rcs(NsyC), data=items, x=T, y=T)

anova(items.ols)

Analysis of Variance Response: RTld

Factor d.f. Partial SS MS F P

PC1 1 0.02393750 0.023937499 4.44 0.0353

Len 1 0.02169801 0.021698008 4.02 0.0450

Bigr 1 0.05852876 0.058528758 10.85 0.0010 Wcat 1 0.04164103 0.041641031 7.72 0.0055 CelS 4 5.50533065 1.376332663 255.19 <.0001 Nonlinear 3 0.70641383 0.235471277 43.66 <.0001 Fdif 4 0.40997178 0.102492944 19.00 <.0001 Nonlinear 3 0.06037631 0.020125438 3.73 0.0108 NVratio 1 0.03855496 0.038554957 7.15 0.0076 Dent 2 0.13940175 0.069700877 12.92 <.0001 Nonlinear 1 0.09494349 0.094943495 17.60 <.0001 Ient 1 0.15225267 0.152252673 28.23 <.0001 NsyC 4 0.10497023 0.026242557 4.87 0.0007 Nonlinear 3 0.04331896 0.014439654 2.68 0.0456 TOTAL NONLINEAR 10 1.06305843 0.106305843 19.71 <.0001 REGRESSION 20 12.76845350 0.638422675 118.37 <.0001 ERROR 2212 11.93018579 0.005393393

We now have only significant predictors. Note that the anova summary also tells us whether the non-linear components are significant.

We have done rather rigorous variable selection. So we use bootstrap validation to check whether we have done something sensible:

validate(items.ols,bw=T,B=200) ...

Frequencies of Numbers of Factors Retained 7 8 9 10

11 23 74 92

index.orig training test optimism

R-square 0.516969917 0.520893823 0.511475061 0.0094187618 MSE 0.005342672 0.005287432 0.005403449 -0.0001160167 Intercept 0.000000000 0.000000000 0.043590926 -0.0435909265 Slope 1.000000000 1.000000000 0.993226182 0.0067738177

index.corrected n R-square 0.507551155 200

MSE 0.005458688 200

Intercept 0.043590926 200 Slope 0.993226182 200

The corrected index for the R-square is not much lower than the original index, so there is little overfitting, good. Let’s plot the partial effects:

par(mfrow=c(3,4))

plot(items.ols, ylim=c(6.3,6.6)) par(mfrow=c(1,1))

plot(items.ols, Fdif=NA) # plots only Fdif

Note that by plotting all effects on the same scale, we get a good visual impression of the effect sizes. Frequency has the greatest effect. Interestingly, the more often a word is typical of written language (greater Fdif), the longer it takes to respond to in visual lexical decision. This points to the primacy of familiarity of spoken language.

For a regression model to be valid and trustworthy for prediction (within the inter-vals of the predictors used), we need to look whether the model did a decent job. One diagnostic is the distribution of the residuals, which should be approximately normally distributed. we can check this with

qqnorm(resid(items.ols)) qqline(resid(items.ols))

This looks reasonable enough, as shown in the upper left panel of Figure 4.6, but there is a weird inflection upwards at the right hand side. Probably, the model is having difficulties with the extremely high decision times. A Shapiro-Wilk test, moreover, is very negative about this distribution being normal (but it tends to be rather picky . . . ):

shapiro.test(resid(items.ols)) Shapiro-Wilk normality test data: resid(items.ols)

W = 0.9947, p-value = 3.866e-07

So let’s look at the distribution of reaction times, and make a new data frame without the outliers:

plot(sort(items$RTld)) abline(h=6.23)

abline(h=6.7)

items2 = items[items$RTld > 6.23 & items$RTld < 6.7,]

nrow(items) - nrow(items2)

[1] 42 # data points removed on total of 2233, roughly 2%

Let’s now refit our model:

items2.ols = ols(RTld˜PC1+ Len+Bigr+

rcs(CelS)+rcs(Fdif)+NVratio+

rcs(Dent)+Ient+rcs(NsyC)+

Wcat, data=items2)

PC1

RTld

−6−226

6.30 6.40 6.50 6.60

Len

6.30 6.40 6.50 6.60

CelS

RTld

04812

6.30 6.40 6.50 6.60

Fdif

RTld

−3−113

6.30 6.40 6.50 6.60

NVratio

RTld

−15−55

6.30 6.40 6.50 6.60

Dent

RTld

0.01.02.03.0

6.30 6.40 6.50 6.60

Ient

RTld

0.01.02.0

6.30 6.40 6.50 6.60

NsyC

RTld

0246

6.30 6.40 6.50 6.60

Wcat

anova(items2.ols)

Analysis of Variance Response: RTld

Factor d.f. Partial SS MS F P

PC1 1 0.01213802 0.012138022 2.54 0.1110

Len 1 0.02274202 0.022742018 4.76 0.0292

Bigr 1 0.05235859 0.052358592 10.96 0.0009 CelS 4 4.39116958 1.097792394 229.89 <.0001 Nonlinear 3 0.60878333 0.202927777 42.50 <.0001 Fdif 4 0.34335007 0.085837518 17.98 <.0001 Nonlinear 3 0.06929259 0.023097529 4.84 0.0023 NVratio 1 0.02537945 0.025379450 5.31 0.0212 Dent 4 0.10719001 0.026797503 5.61 0.0002 Nonlinear 3 0.07752807 0.025842692 5.41 0.0010 Ient 1 0.11648041 0.116480412 24.39 <.0001 NsyC 4 0.10414574 0.026036434 5.45 0.0002 Nonlinear 3 0.04173393 0.013911311 2.91 0.0332 Wcat 1 0.03785084 0.037850837 7.93 0.0049 TOTAL NONLINEAR 12 0.91089015 0.075907513 15.90 <.0001 REGRESSION 22 10.63547856 0.483430843 101.24 <.0001 ERROR 2167 10.34814537 0.004775332

PC1 is now worthless - apparently it depended on a few extreme values, for the rest, nothing changed much (you can check this by plotting the new model). So we could throw PC1 out as a predictor. If we now check the residuals with a qqnorm plot (see Figure 4.6), we get a better distribution, and even the Shapiro-Wilk test is reasonably satisfied:

shapiro.test(resid(items.ols))

Shapiro-Wilk normality test data: resid(items.ols)

W = 0.9982, p-value = 0.01731

The lower right panel of Figure 4.6 shows the residuals plotted against the fitted ues. There is a little heteroskedasticity (the residuals fan out slightly for high fitted val-ues), but not so much to be a worry. Nevertheless, there is a hint that there is a small problem with respect to fitting the highest reaction times (compare the little warning from shapiro.test()).

In document Exploratory data analysis. An introduction to R for the language sciences (Page 137-146)