A numerical vector and a factor - Basic statistical tests

Elementary statistics with R

2.2 Basic statistical tests

2.2.4 A numerical vector and a factor

> mat = log(mat+1) # take logs for visualization

> persp(kde2d(mat[,1],mat[,2], n=50),

> phi=30, theta=20, d=10, col="lightblue",

> shade=0.75,box=T,border=NA, ltheta=-100, expand=0.5,

> xlab="log X", ylab="log Y", zlab="density")

> mtext("bivariate lognormal-Poisson", 3, 1)

The lower panels of Figure 2.14 illustrate two empirical densities. The left panel con-cerns the phonological similarity space of 4171 Dutch word forms with four phonemes.

For each of these words, we calculated the number of four-phoneme words that differ in only one phoneme, its phonological neighborhood size. For each word, we also calcu-lated the rank of that word in its neighborhood. (If the word was the most frequent word in its neighborhood, its rank was 1, etc.) After removal of words with no neighbors and log transforms, we obtain a density that is clearly not strictly bivariate normal, but that might perhaps be considered as bivariate normal when considering a regression model.

The lower right panel of Figure 2.14 presents the frequencies of 4633 Dutch monomor-phemic noun stems in the singular and plural form. This distribution has the same kind of shape as that of the lognormal-Poisson variate in the upper right.

2.2.4 A numerical vector and a factor

The lm()function can also be used to do a t-test for the difference between two group means, under the assumption that the variances of the two groups are the same. In order to test whether the mean frequency differs between plants and animals in theitemsdata set, we makeFrequencydepend onClass:

summary(lm(Frequency ˜ Class, data = items)) ...

Y density

bivariate standard normal

log X log Y

density

bivariate lognormal−Poisson

log types log rank density

lexical neighbors and rank

log Fsg log Fpl density

singular and plural frequency

Figure 2.14: Random samples of a bivariate standard normal and a lognormal-Poisson variate (upper panels). The lower left panel shows the joint distribution of phonologi-cal neighborhood size and rank in the neighborhood for 4-phoneme Dutch wordforms, the lower right panel shows the joint distribution for singular and plural frequency for monomorphemic Dutch nouns.

Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5122 0.1386 25.348 < 2e-16 ***

Classplant 0.8547 0.2108 4.055 0.000117 ***

Whatlm()does internally with the factorClassis to code its levels into numeric vectors.

Because Class has only two levels, one numeric vector suffices, and because animal precedesplant in the alphabet,animalis mapped onto 0 and plantis mapped onto 1. When you look at the coefficients, we get two coefficients, as usual. In this example, the intercept represents the mean for animals (the level mapped onto 0, or, in jargon, the factor mapped onto the intercept). In other words, the level animal is treated as the default. Special measures have to be taken for the non-default case, in this case the level plant. When dealing with a word for a plant, 0.8547 has to be added to the intercept to obtain the mean for the plants. Compare this with what the t-test tells us (when we force it to treat the variances as equal):

t.test(animals$Frequency, plants$Frequency, var.equal = TRUE) Two Sample t-test

data: animals$Frequency and plants$Frequency t = -4.0548, df = 79, p-value = 0.0001168

alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

-1.2742408 -0.4351257 sample estimates:

mean of x mean of y 3.512174 4.366857

The function t.test() is restricted to two vectors, but the lm() function can be applied to a factor with more than two levels. By way of example, consider the dataset dverbs, available in theDATA directory, which contains detailed information about the lexical properties of 286 Dutch verbs. One of these lexical variables,nSynV, is the number of verbal synsets in which a given verb appears in the Dutch WordNet. Another variable, Aux, specifies what the appropriate auxiliary for the perfect tense is for that verb. Dutch has two auxiliaries, zijn (’be’) and hebben (’have’), and verbs subcategorize as to whether they select only zijn, only hebben, or both (depending on the aspect of the clause). We can test whether the number of synsets varies with auxiliary by modelingnSynVas a function ofAuxusing thelm()function, but we then use a different function for summarizing the outcome: anova():

> dverbs.lm = lm(nSynV ˜ Aux, data = dverbs)

> anova(dverb.lm)

Analysis of Variance Table

Response: nSynV

Df Sum Sq Mean Sq F value Pr(>F) Aux 2 117.80 58.90 7.6423 0.0005859 ***

Residuals 282 2173.43 7.71

Theanova()function reports the results of an F -test, which yielded an F -value of 7.64, which, for 2 and 282 degrees of freedom, is highly significant (compare1-pf(7.6423, 2, 282)). What this test tells us is that there are significant differences in the mean number of synsets for the three kinds of verbs, but is not specific as to which of the — in this case 3 — possible differences in the means might be involved. These three means can be obtained with the functiontapply(), the use of which will be explained below.

> tapply(dverbs$nSynV, dverbs$Aux, mean) hebben zijn zijnheb

3.466981 4.066667 5.068966

Some information as to which of these means are really different can be gleaned from the summary:

summary(dverbs.lm) ...

Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4670 0.1907 18.183 < 2e-16 ***

Auxzijn 0.5997 0.7417 0.808 0.419488 Auxzijnheb 1.6020 0.4114 3.894 0.000123 ***

In order to understand the role of the coefficients, you should realize that the auxiliary levelhebbenis mapped onto the intercept. The t-test for the intercept tells us that 3.46 is unlikely to be zero, which is not of interest to us now. The coefficient of 0.5997, which applies to the case that the Aux factor has the level zijn, tells us that you have to add 0.5997 to the mean ofhebbento get the mean ofzijn, and the corresponding t-test tells us that it is unlikely that these two means are very different. The last coefficient specifies that in order to get from the mean forhebbento the mean ofzijnheb, you have to add 1.6020, and the associated t-test gives us reason for surprise that the difference is as big as it is. There is one comparison that is left out in this example (zijnversuszijnheb), and when a factor has more than three levels, there will be more comparisons that do not appear in the table of coefficients. This is because this table lists only those pairwise comparisions that involve the default level, the level that is mapped onto the intercept.

There is reason for worry that the above model may be invalid. If we inspect the variances for the three levels,

> tapply(dverbs$nSynV, dverbs$Aux, var) hebben zijn zijnheb

5.994165 18.066667 11.503932

we find that they are quite different. Moreover, there are also substantial differences in the number of observations for each level — the number of verbs that select zijn only is a distinct minority. It makes sense, therefore, to check whether a test that does not depend on the normality assumptions of thelm()function provides further support for the observed difference. The test we use here is the Kruskal-Wallis rank sum test:

> kruskal.test(dverbs$nSynV, dverbs$Aux) Kruskal-Wallis rank sum test data: dverbs$nSynV and dverbs$Aux

Kruskal-Wallis chi-squared = 11.7206, df = 2, p-value = 0.002850 This non-parametric test supports the hypothesis that the numbers of synsets are not uniformly distributed over the three kinds of verbs.

In document Exploratory data analysis. An introduction to R for the language sciences (Page 91-95)