Elementary statistics with R
2.2 Basic statistical tests
2.2.1 Tests for single vectors
Distribution tests
Baayen and Lieber [1997] studied the frequency distributions of several Dutch deriva-tional prefixes. The frequencies of 985 words with the prefix ver- are available in the data framever.dat.txtin theDATAdirectory. Let’s load this dataset, logarithmically trans-form the frequencies, and plot the probability density function:
> ver = read.table("DATA/ver.dat.txt",T)
> ver$Frequency = log(ver$Frequency)
> plot(density(ver$Frequency))
As you can see in Figure 2.8, this frequency distribution is bimodal: There are two clear peaks. The question that arises here is what kind of distribution this might be.
As many statistical techniques presuppose a normal distribution, we can ask the more specific question of whether this might be a normal distribution with an odd shape due to chance. There are two tests that are convenient for answering this question. The simplest to use is the Shapiro-Wilk test for normality:
> shapiro.test(ver$Frequency)
Shapiro-Wilk normality test data: ver$Frequency
W = 0.9022, p-value = < 2.2e-16
This test makes use of a specific test statistic W , and the probability that W is as large as it is under chance conditions for a normal distribution is vanishingly small. In other words, we can safely reject the idea that the distribution of log frequency is normal. A second test that might be used is the Kolmogorov-Smirnov one-sample test. Its first argument is the observed vector of values, its second argument is the name of the distribution function that we want to compare our observed vector with. As we are considering a normal distri-bution, this second argument isdnorm. The remaining arguments are the corresponding
−2 0 2 4 6 8 10 12
0.000.100.20
density(x = ver$Frequency)
N = 985 Bandwidth = 0.5769
Density
Figure 2.8: Estimated probability density function of the Dutch suffix ver-.
parameters, in this case, the mean and standard deviation which we estimate from the frequency vector:
> ks.test(ver$Frequency, "dnorm",
+ mean(ver$Frequency), sd(ver$Frequency)) One-sample Kolmogorov-Smirnov test data: ver$Frequency
D = 0.9968, p-value = < 2.2e-16 alternative hypothesis: two.sided Warning message:
cannot compute correct p-values with ties in:
ks.test(ver$Frequency, "dnorm",
mean(ver$Frequency), sd(ver$Frequency))
This test produces a test statistic D, and again the large value of D is very unlikely un-der the assumption that we would be dealing with a normal distribution. The warning message arises because there are ties (observations with the same value) in our data. This test presupposes that the input vector is continuous, and in a continuous distribution ties
are, strictly speaking, impossible. The reason that we have ties in our data is that word frequency counts are discrete, even though the probabilities of words that we try to esti-mate with our frequency counts are continuous. A workaround to silence this annoying warning is to add a little bit of noise to the frequency vector with the functionjitter() in order to break the ties:
> ver$Frequency[1:5]
[1] 5.541264 5.993961 4.343805 0.000000 7.056175
> jitter(ver$Frequency[1:5])
[1] 5.5179064 6.0002591 4.2696683 0.0373808 6.9965528
> ks.test(jitter(ver$Frequency), "dnorm", + mean(ver$Frequency), sd(ver$Frequency)) One-sample Kolmogorov-Smirnov test data: jitter(ver$Frequency)
D = 0.9968, p-value = < 2.2e-16 alternative hypothesis: two.sided
If you have a vector of counts, the question may arise whether the probabilities of the things counted are all equal. For instance, suppose that we compile a table of the 10 most frequent function words in the introduction of this book
> intro = rev(sort(table(read.table("DATA/intro.txt",T))))[1:10]
> intro
the to of a you is in and that data
44 39 33 31 28 28 23 23 18 17
and ask ourselves whether the probabilities of these words are all the same. We can test this with a chi-squared test using the functionchisq.test():
> chisq.test(intro)
Chi-squared test for given probabilities data: intro
X-squared = 23.9577, df = 9, p-value = 0.004369
Unsurprisingly, the chi-squared test produces a test statistic, X-squared, that follows a χ2-distribution, in this case with 9 degrees of freedom. You can check that the t-value reported in the above summary can also be calculated using1-pchisq(23.9577, 9). What this test shows for this example is that the 10 most frequent function words do not have equal counts, they are not uniformly distributed.
Tests for the mean
The question may arise whether the mean of a vector of observations has a particular value. Consider, for instance, the acoustic length of the vowel in the Dutch prefix ont-. A phonetically rather uninformed hypothesis would be that since there are three phonemes, the mean length of the vowel should be one third of the mean length of the suffix as a whole. Pluymaekers et al. [2004] studied the temporal properties of this suffix and its segments, and their data are available as data.ont.txt in the DATA directory. Let’s load these data intoR, and test this hypothesis by first calculating the mean length of the prefix, and then using the functiont-test():
> ont = read.table("DATA/data.ont.txt",T)
> meanLengthPrefix = mean(ont$lengteprefix)
> t.test(ont$lengteprefixklinker, mu = meanLengthPrefix/3) One Sample t-test
data: ont$lengteprefixklinker
t = 6.3651, df = 101, p-value = 5.797e-09
alternative hypothesis: true mean is not equal to 0.04960906 95 percent confidence interval:
0.05860697 0.06675495 sample estimates:
mean of x 0.06268096
In this example, we used thet.test()function to carry out a one sample t-test. The test statistic of the t-test is t, which follows a t-distribution with, in this case, 101 degrees of freedom. The p-value given in the summary is easy to calculate yourself, using
> 2 * (1 - pt(abs(6.3651), 101)) [1] 5.796028e-09
Note thatRcarries out a two-tailed test by default. If you need a one-tailed test, you have to add the optionalternative="less"oralternative="greater". It is clear that the observed mean length of the vowel is significantly different (larger) than one third of the mean length of the prefix.
The t-test is a valid test for data that are more or less normally distributed. It should not be used for variables with skewed distributions. For such variables, the one sample Wilcoxon test, implemented in the functionwilcox.test(), should be used. We leave it as an exercise to the reader to check that in our example vowel length is properly nor-mally distributed. Note that when we apply the Wilcoxon test, we obtain a p-value that is somethat larger (although still quite small) compared to that of the t-test.
> wilcox.test(ont$lengteprefixklinker, mu = meanLengthPrefix/3)
Wilcoxon signed rank test with continuity correction data: ont$lengteprefixklinker
V = 4212, p-value = 1.216e-07
alternative hypothesis: true mu is not equal to 0.04960906
This is usually the case when the p-values of these two tests are compared. The Wilcoxon test() is slightly less good at detecting surprise for normal random variables thant.test(), but it still does a good job when the t-test is inapplicable.
When you have two vectors of observations, it is important to distinguish between independent vectors and paired vectors. In the case of independent vectors, the observa-tions in the one vector are not linked in a systematic way to the observaobserva-tions in the other vectors. Consider, for instance, sampling 100 words at random from a frequency list com-piled for corpus A, and then sampling another 100 words at random from a frequency list compiled for corpus B. The two vectors of frequencies can be compared in various ways in order to address differences in frequency of use between the two corpora, and contain independent observations. As an example of paired observations, consider the case in which a specific list of 100 word types is compiled, with for each word type its fre-quency in corpus A and its frefre-quency in corpus B. The observations in the two vectors are now paired: the frequencies are tied, pairwise, to a given word. For such paired vectors, more powerful tests are available. In what follows, we first discuss tests for independent vectors. We then proceed to the case of paired vectors.