Distribution tests - Trellis graphics - Baayen2008 analyzing linguistic data with R draft pdf

2.4 Trellis graphics

4.1.1 Distribution tests

It is often useful to know what kind of distribution characterizes one’s data. For instance, since many statistical procedures assume that vectors are normally distributed, it is often necessary to ascertain whether a vector of values is indeed approximately normally distributed. Sometimes, the shape of a distribution is itself of theoretical interest.

By way of example, consider Baayen and Lieber [1997], who studied the frequency distributions of several Dutch derivational prefixes. The frequencies of985words with the prefixver-are available in the data setver. We plot the estimated density with

> plot(density(ver$Frequency))

As can be seen in the left panel of Figure 4.1, we have a highly skewed distribution with a few high-frequency outliers and most of the probability mass squashed against the vertical axis. It makes sense, therefore, to logarithmically transform these frequencies, in order to remove at least some of the skewness.

> ver$Frequency = log(ver$Frequency) > plot(density(ver$Frequency))

DRAFT

0 5000 15000 0.000 0.002 0.004 0.006 frequency density −2 2 4 6 8 12 0.00 0.10 0.20 log frequency density

Figure 4.1: Estimated probability density functions for the Dutch suffixver-.

The result is shown in the right panel of Figure 4.1. We now have a bimodal frequency distribution with two clear peaks. The question that arises here is what kind of distribution this might be. Could the logged frequencies follow a normal distribution that happens to have a second bump due to chance?

There are several ways to pursue this question. Let’s first consider visualization by means of a quantile-quantile plot. We graph the quantiles of the standard normal distribution (displayed on the horizontal axis) against the quantiles of the empirical distribution (displayed on the vertical axis). If the empirical distribution is normal (irrespective of mean or variance), its quantiles should be identical to those of the standard normal, and the quantile-quantile plot should produce a straight line. The left panel of Figure 4.2 provides an example for985random numbers from a normal distribution with mean4

and standard deviation3.

> qqnorm(rnorm(length(ver$Frequency), 4, 3)) > abline(v = qnorm(0.025), col = "grey") > abline(h = qnorm(0.025, 4, 3), col = "grey")

The theoretical and empirical values for the2.5% percentage points are shown by means of grey lines. The horizontal axis shows the values of the standard normal, ordered from small to large. Around−1.96,2.5% of the data points have been graphed, and around

+1.96,97.5% of the data points have been covered. The vertical axis shows the quantiles of the random numbers. In this case,2.5% of the data points have been covered by the time you have reached the value−1.87. Whenever you compare the largest values observed

DRAFT

−3 −1 0 1 2 3 −5 0 5 10 Theoretical Quantiles Sample Quantiles

normal random numbers

−3 −1 0 1 2 3 0 2 4 6 8 10 Theoretical Quantiles Sample Quantiles ver

Figure 4.2: Quantile-quantile plots for a sample of985normal(4,3)-distributed random numbers (left) and for the logged frequencies of985Dutch derived words with the prefix

ver-.

for a given percentage of the ordered data, you will find that the points always lie very near the same line.

When we make a quantile-quantile plot for the logged frequencies of words with the Dutch prefixver-, we obtain a weirdly shaped graph, as shown in the right panel of Fig- ure 4.2.

> qqnorm(ver$Frequency)

The lowest log frequency, zero, represents27.8% of the words, and this shows up as a horizontal bar of points in the graph. It is clear that we are not dealing with a normal distribution.

Instead of visualizing the distribution, we can make use of two tests. The simplest to use is the SHAPIRO-WILK TEST FOR NORMALITY:

> shapiro.test(ver$Frequency) Shapiro-Wilk normality test data: ver$Frequency

W = 0.9022, p-value = < 2.2e-16

This test makes use of a specific test statisticW, and the probability thatWis as large as it is under chance conditions for a normal distribution is vanishingly small. We can safely reject the null hypothesis that the log-transformed frequencies of words with-verfollow a normal distribution.

DRAFT

A second test that can be used is the KOLMOGOROV-SMIRNOV ONE-SAMPLE TEST. Its first argument is the observed vector of values, its second argument is the name of the density function that we want to compare our observed vector with. As we are consider- ing a normal distribution here, this second argument ispnorm. The remaining arguments are the corresponding parameters, in this case, the mean and standard deviation which we estimate from the (log-transformed) frequency vector:

> ks.test(ver$Frequency, "pnorm",

+ mean(ver$Frequency), sd(ver$Frequency)) One-sample Kolmogorov-Smirnov test data: ver$Frequency

D = 0.1493, p-value < 2.2e-16 alternative hypothesis: two.sided

Warning message: cannot compute correct p-values with ties

This test produces a test statisticDthat is so large that it is very unlikely to arise under the assumption that we would be dealing with a normal distribution.

The warning message arises because there areTIES(observations with the same value) in our data. This test presupposes that the input vector is continuous, and in a continuous distribution ties are, strictly speaking, impossible. The reason that we have ties in our data is that word frequency counts are discrete, even though the probabilities of words that we try to estimate with our frequency counts are continuous. A workaround to silence this warning is to add a little bit of noise to the frequency vector with the functionjitter(), breaking the ties:

> ver$Frequency[1:5] [1] 5.541264 5.993961 4.343805 0.000000 7.056175 > jitter(ver$Frequency[1:5]) [1] 5.5179064 6.0002591 4.2696683 0.0373808 6.9965528 > ks.test(jitter(ver$Frequency), "pnorm", + mean(ver$Frequency), sd(ver$Frequency)) One-sample Kolmogorov-Smirnov test data: jitter(ver$Frequency)

D = 0.1493, p-value < 2.2e-16 alternative hypothesis: two.sided

When dealing with a vector of counts, we may face the question of whether the probabilities of the things counted are all essentially the same. For instance, the most frequent words in an earlier version of the introduction to this book are

> intro = c(75, 68, 45, 40, 39, 39, 38, 33, 24, 24) > names(intro) = c("the", "to", "of", "you", "is", "a", + "and", "in", "that", "data")

> the to of you is a and in that data

75 68 45 40 39 39 38 33 24 24

DRAFT

Are the probabilities of these words (as estimated by their frequencies) essentially the same? We can investigate this with aCHI-SQUARED TEST:

> chisq.test(intro)

Chi-squared test for given probabilities data: intro

X-squared = 59.7294, df = 9, p-value = 1.512e-09

Unsurprisingly, the chi-squared test produces a test statistic namedX-squared, that fol- lows aχ2_{-distribution, in this case with}₉_{degrees of freedom. (You can check that the}

p-value reported in this summary equals1 - pchisq(59.7294, 9)). What this test shows is that the10most frequent function words do not all have the same probability (frequency). The range of values is just too large. By contrast, the counts in the following vector

> x = c(37, 21, 26, 30, 23, 26, 41, 26, 37, 33)

are much more similar, and the chi-squared test is no longer significant:

> chisq.test(x)

Chi-squared test for given probabilities data: x

X-squared = 13.5333, df = 9, p-value = 0.1399

In document Baayen2008 analyzing linguistic data with R draft pdf (Page 43-45)