Elementary statistics with R
2.1.2 Continuous distributions
Continuous random variables assume values that are real numbers. Real numbers have the mathematical property that there infinitely many of them in any interval. This has a far reaching consequence for probabilities. Consider a random variable that assumes any real value in the interval [0,1] with equal probability. Since there is an infinite number of values in this interval, the probability of any specific value between 0 and 1 is infinitely small, and effectively zero. Fortunately, we can handle probabilities for intervals of real values. In the case of the present example, for instance, the probability of a value in the interval [0,0.5] is equal to the probability of a value in the interval [0.5,1]: both proba-bilities are 0.5. The consequence of this property of continuous random variables has consequences for how we can plot their density functions. For the discrete distributions in the preceding section, we were able to plot a vertical line representing the probabil-ity for each possible value of the random variable. This is not possible for continuous random variables, as the individual probabilities are zero. Instead, we plot a continuous curve, as shown in Figure 2.5 for the most important continuous random variable, the normal random variable.
Consider the upper left panel, which shows the normal distribution in its most simple form: the case in which its two parameters, the mean µ and the standard deviation σ, are 0 and 1 respectively. This distribution is known as the standard normal distribution. The mean is represented by a vertical dashed line, and hits the curve of the probability density function where it reaches its maximum. The dotted horizontal line segment represents the standard deviation, the parameter that controls the width of the function. We can shift the curve to the left or right by changing the mean, as shown in the right panels, in which the mean is increased from zero to four. We can make the curve narrower or broader by changing the standard deviation, as shown in the bottom panels, where the standard deviation is 0.5 instead of 1.0.
−4 −2 0 2 4
0.00.10.20.30.4
x
density
normal(0,1)
0 2 4 6 8
0.00.10.20.30.4
x
density
normal(4,1)
−4 −2 0 2 4
0.00.20.40.60.8
x
density
normal(0,0.5)
0 2 4 6 8
0.00.20.40.60.8
x
density
normal(4,0.5)
Figure 2.5: Probability density functions for four normal distributed random variables.
The upper left panel of Figure 2.5 was produced as follows.
> x = seq(-4, 4, 0.1)
> y = dnorm(x)
> plot(x, y, type = "l", xlab = "x", ylab = "density")
> mtext("normal(0, 1)", 3, 1)
> abline(v = 0, lty = 2)
> lines(c(-1, 0), rep(dnorm(-1), 2), lty = 2)
We first defined an interval of values for the standard normal random variable. We then used dnorm() to calculate the values of the standard normal density function. Note that we called dnorm() without further arguments: if you do not specify mean and standard deviation explicitly, dnorm(), and also pnorm(), qnorm(), and rnorm() assume that the mean is zero and the standard deviation is 1. Observe, furthermore, how we used lines() to plot a line segment: the first argument of lines() specifies the x-coordinates, the second argument the y-coordinates.
Figure 2.6 shows the cumulative distribution (upper left) and the quantile function (upper right) for a standard normal random variable. The cumulative distribution was produced bypnorm(), and the quantile function byqnorm().
The lower left panel of Figure 2.6 illustrates how you can calculate the probability that a standard normal random variable has a value between -1 and 0, usingpnorm(). Since pnorm()plots the cumulative probability, the area to the left of the dashed vertical line (and under the curve) represents the probability of a value in the interval [−∞, 0]. This is too much: we need the area highlighted with dark grey, without the area highlighted in light grey. What we need to do, therefore, is subtract the lightgrey area from the area to the left of the dashed line. We can do this inRas follows:
> pnorm(0) - pnorm(-1) [1] 0.3413447
The final panel of Figure 2.6 returns to the probability density function. The shaded areas in the tails of the distribution each represent a probability of 0.025. In other words, the shaded areas highlight the 5% of extreme values in the distribution. The remaining area under the curve that is not shaded represents the 95% of values that are not ’extreme’
(given the rather arbitrary cutoff point of 5% for being extreme).
The lower left panel of Figure 2.6 was produced with the following code.
> x = seq(-3, 3, 0.01)
> plot(x, dnorm(x), type = "l")
> abline(h = 0)
> x1 = seq(-3, qnorm(0.025), 0.005)
> y1 = dnorm(x1, 0, 1)
> polygon(c(x1, rev(x1)), c(rep(0, length(x1)), rev(y1)), + col="lightgrey")
> x1 = seq(qnorm(0.975), 3, 0.005)
−3 −2 −1 0 1 2 3
0.00.40.8
x
pnorm(x)
0.0 0.2 0.4 0.6 0.8 1.0
−3−1123
p
qnorm(p)
−3 −2 −1 0 1 2 3
0.00.40.8
x
pnorm(x)
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
x
dnorm(x)
Figure 2.6: Cumulative distribution function, quantile function, and probability density function for the standard normal distribution.
> y1 = dnorm(x1, 0, 1)
> polygon(c(x1, rev(x1)), c(rep(0, length(x1)), rev(y1)), + col="lightgrey")
The production of the density curve should now be easy to follow. What is new is how to produce the shaded areas. This is accomplished with thepolygon()function, which takes a sequence of X and Y coordinates, connects the corresponding points, and fills the area(s) enclosed with a specified color, if so instructed. For the left tail, we therefore spec-ified the X-coordinates from left to right, and then from right to left. The corresponding Y-coordinates are all the zeros necessary to get from -3 to 1.96 (qnorm(0.025)), and then the Y-coordinates of the density in reverse order to get back to where we began.
A nice property of the normal distribution is that it is very easy to transform a normal random variable with mean µ 6= 0 and σ¬1 into a standard normal random variable with mean µ = 0 and σ = 1. Here is a function that does this for all the values brought together in a vectorv:
> std.fnc = function(v) {
+ return((v - mean(v)) / sd(v)) + }
What we do here is subtract the mean, and then divide by the standard deviation. We have already encountered the function mean(), the function sd() provides one’s best guess of the standard deviation σ for the vector of observations. To see how this works in practice, we will use the function for generating normally distributed random numbers, rnorm():
> x=rnorm(10, 3, 0.1)
> x
[1] 2.959219 3.002300 2.892054 3.070360 3.081620 2.950157 [7] 3.018608 3.043321 2.863275 2.944465
> xsd = std.fnc(x)
> xsd
[1] -0.3180837 0.2695757 -1.2342677 1.1979545 1.3515511 [6] -0.4417013 0.4920236 0.8291266 -1.6268356 -0.5193433
Note that a normal random variable with mean 3 and a small standard deviation of 0.1 is unlikely to have values below zero — in fact, it is very unlikely to have values more than 3 standard deviations (0.3) from the mean (3). When we standardize the random numbers inx, we obtain random numbers that are nicely centered around zero, and that show a much greater spread around the new mean of zero. But now, it is very unlikely to observe values that are more than 3 units away from the mean.
In the past, the standard normal distribution was especially important for the end user as it was only for the standard normal distribution that tables were available of the cumulative distribution function. In order to use these tables, one had to standardize first. InR, this is no longer necessary. We can usepnorm()with the mean and standard deviation of our choice,
pnorm(0, 1, 3) - pnorm(-1, 1, 3) [1] 0.1169488
or we can standardize first, and then drop mean and standard deviation frompnorm(). pnorm(-1/3) - pnorm(-2/3)
[1] 0.1169488
In both cases, the outcome is the same.
The square of the standard deviation is known as the variance. Compare
> v = rnorm(20, 4, 2)
> sd(v)
[1] 2.113831
> sqrt(var(v)) [1] 2.113831
The variance is a measure for how much the values diverge from the mean. At first blush, one might think a measure averaging divergences from the mean would do a sensible job, but this mean is zero:
> mean(v - mean(v)) [1] -5.32907e-16
The variance is defined as a kind of average of the squared divergences from the mean,
> sum( (v - mean(v))ˆ2)/(length(v) - 1) [1] 4.46828
> var(v) [1] 4.46828
where we divide, for technical reasons, not by the number of elements in the vector but by that number minus one. This definition of the variance of a random variable is not specific to normal random variables, even though the mathematical expression for the variance varies from distribution to distribution. For instance, the variance of a Poisson random variable is equal to its mean λ, and the variance of a binomial random variable is n∗ p ∗ (1 − p).
A distribution that is closely related to the normal distribution is the t-distribution, which has one parameter known as the degrees of freedom. This parameter controls the thickness of the tails of the standard normal distribution, as illustrated in the upper left panel of Figure 2.7. The grey line represents the standard normal distribution, the solid line a t-distribution with 2 degrees of freedom, and the dashed line a t-distribution with 5 degrees of freedom. As the degrees of freedom increase, the probability density function becomes more and more similar to that of the standard normal. For 30 or more degrees of freedom, the curves are already very similar, and for more than 100 degrees of freedom, they are virtually indistinguishable. The t-distribution plays an important role in many
statistical tests, and we will use it frequently in the remainder of this book. Rmakes the by now familiar four functions available for this distribution: dt(), pt(), qt() and rt(). Of these functions, the cumulative distribution function is the one we will use most. Here, we use it to illustrate the greater thickness of the tails of the t-distribution:
> pnorm(-3, 0, 1) [1] 0.001349898
> pt(-3, 2) [1] 0.04773298
Note that the probability of observing extreme values (values less than -3 in this example) is greater for the t-distribution.
There are two other continuous probability distributions that are crucial for many sta-tistical tests: the F-distribution and the χ2-distribution. The F -distribution has two param-eters, referred to as ’degrees of freedom 1’ and ’degrees of freedom 2’. The upper right panel of Figure 2.7 shows the probability density function of the F -distribution for 4 dif-ferent combinations of degrees of freedom. The ratio of two variances is F -distributed, and a question that often arises is whether the variance in the numerator is so much larger than the variance in the denominator that we have reason to be surprised. For instance, if this ratio is 6, then, depending on the degrees of freedom involved with the two ratios
— an issue that we will discuss in more detail below — the probability of this value may be small (surprise) or large (no surprise).
> 1 - pf(6, 1, 1) [1] 0.2467517
> 1 - pf(6, 20, 8) [1] 0.006905409
Here,pf()is the cumulative distribution function, which gives the probability of a ratio less than or equal to 6. Consequently, the probability of 6 or more is 1 minus this cumu-lative probability. This probability gives the proportion of values with a value at least as extreme as the observed value of 6. The smaller this proportion, the more reason there is for surprise.
The lower panels of Figure 2.7 show the probability density functions for three χ2 -distribution. The χ2-distribution has a single parameter, which is also referred to as its degrees of freedom. The lower left panel shows the density function for 1 degree of free-dom, the lower right panel gives the densities for 5 (solid line) and 10 (dashed line) de-grees of freedom. As in the case of the F -distribution, the χ2-distribution is used to assess whether an observed value is surprisingly high.
> 1 - pchisq(4, 1)
−6 −4 −2 0 2 4 6
0.00.10.20.30.4
x
density
t−distributions
0 2 4 6 8 10
0.00.20.40.60.81.0
x
density
F−distributions
0 2 4 6 8 10
0.00.51.01.52.02.5
x
density
chi−squared(1) distribution
0 5 10 15 20
0.000.050.100.15
x
density
chi−squared distributions
Figure 2.7: Probability density functions. Upper left: t-distributions with 2 (solid line) and 5 (dashed line) degrees of freedom, and the standard normal (grey line). Upper right:
F-distributions with 5,5 (black, solid line), 2,1 (grey, dashed line), 5,1 (grey, solid line) and 10,10 (black, dashed line) degrees of freedom. Lower left: a χ2-distribution with 1 degree of freedom. Lower right: χ2-distributions with 5 (solid line) and 10 (dashed line) degrees of freedom.
Note that an observed value is less likely to be extreme when the degrees of freedom is large. This is not surprising given the lower panels of Figure 2.7. For 1 degree of freedom, a 4 is already pretty extreme. But for 5 degrees of freedom, a 4 is more or less in the center of the distribution, and for 10 degrees, it is in fact a rather low value instead of a very high value.