E XAMPLE 1.42: B AYESIAN NETWORK
22. Joint probability distribution
2.3 Inferring Population Parameters from Sample Parameters
Thus far, we have focused on statistics that describe a sample in various ways. A sample, however, is usually only a subset of the population. Given the statistics of a sample, what can we infer about the corresponding population parameters? If the sam-ple is small or if the population is intrinsically highly variable, there is not much we can say about the population. However, if the sample is large, there is reason to hope that the sample statistics are a good approximation to the population parame-ters. We now quantify this intuition.
Our point of departure is the central limit theorem, which states that the sum of n independent random variables, for large n, is approximately normally distributed (see Section 1.7.5). Suppose that we collect a set of m samples, each with n ele-ments, from some population. (In the rest of the discussion, we will assume that n is large enough that the central limit theorem applies.) If the elements of each sam-ple are independently and randomly selected from the population, we can treat the sum of the elements of each sample as the sum of n independent and identically distributed random variables X1, X2,..., Xn. That is, the first element of the sample is the value assumed by the random variable X1, the second element is the value assumed by the random variable X2, and so on. From the central limit theorem, the sum of these random variables is normally distributed. The mean of each sample is the sum divided by a constant, so the mean of each sample is also normally distrib-uted. This fact allows us to determine a range of values where, with high confi-dence, the population mean can be expected to lie.
To make this more concrete, refer to Figure 2.3 and consider sample 1. The mean of
this sample is . Similarly, , and, in general, .
Define the random variable X as taking on the values . The distribu-tion of X is called the sampling distribudistribu-tion of the mean. From the central limit theorem, X is approximately normally distributed. Moreover, if the elements are drawn from a population with mean Pand variance V2, we have already seen that E(X) = P (Equation 2.6) and V(X) = V2/n (Equation 2.9). These are, therefore, the
ptg7913109
2.3 Inferring Population Parameters from Sample Parameters 67
parameters of the corresponding normal distribution, or X ~ N(PV2/n). Of course, we do not know the true values of Pand V2.
If we know V2, we can estimate a range of values in which P will lie, with high probability, as follows. For any normally distributed random variable Y ~(PY, VY2), we know that 95% of the probability mass lies within 1.96 standard deviations of its mean and that 99% of the probability mass lies within 2.576 standard deviations of its mean. So, for any value y:
P(PY – 1.96 VY< y < PY +1.96 VY) = 0.95 (EQ 2.16) The left and right endpoints of this range are called the critical values at the 95% confidence level: An observation will lie beyond the critical value, assuming that the true mean is PY, in less than 5% (or 1%) of observed samples. This can be rewritten as
P(|PY – y| < 1.96 VY) = 0.95 (EQ 2.17)
Therefore, from symmetry of the absolute value:
P(y – 1.96 VY<PY < y + 1.96 VY) = 0.95 (EQ 2.18) In other words, given any value y drawn from a normal distribution whose mean is PY, we can estimate a range of values where PY must lie with high probability (i.e., 95% or 99%). This is called the confidence interval for PY.
We just saw that X ~ N(PV2/n). Therefore, given the sample mean x:
P(x – 1.96 < P < x + 1.96 ) = 0.95 (EQ 2.19) and
P(x – 2.576 < P < x + 2.576 ) = 0.99 (EQ 2.20)
Assuming that we knew the sample mean and V2, this allows us to compute the range of values where the population mean will lie with 95% or 99% confidence.
Note that a confidence interval is constructed from the observations in such a way that there is a known probability, such as 95% or 99%, of it containing the pop-ulation parameter of interest. It is not the poppop-ulation parameter that is the random variable; the interval itself is the random variable.
The situation is graphically illustrated in Figure 2.4. Here, we assume that the population is normally distributed with mean 1. The variance of the sampling dis-tribution (i.e., of X) is V2/n, so it has a narrower spread than the population, with the spread decreasing as we increase the number of elements in the sample. A randomly chosen sample happens to have a mean of 2.0. This mean is the value assumed by a
V
---n V
---n
V
---n V
---n
ptg7913109 random variable X whose distribution is the sampling distribution of the mean and
whose expected value from this single sample is centered at 2. (If we were to take more samples, their means would converge toward the true mean of the sampling distribution.) The double-headed arrow around X in the figure indicates a confi-dence interval in which the population mean must lie with high probability.
In almost all practical situations, we do not know V2. All is not lost, however.
Recall that an unbiased estimator for V2 is (Equation 2.15).
Assuming that this estimator is of good quality (in practice, this is true when n >
~20), X ~ N(P ), therefore, when n is sufficiently large, we can still compute the confidence interval in which the population mean lies with high probability.
EXAMPLE 2.5: CONFIDENCE INTERVALS
Consider the data values in Table 2.1. What are the 95% and 99% confidence intervals in which the population mean lies?
Solution:
We will ignore the fact that n =17 < 20, so that the central limit theorem is not Figure 2.4 Population and sample mean distributions
0
ptg7913109
2.3 Inferring Population Parameters from Sample Parameters 69
likely to apply. Pressing on, we find that the sample mean is 2.88. We compute as 107.76. Therefore, the variance of the sampling distribution of the mean is estimated as 107.76/(17 * 16) = 0.396, and the standard deviation of this distribution is estimated as its square root: 0.63. Using the value of r1.96V for the 95% confidence interval and r2.576V for the 99% confidence interval, the 95% confidence interval is [2.88 – 1.96 * 0.63, 2.88 + 1.96 * 0.63] = [1.65, 4.11], and the 99% confidence interval is [1.26, 4.5].
Because X is normally distributed with mean P and variance V2/n, is a
N(0,1) variable, also called the standard Z variable. In practice, when n > 20, we can substitute m2(n/(n – 1)) as an estimate for V2 when computing the standard Z variable.
In the preceding, we have assumed that n is large, so that the central limit theo-rem applies. In particular, we have made the simplifying assumption that the esti-mated variance of the sampling distribution of the mean is identical to the actual variance of the sampling distribution. When n is small, this can lead to underesti-mating the variance of this distribution. To correct for this, we have to reexamine the distribution of the normally distributed standard random variable ,
which we actually estimate as the random variable . The latter variable
is not normally distributed. Instead, it is called the standard t variable that is distributed according to the t distribution with n – 1 degrees of freedom (a parameter of the distribution). The salient feature of the t distribution is that, unlike the normal distribution, its shape varies with the degrees of freedom, with its shape for n > 20 becoming nearly identical to the normal distribution.
How does this affect the computation of confidence intervals? Given a sample, we proceed to compute the estimate of the mean as x as before. However, to compute the, say, 95% confidence interval, we need to change our procedure slightly. We have to find the range of values such that the probability mass under the t distribu-tion (not the normal distribudistribu-tion) centered at that mean and with variance
---ptg7913109 is 0.95. Given the degrees of freedom, which is simply n – 1, we can look
this up in a standard t table. Then, we can state with 95% confidence that the pop-ulation mean lies in this range.
EXAMPLE 2.6: CONFIDENCE INTERVALS FOR SMALL SAMPLES
Continuing with the sample in Table 2.1, we will now use the t distribution to compute confidence intervals. The unbiased estimate of the population standard deviation is 0.63. Since n = 17, this corresponds to a t distribution with 16 degrees of freedom. We find from the standard t table that a (0,1) t variable reaches the 0.025 probability level at 2.12, so that there is 0.05 probability mass beyond 2.12 times the standard deviation on both sides of the mean. Therefore, the 95% confidence interval is [2.88 – (2.12 * 0.63), 2.88 + (2.12 * 0.63)] = [1.54, 4.22]. Compare this to the 95% confidence interval of [1.65, 4.11] obtained using the normal distribution in Example 2.5. Similarly, the t distribution reaches the 0.005 probability level at 2.921, leading to the 99% confidence interval of [2.88 – (2.921 * 0.63), 2.88+(2.921 * 0.63)] = [1.03, 4.72] compared to [1.26, 4.5].
So far, we have focused on estimating the population mean and variance and have computed the range of values in which the population mean is expected to lie with high probability, obtained by studying the sampling distribution of the mean.
We can obtain corresponding confidence intervals for the population variance by studying the sampling distribution of the variance. It can be shown that if the population is normally distributed, this sampling distribution is the F2 distribution (discussed in Section 2.4.7). However, this confidence interval is rarely derived in practice, and so we will omit the details of this result.