Statistical Inference II: Interval Estimation, Hypothesis Testing,
2.1 Confidence Intervals
In practice, the statistics calculated from samples such as the sample aver- age , variance s2, standard deviation s, and others reviewed in the previous chapter are used to estimate population parameters. For example, the sample average is used as an estimator for the population mean Qx,
the sample variance s2is an estimate of the population variance W2, and so on. Recall from Section 1.6 that desirable or “good” estimators satisfy four
X
important properties: unbiasedness, efficiency, consistency, and sufficiency. However, regardless of the properties an estimator satisfies, estimates will vary across samples and there is at least some probability that it will be different from the population parameter it is meant to estimate. Unlike the point estimators reviewed in the previous chapter, the focus here is on interval estimates. Interval estimates allow inferences to be drawn about a population by providing an interval, a lower and upper boundary, within which an unknown parameter will lie with a prespecified level of confi- dence. The logic behind an interval estimate is that an interval calculated using sample data contains the true population parameter with some level of confidence (the long-run proportion of times that the true population parameter interval is contained in the interval). Intervals are called confi- dence intervals (CIs) and can be constructed for an array of levels of confidence. The lower value is called the lower confidence limit (LCL) and the upper value the upper confidence limit (UCL). The wider a confidence interval, the more confident the researcher is that it contains the population parameter (overall confidence is relatively high). In contrast, a relatively narrow confidence interval is less likely to contain the population param- eter (overall confidence is relatively low).
All the parametric methods presented in the first four sections of this chapter make specific assumptions about the probability distributions of sample estimators, or make assumptions about the nature of the sampled populations. In particular, the assumption of an approximately normally distributed population (and sample) is usually made. As such, it is imper- ative that these assumptions, or requirements, be checked prior to apply- ing the methods. When the assumptions are not met, then the nonparametric statistical methods provided in Section 2.5 are more appropriate.
2.1.1 Confidence Interval for QQQQ with Known WWWW2
The central limit theorem (CLT) suggests that whenever a sufficiently large random sample is drawn from any population with mean Q and standard deviation W, the sample mean is approximately normally distributed with mean Q and standard deviation . It can easily be verified that this standard normal random variable Z has a 0.95 proba- bility of being between the range of values [–1.96, 1.96] (see Table C.1 in Appendix C). A probability statement regarding Z is given as
. (2.1)
With some basic algebraic manipulation the probability statement of Equa- tion 2.1 can be written in a different, yet equivalent form:
X W / n P X n ¨ ª© ¸ º¹! 1 96. 1 96 0 95 / . . Q W
. (2.2)
Equation 2.2 reveals that, with a large number of intervals computed from different random samples drawn from the population, the proportion of values of for which the interval captures Q is 0.95. This interval is called the 95% confidence interval estimator ofQ. A shortcut notation for this interval is
. (2.3)
Obviously, probabilities other than 95% can be used. For example, a 90% confidence interval is
.
In general, any confidence level can be used in estimating the confidence intervals. The confidence interval is , and is the value of Z such that the area in each of the tails under the standard normal curve is . Using this notation, the confidence interval estimator of Q can be written as
. (2.4)
Because the confidence level is inversely proportional to the risk that the confidence interval fails to include the actual value ofQ, it generally ranges between 0.90 and 0.99, reflecting 10% and 1% levels of risk of not including the true population parameter, respectively.
Example 2.1
A 95% confidence interval is desired for the mean vehicular speed on Indiana roads (see Example 1.1 for more details). First, the assumption of normality is checked; if this assumption is satisfied we can proceed with the analysis. The sample size is n = 1296, and the sample mean is = 58.86. Suppose a long history of prior studies has shown the popu-
lation standard deviation as W = 5.5. Using Equation 2.4, the confidence
interval can be obtained:
0 95 1 96 1 96 1 96 1 96 . . . . . ! ¨ª© ¸º¹ ! ¨ª© ¸º¹ P n X n P X n X n W Q W W Q W X (X1 96. W n X, 1 96. W n) X n s 1 96. W X n s 1 645. W 1
EZE 2 E 2
X Z n s E2 W X
.
The result indicates that the 95% confidence interval for the unknown population parameter Q consists of lower and upper bounds of 58.56 and 59.16. This suggests that the true and unknown population parameter would lie somewhere in this interval about 95 times out of 100, on average. The confidence interval is rather “tight,” meaning that the range of possible values is relatively small. This is a result of the low assumed standard deviation (or variability in the data) of the population examined. The 90% confidence interval, using the same standard deviation, is [58.60, 59.11], and the 99% confidence interval is [58.46, 59.25]. As the confidence interval becomes wider, there is greater and greater confidence that the interval contains the true unknown population parameter.
2.1.2 Confidence Interval for the Mean with Unknown Variance
In the previous section, a procedure was discussed for constructing confi- dence intervals around the mean of a normal population when the variance of the population is known. In the majority of practical sampling situations, however, the population variance is rarely known and is instead estimated from the data. When the population variance is unknown and the population is normally distributed, a (1 – E)100% confidence interval for Q is given by
, (2.5)
where s is the square root of the estimated variance (s2), is the value of the t distribution with n 1 degrees of freedom (for a discussion of the t distribution, see Appendix A).
Example 2.2
Continuing with the previous example, a 95% confidence interval for the mean speed on Indiana roads is computed, assuming that the population variance is not known, and instead an estimate is obtained from the data with the same value as before. The sample size is n = 1296, and the sample
mean is = 58.86. Using Equation 2.3, the confidence interval can be
obtained as
.
Interestingly, inspection of probabilities associated with the t distribution
(see Table C.2 in Appendix C) shows that the t distribution converges to
X n s1 96 !58 86 1 96s 5 5 ! s !
?
A
1296 58 86 0 30 58 56 59 16 . W . . . , . X t s n s E 2 tE 2 X X t s n s E 2 !58 86 1 96s 4 41 !?
A
1296 58 61 59 10 . . . . , .the standard normal distribution as . Although the t distribution is the correct distribution to use whenever the population variance is un- known, when sample size is sufficiently large the standard normal distri- bution can be used as an adequate approximation to the t distribution.
2.1.3 Confidence Interval for a Population Proportion
Sometimes, interest centers on a qualitative (nominal scale) variable, rather than a quantitative (interval or ratio scale) variable. There might be interest in the relative frequency of some characteristic in a population such as, for exam- ple, the proportion of people in a population who are transit users. In such cases, an estimate of the population proportion, p, whose estimator is has an approximate normal distribution provided that n is sufficiently large ( and , where ). The mean of the sampling distribution is the population proportion p and the standard deviation is .
A large sample confidence interval for the population propor- tion, p is given by
, (2.6)
where the estimated sample proportion, , is equal to the number of “suc- cesses” in the sample divided by the sample size, n, and .
Example 2.3
A transit planning agency wants to estimate, at a 95% confidence level, the share of transit users in the daily commute “market” (that is, the percentage of commuters using transit). A random sample of 100 commut- ers is obtained and it is found that 28 people in the sample are transit users. By using Equation 2.6, a 95% confidence interval for p is calculated as
.
Thus, the agency is 95% confident that transit users in the daily commute range from 19.2 to 36.8%.
2.1.4 Confidence Interval for the Population Variance
In many situations, in traffic safety research for example, interest centers on the population variance (or a related measure such as the population standard deviation). As a specific example, vehicle speeds contribute to crash probability, with an important factor the variability in speeds on the
np g ˆp npu 5 nqu 5 q! 1 p ˆp pq n 1 100
E% ˆ ˆ ˆ p Z pq n s E 2 ˆp ˆ ˆ q! 1 p ˆ ˆ ˆ . . . , p Z pq n a s 2 !0 28 1 96s
! s !
?
A
0 28 0 72 100 0 28 0 088 0 192 0.368roadway. Speed variance, measured as differences in travel speeds on a roadway, relates to crash frequency in that a larger variance in speed between vehicles correlates with a larger frequency of crashes, especially for crashes involving two or more vehicles (Garber, 1991). Large differences in speeds results in an increase in the frequency with which motorists pass one another, increasing the number of opportunities for multivehicle crashes. Clearly, vehicles traveling the same speed in the same direction do not overtake one another; therefore, they cannot collide as long as the same speed is maintained (for additional literature on the topic of speeding and crash probabilities, covering both the United States and abroad, the interested reader should consult FHWA, 1995, 1998, and TRB, 1998).
A confidence interval for W2, assuming the population is nor- mally distributed, is given by
, (2.7)
where is the value of the distribution with n 1 degrees of freedom. The area in the right-hand tail of the distribution is , while the area in the left-hand tail of the distribution is . The chi-square distribution is described in Appendix A, and the table of probabilities associated with the chi-square distribution is provided in Table C.3 of Appendix C.
Example 2.4
A 95% confidence interval for the variance of speeds on Indiana roads
is desired. With a sample size of 100 and a variance of 19.51 mph2, and
using the values from the G2table (Appendix C, Table C.3), one obtains
= 129.56 and = 74.22. Thus, the 95% confidence interval is
given as
.
The speed variance is, with 95% confidence, between 15.05 and 26.02. Again, the units of the variance in speed are in mph2.