Conﬁdence Intervals - Statistical Inference II: Interval Estimation, Hypothesis Testing,

Statistical Inference II: Interval Estimation, Hypothesis Testing,

2.1 Conﬁdence Intervals

In practice, the statistics calculated from samples such as the sample average , variance s2_{, standard deviation s, and others reviewed in the} previous chapter are used to estimate population parameters. For example, the sample average is used as an estimator for the population mean Qx,

the sample variance s2_{is an estimate of the population variance}W2_{, and so} on. Recall from Section 1.6 that desirable or “good” estimators satisfy four

important properties: unbiasedness, efficiency, consistency, and sufficiency. However, regardless of the properties an estimator satisfies, estimates will vary across samples and there is at least some probability that it will be different from the population parameter it is meant to estimate. Unlike the point estimators reviewed in the previous chapter, the focus here is on interval estimates. Interval estimates allow inferences to be drawn about a population by providing an interval, a lower and upper boundary, within which an unknown parameter will lie with a prespecified level of confidence. The logic behind an interval estimate is that an interval calculated using sample data contains the true population parameter with some level of confidence (the long-run proportion of times that the true population parameter interval is contained in the interval). Intervals are called confidence intervals (CIs) and can be constructed for an array of levels of confidence. The lower value is called the lower confidence limit (LCL) and the upper value the upper confidence limit (UCL). The wider a confidence interval, the more confident the researcher is that it contains the population parameter (overall confidence is relatively high). In contrast, a relatively narrow confidence interval is less likely to contain the population parameter (overall confidence is relatively low).

All the parametric methods presented in the ﬁrst four sections of this chapter make speciﬁc assumptions about the probability distributions of sample estimators, or make assumptions about the nature of the sampled populations. In particular, the assumption of an approximately normally distributed population (and sample) is usually made. As such, it is imper- ative that these assumptions, or requirements, be checked prior to apply- ing the methods. When the assumptions are not met, then the nonparametric statistical methods provided in Section 2.5 are more appropriate.

2.1.1 Conﬁdence Interval for QQQQ with Known WWWW2

The central limit theorem (CLT) suggests that whenever a sufﬁciently large random sample is drawn from any population with mean Q and standard deviation W, the sample mean is approximately normally distributed with mean Q and standard deviation . It can easily be veriﬁed that this standard normal random variable Z has a 0.95 proba- bility of being between the range of values [–1.96, 1.96] (see Table C.1 in Appendix C). A probability statement regarding Z is given as

. (2.1)

With some basic algebraic manipulation the probability statement of Equa- tion 2.1 can be written in a different, yet equivalent form:

X W / n P X n ¨ ª© ¸ º¹! 1 96. 1 96 0 95 / . . Q W

. (2.2)

Equation 2.2 reveals that, with a large number of intervals computed from different random samples drawn from the population, the proportion of values of for which the interval captures Q is 0.95. This interval is called the 95% conﬁdence interval estimator ofQ. A shortcut notation for this interval is

. (2.3)

Obviously, probabilities other than 95% can be used. For example, a 90% conﬁdence interval is

In general, any confidence level can be used in estimating the confidence intervals. The confidence interval is , and is the value of Z such that the area in each of the tails under the standard normal curve is . Using this notation, the confidence interval estimator of Q can be written as

. (2.4)

Because the confidence level is inversely proportional to the risk that the confidence interval fails to include the actual value ofQ, it generally ranges between 0.90 and 0.99, reflecting 10% and 1% levels of risk of not including the true population parameter, respectively.

Example 2.1

A 95% conﬁdence interval is desired for the mean vehicular speed on Indiana roads (see Example 1.1 for more details). First, the assumption of normality is checked; if this assumption is satisﬁed we can proceed with the analysis. The sample size is n = 1296, and the sample mean is = 58.86. Suppose a long history of prior studies has shown the popu-

lation standard deviation as W = 5.5. Using Equation 2.4, the conﬁdence

interval can be obtained:

0 95 1 96 1 96 1 96 1 96 . . . . . ! ¨_ª© ¸_º¹ ! ¨_ª© ¸_º¹ P n X n P X n X n W _Q W W _Q W X (X1 96. W n X, 1 96. W n) X n s 1 96. W X n s 1 645. W 1

Z_{E 2} E 2

X Z n s E2 W X

The result indicates that the 95% confidence interval for the unknown population parameter Q consists of lower and upper bounds of 58.56 and 59.16. This suggests that the true and unknown population parameter would lie somewhere in this interval about 95 times out of 100, on average. The confidence interval is rather “tight,” meaning that the range of possible values is relatively small. This is a result of the low assumed standard deviation (or variability in the data) of the population examined. The 90% confidence interval, using the same standard deviation, is [58.60, 59.11], and the 99% confidence interval is [58.46, 59.25]. As the confidence interval becomes wider, there is greater and greater confidence that the interval contains the true unknown population parameter.

2.1.2 Conﬁdence Interval for the Mean with Unknown Variance

In the previous section, a procedure was discussed for constructing conﬁ- dence intervals around the mean of a normal population when the variance of the population is known. In the majority of practical sampling situations, however, the population variance is rarely known and is instead estimated from the data. When the population variance is unknown and the population is normally distributed, a (1 – E)100% conﬁdence interval for Q is given by

, (2.5)

where s is the square root of the estimated variance (s2_), _{is the value of} the t distribution with n  1 degrees of freedom (for a discussion of the t distribution, see Appendix A).

Example 2.2

Continuing with the previous example, a 95% conﬁdence interval for the mean speed on Indiana roads is computed, assuming that the population variance is not known, and instead an estimate is obtained from the data with the same value as before. The sample size is n = 1296, and the sample

mean is = 58.86. Using Equation 2.3, the conﬁdence interval can be

obtained as

Interestingly, inspection of probabilities associated with the t distribution

(see Table C.2 in Appendix C) shows that the t distribution converges to

X n s1 96 !58 86 1 96s 5 5 ! s !

?

A

1296 58 86 0 30 58 56 59 16 . W . . . , . X t s n s E 2 t_{E 2} X X t s n s E 2 !58 86 1 96s 4 41 !

?

A

1296 58 61 59 10 . . . . , .

the standard normal distribution as . Although the t distribution is the correct distribution to use whenever the population variance is unknown, when sample size is sufﬁciently large the standard normal distri- bution can be used as an adequate approximation to the t distribution.

2.1.3 Conﬁdence Interval for a Population Proportion

Sometimes, interest centers on a qualitative (nominal scale) variable, rather than a quantitative (interval or ratio scale) variable. There might be interest in the relative frequency of some characteristic in a population such as, for example, the proportion of people in a population who are transit users. In such cases, an estimate of the population proportion, p, whose estimator is has an approximate normal distribution provided that n is sufﬁciently large ( and , where ). The mean of the sampling distribution is the population proportion p and the standard deviation is .

A large sample conﬁdence interval for the population propor- tion, p is given by

, (2.6)

where the estimated sample proportion, , is equal to the number of “suc- cesses” in the sample divided by the sample size, n, and .

Example 2.3

A transit planning agency wants to estimate, at a 95% conﬁdence level, the share of transit users in the daily commute “market” (that is, the percentage of commuters using transit). A random sample of 100 commuters is obtained and it is found that 28 people in the sample are transit users. By using Equation 2.6, a 95% conﬁdence interval for p is calculated as

Thus, the agency is 95% conﬁdent that transit users in the daily commute range from 19.2 to 36.8%.

2.1.4 Conﬁdence Interval for the Population Variance

In many situations, in trafﬁc safety research for example, interest centers on the population variance (or a related measure such as the population standard deviation). As a speciﬁc example, vehicle speeds contribute to crash probability, with an important factor the variability in speeds on the

np g ˆp npu 5 nqu 5 q! 1 p ˆp pq n 1 100

% ˆ ˆ ˆ p Z pq n s E 2 ˆp ˆ ˆ q! 1 p ˆ ˆ ˆ . . . , p Z pq n a s 2 !0 28 1 96s

! s !

?

A

0 28 0 72 100 0 28 0 088 0 192 0.368

roadway. Speed variance, measured as differences in travel speeds on a roadway, relates to crash frequency in that a larger variance in speed between vehicles correlates with a larger frequency of crashes, especially for crashes involving two or more vehicles (Garber, 1991). Large differences in speeds results in an increase in the frequency with which motorists pass one another, increasing the number of opportunities for multivehicle crashes. Clearly, vehicles traveling the same speed in the same direction do not overtake one another; therefore, they cannot collide as long as the same speed is maintained (for additional literature on the topic of speeding and crash probabilities, covering both the United States and abroad, the interested reader should consult FHWA, 1995, 1998, and TRB, 1998).

A conﬁdence interval for W2_{, assuming the population is nor-} mally distributed, is given by

, (2.7)

where is the value of the distribution with n 1 degrees of freedom. The area in the right-hand tail of the distribution is , while the area in the left-hand tail of the distribution is . The chi-square distribution is described in Appendix A, and the table of probabilities associated with the chi-square distribution is provided in Table C.3 of Appendix C.

Example 2.4

A 95% conﬁdence interval for the variance of speeds on Indiana roads

is desired. With a sample size of 100 and a variance of 19.51 mph2_{, and}

using the values from the G2_{table (Appendix C, Table C.3), one obtains}

= 129.56 and = 74.22. Thus, the 95% conﬁdence interval is

given as

The speed variance is, with 95% conﬁdence, between 15.05 and 26.02. Again, the units of the variance in speed are in mph2_.

In document Statistical and Econometric Methods for Transportation Data Analysis by S P Washington (Page 34-39)