201q_lect7.pdf

(1)

Unit 7: Hypothesis Testing and Confidence Intervals II

QBA 201 – Summer 2013

Instructor: Michael Malcolm

7.1: Tests and confidence intervals for proportions

7.2: Tests and confidence intervals for difference of means

7.3: Tests and confidence intervals for difference of proportions

7.4: Small sample sizes

(2)

7.1: Tests and confidence intervals for proportions

Recall the procedures for a hypothesis test involving a population mean. The general procedure for hypothesis testing is:

1. State the hypotheses. 2. State the rejection region. 3. Compute the test statistic.

4. Form conclusion by comparing the test statistic and the rejection region.

Right-Sided Left-Sided Two-Sided

1. 𝐻0: 𝜇 = 𝜇0 versus 𝐻𝑎: 𝜇 > 𝜇0

2. RR: 𝑧 > 𝑧𝛼

3. 𝑧 =𝑥̅−𝜇0

𝜎 √𝑛⁄

4. Form conclusion.

1. 𝐻0: 𝜇 = 𝜇0 versus 𝐻𝑎: 𝜇 < 𝜇0

2. RR: 𝑧 < −𝑧𝛼

3. 𝑧 =𝑥̅−𝜇0

𝜎 √𝑛⁄

4. Form conclusion.

1. 𝐻0: 𝜇 = 𝜇0 versus 𝐻𝑎: 𝜇 ≠ 𝜇0

2. RR: 𝑧 < −𝑧_{𝛼 2}⁄ or 𝑧 > 𝑧𝛼 2⁄

3. 𝑧 =𝑥̅−𝜇_{𝜎 √𝑛}_⁄ 0

4. Form conclusion.

A 1 − 𝛼 confidence interval for the population mean 𝜇 is given by:

[𝑥̅ − 𝑧_{𝛼 2}⁄ _√𝑛𝜎 , 𝑥̅ + 𝑧𝛼 2⁄ _√𝑛𝜎]

The expression 𝜎

√𝑛 that shows up in both the test statistic and in the confidence interval is known

as the standard error. It is the standard deviation of the sample mean.

It turns out that many hypothesis tests and confidence intervals can be derived using exactly the same “recipe”. The only trick is to get the standard error right. For example, in this section we will discuss hypothesis tests and confidence intervals for proportions.

For example, suppose we want to study the true proportion of residents of some population who intend to vote for a particular candidate. The true population proportion is 𝑝. The sample proportion 𝑝̂ is the measured proportion of people in our sample who intend to vote for the candidate.

The hypotheses are set up the same way as in the case of a sample mean. Our null hypothesis is that the true population proportion 𝑝 is equal to some hypothesized value 𝑝0. We can test against

alternatives that 𝑝 is greater than, less than, or not equal to the hypothesized value. The appropriate tests and confidence intervals are given in the table below. The test statistics and

(3)

1. 𝐻0: 𝑝 = 𝑝0 versus 𝐻𝑎: 𝑝 > 𝑝0

2. RR: 𝑧 > 𝑧_𝛼 3. 𝑧 = 𝑝̂−𝑝0

√𝑝0(1−𝑝0)

𝑛 4. Form conclusion.

1. 𝐻0: 𝑝 = 𝑝0 versus 𝐻𝑎: 𝑝 < 𝑝0

2. RR: 𝑧 < −𝑧_𝛼 3. 𝑧 = 𝑝̂−𝑝0

√𝑝0(1−𝑝0)

1. 𝐻0: 𝑝 = 𝑝0 versus 𝐻𝑎: 𝑝 ≠ 𝑝0

2. RR: 𝑧 < −𝑧𝛼 2⁄ or 𝑧 > 𝑧𝛼 2⁄

3. 𝑧 = 𝑝̂−𝑝0

√𝑝0(1−𝑝0)

The form for a 1 − 𝛼 confidence interval for the population proportion 𝑝 is given by the following.

[𝑝̂ − 𝑧𝛼 2⁄ √𝑝̂(1−𝑝̂)_𝑛 ,𝑝̂ + 𝑧𝛼 2⁄ √𝑝̂(1−𝑝̂)_𝑛 ]

The only important thing to note is that the standard error for hypothesis testing is calculated by

assuming the null value 𝑝0, i.e. √𝑝0(1−𝑝_𝑛 0). For the confidence interval, we rely on the sample

proportion, i.e. the standard error is calculated as √𝑝̂(1−𝑝̂)_𝑛 .

One warning is that these tests are unreliable for values of 𝑝 very close to 0 or 1 unless the sample size is very large. Testing or estimating something involving a true population proportion

𝑝 = 0.001 is obviously very difficult if the sample size is something like 𝑛 = 100. But

inferences when 𝑝 = 0.3 are much less problematic.

For our first example, suppose that a company advertises that only 5% of batteries that it produces are defective. A consumer testing agency wants to test this claim at a significance level

of 𝛼 = 0.01. In order to test the claim, the agency takes a random sample of 300 batteries and it

finds 22 defective batteries.

For this test, a one-sided alternative is appropriate since the agency is presumably interested in investigating whether the proportion of defectives is higher than claimed. Note that the sample proportion 𝑝̂ to be used for testing is 𝑝̂ =₃₀₀22 = 0.0733.

We follow the steps for a right-sided hypothesis test as follows:

1. 𝐻₀: 𝑝 = 0.05 versus 𝐻_𝑎: 𝑝 > 0.05

(4)

3. 𝑧 = 0.0733−0.05

√0.05(1−0.05)

300

= 1.85

4. Since 𝑧 is not in the rejection region, we do not reject the null hypothesis.

The agency’s sample does not provide sufficient evidence that the proportion of defective batteries is higher than the claimed level of 0.05.

The p-value for these data can be calculated using the same technique that we derived in unit 6.4. Since the alternative is a right-sided alternative, the p-value is calculated from the test statistic as:

𝑃(𝑧 > 1.85) = 0.0322

The p-value for these data 𝑝 = 0.0322 confirms that the null hypothesis could not be rejected at significance level 𝛼 = 0.01. However, notice that it could have been rejected if the agency had used a significance level 𝛼 = 0.05. Recall that the p-value is the lowest level of significance at which the null hypothesis can be rejected.

We can also apply the technique of unit 6.5 to estimate the power of this test. Recall that the power of a test is defined against a particular alternative. For example, suppose that the true proportion of defectives is actually 𝑝 = 0.07, so that the null should be rejected. What is the power of the agency’s testing procedure for rejecting the null in this case?

We first compute the rejection region explicitly. Since the rejection region for the agency’s test is

𝑧 > 2.326, we need to find the particular values of 𝑝̂ that lead to rejection. Plugging in the value

of the z-statistic:

𝑝̂−𝑝0

√𝑝0(1−𝑝0)

𝑛

> 2.326

𝑝̂−0.05 √0.05(1−0.05)

300

> 2.326 ⇒ 𝑝̂ > 0.0793

The power of the test is now the probability that 𝑝̂ will actually fall into this rejection region when the true proportion of defectives is actually 𝑝 = 0.07. Using the Central Limit Theorem:

𝑃(𝑝̂ > 0.0793|𝑝 = 0.07) = 𝑃 (𝑧 >0.0793−0.07

√0.07(1−0.07)

300

)

(5)

Notice that these calculations use the alternative value 𝑝 = 0.07 for the standard error, which is appropriate since what we are doing is to calculate the probability that 𝑝̂ will fall in the rejection region when this alternative is true.

Thus, this test is quite low-powered against the alternative that 𝑝 = 0.07. In this case, even though the null 𝑝 = 0.05 should be rejected, our test is only able to reject the null 26.43% of the time.

Finally, a 95% confidence interval for the true population proportion of defectives 𝑝 would be calculated as follows:

[0.0733 − 1.96√0.0733(1−0.0733)₃₀₀ , 0.0733 + 1.96√0.0733(1−0.0733)₃₀₀ ] = [0.0438, 0.1028]

With respect to polling, confidence intervals for proportions are frequently given a “margin of error” interpretation. For example, suppose prior to the 2012 election that a polling company called 2500 US voters and asked whether they intended to vote for Obama. 1302 of the respondents indicated that they intended to vote for Obama, so the sample proportion can be calculated as 𝑝̂ =1302₂₅₀₀= 0.5208.

We can find a 95% confidence interval as such:

[0.5208 ± 1.96√0.5208(1−0.5208)₂₅₀₀ ] = [0.5208 ± 0.0196] = [0.5012, 0.5404]

A pollster might say in this case that his estimate for the proportion of voters who intend to vote for Obama is 52.08% with a margin of error 1.96%. This basically means that the confidence interval extends 0.0196 in both directions of the point estimate 𝑝̂ = 0.5208.

Suppose the pollster wants to get the margin of error down to 1.5%. The necessary sample size

can be derived using the formula from unit 6.7. Recall that this formula gives 𝑛 = (𝑧𝛼 2_𝑀⁄ 𝜎)

2

. We

use the standard deviation estimate for Bernoulli random variables 𝜎 = √𝑝̂(1 − 𝑝̂):

𝑛 = (1.96√0.5208(1−0.5208)_0.015 )2 = 4261.06

Rounding up, a sample size of 4262 is needed in order for the pollster to lower his margin of error to 1.5%. Note that we used 𝑝̂ in this formula. If you don’t have a starting estimate, then 𝑝̂ =

(6)

EXERCISES

1. A local law enforcement agency claimed that fewer than 50% of store owners actually turn shoplifters over to police. A random sample of 40 store owners indicated that only 24 of them turn shoplifters over to police.

a. Is there enough evidence to conclude at 𝛼 = 0.05 significance that the law enforcement agency’s claim is correct?

b. What is the p-value for the test?

c. Calculate a 95% confidence interval for the proportion of store owners who turn shoplifters over to police.

(7)

7.2: Tests and confidence intervals for difference of means

Rather than testing whether a mean is equal to some hypothesized value, we may sometimes be interested in testing whether the means of two different populations are different from each other. For example, we might want to test whether the mean salaries of men and women who work in some particular industry are significantly different from each other. In other words, we are comparing the means of two different populations.

Here, the null hypothesis is that the difference between the two population means is equal to some specified value. We can have both one-sided and two-sided alternatives giving that the difference in means is higher than, lower than or equal to this specified value.

The test follows the same setup as earlier tests, as long as the standard error is calculated correctly.

1. 𝐻₀: 𝜇₁− 𝜇₂= 𝐷₀ versus

𝐻𝑎: 𝜇1− 𝜇2> 𝐷0

2. RR: 𝑧 > 𝑧𝛼

3. 𝑧 =(𝑥̅1−𝑥̅2)−𝐷0

√𝜎12

𝑛1+𝜎22𝑛2 4. Form conclusion.

1. 𝐻₀: 𝜇₁− 𝜇₂= 𝐷₀ versus

𝐻𝑎: 𝜇1− 𝜇2< 𝐷0

2. RR: 𝑧 < −𝑧𝛼

3. 𝑧 =(𝑥̅1−𝑥̅2)−𝐷0

√𝜎12

1. 𝐻0: 𝜇1− 𝜇2= 𝐷0 versus

𝐻𝑎: 𝜇1− 𝜇2≠ 𝐷0

2. RR: 𝑧 < −𝑧_{𝛼 2}⁄ or 𝑧 > 𝑧𝛼 2⁄

3. 𝑧 =(𝑥̅1−𝑥̅2)−𝐷0

√𝜎12

The idea is to take random samples from both populations. The random sample from the first population consists of 𝑛₁ observations and the random sample from the second population consists of 𝑛₂ observations. We then record the sample mean for each random sample, 𝑥̅₁ for the sample from the first population and 𝑥̅₂ for the sample from the second population. As usual, the test statistic technically depends on the true population variances 𝜎₁2 and 𝜎₂2, but since these are usually unknown, in practice we substitute the sample variances 𝑠₁2 and 𝑠₂2.

The usual case is that we are testing for equality of two means. In other words, the normal case is that the hypothesized difference is 𝐷₀ = 0. You might occasionally be interested in testing for some other value for the difference – for example, you might want to know whether the mean difference in the life of two batteries is more than 3 months in order to justify a cost difference. But the most frequently encountered case is simply to test whether two means are equal to each other.

(8)

1. 𝐻₀: 𝜇₁− 𝜇₂= 0 versus

𝐻𝑎: 𝜇1− 𝜇2> 0

2. RR: 𝑧 > 𝑧𝛼

3. 𝑧 =(𝑥̅1−𝑥̅2)

√𝜎12

1. 𝐻₀: 𝜇₁− 𝜇₂= 0 versus

𝐻𝑎: 𝜇1− 𝜇2< 0

2. RR: 𝑧 < −𝑧𝛼

3. 𝑧 =(𝑥̅1−𝑥̅2)

√𝜎12

1. 𝐻0: 𝜇1− 𝜇2= 0 versus

𝐻𝑎: 𝜇1− 𝜇2≠ 0

2. RR: 𝑧 < −𝑧_{𝛼 2}⁄ or 𝑧 > 𝑧𝛼 2⁄

3. 𝑧 =(𝑥̅1−𝑥̅2)

√𝜎12

For the case where the null is that the two means are equal, the right-sided test is relevant when you want to test the hypothesized alternative that 𝜇₁ > 𝜇₂. The left-sided test is relevant when you want to test the hypothesized alternative that 𝜇1 < 𝜇2. The two-sided test is relevant when

you have no direction in mind and simply want to test whether the two are different.

A 1 − 𝛼 confidence interval for the difference 𝜇1− 𝜇2 is given by:

[(𝑥̅1− 𝑥̅2) − 𝑧𝛼 2⁄ √𝜎1

2

𝑛1+

𝜎₂2

𝑛2, (𝑥̅1− 𝑥̅2) + 𝑧𝛼 2⁄ √

𝜎₁2 𝑛1+

𝜎₂2 𝑛2]

As an example, suppose that an environmental researcher wants to study the level of water pollution in two different locations on a river to determine whether they differ. He takes a random sample of 30 readings from the first location and a random sample of 35 readings from the second location. Among the samples from the first location, the mean pollution level is 1.65 ppm with a standard deviation of 0.26 ppm. Among the samples from the second location, the mean pollution level is 1.43 ppm with a standard deviation of 0.22 ppm.

We want to test at 𝛼 = 0.05 whether the two mean pollution levels differ from each other.

Since the researcher does not specify a direction for the alternative, we use a two-sided test. Proceeding through the steps:

1. 𝐻₀: 𝜇₁ − 𝜇₂ = 0 versus 𝐻_𝑎: 𝜇₁− 𝜇₂ ≠ 0 2. RR: 𝑧 < −1.96 or 𝑧 > 1.96

3. Test statistic: 𝑧 = 1.65−1.43

√0.262

30 + 0.222

35

= 3.65

4. The test statistic falls in the rejection region, so we can reject the null hypothesis.

(9)

We can form a 95% confidence interval for the difference between the pollution levels at the two sites using the formula given above.

[(1.65 − 1.43) − 1.96√0.26₃₀2+0.22₃₅2, (1.65 − 1.43) + 1.96√0.26₃₀2+0.22₃₅2]

(10)

EXERCISES

1. APGAR scores are a 1-10 measure of a newborn’s health and alertness at birth. A researcher is interested in studying whether babies of mothers who smoke while pregnant have lower APGAR scores than babies of mothers who do not smoke. To study this question, the researcher takes a random sample of 35 newborns of mothers who smoke while pregnant. He finds a mean APGAR of 7.80 and a standard deviation of 1.73 among these newborns. He then takes a random sample of 86 newborns of mothers who did not smoke while pregnant. He finds a mean APGAR of 8.48 and a standard deviation of 0.97 among these newborns. Test the hypothesis that the researcher is interested in at a significance level of 𝛼 = 0.05.

2. A random sample of 48 men with new CPA certifications showed a starting salary of $80,168 and a standard deviation of $8000. At the same time, a random sample of 39 women with new CPA certifications showed a starting salary of $70,754 and a standard deviation of $6000.

a. Is there enough evidence to conclude that men are paid more than women? Use a significance level of 𝛼 = 0.01.

(11)

7.3: Tests and confidence intervals for difference of proportions

Unit 7.1 dealt with testing whether a proportion was equal to some hypothesized value. But we might be interested in knowing whether proportions from two populations are different from each other. For example, people who live in one area might be exposed to some kind of pollutant, and a researcher might be interested in knowing whether a higher proportion of people from this area contract cancer than people from some other area.

The setup is virtually the same as testing for a difference in two population means. We take a sample of size 𝑛₁ from the first population and record the sample proportion 𝑝̂₁ from this population. We then take a sample of size 𝑛₂ from the second population and record the sample proportion 𝑝̂₂ from this population. The procedures are as follows.

1. 𝐻₀: 𝑝₁− 𝑝₂= 𝐷₀ versus

𝐻𝑎: 𝑝1− 𝑝2> 𝐷0

2. RR: 𝑧 > 𝑧𝛼

3. 𝑧 = (𝑝̂1−𝑝̂2)−𝐷0

√𝑝1(1−𝑝1)

𝑛1 +𝑝2(1−𝑝2)𝑛2 4. Form conclusion.

1. 𝐻₀: 𝑝₁− 𝑝₂ = 𝐷₀ versus

𝐻𝑎: 𝑝1− 𝑝2< 𝐷0

2. RR: 𝑧 < −𝑧𝛼

3. 𝑧 = (𝑝̂1−𝑝̂2)−𝐷0

√𝑝1(1−𝑝1)

1. 𝐻₀: 𝑝₁− 𝑝₂= 𝐷₀ versus

𝐻𝑎: 𝑝1− 𝑝2≠ 𝐷0

2. RR: 𝑧 < −𝑧𝛼 2⁄ or 𝑧 > 𝑧𝛼 2⁄

3. 𝑧 = (𝑝̂1−𝑝̂2)−𝐷0

√𝑝1(1−𝑝1)

Note that the procedure is almost identical to the procedure for testing the difference between two means; you just need to calculate the standard error properly. As usual, the standard errors technically depend on the true proportions 𝑝₁ and 𝑝₂, but since these are unknown in practice we substitute the sample proportions 𝑝̂₁ and 𝑝̂₂.

The usual case is to just to test whether two proportions are different from each other, so we use

𝐷0 = 0 in most cases. In other words. The null hypothesis is 𝑝1 = 𝑝2. The right-sided alternative

tests against the alternative that 𝑝₁ > 𝑝₂, and the left-sided alternative tests against the alternative that 𝑝₁ < 𝑝₂.

1. 𝐻₀: 𝑝₁− 𝑝₂= 0 versus

𝐻𝑎: 𝑝1− 𝑝2> 0

2. RR: 𝑧 > 𝑧𝛼

3. 𝑧 = (𝑝̂1−𝑝̂2)

√𝑝1(1−𝑝1)

1. 𝐻₀: 𝑝₁− 𝑝₂ = 0 versus

𝐻𝑎: 𝑝1− 𝑝2< 0

2. RR: 𝑧 < −𝑧𝛼

3. 𝑧 = (𝑝̂1−𝑝̂2)

√𝑝1(1−𝑝1)

1. 𝐻0: 𝑝1− 𝑝2= 0 versus

𝐻𝑎: 𝑝1− 𝑝2≠ 0

2. RR: 𝑧 < −𝑧_{𝛼 2}⁄ or 𝑧 > 𝑧𝛼 2⁄

3. 𝑧 = (𝑝̂1−𝑝̂2)

√𝑝1(1−𝑝1)

(12)

Some textbooks for the case where the null is 𝑝₁− 𝑝₂ = 0 take this null that the two are equal as a given and form a “pooled” proportion as such:

𝑝̅ =𝑛1𝑝̂1+𝑛2𝑝̂2

𝑛1+𝑛2

This is essentially the weighted average of the two population proportions. The idea is that this is the “best” estimate of the proportion taken as given the null that the two are equal. This pooled proportion is then substituted in the standard error, so that the test statistic is calculated as:

𝑧 = 𝑝̂1−𝑝̂2

√𝑝̅(1−𝑝̅)

𝑛1 + 𝑝 ̅(1−𝑝̅)

𝑛2

This technique does not work for the case where we are testing 𝑝₁− 𝑝₂ = 𝐷₀, with 𝐷₀ ≠ 0. Nevertheless, it is good practice for the case where the null is 𝑝₁− 𝑝₂ = 0.

A 1 − 𝛼 confidence interval for the true difference 𝑝₁− 𝑝₂ is given by:

[(𝑝̂1− 𝑝̂2) − 𝑧𝛼 2⁄ √𝑝̂1(1−𝑝̂_𝑛 1)

1 +

𝑝̂2(1−𝑝̂2)

𝑛2 , (𝑝̂1− 𝑝̂2) + 𝑧𝛼 2⁄ √

𝑝̂1(1−𝑝̂1)

𝑛1 +

𝑝̂2(1−𝑝̂2)

𝑛2 ]

As an example, suppose that a researcher is interested in testing whether Hondas or Toyotas are more likely to need major repairs within two years of purchase. The researcher takes a sample of 400 Honda owners and 500 Toyota owners. Within the first two years, 53 of the Hondas needed major repairs, while 78 of the Toyotas needed major repairs. We want to test at a significance level 𝛼 = 0.10 whether the two are significantly different.

Note that the sample proportions are:

𝑝̂₁ = ₄₀₀53 = 0.1325

𝑝̂2 =₅₀₀78 = 0.1560

For hypothesis testing, we will use the pooled proportion:

𝑝̅ =400⋅0.1325+500⋅0.1560_400+500 = 0.1456

(13)

1. 𝐻₀: 𝑝₁− 𝑝₂ = 0 versus 𝐻_𝑎: 𝑝₁− 𝑝₂ ≠ 0

2. RR: 𝑧 < −1.645 or 𝑧 > 1.645

3. Test statistic: 𝑧 = 0.1325−0.1560

√0.1456(1−0.1456)

400 +

0.1456(1−0.1456) 500

= −0.99

4. The test statistic does not fall in the rejection region, so we do not reject the null hypothesis.

In this case, there is not enough evidence to conclude that the proportion of cars needing repairs in the first two years is different between the two companies. Indeed, using procedures to compute p-values for a two-sided test, the p-value for this test is 2 ⋅ 𝑃(𝑧 < −0.99) = 0.3222.

Finally, a 95% confidence interval for the true difference in proportions would be:

[(0.1325 − 0.1560) ± 1.96√0.1325(1−0.1325)₄₀₀ +0.1560(1−0.1560)₅₀₀ ]

(14)

EXERCISES

1. In 2009, a magazine conducted a survey in which 92% of married men said that they would vote for a woman as President. However, in 1975, a similar poll showed that only 73% would have voted for a woman. Suppose that the 2009 survey consisted of 2000 observations and the 1975 survey consisted of 1500 observations.

a. Is there enough evidence to conclude at 𝛼 = 0.05 level of significance that a higher proportion of married men are in 2009 willing to vote for a woman as President than in 1975?

b. What is the p-value for the hypothesis test given in (a)?

(15)

7.4: Small Sample Sizes

All of our hypothesis tests and confidence intervals so far have relied on an asymptotic normal distribution for the test statistic. For example, the cutoff values for hypothesis tests and the endpoints for the confidence intervals use the 𝑧 distribution. The reason for this is the central limit theorem, which we encountered in unit 5.3. The sample average from any distribution can be approximated by a normal distribution as long as the sample size is sufficiently large.

Recall the idea from unit 5 that the exact sampling distribution of the sample mean depends upon the distribution of the population from which it was drawn, but as the sample size gets larger, the distribution of the sample mean approaches a normal distribution. But, again, this is a limiting result and the approximation is only good for large sample sizes.

How large is large? The normal rule of thumb is that sample sizes of about 30 are large enough for the central limit theorem to provide a good approximation. So, any time the sample size is larger than about 30, you can apply the tests and confidence intervals given in the previous section.1

What about small sample sizes? That is, what if the sample size is not large enough to apply the central limit theorem and use the normal distribution for our hypothesis tests and confidence intervals? The basic answer is that you can’t really say anything general. Because the sampling distribution in the small sample size case depends on the exact form of the population distribution, there is no general procedure for testing hypotheses and constructing confidence intervals.

However, for one special case, we can do something. If the population distribution from which the sample is drawn is a normal distribution, then the exact distribution of the normalized sample mean, using the sample standard deviation, is given by the t-distribution. So, in fact we can implement hypothesis testing and confidence intervals for small sample sizes if the population from which the sample is drawn obeys a normal distribution.

How can we check whether the population distribution is normal? There are tests you can do, but practically the quickest way is to just do a quick plot of the data and see whether it appears to be basically symmetric and without any serious outliers. The sensitivity to the assumptions depends on how small the sample size is. If the sample size is tiny, then the validity of hypothesis tests and confidence intervals is very sensitive to the normality assumption. For larger sample sizes, the t-distribution is fairly robust in the sense that it works well as long as the distribution is reasonably close to normal, i.e. free from very serious skew and/or outliers.

1_{One warning about this was already discussed earlier. For tests involving proportions, you need very large sample}

(16)

To summarize the basic principles for implementing hypothesis testing and confidence intervals with various sample sizes:

 If the sample size is large (𝑛 > 30 or so), then the central limit theorem applies, and you should use the large-sample tests based on the z-distribution covered in previous sections.

 If the sample size is small (𝑛 < 30 or so) and the population from which the sample is drawn is normal, you should use the small-sample tests described in this section.

 If the sample size is small (𝑛 < 30 or so) and the population from which the sample is drawn is not normal, then there is nothing you can do. You don’t have enough information to do any reliable statistical inference.

Note that the small-sample tests and confidence intervals based on the t-distribution should never be used for testing a sample proportion. In this case, the population is a series of 0/1 observations (i.e. the condition is either true or false) which by definition do not obey a normal distribution. So the small-sample inferences discussed in this section do not make sense for tests and confidence intervals involving proportions.

For small-sample inference involving the value of a mean, the implementation is virtually the same as the implementation for the large-sample case, with one exception. Hypothesis tests and confidence intervals for the large-sample case are based on the limiting z-distribution, which is a good approximation of the sampling distribution only in the case of large sample sizes. But for the small sample case we instead use the t-distribution, which is the exact sampling distribution when the population from which the sample is drawn is normal.

For small-sample hypothesis tests and confidence intervals involving the value of a mean, we use the t-distribution with 𝑛 − 1 degrees of freedom, where 𝑛 is the sample size. For purposes of completeness, the procedures for the hypothesis tests and confidence interval are below. They are identical to the large-sample case except for the use of the t-distribution in place of the z-distribution.

1. 𝐻₀: 𝜇 = 𝜇₀ versus 𝐻_𝑎: 𝜇 > 𝜇₀ 2. RR: 𝑡 > 𝑡𝛼

3. 𝑡 =𝑥̅−𝜇0

𝑠 √𝑛⁄

4. Form conclusion.

1. 𝐻₀: 𝜇 = 𝜇₀ versus 𝐻_𝑎: 𝜇 < 𝜇₀ 2. RR: 𝑡 < −𝑡𝛼

3. 𝑡 =𝑥̅−𝜇0

𝑠 √𝑛⁄

4. Form conclusion.

1. 𝐻0: 𝜇 = 𝜇0 versus 𝐻𝑎: 𝜇 ≠ 𝜇0

2. RR: 𝑡 < −𝑧𝛼 2⁄ or 𝑡 > 𝑧𝛼 2⁄

3. 𝑡 =𝑥̅−𝜇0

𝑠 √𝑛⁄

4. Form conclusion.

(17)

[𝑥̅ − 𝑡𝛼 2⁄ _√𝑛𝑠 , 𝑥̅ + 𝑡𝛼 2⁄ _√𝑛𝑠]

As an example, suppose that customers rate airports on a scale of 1-10, and that these ratings are known to be approximately normally distributed. A researcher surveys 12 people at random from Amsterdam’s Schiphol Airport about their customer satisfaction and obtains a sample mean 𝑥̅ = 7.75 and a sample standard deviation 𝑠 = 1.215. The researcher is interested in testing whether there is sufficient evidence to conclude that the true mean rating of customers at Schiphol Airport exceeds 7. We want to test at the 𝛼 = 0.05 level of significance.

Applying the steps for a right-sided alternative as given above.

1. 𝐻₀: 𝜇 = 7 versus 𝐻_𝑎: 𝜇 > 7

2. RR: 𝑡 > 1.796 (Note that this is read from the line with 𝑛 = 12 − 1 = 11 degrees of

freedom). 3. 𝑡 = 7.75−7

1.215 √12⁄ = 2.14

4. Since 𝑡 falls in the rejection region, we reject the null hypothesis.

In this case, we have enough evidence to conclude at the 5% level of significance that the true mean rating for Schiphol Airport does exceed 7.

What is the p-value for the test? We don’t have the whole distribution, but if we look at the line in the t-table for 11 degrees of freedom, observe that 𝑡_0.05 = 1.796 and that 𝑡_0.025 = 2.201. Since our calculated t-statistic for this test is 𝑡 = 2.14, notice that the test rejects the null hypothesis for 𝛼 = 0.05 but would not have fallen in the rejection region for 𝛼 = 0.025. Thus, we know that the p-value is somewhere in the interval 0.025 < 𝑝 < 0.05, since the p-value is the lowest level of significance for which the null hypothesis can be rejected.

A 95% confidence interval for the true mean rating for the airport is:

[7.75 − 2.2011.215

√12 , 7.75 + 2.201

1.215

√12] = [6.98, 8.52]

(18)

The procedures for hypothesis testing and confidence intervals are basically the same as the procedures for the large sample case, again substituting asymptotic values from the z-distribution with exact values from the t-distribution. In this case, the relevant t-distribution is that with 𝑛₁+

𝑛₂ − 2 degrees of freedom.

One difference is that these tests and confidence intervals use a “pooled” variance estimator, which is calculated as follows:

𝑠𝑝2 = (𝑛1−1)𝑠1

2_+(𝑛 2−1)𝑠22

𝑛1+𝑛2−2

The procedures for a hypothesis test are as follows:

𝐻𝑎: 𝜇1− 𝜇2> 𝐷0

2. RR: 𝑡 > 𝑡𝛼

3. 𝑡 =(𝑥̅1−𝑥̅2)−𝐷0

√𝑠_𝑝2(_𝑛11+_𝑛21)

4. Form conclusion.

𝐻𝑎: 𝜇1− 𝜇2< 𝐷0

2. RR: 𝑡 < −𝑡𝛼

3. 𝑡 =(𝑥̅1−𝑥̅2)−𝐷0

√𝑠_𝑝2(_𝑛11+_𝑛21)

4. Form conclusion.

𝐻𝑎: 𝜇1− 𝜇2≠ 𝐷0

2. RR: 𝑡 < −𝑡𝛼 2⁄ or 𝑡 > 𝑡𝛼 2⁄

3. 𝑡 =(𝑥̅1−𝑥̅2)−𝐷0

√𝑠_𝑝2(_𝑛11+_𝑛21)

4. Form conclusion.

Again, the normal case is testing whether two population means are equal. That is, we test whether the difference is 𝐷₀ = 0.

A 1 − 𝛼 confidence interval for the true difference 𝜇₁− 𝜇₂ is given by:

[(𝑥̅1− 𝑥̅2) − 𝑡𝛼 2⁄ √𝑠𝑝2(_𝑛1

1+

1

𝑛2) , (𝑥̅1− 𝑥̅2) + 𝑡𝛼 2⁄ √𝑠𝑝

2₍1 𝑛1+

1 𝑛2)]

For example, suppose that wait times to speak to a reservation agent when calling major airlines are known to be normally distributed. A marketing company randomly placed 22 calls to Delta, waiting an average of 2.5 minutes with a standard deviation of 0.8 minutes. The company also randomly placed 20 calls to Southwest, waiting an average of 2.1 minutes with a standard deviation of 1.1 minutes. The company is interested in testing at a significance level of 𝛼 = 0.05 whether there is a difference in mean waiting times at the two companies.

(19)

𝑠𝑝2 = (22−1)0.8

2_+(20−1)1.12

22+20−2 = 0.9108

We can now apply the steps for a two-sided test as given above:

1. 𝐻₀: 𝜇₁ − 𝜇₂ = 0 versus 𝐻_𝑎: 𝜇₁− 𝜇₂ ≠ 0

2. RR: 𝑡 < −2.021 or 𝑡 > 2.021 (Note that this is read from the line with 𝑛1+ 𝑛2− 2 =

22 + 20 − 2 = 40 degrees of freedom).

3. 𝑡 = (2.5−2.1)−0

√0.9108(₂₀1+₂₂1)= 1.36

4. Since 𝑡 does not fall in the rejection region, we cannot reject the null hypothesis.

The data gathered by the marketing firm does not provide sufficient evidence to conclude that the true mean waiting times for calls placed to the two airlines are actually different.

For the p-value, note that 𝑡_0.05= 1.684 and 𝑡_0.10 = 1.303. Our calculated test statistic 𝑡 = 1.36 falls between the two. However, since it is a two-sided test, we have to double these. Thus, the p-value is somewhere in the interval 0.10 < 𝑝 < 0.20 (because the rejection region has to be symmetric on both sides). This is not very good evidence for rejecting the null hypothesis. The p-value gives the lowest level of significance at which the null hypothesis can be rejected.

A 95% confidence interval for the true mean difference 𝜇₁− 𝜇₂ is given by:

[0.4 − 2.021√0.9108 (₂₂1 +₂₀1) , 0.4 + 2.021√0.9108 (₂₂1 +₂₀1)] = [−0.1959, 0.9959]

One final methodological note – If the line for the relevant degrees of freedom is not shown on the t-table you are using, it is standard practice to read from the line with the next-lowest number of degrees of freedom. For example, if you need to use the t-distribution with 37 degrees of freedom (which is not given on the table), you should instead use the line for the t-distribution with 35 degrees of freedom.

(20)

EXERCISES

For the exercises below, you can assume that the populations from which the samples are drawn are normally distributed.

1. Environmental regulations specify that the mean level of some toxin in fish be lower than 102 ppm. A field worker takes a sample of 5 fish and obtains the following readings:

{99, 102, 94, 99, 95}.

a. Is there enough evidence to conclude at significance level 𝛼 = 0.05 that the regulated standard is being met?

b. What is the p-value for the test in (a)?

c. Form a 99% confidence interval for the true mean toxicity level.

(21)

7.5: Tests and Confidence Intervals for Variances

The previous sections have dealt with hypothesis tests and confidence intervals for means and proportions. In some circumstances, we may also be interested in inference involving variances. For example, a company might produce parts on its machines and specify that the variance in the part’s weight should not exceed 0.001 grams. That is, the company is not interested in studying the mean weight of the parts, but rather is interested in studying the variability in the parts it produces. In this section, we will cover testing whether a variance is equal to a particular null value. In the next section, we will cover comparing variances of two different populations.

The central limit theorem told us that, for large enough sample sizes, the asymptotic distribution of the sample mean was always approximately normal, regardless from the population result from which it was drawn. Unfortunately, there is no similar result for sample variances. That is, even in large samples, the distribution of the sample variance will always depend upon the distribution of population from which the sample is drawn. Thus, there simply are no general results for inference dealing with variances, even for large sample sizes.

However, in the special case where the population distribution is normal, it is known that the sample variance obeys a 𝜒2 distribution. This is read “chi-squared distribution”. To emphasize, inferences in this section dealing with variances are only valid for the case where the population distribution is normal. Even for large samples, the 𝜒2 distribution describes the distribution of the sample variance only for the case where the population from which the data are generated is normal. In this specific case, we can develop hypothesis tests and confidence intervals as given below.

The details of the test are shown below. The null is that the true population variance is equal to some null value 𝜎₀2. We can then implement a hypothesis test of this null hypothesis against right-sided, left-sided or two-sided alternatives. We use the sample variance 𝑠2 to construct a test statistic, and we compare this test-statistic against our rejection region, based on the 𝜒2 distribution with 𝑛 − 1 degrees of freedom, tables of which are easily accessible.

1. 𝐻₀: 𝜎2= 𝜎₀2 versus

𝐻𝑎: 𝜎2> 𝜎02

2. RR: 𝜒2> 𝜒𝛼2

3. 𝜒2=(𝑛−1)𝑠2

𝜎₀2

4. Form conclusion.

1. 𝐻0: 𝜎2= 𝜎02 versus

𝐻𝑎: 𝜎2< 𝜎02

2. RR: 𝜒2< 𝜒1−𝛼2

3. 𝜒2=(𝑛−1)𝑠2

𝜎₀2 4. Form conclusion.

𝐻𝑎: 𝜎2≠ 𝜎02

2. RR: 𝜒2> 𝜒_{𝛼 2}2_⁄ or 𝜒2< 𝜒_{1−𝛼 2}2 _⁄ 3. 𝜒2=(𝑛−1)𝑠2

𝜎₀2

4. Form conclusion.

(22)

values for test statistics. When we use the z-distribution, for example, the cutoff value for an upper-tailed test with 𝛼 = 0.05 significance is 𝑧_0.05= 1.645. This is the z-statistic that cuts off the top 5% of the distribution. But since the distribution is symmetric, the critical value for a lower-tailed test cuts off the lower 5% of the distribution. This is 𝑧_0.95 = −𝑧_0.05= −1.645. Similarly, the cutoff values for a 2-tailed test with 5% significance level are 𝑧_0.025 = 1.96 and

𝑧0.975 = −𝑧0.025 = −1.96.

However, if the distribution were not symmetric about zero, then it would not be true in general that 𝑧0.95 = −𝑧0.05. For example, suppose we are using the 𝜒2 distribution with 10 degrees of

freedom. To test against a right-sided alternative at significance level 𝛼 = 0.05, the relevant rejection region is 𝜒2 > 18.3 since 𝜒_0.052 = 18.3. However, to test against a left-sided alternative at the same significance level, the rejection region is 𝜒2 < 3.94 since 𝜒_0.952 = 3.94. In other words, this is the critical value that chops off the bottom 5% of the distribution. To test against a two-sided alternative at the same significance level, the relevant rejection region is 𝜒2 > 20.5 or

𝜒2 _{< 3.25}_since_𝜒

0.0252 = 20.5 and 𝜒0.9752 = 3.25.

A 1 − 𝛼 confidence interval for the true population variance is given by:

[(𝑛−1)𝑠_𝜒 2

𝛼 2⁄ 2 ,

(𝑛−1)𝑠2

𝜒_{1−𝛼 2}2 _⁄ ]

As an example, suppose that fill measurements in soda cans are known to be normally distributed. An inspector is interested in testing whether there is evidence that the variance in fill measurements is less than 0.01 ounces. He takes a random sample of 10 cans and finds that the sample variance is 0.0016 ounces. We want to use this data to test the inspector’s hypothesis at a significance level 𝛼 = 0.05. Implementing the steps for a left-sided hypothesis test as given above.

1. 𝐻₀: 𝜎2 = 0.01 versus 𝐻_𝑎: 𝜎2 < 0.01

2. RR: 𝜒2 < 3.33 (Note that this is read from the line with 𝑛 − 1 = 10 − 1 = 9 degrees of

freedom and using 𝜒_1−𝛼2 = 𝜒_0.952 ). 3. 𝜒2 = (10−1)⋅0.0016_0.01 = 1.44

4. Since 𝜒2 falls in the rejection region, we can reject the null hypothesis.

(23)

A 95% confidence interval for the true variance in fill measurements is given by:

[(10−1)⋅0.0016_19.0 ,(10−1)⋅0.0016_2.70 ] = [0.000758, 0.005333]

(24)

EXERCISES

1. A company produces machined engine parts that are supposed to have a diameter variance that is no greater than 0.0002 inches. A random sample of 10 parts gave a sample variance of 0.0003. Is this enough evidence to reject the null hypothesis at a 5% level of significance?

2. An experimenter was convinced that the variability in his measuring equipment yielded a variance of 4, but 16 measurements resulted in a sample variance of 6.1.

a. Determine whether there is enough evidence to reject the experimenter’s claim at significance level 𝛼 = 0.05.

b. What is the p-value associated with this test?

(25)

7.6: Tests for Equality of Variances

In the previous section, we tested whether a true population variance took on a particular value. In this section, we will compare two population variances and test whether the two variances are equal or whether there is evidence that the variances are different.

As with the previous section, these tests are valid only under the assumption that the population distributions are both normal. Even in large samples, these tests are specifically for comparing variances of normally distributed populations. There is no generally valid distribution to use for inference when dealing with non-normal populations. However, for normally distributed populations, we use the fact that the sampling distribution for the ratio of two sample variances is known to obey the F-distribution.

The testing procedures are as given below. The idea is that we are testing whether the true variances from two different populations 𝜎₁2 and 𝜎₂2 are equal to each other. We can test against the right-sided alternative 𝜎₁2 > 𝜎₂2, the left-sided alternative 𝜎₁2 < 𝜎₂2 or against the two-sided alternative 𝜎₁2 ≠ 𝜎₂2. The test statistic is based on the observed sample variances from the two populations 𝑠₁2 and 𝑠₂2.

𝐻𝑎: 𝜎12> 𝜎22

2. RR: 𝐹 > 𝐹_𝛼 3. 𝐹 =𝑠12

𝑠₂2

4. Form conclusion.

𝐻𝑎: 𝜎12< 𝜎22

2. RR: 𝐹 > 𝐹_𝛼 3. 𝐹 =𝑠22

𝑠₁2

4. Form conclusion.

1. 𝐻₀: 𝜎₁2= 𝜎₂2 versus

𝐻𝑎: 𝜎12≠ 𝜎22

2. RR: 𝐹 > 𝐹_{𝛼 2}⁄

3. 𝐹 = larger sample variance

smaller sample variance

4. Form conclusion.

The relevant F-distribution to use for the critical values depends on the numerator and the denominator degrees of freedom. Whichever sample variance is in the numerator of the test statistic, the “numerator degrees of freedom” is the sample size used in calculating this sample variance minus one. Whichever sample variance is in the denominator of the test statistic, the “denominator degrees of freedom” is the sample size used in calculating this sample variance minus one.

For example, suppose that a manager wants to test the variability in production levels at two different feed plants. Production levels are known to be normally distributed. He records daily production levels at the two plants. At the first plant, he records production levels for 13 days and observes a mean production level of 26.3 tons with a variance of 67.24 tons. At the second plant, he records production levels for 18 days, and observes a mean production level of 19.7 tons with a variance of 22.09 tons. Is there sufficient evidence to conclude at significance level

(26)

Since no direction is specified for the alternative, it is appropriate to use a two-sided alternative. Following the procedures outlined above:

1. 𝐻₀: 𝜎₁2 = 𝜎₂2 versus 𝐻_𝑎: 𝜎₁2 ≠ 𝜎₂2

2. RR: 𝐹 > 2.82 (Note that this is read from the table giving critical values for 𝐹_{𝛼 2}_⁄ =

𝐹_0.025. The larger sample variance will appear in the numerator of our test statistic, and from this first plant we have 13 observations. The smaller sample variance will appear in the denominator of our test statistic, and from this second plant we have 18 observations. Thus, the relevant critical value uses 12 degrees of freedom in the numerator and 17 degrees of freedom in the denominator).

3. 𝐹 =67.24_22.09= 3.04

4. Since the test statistic falls in the rejection region, we reject the null hypothesis.

Thus, there is sufficient evidence for the manager to reject the null hypothesis that the variability in production levels at the two plants is equal in favor of the alternative that the variability differs between the two plants.

(27)

EXERCISES

1. Suppose that inflation rates are known to be normally distributed. A country hires a new central bank president, and some people are concerned that inflation rates are becoming more volatile. A random sample of 6 monthly inflation rates before the new president took over showed a mean of 2.4167 and a variance of 2.0618 (in percentage terms). A random sample of 5 monthly inflation rates taken after the new president took over showed a mean of 4.36 and a variance of 8.8381. Is this enough evidence to conclude at significance level 𝛼 = 0.01 that the variance in inflation rates is higher under the new president than it was before the new president took over? What can you say about the p-value of the test?

2. The closing prices of two common stocks were recorded for a period of 16 days. Stock prices are known to be normally distributed over short periods of time. The standard deviation in closing prices for the first stock was 1.24 but the variance in closing prices for the second stock was 1.72. Is this enough evidence to conclude at significance level

𝛼 = 0.05 that there is a difference in variability in closing prices of the two stocks? What