Unit 6: Hypothesis Testing and Confidence Intervals I
QBA 201 – Summer 2013
Instructor: Michael Malcolm
6.1: The logic of hypothesis testing
6.2: One-sided tests
6.3: Two-sided tests
6.4: The p-value
6.5: Calculating power
6.1: The logic of hypothesis testing
The essence of the scientific method is to make assertions and then to use available information to test whether these assertions are correct. We often say that scientific propositions should be “testable” in the sense that they can either be confirmed or refuted based on data.
In statistics, the conventional setup is to start with a null hypothesis and an alternative hypothesis. The null hypothesis (denoted 𝐻0) is what is assumed to be true before the test. On the other hand, the alternative hypothesis (denoted 𝐻𝑎) is the proposition for which the researcher is attempting to find support.
For example, suppose that the average IQ score for students in a school is known to be 110. But a researcher is investigating whether some new, innovative method of teaching can raise IQ for students in a particular class above 110. In this case, the null hypothesis is that the average IQ for students in the new classroom is 110 (the starting assumption), whereas the alternative hypothesis is that the average IQ for students in the new classroom is higher than 110 (the proposition for which the researcher is trying to find support).
The principle of hypothesis testing is to start by assuming that the null hypothesis is true, and then apply a test based on the evidence to determine whether the null hypothesis should be rejected in favor of the alternative hypothesis. A note about terminology – if there is not enough evidence to reject the null hypothesis in favor of the alternative hypothesis, it is technically a mistake to say that we “accept” the null hypothesis. Rather, we simply say that we do not reject the null hypothesis.
Now, either the true state of the world is that the null hypothesis is true or that the null hypothesis is false.
If 𝐻0 is true and our test does not reject it, then this is the correct result. But if 𝐻0 is true, and our test mistakenly rejects it, this is called a Type I error.
If 𝐻0 is false and our test rejects it, then this is the correct result. But if 𝐻0 is false and our test mistakenly does not reject it, this is called a Type II error.
We can summarize the terminology in a table below.
Test: Do not reject 𝑯𝟎 Test: Reject 𝑯𝟎
Truth: 𝑯𝟎 is correct Correct result Type I error
In the example above, the null hypothesis is that the average IQ in the new classroom is equal to 110, whereas the alternative hypothesis is that the average IQ in the new classroom is higher than 110. What we would do is to apply some test and then determine whether there is enough evidence to reject the null hypothesis in favor of the alternative, i.e. to conclude that the average IQ in the new classroom is actually higher than 110.
Suppose the truth is that the null is correct and that the students’ average IQ in the new classroom is equal to 110. If our test mistakenly rejects this null hypothesis, then this is a Type I error.
On the other hand, suppose the truth is that the null is false and that students’ average IQ in the new classroom is higher than 110. If our test fails to reject the null hypothesis, then this is a Type II error.
As an additional example, consider a criminal trial. The null hypothesis is that the accused person is innocent. The alternative hypothesis is that the accused person is guilty. What the court does is to look at the evidence to determine whether there is enough evidence of guilt to reject the null hypothesis in favor of the alternative hypothesis, and then convict the accused person if this is the case. In this context, convicting an innocent person is a Type I error, while releasing a guilty person is a Type II error.
Any testing using statistics and evidence will be susceptible to potential mistakes. For example, we might administer a test to students, the students have good luck on the exam, and we conclude that the program does raise the students’ average IQ above 110 even if the truth is that it does not (Type I error). It is also possible that the test did raise the students’ average IQ above 110, but our test does not provide enough evidence to support rejecting the null (Type II error).
The probability of Type I error is denoted 𝛼. This is the probability that our test will reject the null hypothesis when the null hypothesis is actually true.
The probability of Type II error is denoted 𝛽. This is the probability that our test will fail to reject the null hypothesis when the null hypothesis is actually false.
We can now define the concepts of size and power.
The size of a test is 𝛼
The power of a test is 1 − 𝛽
In brief, size is the probability of rejecting a correct null, whereas power is the probability of rejecting a false null. An ideal test would thus have a size of zero and a power of one.
Of course, a test that attains size zero and power one is impossible in practice. In fact, an important observation is that there is typically a tradeoff between size and power. Consider our IQ example from earlier. We wanted to test to determine whether we could reject the null that the students’ average IQ is equal to 110 in favor of the alternative that it is greater than 110.
If the test is very difficult and strict, then the size will be close to zero. In other words, there is very little chance that the students’ average IQ will be equal to 110 but our test concludes that it is higher than 110. On the other hand, the power may also be low. If the test is too difficult, then even if their average IQ is greater than 110, the test might not be able to tell us because it is too difficult.
If we make the test easier, we can see the tradeoff between size and power. The power will be higher because it is more likely that students will pass the test if their average IQ is greater than 110. However, if the test is too easy then the size will also be high. In other words, even if the students’ average IQ is actually equal to 110, there is a chance that the students might pass the test anyway, which would cause us to mistakenly conclude that their average IQ is higher than 110.
Applied to the court example, if the court requires a large amount of solid evidence to convict people, then not many innocent people will be convicted, but at the same time a lot of guilty people may go free (low size but also low power). If courts are willing to convict people with less evidence, then most guilty people will be convicted but at the same time many innocent people may be convicted also (high power but also high size). There is always a tradeoff between size and power.
EXERCISES
1. The mayor of a city is concerned about carbon monoxide pollution. Currently, the concentration of carbon monoxide is 12 parts per million (ppm). The mayor proposes to institute tolls on roads to determine whether this is a good way to lower carbon monoxide pollution.
a. What are the appropriate null and alternative hypotheses for this test? b. Say in English what it means when the null hypothesis cannot be rejected. c. Say in English what it means when the null hypothesis is rejected.
d. Say in English what a Type I error is in this situation. What are the consequences? e. Say in English what a Type II error is in this situation. What are the
6.2: One-sided tests
The statistical implementation of a hypothesis test uses the method of contradictions. What we do is to assume that the null hypothesis is true, and then determine whether the observed data are extremely unlikely when the null hypothesis is true. If the observed data are extremely unlikely when the null hypothesis is true, then we can reject the null hypothesis. If the observed data are not too improbable when the null is true, then we do not reject the null hypothesis.
Thus, implementing a hypothesis test is actually just an application of the central limit theorem. Let us go through a test in detail for our IQ example from the previous section.
The null hypothesis is that the mean IQ in the new class is 110, whereas the alternative hypothesis is that the mean IQ in the new class is greater than 110.
𝐻0: 𝜇 = 110 𝐻𝑎: 𝜇 > 110
Note importantly that the null and alternative hypotheses are functions of the population parameter 𝜇, which is unknown. This is what we are trying to make inferences about. It is very wrong to state the hypotheses in terms of sample statistics like 𝑥̅ because these are measured based on observed data. Inference is about the unknown parameter.
We give an IQ test to a random sample of students in the new class and record the sample mean
𝑥̅. Using the central limit theorem, if we assume that the true mean is 𝜇 = 110, then the distribution of the sample mean 𝑥̅ is normal about 𝜇 = 110. Then if the actual observation of 𝑥̅ is highly unlikely under this distribution, then we reject the null hypothesis.
The shaded region is called the rejection region. We will reject the null hypothesis when the sample mean. The probability of a sample mean in this range is only 0.05 if the null hypothesis is actually true. Thus, observing a sample mean in this range is enough evidence for us to conclude that the null hypothesis is false and that the true mean IQ is actually greater than 110.
The probability 0.05 is the size of the test. This is the probability that the sample mean will fall into the rejection region even when the null hypothesis 𝜇 = 110 is true. This is the probability of a Type I error and is set before the test begins.
Using the z-transformation, it is easier to express this rejection region in terms of z-statistics. Using the inverse z-table, we can immediately see that 𝑃𝑟(𝑧 > 1.645) = 0.05.
Using the central limit theorem, if the null hypothesis is true, then 𝑥̅ is normally distributed with mean 𝜇0 = 110 and standard deviation 𝜎
√𝑛. Thus, the relevant z-statistic is:
𝑧 = 𝑥̅−𝜇0 𝜎 √𝑛⁄
With this in mind, our testing procedure is actually quite simple. We take a random sample of students from the new class and find their sample mean IQ 𝑥̅. We then construct the z-statistic and determine whether it falls in our rejection region.
Suppose we have a random sample of 𝑛 = 64 students from this class. We want to test the null that the true mean IQ is 𝜇 = 110 versus the alternative that 𝜇 > 110. In our random sample, the sample mean IQ is 𝑥̅ = 112 and the standard deviation is 𝑠 = 9.3.
The z-statistic is:
𝑧 = 𝑥̅−𝜇0
𝜎 √𝑛⁄ =
112−110
Since our rejection region is any 𝑧 > 1.645, our conclusion is that we can reject the null hypothesis. We have sufficient evidence to reject the null that 𝜇 = 110 in favor of the alternative that the true mean IQ for the new classroom is actually 𝜇 > 110.
The most common size for a hypothesis test is 𝛼 = 0.05. The cutoff region for a z-statistic giving only 0.05 probability of exceeding this z-statistic is 𝑧 > 1.645. We write this succinctly as 𝑧0.05= 1.645.
But sometimes you might want to implement a statistical test with a different size. For example, if instead of 𝛼 = 0.05 you wanted the probability of committing a Type I error (i.e. rejecting the null when it is true) to be only 𝛼 = 0.01, then the correct cutoff value for the z-statistic would be
𝑧0.01 = 2.326. Test statistics for other commonly used sizes can be read directly from the inverse z-table in a similar manner. The lower the level of 𝛼 = 0.01 the stricter the test, i.e. it becomes harder to reject the null hypothesis.
Let us put this all together and summarize how to conduct a hypothesis test.
1. Set up the null and alternative hypotheses. 2. Formulate the rejection region.
3. Calculate the test statistic.
4. Compare the test statistic to the rejection region and formulate a conclusion.
A statistician would emphasize that it is important to do (2) before (3). Step (2) gives the rule for rejecting the null hypothesis. It’s important to decide on the rule before looking at the data in step (3). We don’t look at the data and then use this to decide what our rule should be.
For a test involving a sample mean along the lines that we have conducted in this section, the particular steps for conducting a test with size are:
1. 𝐻0: 𝜇 = 𝜇0 versus 𝐻𝑎: 𝜇 > 𝜇0 2. Rejection region: 𝑧 > 𝑧𝛼 3. Test statistic: 𝑧 = 𝑥̅−𝜇0
𝜎 √𝑛⁄
4. Compare (2) and (3) to make a conclusion.
We can easily extend this to the case where we are interested in testing the alternative that 𝜇 is
less than some alternative value 𝜇0. For instance, we might be interested in testing whether a
pollution control program has reduced the level of pollution readings in a town. Since the normal distribution is symmetric, everything is the same except that our rejection region is in the lower
Outlining the steps explicitly for a left-sided alternative hypothesis:
1. 𝐻0: 𝜇 = 𝜇0 versus 𝐻𝑎: 𝜇 < 𝜇0 2. Rejection region: 𝑧 < −𝑧𝛼 3. Test statistic: 𝑧 = 𝑥̅−𝜇0
𝜎 √𝑛⁄
4. Compare (2) and (3) to make a conclusion.
Individuals filing federal tax returns had an average refund of $1056. A researcher suggest that people who file their returns during the last five days have a refund that is lower than average. He takes a random sample of 400 people who filed income returns during the last five days and finds that the sample mean refund is $910 and the standard deviation is $1600. Test the researcher’s hypothesis at the 𝛼 = 0.01 level.
Using the steps to implement a hypothesis test.
1. 𝐻0: 𝜇 = 1056 versus 𝐻𝑎: 𝜇 < 1056
2. RR: 𝑧 < −2.326
3. Test statistic: 𝑧 = 910−1056
1600 √400⁄ = −1.825
4. Since 𝑧 is not in the rejection region, we do not reject 𝐻0.
Thus, in this case, there is not enough evidence at the 𝛼 = 0.01 level to support the researcher’s conclusion that the mean refund for late filers is less than the average refund of $1056.
EXERCISES
1. A marketing research company bases charges to a client on the assumption that their surveys can be conducted in a mean time of 15 minutes or less. If a longer mean survey time is necessary, then a premium rate is applied. A sample of 35 surveys shows a sample mean of 17 minutes and a standard deviation of 4 minutes. Using a significance level of
𝛼 = 0.01, is the higher rate justified?
6.3: Two-sided tests
In the previous section, we hypothesized that the mean was equal to a particular value and tested alternative that it was higher than this value or the alternative that it was lower than this value. But sometimes you might simply be interested in whether a particular target is met or not. That is, the null hypothesis is that the mean is equal to some value and the alternative is that it is not
equal to this value.
For example, suppose that a process control system requires that some manufactured component have a mean length of 280 micrometers, and you are interested in testing whether there is evidence that this standard is not being met. In this case, you are concerned about error either because the actual length is too high or too low.
Thus, the hypotheses are:
𝐻0: 𝜇 = 280 𝐻𝑎: 𝜇 ≠ 280
This is called a two-sided test, since we reject the null hypothesis when there is evidence that the mean is higher or lower than the hypothesized value.
In general, for a test with size 𝛼, the appropriate rejection region is to reject the null hypothesis whenever 𝑧 > 𝑧𝛼 2⁄ or when 𝑧 < −𝑧𝛼 2⁄ .
The other details of the test are the same as the details for a one-sided test. The only thing that is different is that the rejection region is symmetric on both sides. Here is a summary of the test.
1. 𝐻0: 𝜇 = 𝜇0 versus 𝐻𝑎: 𝜇 ≠ 𝜇0
2. Rejection region: 𝑧 > 𝑧𝛼 2⁄ or 𝑧 < −𝑧𝛼 2⁄ 3. Test statistic: 𝑧 = 𝑥̅−𝜇0
𝜎 √𝑛⁄
4. Compare (2) and (3) to make a conclusion.
For the example given at the beginning of the section, suppose we measure the lengths of a random sample of 36 components and we obtain a sample mean 𝑥̅ = 278.5 and a standard deviation 𝑠 = 12. Is this enough evidence to conclude that the required average specification
𝜇 = 280 is not being satisfied? Use a test with size 𝛼 = 0.05.
Let us conduct a hypothesis test:
1. 𝐻0: 𝜇 = 280 versus 𝐻𝑎: 𝜇 ≠ 280
2. Rejection region: 𝑧 > 1.96 or 𝑧 < −1.96 3. Test statistic: 𝑧 = 278.5−280
12 √36⁄ = −0.75
4. Since 𝑧 is not in the rejection region, we cannot reject the null hypothesis.
EXERCISES
1. The mean charitable contribution for American taxpayers is $1075. A researcher is investigating whether a change in demographics caused any change in the mean contribution. She takes a random sample of 200 taxpayers and finds that their mean contribution level is $1223, with a standard deviation of $840. Test the researcher’s hypothesis at a 1% level of significance.
6.4: The p-value
One weakness of the concept of hypothesis testing is that the conclusion is a simple yes/no. Either there exists enough evidence to reject the null hypothesis at the specified level of significance or there does not.
However, we might sometimes want to give an answer that provides more detail. For example, if the null hypothesis is not rejected, are we close to being able to reject at the appropriate level of significance or is the evidence very far away from what would allow us to reject the null hypothesis? If the null hypothesis is rejected, is it a “close call” or is the evidence extremely far away from what would be the case if the null hypothesis is true? The p-value provides the detail needed for answering questions like this.
Let us revisit an example from section 6.2. We were interested in testing whether the mean IQ score for students exposed to a new classroom technique was 𝜇 = 110 versus the alternative that the new program was effective and that 𝜇 > 110. We took a random sample of 𝑛 = 64 students and obtained sample mean 𝑥̅ = 112 and the standard deviation is 𝑠 = 9.3. This was enough evidence to reject the null hypothesis at a significance level 𝛼 = 0.05, with steps being as follows.
1. 𝐻0: 𝜇 = 110 versus 𝐻𝑎: 𝜇 > 110 2. Rejection region: 𝑧 > 1.645 3. Test statistic: 𝑧 = 112−110
9.3 √64⁄ = 1.72
4. Since 𝑧 falls in the rejection region, we reject 𝐻0 in favor of 𝐻𝑎.
However, suppose that we had tested at a significance level 𝛼 = 0.01. In other words, the test was stricter in the sense that we needed more evidence to reject the null. In that case, you can see from the inverse z-table that the rejection region would have been 𝑧 > 2.326. In this case, the test statistic would not have fallen in the rejection region. So, although the sample provides enough evidence to reject the null hypothesis at a significance level 𝛼 = 0.05, it does not
provide enough evidence to reject the null hypothesis at a significance level 𝛼 = 0.01.
This might be useful information to provide. If the researcher simply tells readers that the null hypothesis is rejected at 𝛼 = 0.05, then the reader has no idea whether the evidence is extremely strong and that the null would have been rejected at lower significance levels, or whether the evidence is only borderline.
hypothesis can be rejected when the significance level is 𝛼 = 0.05 but it cannot be rejected when the significance level is 𝛼 = 0.01. Thus, extremely low p-values close to zero provide strong evidence for rejecting the null hypothesis. For example, if the p-value is 𝑝 = 0.0034, then the null hypothesis can be rejected at all normal levels of significance such as 𝛼 = 0.05 and 𝛼 =
0.01.
For the example given in this section, we know that the p-value is apparently somewhere in between 𝑝 = 0.05 and 𝑝 = 0.01 since the test statistic 𝑧 = 1.72 was in the rejection region for
𝛼 = 0.05 (which is 𝑧 > 1.645) but was not in the rejection region for 𝛼 = 0.01 (which is 𝑧 >
2.326).
Computing the p-value exactly is easy. Since the test statistic is 𝑧 = 1.72, we can read from the normal table that 𝑃(𝑧 > 1.72) = 0.0427. Thus, this test statistic would be captured in the rejection region for any 𝛼 > 0.0427, which is exactly the definition of the p-value. So, for this test, the p-value is 𝑝 = 0.0427. There is enough evidence to reject the null hypothesis as long as the significance level is 𝛼 > 0.0427, but not lower than that. Thus, we can reject the null at the 5% level of significance but not at the 1% level of significance.
The way to think about the p-value informally is that the p-value is the probability of seeing the observed data if the null hypothesis is true. For the example above, when the null hypothesis is true and 𝜇 = 110, then there was a 4.27% chance of seeing a sample mean IQ of 𝑥̅ = 112 or higher. Thus, we are OK to conclude at the 5% level that the null hypothesis is false, but we cannot make this conclusion at the 1% level since, even when the null hypothesis is true, there is a 4.27% chance of observing a sample mean 𝑥̅ = 112 or higher even when the null is true.
For a left-sided test (i.e. when the alternative hypothesis is 𝜇 < 𝜇0), the idea is the same, but you need to find the probability to the left of the z-statistic since the rejection region is on the lower end. For example, recall from unit 6.2 the following example.
Individuals filing federal tax returns had an average refund of $1056. A researcher suggest that people who file their returns during the last five days have a refund that is lower than average. He takes a random sample of 400 people who filed income returns during the last five days and finds that the sample mean refund is $910 and the standard deviation is $1600. Find the p-value for this hypothesis test.
The hypotheses are 𝐻0: 𝜇 = 1056 versus 𝐻𝑎: 𝜇 < 1056. The z-statistic for our test is:
𝑧 = 910−1056
Using the z-table, 𝑃(𝑧 < −1.83) = 0.0336, which is our p-value. This corresponds with our solution in that section that we could not reject the null hypothesis at 𝛼 = 0.01. The p-value 𝑝 =
0.0336 means that the null hypothesis can be rejected for significance levels higher than 0.336, but not lower. Informally, we can say that – if the true mean refund were $1056 (the null hypothesis) – then the probability of observing a sample mean $910 or lower is 3.36%.
For a two-sided test, note that the rejection region has to be symmetrically on both sides. Recall the example from unit 6.4. We wanted to test whether the mean specification was 𝜇 = 280 versus the alternative that 𝜇 ≠ 280. We took a sample with 36 components, and obtained 𝑥̅ =
278.5 and 𝑠 = 12. The z-statistic for testing these hypotheses is:
𝑧 =278.5−280
12 √36⁄ = −0.75
In computing the p-value, note that the rejection region would have to fall symmetrically on both sides in order to capture 𝑧 = −0.75. Thus, the p-value is 𝑃(𝑧 < −0.75) + 𝑃(𝑧 > 0.75). But since the normal distribution is symmetric, this can be calculated as 2𝑃(𝑧 < −0.75). Thus, the p-value for this test is 𝑝 = 2 ⋅ 0.2266 = 0.4532. This means that the data provide no evidence for rejecting the null hypothesis. It cannot be rejected for any conventional levels of significance like 𝛼 = 0.05. Informally, even if the null is correct and the true mean is 𝜇 = 280, there is still a 45.32% chance of observing a sample mean that is as far away from 280 as our sample mean is. Thus, there is not a strong case for rejecting the null hypothesis.
We will summarize below the method for computing p-values. If we let the calculated z-statistic for the hypothesis test be 𝑧̃, then the p-value is calculated as such:
For a right-sided test with alternative hypothesis 𝜇 > 𝜇0: 𝑝 = 𝑃(𝑧 > 𝑧̃)
For a left-sided test with alternative hypothesis 𝜇 < 𝜇0: 𝑝 = 𝑃(𝑧 < 𝑧̃)
For a two-sided test with alternative hypothesis 𝜇 ≠ 𝜇0:
EXERCISES
The following problems are from units 6.2 and 6.3. For each example, compute and interpret the p-value for the hypothesis test given.
1. A marketing research company bases charges to a client on the assumption that their surveys can be conducted in a mean time of 15 minutes or less. If a longer mean survey time is necessary, then a premium rate is applied. A sample of 35 surveys shows a sample mean of 17 minutes and a standard deviation of 4 minutes. You are interested in whether the true mean time is greater than 15 minutes.
2. The mean selling price for new one-family houses in a particular town is $181,900. But a sample of 40 houses from a particular neighborhood showed a mean selling price of $166,400 and a standard deviation of $33,500. You are interested in whether the mean selling price in this house is lower than the average.
3. The mean charitable contribution for American taxpayers is $1075. A researcher is investigating whether a change in demographics caused any change in the mean contribution. She takes a random sample of 200 taxpayers and finds that their mean contribution level is $1223, with a standard deviation of $840. You are interested in whether the change in demographics changed the mean contribution level.
6.5: Calculating power
Our analysis in the previous sections dealt only with Type I error. That is, we set the significance level of the test so that there was a low probability of rejecting the null hypothesis when the null hypothesis is actually true. This significance level 𝛼 is the size of the test.
But what about the accuracy of the test on the other end? That is, if the null hypothesis is actually false, what is the probability that our test will actually be able to reject it? This is the power of a statistical test.
The simple answer is that it depends on the particular choice of the alternative. For example, suppose we are testing the null that 𝜇 = 110 versus the alternative that 𝜇 > 110. If the true mean IQ for students in the new classroom setting has risen to 𝜇𝑎 = 140, then the power of our test will be very high. In other words, if the IQ score really has risen this much, then the likelihood is very high that the students’ average score on the test will be high enough to reject the null hypothesis.
On the other hand, if the true mean IQ has risen only to 𝜇𝑎 = 111, then the power of our test will be quite low. Even though the null should be rejected since the alternative 𝜇 > 110 is true, it will be more difficult to pick this up with our test since the students’ IQ is only 𝜇𝑎 = 111. In other words, the students might not score high enough on the test to allow us to reject the null hypothesis.
To apply this question to the IQ example, suppose that the new classroom technique actually raises the IQ of the students to 𝜇𝑎 = 113. What is the power of the test? In other words, what is the probability that we will be able to reject the null hypothesis?
The null hypothesis is 𝜇 = 110 and the alternative is 𝜇 > 110. Since the test is conducted at a significance level of 𝛼 = 0.05, recall that the rejection region is 𝑧 > 1.645. Our test had 𝑛 = 64 and the standard deviation was estimated to be 𝑠 = 9.3. With this in mind, what are the specific values of 𝑥̅ that would lead to rejection of the null hypothesis with our test at 𝛼 = 0.05? We can simply substitute in the definition of the test statistic:
𝑧 > 1.645
𝑥̅−110
9.3 √64⁄ > 1.645
𝑥̅ > 111.91
If the true mean IQ has risen to 𝜇𝑎 = 113, then we can use the central limit theorem to determine the probability that the sample mean will fall into the rejection region above.
𝑃(𝑥̅ > 111.91 | 𝜇 = 113) = 𝑃 (𝑧 >111.91−113
9.3 √64⁄ )
= 𝑃(𝑧 > −0.94) = 0.8264
So, when the true mean IQ is 𝜇𝑎= 113 then the power of our statistical test is 0.8264. In words,
there is an 82.64% chance when 𝜇𝑎 = 113 that the sample mean IQ will fall into the rejection region that allows us to (correctly) reject the null hypothesis that 𝜇 = 110.
But what if the true mean IQ of students in the new classroom is actually 𝜇𝑎 = 130. In this case, the probability that the sample mean 𝑥̅ will fall in the rejection region is:
𝑃(𝑥̅ > 111.91 | 𝜇 = 130) = 𝑃 (𝑧 >111.91−130
9.3 √64⁄ )
= 𝑃(𝑧 > −15.56) ≈ 1
In this case, when the true mean IQ of students is 𝜇𝑎 = 130, then we are almost 100% certain that the sample mean on the test that they take will put us in the rejection region that allows us to reject the null hypothesis that 𝜇 = 110.
The logic is the same to calculate power for two-tailed tests, but it is important to include both tails of the rejection region in the calculations.
To revisit the example from section 6.3, recall that the null was that the mean component length is 𝜇 = 280 as given in the specifications and the alternative was that the mean is 𝜇 ≠ 280. For a test with 𝛼 = 0.05, the rejection region for this test was 𝑧 > 1.96 or 𝑧 < −1.96. The sample size for the test is 𝑛 = 36 and the estimate for the standard deviation is 𝑠 = 12.
Suppose that the machine is not calibrated properly, and that the mean component length is actually 𝜇𝑎 = 275. What is the power of the test for rejecting the null 𝜇 = 280 in favor of the alternative 𝜇 ≠ 280?
𝑧 > 1.96 or 𝑧 < −1.96
𝑥̅−280
12 √36⁄ > 1.96 or 𝑥̅−280
12 √36⁄ < −1.96
𝑥̅ > 283.92 or 𝑥̅ < 276.08
The power of the test against the alternative that 𝜇𝑎 = 275 is determined by finding the probability that 𝑥̅ will fall into the rejection region given above when the true mean is 𝜇𝑎 = 275.
Note that we need to add the probability of being in both rejection regions.
𝑃(𝑥̅ > 283.92 | 𝜇 = 275) + 𝑃(𝑥̅ < 276.08 | 𝜇 = 275) = 𝑃 (𝑧 >283.92−275
12 √36⁄ ) + 𝑃 (𝑧 <
276.08−275 12 √36⁄ )
= 𝑃(𝑧 > 4.46) + 𝑃(𝑧 < 0.54) = 0 + 0.7054
= 0.7054
In this case, when the machine is actually calibrated to 𝜇𝑎 = 275, the probability that our testing procedure will lead to a sample mean that allows us to reject the null hypothesis is 0.7054.
To emphasize one more time, the power of a test depends upon the particular value of the alternative hypothesis that we choose. For testing 𝜇 = 280 against 𝜇 ≠ 280, the power of our test is different against the alternative that the true mean is 𝜇𝑎 = 275 than against the true alternative that the mean is 𝜇𝑎 = 250. The farther away that the true mean is from the
hypothesized mean, the higher will be the power of the test.
EXERCISES
1. A marketing research company bases charges to a client on the assumption that their surveys can be conducted in a mean time of 15 minutes or less. If a longer mean survey time is necessary, then a premium rate is applied. The standard deviation in survey administration time is 4 minutes. You are interested in testing whether the mean administration time is greater than 15 minutes. The size of your test is 𝛼 = 0.01.
a. If you take a random sample of 35 surveys, what is the power of this test for testing against the alternative that the mean survey administration time is actually 16 minutes?
b. What would the answer to (a) be if your sample size were 1000 surveys instead of only 35 surveys?
2. The mean selling price for new one-family houses in a particular town is $181,900 and the standard deviation in selling prices is $33,500. But you are interested in testing at a 5% level of significance whether the mean selling price for houses in a particular neighborhood is lower than this mean. Suppose you take a sample of 40 houses.
a. What is the power of your test against the alternative that the mean selling price is $175,000?
b. What is the power of your test against the alternative that the mean selling price is $150,000?
6.6: Determining the sample size
From the previous unit, we can see that tests with an acceptable significance might be low-powered. In other words, if the size of a test is set so that there is only 𝛼 = 0.05 probability of rejecting the null when it is true, the difficulty is that there might be a low probability of rejecting the null hypothesis even when it is false. This was particularly true when the null and the particular alternative were fairly close to each other. For example, if we were testing against the null that the true mean IQ for students in a new classroom is 𝜇 = 110 and their mean IQ is 𝜇𝑎 = 112, it might not be very likely that our test is able to detect this difference
with statistical significance.
The basic solution for raising the power of a test is to increase the sample size. Distinguishing a mean IQ 𝜇 = 110 from an alternative mean 𝜇𝑎 = 112 in a reliable way is a lot easier if the sample size is 𝑛 = 1000 than if the sample size is 𝑛 = 20. Luckily, there is a very simple formula for determining the sample size needed to attain a certain size and power in a statistical test.
Recall from unit 6.1 that 𝛼 measures the probability of Type I error and that 𝛽 measures the probability of Type II error. The power of a test is 1 − 𝛽. For example, if we want a test to have 90% power, then the required probability of Type II error is 𝛽 = 0.10.
The sample size needed for a one-sided hypothesis test with size 𝛼 and power 1 − 𝛽, with null hypothesis 𝜇0 against the particular alternative 𝜇𝑎 is as follows:
𝑛 =(𝑧𝛼+𝑧𝛽)2𝜎2 (𝜇0−𝜇𝑎)2
For a two-sided hypothesis test, the necessary sample size is:
𝑛 =(𝑧𝛼 2⁄ +𝑧𝛽)2𝜎2 (𝜇0−𝜇𝑎)2
Going back to the IQ example from earlier sections, recall that we were testing 𝜇 = 110 against the alternative that 𝜇 > 110. The estimate for the standard deviation was 𝑠 = 9.3. We had a sample size of 𝑛 = 64 and tested the claim at a significance level of 𝛼 = 0.05. In the previous unit, we determined that, if the true mean IQ were 𝜇𝑎 = 113, then the power of the test was 0.8264 for testing against this particular alternative. In other words, there was an 82.64% chance of rejecting the null hypothesis if the mean IQ had actually been increased to
Suppose the person funding the research project tells you that this is too low, and he wants you to sample enough students so that the power is 0.9. In other words, he wants a 90% chance of rejecting the null when it is false. We can compute the required sample size as given below. Note that 𝛼 = 0.05 is given and 𝛽 = 0.10 is the desired probability of Type II error for the test.
𝑛 =(𝑧𝛼+𝑧𝛽)2𝜎2 (𝜇0−𝜇𝑎)2 =
(1.645+1.282)2⋅9.32
(110−113)2 = 82.33
We always round these up, so 𝑛 = 83 is our answer. In order to attain the desired 90% power for the test, we need to increase our sample size to 83 students.
To understand why the formula makes sense, note that as 𝛼 and 𝛽 fall, so that we want a more accurate test, 𝑧𝛼 and 𝑧𝛽 rise, which makes the necessary sample size 𝑛 rise as well. As the variability 𝜎2 rises, we need a higher sample size. This makes sense – it is more difficult to do reliable estimation when there is a lot of inherent variability. Finally, as the distance between the null value and the alternative value 𝜇0− 𝜇𝑎 rises, the sample size falls. As we discussed in the previous section, it is easier statistically to detect that the null hypothesis
EXERCISES
1. A researcher is testing whether the mean fuel efficiency in new SUV’s meets the new environmental requirement 𝜇 = 10 versus the alternative that 𝜇 < 10. Assume that the standard deviation is 5. He tests this hypothesis at a significance level of 𝛼 =
0.05. What sample size should the researcher use in order to attain 90% power against the alternative that 𝜇 = 9.
6.7: Confidence intervals
The sample mean 𝑥̅ provides a point estimate for the parameter 𝜇 in the sense that it represents a good guess (unbiased, consistent and efficient) using the data about the true value of 𝜇.
The difficulty is that point estimates do not capture the level of uncertainty that is inherent in parameter estimation. Are we fairly certain that 𝜇 is close to 𝑥̅, or is our estimate less reliable? For questions like this, it can be useful to give an interval estimate for the parameter 𝜇, which provides information about the precision of the estimate.
A 1 − 𝛼 confidence interval for a population mean provides a formula that, on average, captures the true population mean with probability 1 − 𝛼. For example, a formula for a 95% confidence interval is constructed so that it contains the true population mean with probability 0.95. Here, 0.95 is called the confidence level.
In order to construct a confidence interval, note that for the standard normal distribution, there is probability 1 − 𝛼 that 𝑧 will fall in the interval [−𝑧𝛼 2⁄ , 𝑧𝛼 2⁄ ]. For example, there is probability
1 − 𝛼 = 1 − 0.05 = 0.95 that 𝑧 will fall in the interval [−𝑧0.025, 𝑧0.025] = [−1.96,1.96], since
this interval excludes the 2.5% mass contained in the left and right tails. Writing formally:
𝑃(−𝑧𝛼 2⁄ ≤ 𝑧 ≤ 𝑧𝛼 2⁄ ) = 1 − 𝛼
Now, using the central limit theorem, as long as the sample size 𝑛 is sufficiently large, then the z-statistic 𝑧 = 𝑥̅−𝜇
𝜎 √𝑛⁄ follows a standard normal distribution. Thus, we can substitute it into the
probability statement above:
𝑃 (−𝑧𝛼 2⁄ ≤ 𝜎 √𝑛𝑥̅−𝜇⁄ ≤ 𝑧𝛼 2⁄ ) = 1 − 𝛼
Multiplying through the inequality by 𝜎
√𝑛 and rearranging gives:
𝑃 (𝑥̅ − 𝑧𝛼 2⁄ (√𝑛𝜎) ≤ 𝜇 ≤ 𝑥̅ + 𝑧𝛼 2⁄ (√𝑛𝜎)) = 1 − 𝛼
But this is exactly what we wanted. By construction, this interval [𝑥̅ − 𝑧𝛼 2⁄ (𝜎
√𝑛) , 𝑥̅ + 𝑧𝛼 2⁄ ( 𝜎 √𝑛)]
90% confidence interval:
[𝑥̅ − 𝑧0.05(𝜎
√𝑛) , 𝑥̅ + 𝑧0.05( 𝜎
√𝑛)] = [𝑥̅ − 1.645 ( 𝜎
√𝑛) , 𝑥̅ + 1.645 ( 𝜎 √𝑛)]
95% confidence interval:
[𝑥̅ − 𝑧0.025(𝜎
√𝑛) , 𝑥̅ + 𝑧0.025( 𝜎
√𝑛)] = [𝑥̅ − 1.96 ( 𝜎
√𝑛) , 𝑥̅ + 1.96 ( 𝜎 √𝑛)]
99% confidence interval:
[𝑥̅ − 𝑧0.005(𝜎
√𝑛) , 𝑥̅ + 𝑧0.005( 𝜎
√𝑛)] = [𝑥̅ − 2.576 ( 𝜎
√𝑛) , 𝑥̅ + 2.576 ( 𝜎 √𝑛)]
One important point about these formulas is that, although the formulas technically depend on the population standard deviation 𝜎, this is generally unknown in practice, and so we substitute the sample standard deviation 𝑠 in its place. As long as the sample size is large enough, this is not a problem since 𝑠 is a consistent estimator of 𝜎.
A second point is to observe that as the level of confidence increases, our confidence intervals get wider. This makes good intuitive sense. If we want to be 99% sure that the sample mean is contained in our interval, the interval is going to have to be wide. But, if we only want to be 90% sure that the interval captures the mean, then we can get away with a tighter prediction. This shows that higher confidence levels are not necessarily “better”. Although there is a higher probability that the interval will contain the true 𝜇, a confidence interval with an extremely high confidence level may be so wide that it is practically uninformative. For example, I am almost 100% sure that the average IQ among a group of schoolchildren is in the interval [50,150], but this doesn’t really tell me very much. If I can live with a 95% confidence level, I might be able to get the interval narrowed down to [108,113], which is much more informative.
Our confidence interval for the true sample mean is 𝑥̅ ± 𝑧𝛼 2⁄ (√𝑛𝜎). This “window” around the
sample mean of 𝑧𝛼 2⁄ (√𝑛𝜎) is sometimes called the margin of error. For example, you might
hear that a survey estimates that the mean MPG for new sports utility vehicles is 10 ± 1.6, where 1.6 is the margin of error. This is basically a confidence interval, and the reported margin of error is relative to whatever confidence level the researchers use for the confidence interval.
𝑧𝛼 2⁄ (√𝑛𝜎) = 𝑀 ⇒ 𝑛 = (𝑧𝛼 2𝑀⁄ 𝜎) 2
As usual, this should always be rounded up in order to estimate the necessary sample size.
To illustrate what we have learned, suppose we take a random sample 𝑛 = 85 households and we find that the sample mean average debt level for the households is 𝑥̅ = 5900 with a standard deviation 𝑠 = 3058. Let us construct a 95% confidence interval for the true level of debt held by households in this population. From above, the formula is [𝑥̅ − 1.96 (𝑠
√𝑛) , 𝑥̅ + 1.96 ( 𝑠
√𝑛)], and
so substituting given values, our confidence interval is:
[5900 − 1.96 (3058
√85) , 5900 + 1.96 ( 3058
√85)] = [5250, 6550]
There is one technical point to make here. It is wrong to interpret the 95% confidence interval
[5249.89, 6550.11] by saying that the probability that 𝜇 is in this interval is 0.95. The sample mean 𝜇 has a particular value, and so the given interval either contains 𝜇 or it does not.
To clarify the point, the interval [𝑥̅ − 1.96 (𝑠
√𝑛) , 𝑥̅ + 1.96 ( 𝑠
√𝑛)] is designed so that 95% of the
time, this interval will capture 𝜇. So a better statement is that, with repeated samples, approximately 95% of confidence intervals constructed using sample data would contain the true population mean 𝜇. However, for a given sample, once we plug numbers into the formula, it is meaningless to say that there is a 95% chance that 𝜇 is contained in some particular interval since it either is or it is not (and we don’t know which).
Now, if we increased the confidence level to 99%, our confidence interval would be:
[5900 − 2.576 (3058
√85) , 5900 + 2.576 ( 3058
√85)] = [5046,6754]
As expected, this is wider than the 99% confidence interval. If we want to increase the confidence level for our interval estimate, we have to be willing to tolerate guesses which aren’t quite as precise.
Finally, suppose we wanted a 99% confidence level, but we wanted to reduce the margin of error for our estimate to $500, then the sample size required can be calculated as such
𝑛 = (𝑧𝛼 2⁄ 𝜎 𝑀 )
2
= (2.576⋅3058500 )2 = 248.21
EXERCISES
1. A marketing researcher wants to study the average amount of money spent by patrons of a seafood restaurant on dinners at a particular restaurant chain. He samples 49 customers and obtains a mean expenditure of $34.80 with a standard deviation of $5.
a. Give a 95% confidence interval for the mean expenditure level. b. What is the margin of error?
c. How many customers should he sample to lower the margin of error to $0.50?
2. Nielsen Media Research reports that mean weekly television viewing time, based on a random sample of 180 American households, is 7.75 hours, with a standard deviation of 3.45 hours.