Chapter 11: Testing a Claim
Section 11.1 & 11.2: Significance Tests: The Basics
Significance Testing (Hypothesis Testing)
While confidence intervals are a common type of statistical inference – used when the goal is to estimate a population parameter using sample data – another form of statistical inference is that of significance testing (A.K.A. Hypothesis Test). Significance tests are used when one wants to assess whether the data provides enough evidence of some claim about the population. The theory behind significance testing is very similar to confidence intervals in its use of the central limit theorem. Significance tests require adherence to a specific procedural process – every step of which is important.
The theory behind hypothesis testing:
Example: After a heated debate over grade inflation with his class, Mr. Ruggieri claims that teachers at DHS tend to inflate their grades in order to keep students from complaining. The class disagrees and thinks that grades are not inflated and that the average GPA in the school is only 3.0. In order to test Mr. Ruggieri’s claim of grade inflation, an SRS of 40 DHS students was taken and the average GPA from that sample was found to be ¯x=3 .3 (lets assume that grades at DHS have a standard deviation σ=0 . 9 ). Is this enough evidence to suggest that the average DHS grade is in fact above 3.0 (thus constituting grade inflation)?
Solution: Keep in mind, that our sample is only one of many possible samples. The sampling distribution of all sample means will be approximately normal (due to CLT) with a mean of
μ
¯x=
μ
and a standarddeviation of
σ
¯x=
σ
√
n
=
0 . 9
√
40
. So if we were to take another sample of the same size our ¯x might bedifferent. The question therefore is this: If, in fact, the mean GPA at DHS is 3.0 (as the student’s claim) and the standard deviation is 0.9, how likely are we to get an ¯x that is 3.3? Is the likelihood so small that we can’t attribute it to chance variation?Or is 3.3 a “reasonable” sample mean based on our data?
We can answer this question by sketching the distribution of sample means below and finding the shaded area. This shaded area represents the probability that we would get a mean as large as 3.3 (or larger):
Z
3 .3=
3.3
−
3.0
0. 9
/
√
40
=
2.108
P(X≥3. 3)=P(Z≥2.108)=. 0175
This means that there is less than a one in 50 chance that our ¯x was 3.3 by chance. Because this probability is so small it suggests that the true population mean GPA is in fact not 3.0, but some higher number.
Setting up a formal hypothesis test:
Step 1: Set up your hypothesis based on what you think about the population and identify the parameter of interest.
Ho:μ=3.0 Ha: μ>3. 0
µ = average GPA of DHS students
NOTE: The null hypothesis is always stated in terms of
μ
=
μ
o and the alternative hypothesis will be stated in one of three ways:μ
>
μ
o orμ
<
μ
o orμ
≠
μ
o .Step 2: Make sure the conditions are met. 1.) SRS – Given in the problem OK
2.) Normality - the sample size is greater than 30, CLT kicks in OK 3.) Independence – there are more than 400 students at DHS OK
The Null Hypothesis states that the effect you are looking for is not present in the population. Since the effect we are looking for is “grade inflation”, the effect not being there would imply that
μ
=
3.0
The Alternate Hypothesis states our suspicion about the population. Our suspicion is that there is grade inflation which would imply that
μ
>
3.0
Z3 .3=3.3−3. 0
0 .9/√40=2.108
Step 3: Calculate the test statistic and the P-value. The test statistic is the statistic
that estimates the parameter in the hypothesis. In our case the test statistic, ¯x , was calculated to be 3.3. Sometimes the z statistic for ¯x is also called the test statistic; in our case that would be
Is the value of this test statistic far from the parameter value stated in your null hypothesis? If so, then there is good evidence that your null hypothesis is false, and that that there is enough evidence to support your “hunch” about the parameter (stated as your alternative hypothesis). How far is “far”? Usually if the probability of getting an outcome like the one you got from your sample is smaller than some
pre-determined level (called an α− level) then your result is considered extreme enough to be considered “far”. This, of course, would imply that getting a result as extreme as you got from your sample is unlikely to have occurred by chance and therefore you can ”reject the null hypothesis” in favor of the alternate hypothesis. In our case we are asking:
P(X ¿ 3.3)? We should also sketch the curve and shade the appropriate region:
3 . 3 0 . 3
We answer this question by finding the z-statistic for this value and then finding the probability associated with our alternate hypothesis using our table (or calculator):
Z
3 .3=
3.3
−
3.0
0. 9
/
√
40
=
2.108
P(X≥3. 3)=P(Z≥2.108)=. 0175
Most α levels are set at 0.05. Meaning that there would only be a 5% probability that a sample result would have occurred by chance ( α levels are similar to confidence intervals in that an α levels of 0.05, in a double-sided alternative, corresponds to a confidence level of 95%). Let’s use an α levels of 0.05 for our example. Since our p-value = 0.0175 is less than the α levels, we say that our results are “statistically significant to the 0.05 α level” and that we can ”reject the null hypothesis”
Step 4: State your conclusion in plain English based on what you set out to study. In our case our conclusion would be stated as follows: “Our sample data suggests that the DHS GPA is greater than 3.0 indicating that there might be some grade inflation”.
Example 2: Worried about his prospects for the prom, Malcolm claims that girls at DHS are a bit snobby and that the average number of girls that a guy must ask to the prom before getting a “yes” is 4. Doug
disagrees. He doesn’t think the girls are that snobby and that the average number of girls that a guy must ask out before getting a positive response is less than 4. An SRS from last year of 50 junior and senior guys found that the average number of girls that a guy asked out to the prom was 3.4. Assuming the standard deviation from the entire population is σ=2 , is there enough evidence to support Alex’s claim (at the level α=.05) ?
Step 1:
Ho:μ=4.0 Ha: μ<4 .0
μ = The average number of girls that a guy needs to ask to the prom before getting a positive
answer.
Step 2: Conditions (requirements) 1) SRS O.K. (given)
Step 3: The test statistic is ¯x =3.4
The z statistic for this ¯x is
Z
=
3.4
−
4
2
/
√
50
=−
2.121
P( ¯x <3.4) = P(Z<-2.121) = 0.01697
The p-value is 0.01696
4 3.4
Step 4: Because the p-value is less than our α level of 0.05 we can reject the null hypothesis and conclude that there is enough evidence to support Doug’s claim that Darien girls aren’t snobby and that the number of girls a guy needs to ask to the prom before getting a positive answer is less than 4. (Note: Our data is not sufficient to reject Ho at the α=.01 level.)
Example 3: Frank has been sensing that his car is not driving right. He takes his car to the mechanic who does some testing on the ignition timing. In order for Frank’s car to run at optimum efficiency, the spark plugs need to ignite and spark, on the average every 1.3 seconds. Assume that the standard deviation of all spark plug firings is known to be σ=0.5 seconds. The mechanic suspects that Frank’s car spark plugs are not firing at this optimal interval. The mechanic took a random sample of 30 spark plug ignition firings from Frank’s car and got the following data:
1.0 1.1 0.8 1.7 0.9 1.3 1.2 1.5 1.3 0.8
0.6 1.3 1.1 1.2 0.7 1.9 2.0 1.1 1.3 1.4
1.0 1.2 0.9 0.4 1.3 1.2 1.4 1.0 1.3 1.3
Based on this data, can we say that Frank’s car problems stem from spark plug timing? (at the level α=.05) ?
Step 1:
Ho:μ=1.3 Ha:μ≠1.3
Step 2: Conditions (requirements) – 1) SRS O.K. (given)
2) Normality O.K. Sample size is ¿ 30
3) Independence O.K. more than 300 spark plugs in population
Step 3: Calculate the test statistics
¯x = 1.173
Z
=
1.173
−
1.3
0.5
/
√
30
=−
1.39
Since the alternate hypothesis in this case is a ¿ situation, there is a higher burden of proof in the sense
that instead of our rejection region being the extreme end of just one tail of the normal curve with an area of 0.05, the rejection region is now two tails, each of which has an area of α/2 or 0.025.
P( ¯x ¿ 1.173) = P(Z < –1.39 or Z > 1.39) = 0.082 + 0.082 = 0.164
this area is 0.082 this area is 0.082
−1.39 1.39 0
Step 4:
Because the p-value is greater than our α level of 0.05 we cannot reject the null hypothesis. Therefore, there is not enough evidence to support the mechanic’s claim that the timing in the spark plugs is not the optimal 1.3 seconds.
Final Comment: Comparison to the American Judicial System
A good way to think of significance tests is by comparing it to the judicial system. In the American judicial system we must first assume innocence and then must try to prove guilt. In court, we cannot prove innocence, we either have evidence "beyond a reasonable doubt" to 'prove' guilt, or we declare the person "not-guilty". The same is true in statistics. With a hypothesis test, either one or two tailed, the null hypothesis is our assumption of innocence. If we wish to determine ‘guilt’ (the alternative hypothesis), we must first assume innocence (the null hypothesis). Either we have enough evidence to reject the null hypothesis (reject innocence) using a level of significance or we do not have enough evidence (fail to reject innocence). We can never “accept” the null hypothesis since that is not what we set out to prove. Just like in court, when a jury fails to reject innocence (the null hypothesis) it doesn’t say the person is found innocent, it says the person is found “not guilty”.
Note: Our data is not sufficient to reject Ho at the