Chapter 14: Inference for Distributions of
Categorical Variables: Chi-Square Procedure
Section 14.1: Test for Goodness of Fit
In the previous chapter we compared two proportions. There are situations, however, that call for a comparison of more than two proportions. In such situations it would be cumbersome to use the z-test for two proportions because we would need to conduct multiple tests between all the relevant
proportions. In such situations we use what is called the chi-squared test statistic. This test statistic can compare multiple proportions to tell us whether a certain frequency distribution fits a particular pattern (“goodness of fit test”) or if multiple categorical variables have a relationship (“Test for independence in a two-way table”)
χ² Goodness of Fit Test
Example: In 1992 a group of researchers conducted a study to see if there was a relationship between month of birth and achievement in soccer (this is a real study). They surveyed all players on the teams competing in the 1990 world cup championship and got the following data:
Birthdays
Quarter 1 (Aug –
Oct.) Quarter 2 (Nov. – Jan.) Quarter 3 (Feb. – Apr.) Quarter 4 (May – July)
Frequency 150 138 140 100
The researchers wanted to see if the birthdays of soccer players were evenly distributed across the four quarters. If the birthdays were evenly distributed across the four quarters we would conclude that there is no predisposition for certain birthdates to go into soccer
Conduct a hypothesis test at the alpha = 0.05 level.
Step 1: State your null and alternative hypothesis
Ho:p1=p2=p3=p4=0.25
Ha:Ho is not true(at least one p doesn't =0.25)
Where:
p1= Proportion of soccer players’ birthdays in the first quarter p2 = Proportion of soccer players’ birthdays in the second quarter p3 = Proportion of soccer players’ birthdays in the third quarter p4 = Proportion of soccer players’ birthdays in the fourth quarter
Expand the table above to compare actual counts with expected counts:
If our null hypothesis is correct then we would expect that 0.25 of all players will fall in each of the categories (Q1, Q2, Q3, Q4). Since there are a total of 528 players, we would expect 0.25 · 528 = 132 players to be born in each quarter
Birthdays
Quarter 1 (Aug –
Oct.) Quarter 2 (Nov. – Jan.) Quarter 3 (Feb. – Apr.) Quarter 4 (May – July)
Frequency 150 138 140 100
Expected 132 132 132 132
Q1 Q2 Q3 Q4
0 0.05 0.1 0.15 0.2 0.25 0.3
Proportion of Soccer Player Births
Quarter P ro p o rt io n o f b ir th d at es
Step 2: Assumptions:
1) In order to conduct a “goodness of fit” hypothesis test we must make sure that the expected counts in each cell is ¿ 5. In our case that assumption is met.
2) Data come for a single SRS or several independent SRS from one or several populations. Or, Data is from an entire population.
Step 3a: Test Statistic
The test statistic for the goodness of fit test is:
χ
2=
∑
(
observed
−
exp
ected
)
2
exp
ected
χ
2=
∑
(
observed
−
exp
ected
)
2
exp
ected
=
(
150
−
132
)
2132
+
(
138
−
132
)
2132
+
(
140
−
132
)
2132
+
(
100
−
132
)
2132
¿
2. 45
+
0 .273
+
0. 485
+
7. 758
=
10. 966
Step 3b: Calculate the p-value.
The statistic χ 2 has approximately what statisticians called the
χ
2 distribution with (k-1) degrees of freedom (k is the number of categories in the problem. In our case k = 4). Theχ
2 distribution has the following properties:1) The total area under the curve is 1
2) Each curve starts at zero on the axis, increases to a peak, and then decreases toward the x-axis asymptotically.
3) Each curve is skewed to the right and approaches a normal curve as the d.f increase.
1) Table – Very similar to the t-table. Go down the left column to the correct d.f. Then find the χ2 value in the body of the table (or the 2 numbers between which your χ 2 statistic lies). The matching probability range at the top row of the table is your p-value range.
In our case with 3 d.f, our χ 2 test statistic lies between 9.84 and 11.34 which corresponds to a p-value between 0.01 and 0.02
2) Calculator -
χ
2 cdf (lowerbound, upperbound, d.f)In our case the lower bound will always be the χ 2 test statistic and the upperbound will be a large number like 10,000.
χ
2 cdf (10.966, 10000, 3) = 0.01191Step 4: Conclusion
Since our p-value is less than the alpha level of 0.05, we can reject the null hypothesis in favor of the alternative, and conclude that soccer players’ birthdays in the world cup are not evenly distributed between the 4 quarters.
Try this Example
Section 14.2: Inference for Two-Way Tables
χ²
Test for Independence
Often times we want to determine whether 2 or more categorical variables are related to each other in some way. For example, is there a relationship between one’s religion and one’s political party affiliation? Is there a relationship between the sport a DHS student participates in and his/her affinity for a particular movie genre? To determine the correlation or relationship between two variables we can do a correlation analysis, but if we have more than two variables and if the variables are
categorical (as opposed to quantitative) we cannot use correlation, we need to use a
χ
2 analysis.Example: Is there a relationship between male and female political leanings at DHS? An SRS of 90 DHS students was asked to categorize themselves as liberal, moderate or conservative. A two-way table with the resulting data is given below:
Liberal Moderate Conservative Total
Males 16 10 6 32
Females 20 22 16 58
Total 36 32 22 90
Do these data provide evidence of an association between political philosophy and gender at DHS? (alpha = 0.01)
Step 1: State the null and alternative hypothesis
Ho: There is no relationship between gender and political affiliation at DHS. Or you can state it this way: Gender and political affiliation are independent
Ha: There is a relationship between gender and political affiliation at DHS.
Or you can state it this way: Gender and political affiliation are not independent.
Construct a table of expected values:
You can use the following general rule for finding the expected count for each cell of a two-way table:
The expected count for each cell of a two-way table =
row total
⋅
column total
table total
So for liberal females this would be:
36⋅58
For moderate females the expected count is:
32⋅58
90 =20 .62
For conservative females the expected count is:
22⋅58
90 =14 .18
Lets create a new table with the observed and expected values for each cell:
Liberal Moderate Conservative Total
Males Observed 16 10 6 32
Expected 12.8 11.4 7.8
Females Observed 20 22 16 58
Expected 23.2 20.62 14.18
Total 36 32 22 90
Step 2: Assumptions
1) Since all our expected cell counts are ¿ 5 we may proceed
2) Data are an SRS of the population of DHS students
Step 3a: Calculate the test statistic
χ2=
∑
(observed−expected)2
expected =
(16−12 .8)2 12 .8 +
(10−11 . 4)2 11. 4 +
(6−7 .8)2 7 .8 +
(20−23. 2)2 23. 2 + (22−20 . 62)2
20 . 62 +
(16−14 .18)2
14 .18 =0 .8+0 .17+0. 42+0. 44+0 . 09+0 . 23=2 .15
Step 3b: Calculate the p-value
The degrees of freedom for two-way table
χ
2 test = (# of rows – 1)(# of columns – 1)In our case d.f = (2-1)(3-1) = 2
Step 4: Conclusion
Chapter 14 Summary
1.) The chi-square test for homogeneity of populations (Goodness of Fit)
Independent SRSs are drawn from each of several populations and each observation is classified according to a categorical variable of interest.
The null hypothesis is that the distribution of this categorical variable is the same for all populations.
The alternate hypothesis is that the distribution is not the same for all populations.
One common use is to compare several population proportions.
2.) The chi-square test of association/independence
A single SRS is drawn from a population, and observations are classified according to two categorical variables.
The null hypothesis is that there is no relationship between the row variable and the column variable.
The alternate hypothesis is that there is a relationship between the row variable and the column variable.
Other key words that can replace “relationship” are “association” and “independence”
The expected count in any cell of a two-way table when
H
0 is true isexpected count=row total×column total
n
χ
2=
∑
(
observed count-expected count
)
2
expected count
with (r-1)(c-1) degrees of freedom.Conditions are 1. SRS