Chapter 14 WebNotes.docx

(1)

Chapter 14: Inference for Distributions of

Categorical Variables: Chi-Square Procedure

Section 14.1: Test for Goodness of Fit

In the previous chapter we compared two proportions. There are situations, however, that call for a comparison of more than two proportions. In such situations it would be cumbersome to use the z-test for two proportions because we would need to conduct multiple tests between all the relevant

proportions. In such situations we use what is called the chi-squared test statistic. This test statistic can compare multiple proportions to tell us whether a certain frequency distribution fits a particular pattern (“goodness of fit test”) or if multiple categorical variables have a relationship (“Test for independence in a two-way table”)

χ² Goodness of Fit Test

Example: In 1992 a group of researchers conducted a study to see if there was a relationship between month of birth and achievement in soccer (this is a real study). They surveyed all players on the teams competing in the 1990 world cup championship and got the following data:

Birthdays

Quarter 1 (Aug –

Oct.) Quarter 2 (Nov. – Jan.) Quarter 3 (Feb. – Apr.) Quarter 4 (May – July)

Frequency 150 138 140 100

The researchers wanted to see if the birthdays of soccer players were evenly distributed across the four quarters. If the birthdays were evenly distributed across the four quarters we would conclude that there is no predisposition for certain birthdates to go into soccer

Conduct a hypothesis test at the alpha = 0.05 level.

Step 1: State your null and alternative hypothesis

H_o:p₁=p₂=p₃=p₄=0.25

H_a:H_o is not true(at least one p doesn't =0.25)

Where:

p1= Proportion of soccer players’ birthdays in the first quarter p2 = Proportion of soccer players’ birthdays in the second quarter p3 = Proportion of soccer players’ birthdays in the third quarter p4 = Proportion of soccer players’ birthdays in the fourth quarter

(2)

Expand the table above to compare actual counts with expected counts:

If our null hypothesis is correct then we would expect that 0.25 of all players will fall in each of the categories (Q1, Q2, Q3, Q4). Since there are a total of 528 players, we would expect 0.25 · 528 = 132 players to be born in each quarter

Birthdays

Quarter 1 (Aug –

Oct.) Quarter 2 (Nov. – Jan.) Quarter 3 (Feb. – Apr.) Quarter 4 (May – July)

Frequency 150 138 140 100

Expected 132 132 132 132

Q1 Q2 Q3 Q4

0 0.05 0.1 0.15 0.2 0.25 0.3

Proportion of Soccer Player Births

Quarter P ro p o rt io n o f b ir th d at es

Step 2: Assumptions:

1) In order to conduct a “goodness of fit” hypothesis test we must make sure that the expected counts in each cell is ¿ _{5. In our case that assumption is met.}

2) Data come for a single SRS or several independent SRS from one or several populations. Or, Data is from an entire population.

Step 3a: Test Statistic

The test statistic for the goodness of fit test is:

χ

2

=

∑

(

observed

−

exp

ected

)

2

exp

ected

(3)

χ

2

=

∑

(

observed

−

exp

ected

)

2

exp

ected

=

(

150

−

132

)

2

132

+

(

138

−

132

)

2

132

+

(

140

−

132

)

2

132

+

(

100

−

132

)

2

132

¿

2. 45

+

0 .273

+

0. 485

+

7. 758

=

10. 966

Step 3b: Calculate the p-value.

The statistic χ 2_{has approximately what statisticians called the}

χ

2 _{distribution with (k-1) degrees of} freedom (k is the number of categories in the problem. In our case k = 4). The

χ

2 distribution has the following properties:

1) The total area under the curve is 1

2) Each curve starts at zero on the axis, increases to a peak, and then decreases toward the x-axis asymptotically.

3) Each curve is skewed to the right and approaches a normal curve as the d.f increase.

(4)

1) Table – Very similar to the t-table. Go down the left column to the correct d.f. Then find the χ2_{value in the body of the table (or the 2 numbers between which your χ} 2_{statistic lies). The} matching probability range at the top row of the table is your p-value range.

In our case with 3 d.f, our χ 2_{test statistic lies between 9.84 and 11.34 which corresponds to a} p-value between 0.01 and 0.02

2) Calculator -

χ

2 cdf (lowerbound, upperbound, d.f)

In our case the lower bound will always be the χ 2_{test statistic and the upperbound will be a} large number like 10,000.

χ

2 _{cdf (10.966, 10000, 3) = 0.01191}

Step 4: Conclusion

Since our p-value is less than the alpha level of 0.05, we can reject the null hypothesis in favor of the alternative, and conclude that soccer players’ birthdays in the world cup are not evenly distributed between the 4 quarters.

Try this Example

(5)

Section 14.2: Inference for Two-Way Tables

χ²

Test for Independence

Often times we want to determine whether 2 or more categorical variables are related to each other in some way. For example, is there a relationship between one’s religion and one’s political party affiliation? Is there a relationship between the sport a DHS student participates in and his/her affinity for a particular movie genre? To determine the correlation or relationship between two variables we can do a correlation analysis, but if we have more than two variables and if the variables are

categorical (as opposed to quantitative) we cannot use correlation, we need to use a

χ

2 analysis.

Example: Is there a relationship between male and female political leanings at DHS? An SRS of 90 DHS students was asked to categorize themselves as liberal, moderate or conservative. A two-way table with the resulting data is given below:

Liberal Moderate Conservative Total

Males 16 10 6 32

Females 20 22 16 58

Total 36 32 22 90

Do these data provide evidence of an association between political philosophy and gender at DHS? (alpha = 0.01)

Step 1: State the null and alternative hypothesis

Ho: There is no relationship between gender and political affiliation at DHS. Or you can state it this way: Gender and political affiliation are independent

Ha: There is a relationship between gender and political affiliation at DHS.

Or you can state it this way: Gender and political affiliation are not independent.

Construct a table of expected values:

You can use the following general rule for finding the expected count for each cell of a two-way table:

The expected count for each cell of a two-way table =

row total

⋅

column total

table total

So for liberal females this would be:

36⋅58

(6)

For moderate females the expected count is:

32⋅58

90 =20 .62

For conservative females the expected count is:

22⋅58

90 =14 .18

Lets create a new table with the observed and expected values for each cell:

Liberal Moderate Conservative Total

Males Observed 16 10 6 32

Expected 12.8 11.4 7.8

Females Observed 20 22 16 58

Expected 23.2 20.62 14.18

Total 36 32 22 90

Step 2: Assumptions

1) Since all our expected cell counts are ¿ _{5 we may proceed}

2) Data are an SRS of the population of DHS students

Step 3a: Calculate the test statistic

χ2=

∑

(observed−expected)

2

expected =

(16−12 .8)2 12 .8 +

(10−11 . 4)2 11. 4 +

(6−7 .8)2 7 .8 +

(20−23. 2)2 23. 2 + (22−20 . 62)2

20 . 62 +

(16−14 .18)2

14 .18 =0 .8+0 .17+0. 42+0. 44+0 . 09+0 . 23=2 .15

Step 3b: Calculate the p-value

The degrees of freedom for two-way table

χ

2 test = (# of rows – 1)(# of columns – 1)

In our case d.f = (2-1)(3-1) = 2

(7)

Step 4: Conclusion

(8)

Chapter 14 Summary

1.) The chi-square test for homogeneity of populations (Goodness of Fit)

 Independent SRSs are drawn from each of several populations and each observation is classified according to a categorical variable of interest.

 The null hypothesis is that the distribution of this categorical variable is the same for all populations.

 The alternate hypothesis is that the distribution is not the same for all populations.

 One common use is to compare several population proportions.

2.) The chi-square test of association/independence

 A single SRS is drawn from a population, and observations are classified according to two categorical variables.

 The null hypothesis is that there is no relationship between the row variable and the column variable.

 The alternate hypothesis is that there is a relationship between the row variable and the column variable.

 Other key words that can replace “relationship” are “association” and “independence”

The expected count in any cell of a two-way table when

H

0 _{is true is}

expected count=row total×column total

n

χ

2

=

∑

(

observed count-expected count

)

2

expected count

_{with (r-1)(c-1) degrees of freedom.}

Conditions are 1. SRS