The Chi-Square Test. STAT E-50 Introduction to Statistics

(1)

The Chi-Square Test STAT E-50

Introduction to Statistics

2

The Chi-square test is a nonparametric test that is used to compare experimental results with theoretical models. That is, we will be comparing observed frequencies with expected frequencies. In a hypothesis test, the expected frequencies are those we would expect if the null hypothesis of our test is true.

The formula is

where O represents the observed frequency and E represents the expected frequency. The value of df depends on the type of test you are performing.

 

²

2 O E

E

   

3 The Chi-Square Distribution

The χ²distribution is

• nonnegative

• not symmetrical; it is skewed to the right

• distributed to form a family of distributions, with a separate distribution for each different number of degrees of freedom.

4 The Chi-Square Test for Goodness of Fit

The goodness-of-fit test compares the distribution of observed outcomes for a single categorical variable to the expected outcomes predicted by a probability model. This test involves one sample, and one variable.

Assumptions and Conditions:

Counted data condition

Be sure that the data is counts, or frequencies

Independence assumption Randomization condition

Sample size assumption

Expected cell frequency condition: each expected frequency is at least 5

5 The Chi-square test is one-sided

0 ²(df, α)

6 Automobile insurance is much more expensive for teenage drivers than for older drivers. To justify this cost difference, insurance companies claim that the younger drivers are much more likely to be involved in costly accidents. To test this claim, a researcher obtains information about registered drivers from the Department of Motor Vehicles and selects a sample of 300 accident reports from the police department.

The DMV reports the percentage of registered drivers in each age category as reported below. The number of accident reports is also shown. Does this data indicate that accidents occur with the same distribution as the ages of the drivers?

H0:

Ha:

(2)

7 Automobile insurance is much more expensive for teenage drivers than for older drivers. To justify this cost difference, insurance companies claim that the younger drivers are much more likely to be involved in costly accidents. To test this claim, a researcher obtains information about registered drivers from the Department of Motor Vehicles and selects a sample of 300 accident reports from the police department.

The DMV reports the percentage of registered drivers in each age category as reported below. The number of accident reports is also shown. Does this data indicate that accidents occur with the same distribution as the ages of the drivers?

H0: The distribution of the ages of drivers involved in accidents is the same as the distribution of the ages of registered drivers.

Ha: The distribution of the ages of drivers involved in accidents is not the same as the distribution of the ages of registered drivers.

Check the conditions:

Counted data condition Randomization condition Expected cell frequency condition

(this is the data)

8 Age

percent of drivers

number of accidents (observed)

expected O - E (O - E)² (O - E)² E

Under 20 16 68

20 - 29 28 92

30 or over 56 140

n = 300 300

9 Age

Under 20 16 68 48

20 - 29 28 92 84

30 or over 56 140 168

Note: Σ observed = Σ expected

n = 300 300

10 Age

Under 20 16 68 48

20 - 29 28 92 84

30 or over 56 140 168

n = 300 300

11 Age

Under 20 16 68 48

20 - 29 28 92 84

30 or over 56 140 168

n = 300 300

12 Age

Under 20 16 68 48

20 - 29 28 92 84

30 or over 56 140 168

(3)

Counted data condition Randomization condition

Expected cell frequency condition - each expected frequency ≥ 5

n = 300 300

13 Age

Under 20 16 68 48

20 - 29 28 92 84

30 or over 56 140 168

14 Age

Under 20 16 68 48 20 400 8.33

20 - 29 28 92 84

30 or over 56 140 168

 

²

2 O E

E

   

Specify the sampling distribution model and the test you will use.

, with df = k-1

² = df =

15 Age

Under 20 16 68 48 20 400 8.33

20 - 29 28 92 84 8 64 .76

30 or over 56 140 168 -28 784 4.67

 

²

2 O E

E

   

Note: Σ(O - E) = 0 Specify the sampling distribution model and the test you will use.

, with df = k-1

² = df =

16 Age

Under 20 16 68 48 20 400 8.33

20 - 29 28 92 84 8 64 .76

30 or over 56 140 168 -28 784 4.67

Note: Σ(O - E) = 0 Specify the sampling distribution model and the test you will use.

, with df = k-1

² = df =

 

²

2 O E

E

   

17 Age

Under 20 16 68 48 20 400 8.33

20 - 29 28 92 84 8 64 .76

30 or over 56 140 168 -28 784 4.67

Specify the sampling distribution model and the test you will use.

Since the conditions are met, we will use a Chi-square model with 2 degrees of freedom, and do a Chi-square goodness-of-fit test.

, with df = k-1

² = df =

 

²

2 O E

E

   

18 , with df = k-1

² = 13.76 df = 3 - 1 = 2 P-value:

Age percent

of drivers

Under 20 16 68 48 20 400 8.33

20 - 29 28 92 84 8 64 .76

30 or over 56 140 168 -28 784 4.67

13.76

 

²

2 O E

E

   

(4)

19 20

² = 13.76 df = 3 - 1 = 2 P-value: P < .005

Statistical conclusion:

Conclusion in context:

Age percent

of drivers

Under 20 16 68 48 20 400 8.33

20 - 29 28 92 84 8 64 .76

30 or over 56 140 168 -28 784 4.67

13.76

21 Age

Under 20 16 68 48 20 400 8.33

20 - 29 28 92 84 8 64 .76

30 or over 56 140 168 -28 784 4.67

13.76

² = 13.76 df = 3 - 1 = 2 P-value: P < .005

Statistical conclusion: Since the p-value is small, reject the null hypothesis.

Conclusion in context: The data indicates that the distribution of ages of drivers involved in accidents is not the same as the distribution of ages of the drivers in the population.

22

Using SPSS for a Goodness of Fit Test If you have the expected proportions:

1. Create a numeric variable with a width of 1 and no decimal places for the categories. Code the values of this variable as follows:

In the Values column, click on the box with the three dots:

23 You will then see the Value Labels dialog box. Since there are three categories of ages, enter the values 1, 2, and 3 as coding variables:

Enter the value "1" and code it as "under 20". (You do not have to use quotation marks; they will be added by SPSS.)

24 Then click on Add and you will see the results:

(5)

25 Continue adding all categories, one at a time, and then click on OK.

26 You will see the results in the Values column in Variable View.

27 2. Create a numeric variable with no decimal places for the observed

frequencies.

Then, for each category, enter the coded value:

As you enter each value you will see a drop-down box. If you click on it, you can choose from the list of labels. However, if you just move to the next column, you will see the category name associated with the coded value.

28 You can then enter the observed frequency for that category.

Repeat this until all observed frequencies have been entered:

29 3. Weight the cases using the observed frequencies.

30 4. Now select

> Analyze > Nonparametric Tests > Legacy Dialogs > Chi-Square

(6)

31 5. Select the variable with the observed frequencies as the Test Variable

In the Expected Values box, select Values:

32 6. Enter the expected percents (as decimals) one at a time, and click

on Add until all have been entered:

33 6. Enter the expected percents (as decimals) one at a time, and click

on Add until all have been entered:

(Note that you also have the option to choose All categories equal if that is appropriate.)

34 7. After the last value has been entered, click on OK. You should see a table showing the observed and expected frequencies and a table with the results of the Chi-square test:

These results show that χ² = 13.762, and p = .001

count Observed N Expected N Residual

68 68 48.0 20.0

92 92 84.0 8.0

140 140 168.0 -28.0

Total 300

Test Statistics count Chi-Square 13.762^a

df 2

Asymp. Sig. .001 a. 0 cells (.0%) have expected frequencies less than 5. The minimum expected cell frequency is 48.0.

35 The Chi-Square Test for Homogeneity

In a test for homogeneity, we compare observed distributions for several groups to see if there are differences among the respective populations.

The central issue is whether the category proportions are the same for all of the populations. The test involves several samples but only one variable.

36 The article “Relationship of Health Behaviors to Alcohol and Cigarette Use by College Students “ (J. of College Student Development (1992)) included data on drinking behavior for independently chosen random samples of male and female students similar to the data shown below.

Does there appear to be a gender difference with respect to drinking behavior?

Drinking level (Drinks per week)

Gender

Men Women Total

None 140 ( ) 186 ( )

Low (1 - 7) 478 ( ) 661 ( )

Moderate (8 - 24) 300 ( ) 173 ( )

High (25 or more) 63 ( ) 16 ( ) Total

(7)

37 The Chi-Square Test for Homogeneity

Counted data condition

Be sure that the data is counts, or frequencies

Independence assumption Randomization condition

If you want to generalize from the data to a population.

Sample size assumption Expected cell frequency condition Each expected frequency is at least 5

H0:

Ha:

H0: The proportions of the four drinking levels are the same for males and for females

Ha: The proportions of the four drinking levels are not the same for males and for females

Counted data condition Randomization condition Expected cell frequency condition:

(row total)(column total)

E n

40 Specify the sampling distribution model and the test you will use.

df = (R - 1)(C - 1) Drinking level (Drinks per week)

Gender

None 140 ( ) 186 ( )

Low (1 - 7) 478 ( ) 661 ( )

Moderate (8 - 24) 300 ( ) 173 ( )

High (25 or more) 63 ( ) 16 ( )

Total 41 Specify the sampling distribution model and the test you will use. df = (R - 1)(C - 1) = (4 - 1)(2 - 1) = (3)(1) = 3 The conditions are met, so we will use a Chi-square model with 3 degrees of freedom, and do a Chi-square test of homogeneity. Drinking level (Drinks per week) Gender Men Women Total None 140 ( ) 186 ( )

Low (1 - 7) 478 ( ) 661 ( )

Moderate (8 - 24) 300 ( ) 173 ( )

High (25 or more) 63 ( ) 16 ( )

Total 42 Fill in the row and column totals. Drinking level (Drinks per week) Gender Men Women Total None 140 ( ) 186 ( )

Low (1 - 7) 478 ( ) 661 ( )

Moderate (8 - 24) 300 ( ) 173 ( )

High (25 or more) 63 ( ) 16 ( ) Total

(8)

43 Calculate the expected frequencies for each cell, using

Drinking level (Drinks per week)

Gender

None 140 ( ) 186 ( ) 326

Low (1 - 7) 478 ( ) 661 ( ) 1139

Moderate (8 - 24) 300 ( ) 173 ( ) 473

High (25 or more) 63 ( ) 16 ( ) 79

Total 981 1036 2017 (row total)(column total) E = n 44 Calculate the expected frequencies for each cell, using Drinking level (Drinks per week) Gender Men Women Total None 140 ( 158.56 ) 186 ( ) 326

Low (1 - 7) 478 ( ) 661 ( ) 1139

Moderate (8 - 24) 300 ( ) 173 ( ) 473

High (25 or more) 63 ( ) 16 ( ) 79

Total 981 1036 2017 (row total)(column total) E = n 45 Calculate the expected frequencies for each cell, using Drinking level (Drinks per week) Gender Men Women Total None 140 ( 158.56 ) 186 ( 167.44 ) 326

Low (1 - 7) 478 ( ) 661 ( ) 1139

Moderate (8 - 24) 300 ( ) 173 ( ) 473

High (25 or more) 63 ( ) 16 ( ) 79

Total 981 1036 2017 (row total)(column total) E = n 46

 

² 2 O E E     Drinking level (Drinks per week) Gender Men Women Total None 140 ( 158.56 ) 186 ( 167.44 ) 326

Low (1 - 7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8 - 24) 300 ( 230.05 ) 173 ( 242.95 ) 473

High (25 or more) 63 ( 38.42 ) 16 ( 40.58 ) 79

Total 981 1036 2017 47 2.17 +

 

Low (1 - 7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8 - 24) 300 ( 230.05 ) 173 ( 242.95 ) 473

High (25 or more) 63 ( 38.42 ) 16 ( 40.58 ) 79

Total 981 1036 2017 48 2.17 + 2.06 +

 

Low (1 - 7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8 - 24) 300 ( 230.05 ) 173 ( 242.95 ) 473

High (25 or more) 63 ( 38.42 ) 16 ( 40.58 ) 79

Total 981 1036 2017

(9)

49 2.17 + 2.06 +

10.418 + 9.865 +

 

²

2 O E

E

    Drinking level (Drinks per week)

Gender

None 140 ( 158.56 ) 186 ( 167.44 ) 326

Low (1 - 7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8 - 24) 300 ( 230.05 ) 173 ( 242.95 ) 473 High (25 or more) 63 ( 38.42 ) 16 ( 40.58 ) 79

Total 981 1036 2017

50 2.17 + 2.06 +

10.418 + 9.865 + 21.27 + 20.14 + 15.73 + 14.89 =

 

²

2 O E

E

Gender

None 140 ( 158.56 ) 186 ( 167.44 ) 326

Total 981 1036 2017

51 2.17 + 2.06 +

10.418 + 9.865 + 21.27 + 20.14 + 15.73 + 14.89 = 96.54

 

²

2 O E

E

Gender

None 140 ( 158.56 ) 186 ( 167.44 ) 326

Total 981 1036 2017

52

Does there appear to be a gender difference with respect to drinking behavior?

H0: The proportions of the four drinking levels are the same for males and females

Ha: The proportions of the four drinking levels are not the same for males and females

² = 96.54 df = 3 P-value: p < .005 Statistical conclusion:

Conclusion in context:

Does there appear to be a gender difference with respect to drinking behavior?

H0: The proportions of the four drinking levels are the same for males and females

Ha: The proportions of the four drinking levels are not the same for males and females

² = 96.54 df = 3 P-value: p < .005

Statistical conclusion: p is small, so the null hypothesis is rejected Conclusion in context: The data does indicate a gender difference with respect to drinking behavior.

(10)

55

Using SPSS for a Test for Homogeneity

1. Create a string variable for each of the categories, and a numeric variable for the observed frequencies. Be sure to make the columns wide enough ("columns" in Variable View).

Then enter the values of these two variables:

2. Weight the cases using the observed frequencies.

(> Data > Weight Cases…)

56 3. Select > Analyze > Descriptive Statistics > Crosstabs…

Select one variable as the row variable and the other as the column variable.

Click on Statistics… and then on Chi-square.

57 Click on the Cells button, and select Observed and Expected in the Cell Display window. Then click on Continue.

Click on Display clustered bar charts to produce the graph shown in the results.

Click on Continue and then click on OK.

58 Your output should include a table showing the observed and expected frequencies:

gender * level Crosstabulation level

Total

high low moderate none

gender female Count 16 661 173 186 1036

Expected Count 40.6 585.0 242.9 167.4 1036.0

male Count 63 478 300 140 981

Expected Count 38.4 554.0 230.1 158.6 981.0

Total Count 79 1139 473 326 2017

Expected Count 79.0 1139.0 473.0 326.0 2017.0

59 and a table with the results of your Chi-square test:

These results show that χ² = 96.526, and p = .000

Chi-Square Tests

Value df

Asymp. Sig. (2- sided)

Pearson Chi-Square 96.526^a 3 .000

Likelihood Ratio 98.966 3 .000

N of Valid Cases 2017

a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 38.42.

60 Here is the graph that represents the results:

(11)

61 The Chi-Square Test for Independence

In a test for independence, we investigate association between two categorical variables in a single population. There is one sample, but there are two variables.

Counted data condition Randomization condition

If you want to generalize from the data to a population.

Expected cell frequency condition

62 The table shown below was constructed using data in the article

“Television Viewing and Physical Fitness in Adults” (Research Quarterly for Exercise and Sport (1990)). The author hoped to determine whether time spent watching television is associated with cardiovascular fitness.

Subjects were asked about their television viewing time (per day, rounded to the nearest hour) and were classified as physically fit if they scored in the excellent or very good category on a step test.

Ho: Ha:

TV Group Fitness Level

Physically Fit Not Physically Fit Total

0 35 ( ) 147 ( )

1 - 2 101 ( ) 629 ( )

3 - 4 28 ( ) 222 ( )

5 or more 4 ( ) 34 ( )

Total 63 The table shown below was constructed using data in the article “Television Viewing and Physical Fitness in Adults” (Research Quarterly for Exercise and Sport (1990)). The author hoped to determine whether time spent watching television is associated with cardiovascular fitness. Subjects were asked about their television viewing time (per day, rounded to the nearest hour) and were classified as physically fit if they scored in the excellent or very good category on a step test. Ho: Fitness and TV viewing are independent Ha: Fitness and TV viewing are not independent TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( ) 147 ( )

1 - 2 101 ( ) 629 ( )

3 - 4 28 ( ) 222 ( )

5 or more 4 ( ) 34 ( )

Total 64 Check the conditions: Counted data condition Randomization condition Expected cell frequency condition 65 Specify the sampling distribution model and the test you will use. df = (R - 1)(C - 1) = (4 - 1)(2 - 1) = (3)(1) = 3 Since the conditions are met, we will use a Chi-square model with 3 degrees of freedom, and do a Chi-square test for independence. TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( ) 147 ( )

1 - 2 101 ( ) 629 ( )

3 - 4 28 ( ) 222 ( )

5 or more 4 ( ) 34 ( )

Total 66 Find the row and column totals. TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( ) 147 ( )

1 - 2 101 ( ) 629 ( )

3 - 4 28 ( ) 222 ( )

5 or more 4 ( ) 34 ( ) Total

(12)

67

0 35 ( ) 147 ( ) 182

1 - 2 101 ( ) 629 ( ) 730

3 - 4 28 ( ) 222 ( ) 250

5 or more 4 ( ) 34 ( ) 38

Total 168 1032 1200 (row total)(column total) E = n 68 TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( 25.48 ) 147 ( ) 182

1 - 2 101 ( 102.20 ) 629 ( ) 730

3 - 4 28 ( 35.00 ) 222 ( ) 250

5 or more 4 ( 5.32 ) 34 ( ) 38

Total 168 1032 1200 (row total)(column total) E = n 69 TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( 25.48 ) 147 ( 156.52 ) 182 1 - 2 101 ( 102.20 ) 629 ( 627.80 ) 730 3 - 4 28 ( 35.00 ) 222 ( 215.00 ) 250 5 or more 4 ( 5.32 ) 34 ( 32.68 ) 38

Total 168 1032 1200 (row total)(column total) E = n 70 TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( 25.48 ) 147 ( 156.52 ) 182 1 - 2 101 ( 102.20 ) 629 ( 627.80 ) 730 3 - 4 28 ( 35.00 ) 222 ( 215.00 ) 250 5 or more 4 ( 5.32 ) 34 ( 32.68 ) 38

Total 168 1032 1200

 

² 2 O E E     3.557 + .579 + 71 TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( 25.48 ) 147 ( 156.52 ) 182 1 - 2 101 ( 102.20 ) 629 ( 627.80 ) 730 3 - 4 28 ( 35.00 ) 222 ( 215.00 ) 250 5 or more 4 ( 5.32 ) 34 ( 32.68 ) 38

Total 168 1032 1200

 

² 2 O E E     3.557 + .579 + 72

 

² 2 O E E     3.557 + .579 + .014 + .002 + 1.4 + .228 + .328 + .0539 = 6.161 TV Group Fitness Level Physically Fit Not Physically Fit Total 0 35 ( 25.48 ) 147 ( 156.52 ) 182 1 - 2 101 ( 102.20 ) 629 ( 627.80 ) 730 3 - 4 28 ( 35.00 ) 222 ( 215.00 ) 250 5 or more 4 ( 5.32 ) 34 ( 32.68 ) 38

Total 168 1032 1200

(13)

73

P-value:

 2 6.161 df = 3

0 35 ( 25.48 ) 147 ( 156.52 ) 182

1 - 2 101 ( 102.20 ) 629 ( 627.80 ) 730 3 - 4 28 ( 35.00 ) 222 ( 215.00 ) 250 5 or more 4 ( 5.32 ) 34 ( 32.68 ) 38

Total 168 1032 1200

74

75

P-value: p > .10

Statistical conclusion:

Conclusion in context:

 2 6.161 df = 3

0 35 ( 25.48 ) 147 ( 156.52 ) 182

1 - 2 101 ( 102.20 ) 629 ( 627.80 ) 730 3 - 4 28 ( 35.00 ) 222 ( 215.00 ) 250 5 or more 4 ( 5.32 ) 34 ( 32.68 ) 38

Total 168 1032 1200

76

P-value: p > .10

Statistical conclusion: Since the p-value is large, we cannot reject the null hypothesis.

Conclusion in context: There is not enough evidence to conclude that time spent watching television is associated with cardiovascular fitness.

 2 6.161 df = 3

0 35 ( 25.48 ) 147 ( 156.52 ) 182

1 - 2 101 ( 102.20 ) 629 ( 627.80 ) 730 3 - 4 28 ( 35.00 ) 222 ( 215.00 ) 250 5 or more 4 ( 5.32 ) 34 ( 32.68 ) 38

Total 168 1032 1200

77 Using SPSS for a Test for Independence

Follow the instructions for a Chi-Square test for homogeneity. You may define two string variables for the categories and one numeric variable for the counts, or you may choose to use coding for one or either of the variables representing the categories.

78 Then enter the frequencies as before:

(14)

79 Weight the cases by counts, and then use

> Analyze > Descriptive Statistics > Crosstabs…

Select one variable as the row variable and the other as the column variable.

Click on Statistics… and then on Chi-square.

Click on the Cells button, and select Observed and Expected in the Cell Display window.

Click on Display clustered bar charts to produce the graph shown in the results.

Then click on Continue and on OK.

80 SPSS output:

TVGroup * Fitness Crosstabulation Fitness

Total Fit Not Fit

TVGroup 0 Count 35 147 182

Expected Count 25.5 156.5 182.0

1-2 Count 101 629 730

3-4 Count 28 222 250

5 or more Count 4 34 38

Total Count 168 1032 1200

Expected Count 168.0 1032.0 1200.0

81 SPSS output:

These results show that χ² = 36.161 and p = .104

Chi-Square Tests

Value df

Asymp. Sig. (2- sided)

Pearson Chi-Square 6.161^a 3 .104

Likelihood Ratio 5.930 3 .115

N of Valid Cases 1200

a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 5.32.

82 Here is the graph that supports these results:

83

1. A health professional selected a random sample of 100 patients from each of four major hospital emergency rooms to see if the major reasons for emergency room visits (accident, illegal activity, illness, other) are the same in all four hospitals.

This is an example of a. A goodness-of-fit test b. A test for homogeneity c. A test for independence

84

1. A health professional selected a random sample of 100 patients from each of four major hospital emergency rooms to see if the major reasons for emergency room visits (accident, illegal activity, illness, other) are the same in all four hospitals.

This is an example of

a. A goodness-of-fit test

b. A test for homogeneity

c. A test for independence

(15)

85

2. An urban economist wants to determine whether the region of the United States a resident lives in is related to his level of education. He randomly selects 1800 US residents and asks them to report their level of education and the region of the US in which they live.

The economist is using a. A goodness-of-fit test b. A test for homogeneity c. A test for independence

86

2. An urban economist wants to determine whether the region of the United States a resident lives in is related to his level of education. He randomly selects 1800 US residents and asks them to report their level of education and the region of the US in which they live.

The economist is using a. A goodness-of-fit test b. A test for homogeneity

c. A test for independence

87

3. As part of a class project, a student asked a random sample of students about their preferred soft drink: Pepsi, Coke, or 7-Up, to determine whether these three drinks were equally preferred by students.

The student should use a. A goodness-of-fit test b. A test for homogeneity c. A test for independence

88

3. As part of a class project, a student asked a random sample of students about their preferred soft drink: Pepsi, Coke, or 7-Up, to determine whether these three drinks were equally preferred by students.

The student should use

a. A goodness-of-fit test