Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

(1)

Bivariate Statistics Session 2:

Measuring Associations – Chi-Square Test

Features Of The Chi-Square Statistic

The chi-square test is non-parametric. That is, it makes no assumptions about the distribution of variables. For this reason, it is typically used with data measured at the nominal or ordinal levels.

Pearson’s Chi-Square (χ2) is the most popular of the non-parametric statistics.

The chi-square (χ2) test is used to assess the relationship between 2 nominal or ordinal variables.

It is a very general statistical test that can be used whenever we wish to evaluate whether frequencies that have been empirically obtained differ significantly from those that would be expected on the basis of chance or theoretical expectations. In other words, when the researcher wishes to explore how the categories of the ‘row variable’ are distributed according to the categories of the ‘column variable’. A statistically significant chi-square test indicates that the rows and columns of the contingency table are dependent, that is, that there are differences between the cell frequencies (cell: fields in the table) that are substantial enough not to be attributed to chance or randomness. A non-significant chi-square test implies that differences in cell frequencies may be random.

The basic idea of the chi-square statistic is to compare the observed distribution of frequencies with the expected distribution of frequencies. The chi-square test shows whether the observed association between the variables is due to chance. This test relies on the basic assumption that there is no association between the variables in the contingency table (remember the null hypothesis: no association between 2 variables).

Assumptions of the Chi-Square Test:

Required Level of Measurement:

- The chi-square statistic requires 2 nominal (or ordinal) variables.

Postulates of the chi-square test:

1) Random sample

2) Mutually exclusive categories

3) Expected frequencies must all be > 1

4) No more than 20% of cases in the contingency table should have an expected frequency <

5.

If these conditions are not satisfied, the chi-square test may be biased.

Contingency Tables

The basis of any chi-square test is always a table with frequency counts in the cells. Depending on the number of columns and rows, the table is usually referred to as an N (Number of Rows) x M (number of Columns) table. The simplest version is a 2x2 contingency table.

(2)

Contingency table: frequencies of 2 variables presented in one table. All categories of the first variable appear in rows, and all the categories of a second variable in columns. You also obtain a joint frequency for each cell, and totals (for both rows and columns).

Cells: Fields in the table.

Example 1: 2x2 Table

In the fear of crime survey we found that women are more likely than men to say that they “go to certain areas only if accompanied by others”.

In certain areas, I only go in the company with others

Yes No Total

Male 290 364 654

Female 795 200 995

Total 1085 564 1649

Example 2: 2x3 Table

Suppose a survey asked about whether people are in favour of introducing the Euro. It found the following answers by political party reference. We might want to know:

Does preference for the Euro vary by political preference?

How strong is the relationship?

Political Preference

Labour Tory Liberal Row total

Pro Euro introduction

yes 120 40 30 190

no 60 60 20 140

Colum total 180 100 50 330

Calculating A Chi-Square Test

How can we know that the differences above are systematic, and not due to chance?

We compare our observed values to the values we would expect to see by chance alone if the null hypothesis were true.

Observed frequencies: distribution of variables in the sample

Expected frequencies: theoretical frequencies that would be obtained if there was no association between the variables (that is, if null hypothesis was to be accepted). Expected frequencies are computed as follows:

Row total * Column total Expected cell frequency =

Total number of observations

(3)

The general formula for computing the chi-square statistic is:

equency ExpectedFr

equency ObservedFr

SumOf ChiSquare

)2

( −

=

Degrees of Freedom (df)

Statisticians use the term “degrees of freedom” to describe the number of values in the final calculation of a statistic that are “free to vary”. Degrees of freedom is computed by multiplying the number of rows minus one, by the number of columns minus one. The formula is:

df = (# of rows - 1 ) * (# of columns -1)

Basic steps underlying the computation of a Chi-square statistic:

Step 1: Observing some distribution of frequencies in the cells of a table (the “observed”

frequencies) and computing the sum of each row and column.

Step 2: Computing the frequencies one would expect in each cell “by chance” (the “expected”

frequencies: row total * column total / total number of observations)

Step 3: Comparing the “observed” to the “expected” frequencies (observed frequency – expected frequency)

Step 4: Computing the chi-square value (see formula above) A Step to Step guide computing a 2x2 Chi-Square Statistic.

Starting point:

Observed Distrib.

Certain Areas only with others

Yes No Sum

Gender Males 290 364 654

Females 795 200 995

Sum 1085 564 1649

Step 1: Compute Expected Cell Frequencies

Formula (for each cell): row total * column total / n

Yes No Sum

Gender Males 430.32 223.68 654

Females 654.68 340.32 995

Sum 1085 564 1649

Computing expected cell frequencies:

The expected frequency in cell 1,1 (yes, males) is (654*1085)/1649= 430.32 The expected frequency in cell 1,2 (yes, females) is (995 * 1085)/1649= 654.68 The expected frequency in cell 2,1 (no, males) is (654*564)/1649= 223.68

(4)

The expected frequency in cell 2,2 (no, females) is (995 * 564)/1649= 340.32 Step 2: Compute difference

Observed-Expected

Yes No Sum

Gender Males -140.32 140.32

Females 140.32 -140.32

Males yes: 290 – 430.32= -140.32 Females yes: 795 – 654.68= 140.32 Males no: 364 – 223.68= 140.32 Females no: 200 – 340.32= -140.32

Step 3: Compute (Difference Squared)/Expected

Yes No Sum

Gender Males 45.75 88.02

Females 30.07 57.85

Males yes: (-140.32)² / 430.32= 45.75 Females yes: (140.32)² / 654.68= 30.07 Males no: (140.32)² / 223.68= 88.02 Females no: (-140.32)² / 340.32= 57.85 Step 4: Sum of all Cell chi-Squares

45.76 + 30.07 + 88.02 + 57.85 = 221.7

χ² = 221.7

Testing for Significance

Verifying in a standard table (χ² distribution table) whether for a given value of χ² and a given number of degrees of freedom, the association between the variables is statistically significant, i.e. whether there are differences between observed and expected frequencies that are substantial enough not to have been caused by chance. The standard level of significance (alpha) used is .05.

In the example above, with one Degree of Freedom ((Rows-1)*(Columns-1)), the critical value of χ²for α=.05 is 3.85. Since our χ² value (221.7) exceeds the critical value at α=.05, we reject the null hypothesis and conclude that there is a significant association between these two variables.

However, if we look at other critical values at α=.01 and α=.001, we see that our χ² =221.7 exceeds these values (6.63 and 10.83, respectively), too. This indicates that the association between these two variables is highly significant and we report as p< .001.

(5)

Just like other statistical tests, the chi-square test is sensitive to sample size. The larger the

sample, the more likely it is that you will reject the null hypothesis. In other words, the chi-square test is more likely to be statistically significant with larger sample sizes, even if the association between the two variables is weak.

Measuring Strength Of Association

The chi-square test is not a measure of the strength of the association between two variables. Other tests need to be carried out to test the strength of the association between nominal (or ordinal variables), such as phi, Cramer’s V, contingency coefficient, and gamma.

1) Phi Coefficient (φ)

- Phi is a coefficient based on the value of χ²

- Measure of the strength of the relationship between two dichotomous variables (i.e., 2x2 table)

- Phi ranges between 0 and 1. The higher the value of Phi, the stronger the association between the 2 variables.

- Phi is a symmetrical measure, that is, it does not make the distinction between the IV and DV. In other words, it does not indicate which variable is the cause of the other.

- Phi is computed as:

Phi = χ² N

2) Cramer’s V

- V is a coefficient based on the value of χ²

- Measure of the strength of the relationship between two nominal variables, regardless of the size of the contingency table (ex: 3x2, 4x3, 5x2, etc.)

- Same basic idea as Phi, but is not limited to 2x2 tables

- V ranges between 0 and 1. The higher the value of V, the stronger the association between the 2 variables.

- Like Phi, V is a symmetrical measure; it does not make the distinction between the IV and DV.

- In a 2x2 table, V and φ are identical - V is computed as:

V= SQRT (χ² / n (k – 1))

Where χ² = value of chi-square statistic, n= sample size, and k= minimum number of columns or rows in the table (ex: if table has 2 rows and 3 columns, then k= 2).

(6)

Chi-Square in R

1. Make contingency table

e.g. with CrossTable() from library(gmodels)

2. Calculate chi-square and p-value

e.g. included in CrossTable() or use chisq.test()

3. if significant, interpret strength with Phi, Cramer’s V

library(vcd) assocstats()

Alternative Tests:

Fisher’s Exact test (E<5) Yate’s Correction (2x2 table) Likelihood ratio

IN SUMMARY, WHEN YOU WANT TO ASSESS THE RELATIONSHIP BETWEEN 2 NOMINAL (OR ORDINAL VARIABLES):

1) Compute the value of χ².

2) If the χ² is statistically significant, measure the strength of the association. If the χ² is not statistically significant, the variables are independent (i.e., no association between the variables, and it is irrelevant to measure the strength of the association).

3) Offer an interpretation of the results.