Error: Inferring Causation from Correlation

Another common error in interpreting a correlation coefficient is to infer that because two variables are correlated, one causes the other.

A nonzero correlation coefficient simply means that there is aconcomitant relationshipbetweenX and Y—that is, variation in one variable is associated in some way with variation in the other.

It is true that if X causes Y, there must be a correlation between the variables. However, the converse of this statement is not true. A concomitant relationship is necessary but not sufficient for inferring causality. A concomitant relationship often exists because both variables are caused by a third variable. For example, it does not necessarily follow from the positive correlation between Sunday school attendance and honesty that attending Sunday school causes honesty. In all likelihood, both variables are caused by a third variable—parental reinforcement and modeling prac- tices in the home.

It is easy to fall into the trap of inferring causality from correlation, especially when one variable occurs before the other. Consider the well-publicized positive correlation between years of formal education and income. Does such a correlation mean that going to college causes one to earn more money? Before giving an affir- mative answer you would have to know how much college graduates would have earned if they had not gone to college. A causal relationship may in fact exist, but this cannot be ascertained from the correlation. Some or all of the correlation between education and income might be explained in terms of other causal variables. For example, colleges attract two kinds of students—the bright and the rich. We know that bright individuals tend to rise to better paying jobs whether or not they have gone to college and that few children of rich parents end up poor.

CHECK YOUR UNDERSTANDING OF SECTION 5.5

20. Which of the following are incorrect interpretations of a correlation coefficient and why?

a. The strength of association between two forms (Land M) of a psychological test is .96.

b. There is a medium correlation,r .67, between the age at which babies can roll over and the age at which they can sit up alone.

c. The correlation between women’s scores on the Beck Depression Inventory and a self-report questionnaire measuring marital discord is .30; this correlation is twice as high as that for men, which is r .15.

d. We can conclude from the high correlation between risk for sexual assault and alcohol consumption by female victims that victimization is caused at least in part by consuming alcohol.

21. In an attempt to help children with low IQs improve their school performance, a special perceptual awareness program was instituted. Suppose that the program

was completely ineffective. The group’s mean IQ before the program was 72. Would you expect it to change after the special program, and if so, in what direc- tion? (Hint:If you don’t see the issue, reread “A Bit of History” in Section 5.1.) 22. Terms to remember:

a. Test-retest reliability b. Validity c. Concomitant relationship

5.6 FACTORS THAT AFFECT THE SIZE

OF A CORRELATION COEFFICIENT

Nature of the Relationship Between XandY

There are many ways in which two variables can be related. It is sufficient for our purposes to classify them as a linear(straight line) relationship, or a nonlinear

(curved line) relationship. Three examples showing the straight or curved lines of best fit for paired scores are presented in Figure 5.6-1. In general, the more closely data points cluster around the line of best fit, whether it is a straight or a curved line, the higher the correlation. You saw in Section 5.2 that when r is equal to 1 or 1, the data points fall on a straight line. If Xand Yare normally distributed and have equal variances, as the absolute value of rdecreases, the points form fatter and fatter ellipses until finally, when r is equal to 0, they tend to fall in a circle. The Pearson product-moment correlation always fits data points by a straight line. This works fine if the relationship is linear but not so well if the relationship is nonlinear, as in Figure 5.6-1(c). If a nonlinear relationship is fitted by a straight line, the data points will not cluster around the line as closely as they would an appropriate curved line; consequently,runderestimates the strength of association. In fact, an r equal to 0 can be obtained even though Xand Yare highly correlated.

A different correlation measure called thecorrelation ratiooreta squared,

2_{, has been developed for determining the strength of association between}

nonlinearly related variables.

h a. Y X b. Y X c. Y X

Figure 5.6-1. Parts a and b illustrate linear relationships; part c illustrates a nonlinear relationship. The higher the correlation, the closer the data points cluster around the line of best fit.

5.6 Factors That Affect the Size of a Correlation Coefficient

141

Eta squared fits data points by whatever line is appropriate. If the relationship is linear, a straight line is used, and 2_r2_{. For nonlinear relationships in which the}

correlation is not equal to zero, 2_{fits the points by a curved line, and its value is al-}

ways larger than that for r2_{. A discussion of the correlation ratio can be found in}

more advanced texts.

How can you determine whether the relationship between Xand Yis linear or nonlinear and hence whether to use ror 2_{? You can use statistical tests;}3_however,

the simplest method is to examine the scatterplot for evidence of nonlinearity—the so-called eyeball test. Usually, visual inspection is adequate to detect cases in which

rwould underestimate strength of association.

In summary,ris a measure of the linear relationship between two quantitative variables. If the relationship is not linear,runderestimates the strength of association.

In document Descriptive Statistics Selection Guide (Page 160-162)