Warnings about Correlation - Reading Statistics Huck

At this point, you may be tempted to consider yourself a semi-expert when it comes to deciphering discussions about correlation. You now know what a scatter diagram is, you have looked at the correlational continuum (and know that correlation coefﬁcients typically extend from 1.00 to 1.00), you understand what a correlation matrix is, and you have considered several different kinds of bivariate correlation. Before you assume that you know everything there is to know about measuring the relationship

between two variables, I want to provide you with six warnings that deal with causal-ity, the coefﬁcient of determination, the possibility of outliers, the assumption of lin-earity, the notion of independence, and criteria for claims of high and low correlations.

Correlation and Cause

It is important for you to know that a correlation coefﬁcient does not speak to the issue of cause and effect. In other words, whether a particular variable has a causal impact on a different variable cannot be determined by measuring the two variables simultaneously and then correlating the two sets of data. Many recipients of research reports (and even a few researchers) make the mistake of thinking that a high correlation implies that one variable has a causal inﬂuence on the other vari-able. To prevent yourself from making this mistake, I suggest that you memorize this simple statement: correlation cause.

Competent researchers often collect data using strategies that allow them to address the issue of cause. Those strategies are typically complex and require a con-sideration of issues that cannot be discussed here. In time, however, I am conﬁdent that you will understand the extra demands that are placed on researchers who want to investigate the potential causal connections between variables. For now, all I can do is ask that you believe me when I say that bivariate correlational data alone can-not be used to establish a cause-and-effect situation.

Coefﬁcient of Determination

To get a better feel for the strength of the relationship between two variables, many researchers square the value of the correlation coefﬁcient. For example, if r turns out equal to .80, the researcher squares .80 and obtains .64. When r is squared like this, the resulting value is called the coefﬁcient of determination.

The coefficient of determination indicates the proportion of variability in one variable that is associated with (or explained by) variability in the other vari-able. The value of lies somewhere between 0 and 1.00, and researchers usu-ally multiply by 100 so they can talk about the percentage of explained variability.

In Excerpt 3.24, we see an example from a stress/eyewitness study where r²has r²

EXCERPT 3.24 • r²and Explained Variation

Pearson’s correlation coefﬁcient between the change in heart rate (labyrinth mean heart rate–baseline mean heart rate) and state anxiety score showed a reliable association, r .76 [and] r² .58. Change in heart rate accounted for 58% of the variance in state anxiety score.

Source: Valentine, T., & Mesout, J. (2009). Eyewitness identiﬁcation under stress in the London Dungeon. Applied Cognitive Psychology, 23(2), 151–161.

been converted into a percentage. As this excerpt indicates, researchers some-times refer to this percentage as the amount of variance in one variable that is ac-counted for by the other variable, or they sometimes say that this percentage indicates the amount of shared variance.

As suggested by the material in Excerpt 3.24, the value of indicates how much (proportionately speaking) variability in either variable is explained by the other variable. The implication of this is that the raw correlation coefﬁcient (i.e., the value of r when not squared) exaggerates how strong the relationship really is between two variables. Note that r must be stronger than .70 for there to be at least 50 percent explained variability. Or, consider the case where ; here, only one-fourth of the variability is explained.

Outliers

My third warning concerns the effect on r of one or more data points located away from the bulk of the scores. Such data points are called outliers, and they can cause the size of a correlation coefﬁcient to understate or exaggerate the strength of the relationship between two variables. In Excerpt 3.25, we see a case where the researchers were aware of the danger of outliers, so they examined their scatter plots before making claims based on their correlation coefﬁcients.

r = .50

r²

EXCERPT 3.25

• Outliers

[C]orrelations were run for the whole sample and for married and custodial fathers separately. . . . [W]e examined the scatterplots to ensure that the relations were not attributable to one or a few outliers. The plots show clear group tendencies not inﬂated by extreme data points.

Source: Bernier, A., & Miljkovitch, R. (2009). Intergenerational transmission of attachment in father–child dyads: The case of single parenthood. Journal of Genetic Psychology, 170(1), 31–52.

In contrast to the good example provided in Excerpt 3.25, most researchers fail to check to see if one or more outliers serve to distort the statistical summary of the bivariate relationships they study. There are not many scatter plots in journal articles, and thus you cannot examine the data yourself to see if outliers were present. Almost always, only the correlation coefficient is provided. Give the researcher some extra credit, however, whenever you see a statement to the effect that the correlation coefficient was computed after an examination of a scatter plot revealed no outliers (or revealed an outlier that was removed prior to computing the correlation coefficient).

Linearity

The most popular technique for assessing the strength of a bivariate relationship is Pearson’s product–moment correlation. This correlational procedure works nicely if the two variables have a linear relationship. Pearson’s technique does not work well, however, if a curvilinear relationship exists between the two variables.

A linear relationship does not require that all data points (in a scatter plot) lie on a straight line. Instead, what is required is that the path of the data points be straight. The path itself can be very narrow, with most data points falling near an imaginary straight line, or the path can be very wide—so long as the path is straight.

(Regardless of how narrow or wide the path is, the path to which we refer can be tilted at any angle.)

If a curvilinear relationship exists between two variables, Pearson’s correla-tion underestimates the strength of the relacorrela-tionship present in the data. Accordingly, you can place more conﬁdence in any correlation coefﬁcient you see when the re-searcher who presents it indicates that a scatter plot was inspected to see whether the relationship was linear before Pearson’s r was used to summarize the nature and strength of the relationship. Conversely, add a few grains of salt to the rs that are thrown your way without statements concerning the linearity of the data.

In Excerpt 3.26, we see an example where a pair of researchers checked to see if their bivariate data sets were linear. These researchers deserve high praise for tak-ing the time to check out the linearity assumption before computtak-ing Pearson’s r.

Unfortunately, most researchers collect their data and compute correlation coefﬁ-cients without ever thinking about linearity.

EXCERPT 3.26

• Linearity

Examination of the scatter plots provided further information on linearity [and] no evidence of curvilinear relationship was identiﬁed.

Source: Tam, D. M. Y., & Coleman, H. (2009). Construction and validation of a professional suitability scale for social work practice. Journal of Social Work Education, 45(1), 47–63.

In document Reading Statistics Huck (Page 88-91)