Introduction to Correlation
The Pearson product-moment correlation coefficient measures the
degree of association between two interval (or better)-level
variables, for example, the relationship between daily consumption
of fat calories and body weight, or attitudes towards smoking and
attitudes towards consumption of alcohol; what is the relationship
between student achievement and dollars per student spent by the
school district?
Sometimes both of the variables are treated as “dependent,”
meaning that we haven’t ordered them causally. Sometimes one
of the variables, X, is treated as independent and the other, Y, as
dependent. Which of these is dependent and which is independent
depends on your theory of the relationship
The correlation coefficient, Pearson’s
r
, ranges between +1 and -1
where +1 is a perfect positive association (people who get high
scores on X also get high scores on Y) and -1 is a perfect negative
association (people who get high scores on X get low scores on Y).
A correlation near zero indicates that there is no relationship
Related Measures of Association
The correlation coefficient is related to other types of
measures of association:
The
partial correlation
, which measures the degree of
association between two variables when the effects
on them of a third variable is removed: what is the
relationship between student achievement and
dollars per student spent by the school district when
the effect of parents’ SES is removed
The
multiple correlation
, which measures the degree
to which one variable is correlated with two or more
other variables: how well can I predict student
Other Related Measures
The squared Pearson’s correlation coefficient,
usually called
R squared
or the
coefficient of
determination
, tells us how much of the
variation in Y, the dependent variable, can be
explained by variation in X, the independent
variable; for example, how much of the
variation in student achievement can be
explained by dollars per student expenditure by
the school district?
The quantity 1-R
2is sometimes called the
coefficient of non-determination
, and it is an
estimate of the proportion of variance in the
Scatterplot: Visual Representation of the
Relationship Measured by the Correlation
Coefficient
The scatterplot is a figure which plots off cases for which
two measures have been taken (for example, people who
have filled out a survey of their attitudes toward smoking
and another survey about their attitudes toward drinking)
against each other
In a scatterplot, one of the variables (usually the
independent variable) is plotted along the horizontal or X
axis and the other is plotted along the vertical or Y axis
Each point in a scatterplot corresponds to the scores (X,Y)
for an individual case (a person, for example) where X is
the score that person was assigned or obtained on one
variable and Y is the score they attained on the other
The strength of the linear relationship between X and Y is
An Example of a Scatterplot
In this scatterplot,
computer anxiety scores
(openness to computing)
are plotted against the Y
(vertical) axis and
computer self-efficacy
scores are plotted along the
X (horizontal) axis. For
example, the person to
whom the arrow is pointing
had a score of about 17 on
the openness scale and
about 162 on the
self-efficacy scale. What were
the scores on the two
scales of the person with
the star next to his point?
Scatterplot Allows You to Visualize
the Relationship between Variables
The purpose of the
scatterplot is to visualize
the relationship between
the two variables
represented by the
horizontal and vertical
axes. Note that although
the relationship is not
perfect, there is a tendency
for higher values of
openness to computing to
be associated with larger
values of computer
self-efficacy, suggesting that as
openness increases,
self-efficacy increases. This
indicates that there is a
positive
correlation
Drawing A Possible Regression Line
Computer Self-efficacy 180 160 140 120 100 80 60 40 20 O p e n n e ss t o c o m p u tin g 30 20 10 0Let’s draw a line through the swarm of points that best “fits” the data set (minimizes the distance between the line and each of the points). This is imposing a linear description of the relationship between the two variables, when sometimes you might want to find out if a line that represented a
curvilinear relationship (in this case an inverted U) was a better fit, but we’ll leave that question for another time. The line that represents this relationship best mathematically is called a
“regression line” and the point at which the mathematically best fitting line crosses the y
Various Types of Associations
Engine Displacement (cu. inches)
500 400 300 200 100 0 -100 H o rs e p o w e r 300 200 100 0
Vehicle Weight (lbs.)
6000 5000 4000 3000 2000 1000 0 M ile s pe r G a llo n 50 40 30 20 10 0
Number of Children
8 6 4 2 0 -2 T o ta l Y e a rs o f E d u ca tio n 30 20 10 0
Positive Relationship between X and Y
Strong negative Relationship between X and Y; points tightly clustered around line; nonlinear trend at lower weights Essentially no relationship between X and Y; points
How is the Correlation Coefficient
Computed?
The conceptual formula for the
correlation coefficient is a little
daunting, but it looks like this:
∑(X – X) (Y – Y)
[∑ (X – X)
2] [∑ (Y – Y)
2]
Where X is a person’s or case’s score on the independent variable, Y is a person’s or case’s score on the dependent variable, and X-bar and Y-bar are the means of the scores on the
independent and dependent variables, respectively. The quantity in the numerator is called the sum of the crossproducts (SP). The quantity in the denominator is the square root of the
product of the sum of squares for both variables (SSx and SSy)
Meaning of Crossproducts
The notion of the crossproducts is not too difficult to understand.
When we have a positive relationship between two variables, a
person who is high on one of the variables will also score high on
the other. And it follows that if his or her score on X is larger than
the mean of variable X, then if there is a positive relationship his
or her score on Y will be larger than the mean of Y. And this
should hold for all or most of the cases
When the crossproducts are negative (when for example the
typical person who scores higher than the mean on X scores lower
than the mean on Y) then there still may be a relationship but it is
a negative relationship
Thus the sign of the crossproducts (positive or negative) in the
numerator of the formula for
r
tells us whether the relationship is
positive or negative
Computing Formula for Pearson’s
r
The conceptual formula for Pearson’s
r
is rarely used to
compute it. You will find a nice illustration
here
of a
computing formula and a brief example
Here is another computing formula
N ∑XY - ∑X ∑Y
r =
[ N ∑X
2– (∑X)
2] [N ∑Y
2– (∑Y)
2]
Scatterplot for the Correlation.sav
data set
Open the correlation.sav file in SPSS
Go to Graphs/Chart Builder/OK
Under Choose From select ScatterDot (top
leftmost icon) and double click to move it
into the preview window
Drag Shyness onto the X axis box
Drag Speeches onto the Y axis box and
click OK
In the Output viewer, double click on the chart
to bring up the Chart Editor; go to Elements
and select “Fit Line at Total,” then select
ScatterPlot of Shyness and
Speeches
A negative
Computational Example of
r
for the
relationship between Shyness and
Speeches
N ∑XY - ∑X ∑Y
r =
[ N ∑X2 – (∑X)2] [N ∑Y2 – (∑Y)2]
Shyness
X
Speeches
Y
XY
X
2
Y
20
8
0
0
64
2
10
20
4
100
3
4
12
9
16
6
6
36
36
36
9
1
9
81
1
10
3
30
100
9
30
32
107 230 226
(6 X 107) – 30 (32)
[6 (230) – 30
2] [6 (226) – 32
2]
r
= -.797
(note crossproducts term in the numerator isSPSS vs. the Hand Calculation: It’s
a Lot Quicker
Now let’s try computing the coefficient with that same
data in SPSS
Go to Analyze/Correlate/Bivariate, and move Shyness
and Speeches into the Variables box. Click Pearson,
one-tailed, and OK. Did you get the same result as
the hand calculation?
Correlations
1 -.797*
.029
6 6
-.797* 1
.029
6 6
Pearson Correlation Sig. (1-tailed) N
Pearson Correlation Sig. (1-tailed) N
Shyness
Speeches
Shyness Speeches
Using SPSS to Test a Hypothesis about the Strength of
Association between Two Interval or Ratio Level
Variables: Correlation Coefficient
Download the file called
World95.sav
We are going to test the strength of the association between
population density (the variable is “number of people per square
kilometer) and “average female life expectancy,” based on data
from 109 cases (109 countries, with each country a case). Our
hypothesis is that the association will be negative; that is, as
population density increases, female life expectancy will decrease
In SPSS Data Editor, go to Analyze/ Correlate/ Bivariate
Move the two variables, “number of people per square kilometer” and “average female life expectancy” into the variables box
Under correlation coefficients, select Pearson
Under Tests of Significance, click one-tailed (we are making a
directional prediction, so we will only accept as significant results in the “negative” 5% of the distribution
Click “flag significant results”
Click Options, and under Statistics, select Means and standard deviations, then Continue, then OK
SPSS Output for Bivariate
Correlation
Correlations
1 .128
. .093
109 109
.128 1
.093 .
109 109
Pearson Correlation Sig. (1-tailed) N
Pearson Correlation Sig. (1-tailed) N
Number of people / sq. kilometer
Average female life expectancy
Number of people / sq.
kilometer
Average female life expectancy
Descriptive Statistics
203.415 675.7052 109
70.16 10.572 109 Number of people
/ sq. kilometer Average female life expectancy
Significance Test of Pearson’s
r
Significance of
r
is tested with a
t
-statistic
with N-2 degrees of freedom where
t
=
r N – 2
1 – r
2SPSS provides the results of the
t test of the significance of r for you. Can also consult table F in Levin and Fox
Correlations
1 .128
. .093
109 109
.128 1
.093 .
109 109
Pearson Correlation Sig. (1-tailed) N
Pearson Correlation Sig. (1-tailed) N
Number of people / sq. kilometer
Average female life expectancy
Number of people / sq.
kilometer
Average female life expectancy
Write a sentence which states your findings. Report the correlation coefficient, r, R2
A Hypothesis to Test
Now, test the following hypothesis:
Countries in which there is a larger
proportion of people living in cities (urban)
will have a higher proportion of males who
read (lit_male) (not “people who read”)
Write up your result
Writing up the Results of Your Test
Descriptive Statistics
56.53 24.203 108 78.73 20.445 85 People living in cities (%)
Males who read (%)
Mean Std. Deviation N
Correlations
1 .587**
. .000
108 85
.587** 1
.000 .
85 85
Pearson Correlation Sig. (1-tailed) N
Pearson Correlation Sig. (1-tailed) N
People living in cities (%)
Males who read (%)
People living in cities (%)
Males who read (%)
Correlation is significant at the 0.01 level (1-tailed). **.
The hypothesis that the proportion of its people living in cities would be positively associated with a country’s rate of male literacy was confirmed (r = .587, DF=83, p < .01, one-tailed).
The Regression Model
Regression takes us a step beyond correlation in that
not only are we concerned with the strength of the
association, but we want to be able to describe its
nature with sufficient precision to be able to make
predictions
To be able to make predictions, we need to be able to
characterize one of the variables in the relationship as
independent and the other as dependent
Regression model, cont’d
A regression equation is used to predict the value of a
dependent variable, Y, in this case a country’s male
literacy rate, on the basis of some constant
a
that
applies to all cases, plus some amount
b
which is
applied to each individual value of X (the country’s %
of people living in cities), plus some error term
e
that
is unique to the individual case and unpredictable: Y
=
a
+
b
X +
e (
male literacy
= a + b(
percent urban)
+ e)
Calculating the Regression Line
What line best describes the
relationship depicted in the
scattergram?
The formula for the regression line
is Y =
a
+
b
X +
e
where Y is (in
this case) a country’s score on
male literacy and X is the
country’s % of people living in
cities,
a
is the
y-intercept
or
constant (the point where the line
crosses the Y axis, or, the value of
Y when X is zero if X is a variable
for which there is zero amount of
the property) and
b
is the
slope of
the regression line
(the amount
by which male literacy changes for
each unit change in percent living
in cities)
Calculating the Regression Line,
cont’d
We will not do the hand computations for
b
, the slope of
the regression line, or
a
, the intercept. Let’s use
another SPSS method for finding not only the correlation
coefficient, Pearson’s
r
, but also the regression equation
(e.g., find the intercept,
a
, and the slope of the
regression line,
b
)
In Data Editor, go to Analyze/Regression/Linear
Put the dependent variable, male literacy, into
the Dependent box
Move the independent variable, percentage of
people living in cities, into the Independent(s)
box, and click OK
Finding the Intercept (Constant) and Slope (β
or unstandardized regression coefficient)
Coefficientsa
52.372 4.378 11.961 .000
.495 .075 .587 6.608 .000
(Constant)
People living in cities (%) Model
1
B Std. Error
Unstandardized Coefficients
Beta Standardized
Coefficients
t Sig.
Dependent Variable: Males who read (%) a.
The regression equation for predicting Y (male literacy) is Y = a + (b)X, or Y = 52.372 +.495X, so if we wanted to predict the male literacy rate in country j we would multiply its percentage of people living in cities by .495, and add the
constant, 52.372. Compare this to the scatterplot. Does it look right?
When scores on X and Y are available as Z scores, and are expressed in the same standardized units, then there is no intercept (constant) because you don’t have to make an adjustment for the differences in scale between X and Y, and so the
equation for the regression line just becomes Y = (b) X, or in this case Y = .587 X, where .587 is the standardized version of b (note that it’s also the value of r, but only when there is just the one X variable and not multiple independent variables)
The intercept, or
a
(sometimes called
β
0The slope, or
β
More Output from Linear
Regression
Model Summary
.587a .345 .337 16.649
Model 1
R R Square
Adjusted R Square
Std. Error of the Estimate
Predictors: (Constant), People living in cities (%) a.
The correlation coefficient and the coefficient of
F
test of the regression equation: More
Output from Linear Regression, Cont’d
ANOVAb
12104.898 1 12104.898 43.668 .000a
23007.879 83 277.203
35112.776 84 Regression
Residual Total Model 1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), People living in cities (%) a.
Dependent Variable: Males who read (%) b.
If the independent variable, X, were of no value in predicting Y, the best
estimate of Y would be the mean of Y. To see how much better our calculated
regression line is as a predictor of Y
than the simple mean of Y, we calculate
the sum of squares for the regression line and then a residual sum of squares
(variance left over after the regression line has done its work as a predictor)
which shows how well or how badly the regression line fits the actual obtained
scores on Y. If the residual mean square is large compared to the regression
mean square, the value of
F
would be low and the resulting
F
ratio may not be
significant. If the
F
ratio is statistically significant it suggests that we can reject
the hypothesis that our predictor,
β
, is zero in the population, and say that the
Partial Correlation
What is the relationship between a country’s
percentage of people living in cities (X2, the IV) and
male literacy rate (Y, the DV) when the effects of gross
domestic product (X1, another potential IV or control
variable) are removed?
That is, what happens when you statistically remove that portion of the variance that both percentage of people living in cities (X2) and gross domestic product (X1) have in common with each other and with Y, male literacy rate, e.g. compute a partial correlation?
Using SPSS to Compute a Partial
Correlation
A partial correlation is the relationship between two variables
after removing the overlap with a third variable completely
from both variables. In the diagram below, this would be the
relationship between male literacy (Y) and percentage living in
cities (X2), after removing the influence of gross domestic
product (X1) on both literacy and percentage living in cities
In the calculation of the partial correlation
coefficient rYX2.X1, the area of interest is section
a, and the effects removed are those in b, c, and d; partial correlation is the relationship of X2 and Y after the influence of X1 is completely removed from both variables. When only the effect of X1 on X2 is removed, this is called a
Computing the Partial Correlation in
SPSS
Go to Analyze/Correlate/Partial
Move % People living in cities and Males who
read into the Variables box
Put Gross Domestic Product into the Controlling
for box
Select one-tailed test and check display actual
significance level
Under Options, select zero-order correlations
Click Continue and then OK
Comparing Partial to Zero-Order Correlation: Effect of
Controlling for GDP on Relationship Between Percent
Living in Cities and Male Literacy
Note that the partial correlation of % people living in cities and male literacy is only .4644 when GDP is held constant, where the zero order correlation you obtained previously was .5871. So clearly GDP is a control variable which
influences the relationship between % of people living in cities and male literacy, although the % living in cities-literacy relationship is still significant even with GDP removed
Correlations
1.000 .587 .591
. .000 .000
0 83 83
.587 1.000 .417
.000 . .000
83 0 83
.591 .417 1.000
.000 .000 .
83 83 0
1.000 .464 . .000 0 82 .464 1.000 .000 . 82 0 Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df
People living in cities (%)
Males who read (%)
Gross domestic product / capita
People living in cities (%)
Males who read (%) Control Variables
-none-a
Gross domestic product / capita
People living in cities (%)
Males who read (%) Gross domestic product / capita
Cells contain zero-order (Pearson) correlations. a.
Zero-order
r
r
when
effect of
GDP is
removed
Zero