(Lesson-6) Correlation.ppt

(1)

(2)

Introduction to Correlation



The Pearson product-moment correlation coefficient measures the

degree of association between two interval (or better)-level

variables, for example, the relationship between daily consumption

of fat calories and body weight, or attitudes towards smoking and

attitudes towards consumption of alcohol; what is the relationship

between student achievement and dollars per student spent by the

school district?



Sometimes both of the variables are treated as “dependent,”

meaning that we haven’t ordered them causally. Sometimes one

of the variables, X, is treated as independent and the other, Y, as

dependent. Which of these is dependent and which is independent

depends on your theory of the relationship



The correlation coefficient, Pearson’s

r

, ranges between +1 and -1

where +1 is a perfect positive association (people who get high

scores on X also get high scores on Y) and -1 is a perfect negative

association (people who get high scores on X get low scores on Y).

A correlation near zero indicates that there is no relationship

(3)

Related Measures of Association



The correlation coefficient is related to other types of

measures of association:



The

partial correlation

, which measures the degree of

association between two variables when the effects

on them of a third variable is removed: what is the

relationship between student achievement and

dollars per student spent by the school district when

the effect of parents’ SES is removed



The

multiple correlation

, which measures the degree

to which one variable is correlated with two or more

other variables: how well can I predict student

(4)

Other Related Measures



The squared Pearson’s correlation coefficient,

usually called

R squared

or the

coefficient of

determination

, tells us how much of the

variation in Y, the dependent variable, can be

explained by variation in X, the independent

variable; for example, how much of the

variation in student achievement can be

explained by dollars per student expenditure by

the school district?



The quantity 1-R

2

is sometimes called the

coefficient of non-determination

, and it is an

estimate of the proportion of variance in the

(5)

Scatterplot: Visual Representation of the

Relationship Measured by the Correlation

Coefficient



The scatterplot is a figure which plots off cases for which

two measures have been taken (for example, people who

have filled out a survey of their attitudes toward smoking

and another survey about their attitudes toward drinking)

against each other



In a scatterplot, one of the variables (usually the

independent variable) is plotted along the horizontal or X

axis and the other is plotted along the vertical or Y axis



Each point in a scatterplot corresponds to the scores (X,Y)

for an individual case (a person, for example) where X is

the score that person was assigned or obtained on one

variable and Y is the score they attained on the other



The strength of the linear relationship between X and Y is

(6)

An Example of a Scatterplot

In this scatterplot,

computer anxiety scores

(openness to computing)

are plotted against the Y

(vertical) axis and

computer self-efficacy

scores are plotted along the

X (horizontal) axis. For

example, the person to

whom the arrow is pointing

had a score of about 17 on

the openness scale and

about 162 on the

self-efficacy scale. What were

the scores on the two

scales of the person with

the star next to his point?

(7)

Scatterplot Allows You to Visualize

the Relationship between Variables

The purpose of the

scatterplot is to visualize

the relationship between

the two variables

represented by the

horizontal and vertical

axes. Note that although

the relationship is not

perfect, there is a tendency

for higher values of

openness to computing to

be associated with larger

values of computer

self-efficacy, suggesting that as

openness increases,

self-efficacy increases. This

indicates that there is a

positive

correlation

(8)

Drawing A Possible Regression Line

Computer Self-efficacy 180 160 140 120 100 80 60 40 20 O p e n n e ss t o c o m p u tin g 30 20 10 0

Let’s draw a line through the swarm of points that best “fits” the data set (minimizes the distance between the line and each of the points). This is imposing a linear description of the relationship between the two variables, when sometimes you might want to find out if a line that represented a

curvilinear relationship (in this case an inverted U) was a better fit, but we’ll leave that question for another time. The line that represents this relationship best mathematically is called a

“regression line” and the point at which the mathematically best fitting line crosses the y

(9)

Various Types of Associations

Engine Displacement (cu. inches)

500 400 300 200 100 0 -100 H o rs e p o w e r 300 200 100 0

Vehicle Weight (lbs.)

6000 5000 4000 3000 2000 1000 0 M ile s pe r G a llo n 50 40 30 20 10 0

Number of Children

8 6 4 2 0 -2 T o ta l Y e a rs o f E d u ca tio n 30 20 10 0

Positive Relationship between X and Y

Strong negative Relationship between X and Y; points tightly clustered around line; nonlinear trend at lower weights Essentially no relationship between X and Y; points

(10)

How is the Correlation Coefficient

Computed?



The conceptual formula for the

correlation coefficient is a little

daunting, but it looks like this:

∑(X – X) (Y – Y)

[∑ (X – X)

2

] [∑ (Y – Y)

2

]

Where X is a person’s or case’s score on the independent variable, Y is a person’s or case’s score on the dependent variable, and X-bar and Y-bar are the means of the scores on the

independent and dependent variables, respectively. The quantity in the numerator is called the sum of the crossproducts (SP). The quantity in the denominator is the square root of the

product of the sum of squares for both variables (SS_x and SS_y)

(11)

Meaning of Crossproducts



The notion of the crossproducts is not too difficult to understand.

When we have a positive relationship between two variables, a

person who is high on one of the variables will also score high on

the other. And it follows that if his or her score on X is larger than

the mean of variable X, then if there is a positive relationship his

or her score on Y will be larger than the mean of Y. And this

should hold for all or most of the cases



When the crossproducts are negative (when for example the

typical person who scores higher than the mean on X scores lower

than the mean on Y) then there still may be a relationship but it is

a negative relationship



Thus the sign of the crossproducts (positive or negative) in the

numerator of the formula for

r

tells us whether the relationship is

positive or negative

(12)

Computing Formula for Pearson’s

r



The conceptual formula for Pearson’s

r

is rarely used to

compute it. You will find a nice illustration

here

of a

computing formula and a brief example

Here is another computing formula

N ∑XY - ∑X ∑Y

r =

[ N ∑X

2

– (∑X)

2

] [N ∑Y

2

– (∑Y)

2

]

(13)

Scatterplot for the Correlation.sav

data set



Open the correlation.sav file in SPSS



Go to Graphs/Chart Builder/OK



Under Choose From select ScatterDot (top

leftmost icon) and double click to move it

into the preview window



Drag Shyness onto the X axis box



Drag Speeches onto the Y axis box and

click OK



In the Output viewer, double click on the chart

to bring up the Chart Editor; go to Elements

and select “Fit Line at Total,” then select

(14)

ScatterPlot of Shyness and

Speeches

A negative

(15)

Computational Example of

r

for the

relationship between Shyness and

Speeches

N ∑XY - ∑X ∑Y

r =

[ N ∑X2_{– (∑X)}2_{] [N ∑Y}2_{– (∑Y)}2_]

Shyness

X

Speeches

Y

XY

X

2

Y

2

0

8

0

64

2

10

20

4

100

3

4

12

9

16

6

36

9

1

9

81

1

10

3

30

100

9

30

32

107 230 226

(6 X 107) – 30 (32)

[6 (230) – 30

2

] [6 (226) – 32

2

]

r

= -.797

(note crossproducts term in the numerator is

(16)

SPSS vs. the Hand Calculation: It’s

a Lot Quicker



Now let’s try computing the coefficient with that same

data in SPSS



Go to Analyze/Correlate/Bivariate, and move Shyness

and Speeches into the Variables box. Click Pearson,

one-tailed, and OK. Did you get the same result as

the hand calculation?

Correlations

1 -.797*

.029

6 6

-.797* 1

.029

6 6

Pearson Correlation Sig. (1-tailed) N

Shyness

Speeches

Shyness Speeches

(17)

Using SPSS to Test a Hypothesis about the Strength of

Association between Two Interval or Ratio Level

Variables: Correlation Coefficient



Download the file called

World95.sav



We are going to test the strength of the association between

population density (the variable is “number of people per square

kilometer) and “average female life expectancy,” based on data

from 109 cases (109 countries, with each country a case). Our

hypothesis is that the association will be negative; that is, as

population density increases, female life expectancy will decrease

 In SPSS Data Editor, go to Analyze/ Correlate/ Bivariate

 Move the two variables, “number of people per square kilometer” and “average female life expectancy” into the variables box

 Under correlation coefficients, select Pearson

 Under Tests of Significance, click one-tailed (we are making a

directional prediction, so we will only accept as significant results in the “negative” 5% of the distribution

 Click “flag significant results”

 Click Options, and under Statistics, select Means and standard deviations, then Continue, then OK

(18)

SPSS Output for Bivariate

Correlation

Correlations

1 .128

. .093

109 109

.128 1

.093 .

109 109

Number of people / sq. kilometer

Average female life expectancy

Number of people / sq.

kilometer

Descriptive Statistics

203.415 675.7052 109

70.16 10.572 109 Number of people

/ sq. kilometer Average female life expectancy

(19)

Significance Test of Pearson’s

r

Significance of

r

is tested with a

t

-statistic

with N-2 degrees of freedom where

t

=

r N – 2

1 – r

2

SPSS provides the results of the

t test of the significance of r for you. Can also consult table F in Levin and Fox

Correlations

1 .128

. .093

109 109

.128 1

.093 .

109 109

Number of people / sq. kilometer

Number of people / sq.

kilometer

Write a sentence which states your findings. Report the correlation coefficient, r, R2

(20)

A Hypothesis to Test



Now, test the following hypothesis:

Countries in which there is a larger

proportion of people living in cities (urban)

will have a higher proportion of males who

read (lit_male) (not “people who read”)



Write up your result

(21)

Writing up the Results of Your Test

Descriptive Statistics

56.53 24.203 108 78.73 20.445 85 People living in cities (%)

Males who read (%)

Mean Std. Deviation N

Correlations

1 .587**

. .000

108 85

.587** 1

.000 .

85 85

People living in cities (%)

Males who read (%)

Correlation is significant at the 0.01 level (1-tailed). **.

The hypothesis that the proportion of its people living in cities would be positively associated with a country’s rate of male literacy was confirmed (r = .587, DF=83, p < .01, one-tailed).

(22)

The Regression Model



Regression takes us a step beyond correlation in that

not only are we concerned with the strength of the

association, but we want to be able to describe its

nature with sufficient precision to be able to make

predictions



To be able to make predictions, we need to be able to

characterize one of the variables in the relationship as

independent and the other as dependent

(23)

Regression model, cont’d



A regression equation is used to predict the value of a

dependent variable, Y, in this case a country’s male

literacy rate, on the basis of some constant

a

that

applies to all cases, plus some amount

b

which is

applied to each individual value of X (the country’s %

of people living in cities), plus some error term

e

that

is unique to the individual case and unpredictable: Y

=

a

+

b

X +

e (

male literacy

= a + b(

percent urban)

+ e)

(24)

Calculating the Regression Line



What line best describes the

relationship depicted in the

scattergram?



The formula for the regression line

is Y =

a

+

b

X +

e

where Y is (in

this case) a country’s score on

male literacy and X is the

country’s % of people living in

cities,

a

is the

y-intercept

or

constant (the point where the line

crosses the Y axis, or, the value of

Y when X is zero if X is a variable

for which there is zero amount of

the property) and

b

is the

slope of

the regression line

(the amount

by which male literacy changes for

each unit change in percent living

in cities)

(25)

Calculating the Regression Line,

cont’d



We will not do the hand computations for

b

, the slope of

the regression line, or

a

, the intercept. Let’s use

another SPSS method for finding not only the correlation

coefficient, Pearson’s

r

, but also the regression equation

(e.g., find the intercept,

a

, and the slope of the

regression line,

b

)



In Data Editor, go to Analyze/Regression/Linear



Put the dependent variable, male literacy, into

the Dependent box



Move the independent variable, percentage of

people living in cities, into the Independent(s)

box, and click OK

(26)

Finding the Intercept (Constant) and Slope (β

or unstandardized regression coefficient)

Coefficientsa

52.372 4.378 11.961 .000

.495 .075 .587 6.608 .000

(Constant)

People living in cities (%) Model

1

B Std. Error

Unstandardized Coefficients

Beta Standardized

Coefficients

t Sig.

Dependent Variable: Males who read (%) a.

The regression equation for predicting Y (male literacy) is Y = a + (b)X, or Y = 52.372 +.495X, so if we wanted to predict the male literacy rate in country j we would multiply its percentage of people living in cities by .495, and add the

constant, 52.372. Compare this to the scatterplot. Does it look right?

When scores on X and Y are available as Z scores, and are expressed in the same standardized units, then there is no intercept (constant) because you don’t have to make an adjustment for the differences in scale between X and Y, and so the

equation for the regression line just becomes Y = (b) X, or in this case Y = .587 X, where .587 is the standardized version of b (note that it’s also the value of r, but only when there is just the one X variable and not multiple independent variables)

The intercept, or

a

(sometimes called

β

₀

The slope, or

β

(27)

More Output from Linear

Regression

Model Summary

.587a .345 .337 16.649

Model 1

R R Square

Adjusted R Square

Std. Error of the Estimate

Predictors: (Constant), People living in cities (%) a.

The correlation coefficient and the coefficient of

(28)

F

test of the regression equation: More

Output from Linear Regression, Cont’d

ANOVAb

12104.898 1 12104.898 43.668 .000a

23007.879 83 277.203

35112.776 84 Regression

Residual Total Model 1

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), People living in cities (%) a.

Dependent Variable: Males who read (%) b.

If the independent variable, X, were of no value in predicting Y, the best

estimate of Y would be the mean of Y. To see how much better our calculated

regression line is as a predictor of Y

than the simple mean of Y, we calculate

the sum of squares for the regression line and then a residual sum of squares

(variance left over after the regression line has done its work as a predictor)

which shows how well or how badly the regression line fits the actual obtained

scores on Y. If the residual mean square is large compared to the regression

mean square, the value of

F

would be low and the resulting

F

ratio may not be

significant. If the

F

ratio is statistically significant it suggests that we can reject

the hypothesis that our predictor,

β

, is zero in the population, and say that the

(29)

Partial Correlation



What is the relationship between a country’s

percentage of people living in cities (X2, the IV) and

male literacy rate (Y, the DV) when the effects of gross

domestic product (X1, another potential IV or control

variable) are removed?

 That is, what happens when you statistically remove that portion of the variance that both percentage of people living in cities (X2) and gross domestic product (X1) have in common with each other and with Y, male literacy rate, e.g. compute a partial correlation?

(30)

Using SPSS to Compute a Partial

Correlation



A partial correlation is the relationship between two variables

after removing the overlap with a third variable completely

from both variables. In the diagram below, this would be the

relationship between male literacy (Y) and percentage living in

cities (X2), after removing the influence of gross domestic

product (X1) on both literacy and percentage living in cities

In the calculation of the partial correlation

coefficient r_YX2.X1, the area of interest is section

a, and the effects removed are those in b, c, and d; partial correlation is the relationship of X2 and Y after the influence of X1 is completely removed from both variables. When only the effect of X1 on X2 is removed, this is called a

(31)

Computing the Partial Correlation in

SPSS



Go to Analyze/Correlate/Partial



Move % People living in cities and Males who

read into the Variables box



Put Gross Domestic Product into the Controlling

for box



Select one-tailed test and check display actual

significance level



Under Options, select zero-order correlations



Click Continue and then OK

(32)

Comparing Partial to Zero-Order Correlation: Effect of

Controlling for GDP on Relationship Between Percent

Living in Cities and Male Literacy

Note that the partial correlation of % people living in cities and male literacy is only .4644 when GDP is held constant, where the zero order correlation you obtained previously was .5871. So clearly GDP is a control variable which

influences the relationship between % of people living in cities and male literacy, although the % living in cities-literacy relationship is still significant even with GDP removed

Correlations

1.000 .587 .591

. .000 .000

0 83 83

.587 1.000 .417

.000 . .000

83 0 83

.591 .417 1.000

.000 .000 .

83 83 0

1.000 .464 . .000 0 82 .464 1.000 .000 . 82 0 Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df Correlation Significance (1-tailed) df

Males who read (%)

Gross domestic product / capita

Males who read (%) Control Variables

-none-a

Gross domestic product / capita

Males who read (%) Gross domestic product / capita

Cells contain zero-order (Pearson) correlations. a.

Zero-order

r

when

effect of

GDP is

removed

Zero

here

e correlation.sav

World95.sav