Techniques of Statistical
Analysis I
Lect_8: Correlation
Bruno Arpino
Scatterplot
Covariance and correlation: what they tell and do
not tell us?
Outline
Imagine we have two quantitative variables, Y and
X
E.g.: happiness and income; TFR and GDP
We want to test if there is association between
Goal
We want to test if there is association between
these two variables
Graphical tool: scatterplot
Statistical tool: test of hypothesis on correlation coefficient
We want to assess how strong is the association
Scatterplot
4
Each unit can beScatterplot: what info we get
religiosity level (7) but different fertility intentions (individuals are heterogeneous)
Measuring the relationship
2 units show values of both X and Y above their means;
2 units have values of both X and Y below their means
µX = 6.2
µY = 2.4
2 units have values of both X and Y below their means
These 4 pairs indicate a positive relationship: values above the mean of X are associated with values above the mean of Y
(x - µ
X)(y- µ
Y)
Covariances
(x - µX)(y- µY) >0
(x - µX)(y- µY) < 0
µX = 6.2
µY = 2.4
(x - µ
X)(y- µ
Y) > 0
: values above (below) the mean of X are associated with values of Y above (below) its mean. The two variables vary in the same direction with respect to the means.Covariances (cont’d)
In the quadrants II and IV, units show negative covariances:(x - µ
X)(y- µ
Y) < 0
: values below (above) the mean of X are associated with values of Y above (below) its mean. The two variables vary in opposite directions with respect to the means.Population Covariance
It assumes that the relationship that links the two variables is linearN
)
)(y
(x
Y)
,
(X
Cov
N 1 i Y i X i Y X,∑
=−
−
=
Population Covariance: an example of
calculation
religiosity (X) fertility (Y) x-mean(X) y-mean(Y) (x-mean(X))*(y-mean(Y))
10 5 3.8 2.6 9.88
2 0 -4.2 -2.4 10.08
7 3 0.8 0.6 0.48
Positive linear relationship7 3 0.8 0.6 0.48
7 2 0.8 -0.4 -0.32
5 2 -1.2 -0.4 0.48
Total 20.6
12
.
4
5
6
.
20
N
)
)(y
(x
Y)
,
(X
Cov
N 1 i Y i X i YX,
=
=
−
−
=
=
∑
=µ
µ
σ
µX = 6.2
Sample Covariance
)
)(y
(x
n i i∑
−
−
=
=
y
x
This formula applied on a specific sample gives a point estimate of the population covariance1
-n
Y)
,
(X
Cov
i 1(Population or sample) Covariance:
interpretation
Cov(X,Y) > 0 X and Y tend to move in the same direction (Positive linear relationship) Cov(X,Y) < 0 X and Y tend to move in opposite directions (Negative linear relationship)(Population or sample) Covariance:
interpretation (cont’d)
Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) < 0 Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) < 0
Cov(X,Y) < 0 Cov(X,Y) < 0 Cov(X,Y) = 0 Cov(X,Y) = 0
Population Correlation coefficient
Y X Y
X Y X,
σ
σ
Y)
COV(X,
σ
σ
σ
ρ
Y)
Sample Correlation coefficient
Y X,
s
r
Y)
corr(X,
=
=
This formula applied on a specific sample gives a point estimate of the population correlation coefficientY X
Y X,
s
s
r
Y)
Cov(X,Y) > 0 Corr(X,Y) > 0 (positive linear relationship)
(Population or sample) Correlation
coefficient: interpretation
16
(positive linear relationship)
Cov(X,Y) < 0 Corr(X,Y) < 0 (negative linear relationship)
Cov(X,Y) = 0 Corr(X,Y) = 0 (no linear relationship)
(Population or sample) Correlation:
interpretation (cont’d)
Corr = 1 (perfect Corr = +0.9 Corr = +0.4 Corr = -1 (perfect
Corr = 1 (perfect + linear relationship)
Corr indicates if there is linear relationship, its sign and strenght If corr is 0 (or close to 0) a relationship might exist but differentCorr = +0.9
(strong + linear rel.)
Corr = +0.4
(moderate + linear rel.)
Corr = -1 (perfect - linear relationship)
Corr = -0.9
(strong - linear rel.)
Corr = -0.4
(moderate - linear rel.)
Corr = 0 (no linear relationship)
Non linear relationship: an example
Consider again the sample of 5 married women. Is there a relationship between religiosity and age at marriage? The correlation is very low: very weak linear relationship! The scatterplot showsµX = 6.2
µY = 24.4
corr = -0.07
The scatterplot shows the existence of aInference on the correlation coefficient
If we have a sample, the sample correlation coefficient gives a point estimate of the population correlation coefficient. A test of hypothesis can be implemented using the following test statistic:2)
(n
r
t
=
−
which follows a t distribution with n-2 degrees of freedom and so can be compared with the appropriate critical value from the t distribution. However, we will only consider the p-value approach for the test on the correlation coefficient.
)
r
(1
2)
(n
r
t
2−
−
=
Lower-tail test: H0: ρ ≥≥≥≥ 0Upper-tail test: H0: ρ ≤ 0
Exercise 1
Data on interest in politics (on a scale from 0 – not interest at all to 10 – very interested) and income (in thousands of Euros) have been collected on a sample of 50 individuals. The following quantities have been calculated on thesample:
Calculate and interpret the covariance Calculate and interpret the correlation Test the hypothesis that the correlation is negative using the fact that the p-value was found to be equal to 0.02735
.
105
)
)(y
(x
n
1 i
i
i
−
−
=
−
∑
=
y
Exercise 1 (cont’d)
The sample covariance can be calculated with the formula: Cov is negative meaning that in the sample there is a 15 . 2 49 35 . 105 1 -n ) )(y (x Y) , (X Cov n 1 i i i YX, = −
− = − − = =
∑
= y x s 1 i=56
.
0
;
15
.
9
=
=
Y Xs
s
Cov is negative meaning that in the sample there is a negative linear relationship between the two variable. To assess the strenght of this relationship we need to calculate the correlation coefficient: Cor is negative but not very close to -1. We can conclude that in the sample there is a moderate negative linear42 . 0 0.56 * 9.15 15 . 2 Y) , (X Cor Y X Y
X, = − = −
=
s s
Exercise 1 (cont’d)
The test that we want to implement is:H0: ρ ≥≥≥≥ 0 H1: ρ < 0
The p-value is quite low (0.027). So there is evidenceagainst H0. However, note that we can reject H0 at the 5% level but not at the 1% level.
Exercise 2
Exercise 2
Which one among the following is the most plausible value for the correlation coefficient calculated in this sample:r = +0.87; r=+0.12; r=-0.36; r=-0.79
Explain why and interpret the chosen value.
Knowing that the p-value for the two-tail test on theExercise 2
The scatterplot shows a negative but not strong linearrelationship. The most plausible value for r is: r= -0.36 that can be interpret as evidence of a moderate negative linear relation between GNP and average population growth in the sample of 63 countries.
sample of 63 countries.
The p-value is very low (0.004) and lower than all thecommonly used levels of significance. The null hypothesis that the correlation coefficient is zero can be rejected even at the 1% level against the alternative that the correlation is different from zero.
Exercise 3
The following scatterplot displays data on carbon dioxide emissions (C02, in metric tons per capita) and grossdomestic product (GDP, in thousands of dollars per capita)
Which one amongthe following is the most plausible value most plausible value for the correlation coefficient?
Remarks
Cov and Corr are symmeteric measures (order is not relevant): Cov(X,Y) = Cov(Y,X); corr(X,Y) = corr(Y,X) They do not assume any direction in the link between the two variablesIf something is not clear
(or you find mistakes in the slides)
do not hesitate to come at office hours
or e-mail me