Lecture_8 [Modo de compatibilidad]

(1)

Techniques of Statistical

Analysis I

Lect_8: Correlation

Bruno Arpino

(2)

Scatterplot

Covariance and correlation: what they tell and do

not tell us?

Outline

(3)

Imagine we have two quantitative variables, Y and

X

E.g.: happiness and income; TFR and GDP

We want to test if there is association between

Goal

We want to test if there is association between

these two variables

Graphical tool: scatterplot

Statistical tool: test of hypothesis on correlation coefficient

We want to assess how strong is the association

(4)

We measure two quantitative variables on each unit

E.g.: On a sample of 5 recently married women we measure: religiosity index (from 0 to 10) and fertility intentions (how many children do you intend to have?)

Scatterplot

4

Each unit can be

(5)

There is a positive relationship (the higher is religiosity the higher is the fertility intention)

We can imagine to draw a line passing trough the scatter of points. This line will represent quite well the relationship

The relationship is not perfect: two women have the same religiosity level (7)

Scatterplot: what info we get

religiosity level (7) but different fertility intentions (individuals are heterogeneous)

(6)

How can we quantify the statement “the higher is religiosity the higher is the fertility intention” with a number?

Idea: how many pairs are above the mean of both X and Y? How many pairs deviate from this pattern?

Measuring the relationship

2 units show values of both X and Y above their means;

2 units have values of both X and Y below their means

µ_X= 6.2

µ_Y= 2.4

2 units have values of both X and Y below their means

These 4 pairs indicate a positive relationship: values above the mean of X are associated with values above the mean of Y

(7)

To build an index that quantify the relationship between X and Y, let consider the quantities

(x - µ

_X

)(y- µ

_Y

)

Covariances

(x - µ_X)(y- µ_Y) >0

(x - µ_X)(y- µ_Y) < 0

µ_X= 6.2

µ_Y= 2.4

(8)

In the quadrants I and III, units show positive covariances:

(x - µ

_X

)(y- µ

_Y

) > 0

: values above (below) the mean of X are associated with values of Y above (below) its mean. The two variables vary in the same direction with respect to the means.

Covariances (cont’d)

In the quadrants II and IV, units show negative covariances:

(x - µ

_X

)(y- µ

_Y

) < 0

: values below (above) the mean of X are associated with values of Y above (below) its mean. The two variables vary in opposite directions with respect to the means.

(9)

The covariance is an index of the LINEAR relationship between two quantitative variables.

COV is calculated as the average of all covariances for the N pairs of observations:

Population Covariance

It assumes that the relationship that links the two variables is linear

N

)

)(y

(x

Y)

,

(X

Cov

N 1 i Y i X i Y X,

∑

=

−

=

(10)

Population Covariance: an example of

calculation

religiosity (X) fertility (Y) x-mean(X) y-mean(Y) (x-mean(X))*(y-mean(Y))

10 5 3.8 2.6 9.88

2 0 -4.2 -2.4 10.08

7 3 0.8 0.6 0.48

Positive linear relationship

7 3 0.8 0.6 0.48

7 2 0.8 -0.4 -0.32

5 2 -1.2 -0.4 0.48

Total 20.6

12 .

4

5

6 .

20 N

)

)(y

(x

Y)

,

(X

Cov

N 1 i Y i X i Y

X,

=

−

=

∑

=

µ

σ

µ_X= 6.2

(11)

If we have a sample and want to make inference on the population covariance, an unbiased estimator is the sample covariance:

Sample Covariance

)

)(y

(x

n i i

∑

−

=

y

x

This formula applied on a specific sample gives a point estimate of the population covariance

1 -n

Y)

,

(X

Cov

i 1

(12)

COV is calculated as the average of all covariances and indicate if positive or negative covariances prevail:

(Population or sample) Covariance:

interpretation

Cov(X,Y) > 0 X and Y tend to move in the same direction (Positive linear relationship)

Cov(X,Y) < 0 X and Y tend to move in opposite directions (Negative linear relationship)

(13)

(Population or sample) Covariance:

interpretation (cont’d)

Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) < 0 Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) > 0 Cov(X,Y) < 0

Cov(X,Y) < 0 Cov(X,Y) < 0 Cov(X,Y) = 0 Cov(X,Y) = 0

(14)

COV has no upward and downward limits. So we cannot interpret its value apart from the sign

The correlation coefficient instead is bounded between -1 and +1.

Population Correlation coefficient

Y X Y

X Y X,

σ

Y)

COV(X,

σ

ρ

Y)

(15)

If we have a sample and want to make inference on the population correlation coefficient, an unbiased estimator is the sample correlation coefficient:

Sample Correlation coefficient

Y X,

s

r

Y)

corr(X,

=

This formula applied on a specific sample gives a point estimate of the population correlation coefficient

Y X

Y X,

s

r

Y)

(16)

From the formulas we can easily recognize that:

Cov(X,Y) > 0 Corr(X,Y) > 0 (positive linear relationship)

(Population or sample) Correlation

coefficient: interpretation

16

(positive linear relationship)

Cov(X,Y) < 0 Corr(X,Y) < 0 (negative linear relationship)

Cov(X,Y) = 0 Corr(X,Y) = 0 (no linear relationship)

(17)

(Population or sample) Correlation:

interpretation (cont’d)

Corr = 1 (perfect Corr = +0.9 Corr = +0.4 Corr = -1 (perfect

Corr = 1 (perfect + linear relationship)

Corr indicates if there is linear relationship, its sign and strenght

If corr is 0 (or close to 0) a relationship might exist but different

Corr = +0.9

(strong + linear rel.)

Corr = +0.4

(moderate + linear rel.)

Corr = -1 (perfect - linear relationship)

Corr = -0.9

(strong - linear rel.)

Corr = -0.4

(moderate - linear rel.)

Corr = 0 (no linear relationship)

(18)

Non linear relationship: an example

Consider again the sample of 5 married women. Is there a relationship between religiosity and age at marriage?

The correlation is very low: very weak linear relationship!

The scatterplot shows

µ_X= 6.2

µ_Y= 24.4

corr = -0.07

The scatterplot shows the existence of a

(19)

Inference on the correlation coefficient

If we have a sample, the sample correlation coefficient gives a point estimate of the population correlation coefficient.

A test of hypothesis can be implemented using the following test statistic:

2)

(n

r

t

=

−

which follows a t distribution with n-2 degrees of freedom and so can be compared with the appropriate critical value from the t distribution. However, we will only consider the p-value approach for the test on the correlation coefficient.

)

r

(1

2)

(n

r

t

₂

−

=

Lower-tail test: H₀: ρ ≥≥≥≥ 0

Upper-tail test: H₀: ρ ≤ 0

(20)

Exercise 1

Data on interest in politics (on a scale from 0 – not interest at all to 10 – very interested) and income (in thousands of Euros) have been collected on a sample of 50 individuals. The following quantities have been calculated on the

sample:

Calculate and interpret the covariance

Calculate and interpret the correlation

Test the hypothesis that the correlation is negative using the fact that the p-value was found to be equal to 0.027

35 .

105 )

)(y

(x

n

1 i

i

−

=

−

∑

=

y

(21)

Exercise 1 (cont’d)

The sample covariance can be calculated with the formula:

Cov is negative meaning that in the sample there is a 15 . 2 49 35 . 105 1 -n ) )(y (x Y) , (X Cov n 1 i i i Y

X, = −

− = − − = =

∑

= y x s 1 i=

56 .

0 ;

15 .

9 =

=

_Y X

s

Cov is negative meaning that in the sample there is a negative linear relationship between the two variable. To assess the strenght of this relationship we need to calculate the correlation coefficient:

Cor is negative but not very close to -1. We can conclude that in the sample there is a moderate negative linear

42 . 0 0.56 * 9.15 15 . 2 Y) , (X Cor Y X Y

X, = − = −

=

s s

(22)

Exercise 1 (cont’d)

The test that we want to implement is:

H₀: ρ ≥≥≥≥ 0 H₁: ρ < 0

The p-value is quite low (0.027). So there is evidence

against H0. However, note that we can reject H0 at the 5% level but not at the 1% level.

(23)

Exercise 2

(24)

Exercise 2

Which one among the following is the most plausible value for the correlation coefficient calculated in this sample:

r = +0.87; r=+0.12; r=-0.36; r=-0.79

Explain why and interpret the chosen value.

Knowing that the p-value for the two-tail test on the

(25)

Exercise 2

The scatterplot shows a negative but not strong linear

relationship. The most plausible value for r is: r= -0.36 that can be interpret as evidence of a moderate negative linear relation between GNP and average population growth in the sample of 63 countries.

sample of 63 countries.

The p-value is very low (0.004) and lower than all the

commonly used levels of significance. The null hypothesis that the correlation coefficient is zero can be rejected even at the 1% level against the alternative that the correlation is different from zero.

(26)

Exercise 3

The following scatterplot displays data on carbon dioxide emissions (C02, in metric tons per capita) and gross

domestic product (GDP, in thousands of dollars per capita)

Which one among

the following is the most plausible value most plausible value for the correlation coefficient?

(27)

Remarks

Cov and Corr are symmeteric measures (order is not relevant): Cov(X,Y) = Cov(Y,X); corr(X,Y) = corr(Y,X)

They do not assume any direction in the link between the two variables

(28)

If something is not clear

(or you find mistakes in the slides)

do not hesitate to come at office hours

or e-mail me