COVARIANCE, VARIANCE, AND CORRELATION
1.8 The Correlation Coefficient
In this chapter a lot of attention has been given to covariance. This is because it is very convenient mathematically, not because it is a particularly good measure of association. We shall discuss its deficiencies in this respect in Section 1.9. A much more satisfactory measure is its near-relative, the correlation coefficient.
Like variance and covariance, the correlation coefficient comes in two forms, population and sample. The population correlation coefficient is traditionally denoted ρ, the Greek letter that is the equivalent of “r”, and pronounced “row”, as in row a boat. For variables X and Y it is defined by
2
If X and Y are independent, ρXY will be equal to 0 because the population covariance will be 0. If there is a positive association between them, σXY, and hence ρXY, will be positive. If there is an exact positive linear relationship, ρXY will assume its maximum value of 1. Similarly, if there is a negative relationship, ρXY will be negative, with minimum value of –1.
The sample correlation coefficient, rXY, is defined by replacing the population covariance and variances in (1.24) by their unbiased estimators. We have seen that these may be obtained by multiplying the sample variances and covariances by n/(n–1). Hence
)
The factors n/(n–1) cancel, so we can conveniently define the sample correlation by
) ( Var ) ( Var
) , ( Cov
Y X
Y
rXY = X (1.26)
Like ρ, r has maximum value 1, which is attained when there is a perfect positive association between the sample values of X and Y (when you plot the scatter diagram, the points lie exactly on an upward-sloping straight line). Similarly, it has minimum value –1, attained when there is a perfect negative association (the points lying exactly on a downward-sloping straight line). A value of 0 indicates that there is no association between the observations on X and Y in the sample. Of course the fact that r = 0 does not necessarily imply that ρ = 0 or vice versa.
Illustration
We will use the education and earnings example in Section 1.1 to illustrate the calculation of the sample correlation coefficient. The data are given in Table 1.1 and they are plotted in Figure 1.1. We have already calculated Cov(S, Y) in Table 1.2, equal to 15.294, so we now need only Var(S) and Var(Y), calculated in Table 1.3.
TABLE 1.3
Observation S Y (S – S ) (Y – Y ) (S – S )2 (Y – Y )2 (S – S )(Y – Y )
1 15 17.24 1.75 3.016 3.063 9.093 5.277
2 16 15.00 2.75 0.775 7.563 0.601 2.133
3 8 14.91 –5.25 0.685 27.563 0.470 –3.599
4 6 4.5 –7.25 –9.725 52.563 94.566 70.503
5 15 18.00 1.75 3.776 3.063 14.254 6.607
6 12 6.29 –1.25 –7.935 1.563 62.956 9.918
7 12 19.23 –1.25 5.006 1.563 25.055 –6.257
8 18 18.69 4.75 4.466 22.563 19.941 21.211
9 12 7.21 –1.25 –7.015 1.563 49.203 8.768
10 20 42.06 6.75 27.836 45.563 774.815 187.890
11 17 15.38 3.75 1.156 14.063 1.335 4.333
12 12 12.70 –1.25 –1.525 1.563 2.324 1.906
13 12 26.00 –1.25 11.776 1.563 138.662 –14.719
14 9 7.50 –4.25 –6.725 18.063 45.219 28.579
15 15 5.00 1.75 –9.225 3.063 85.091 –16.143
16 12 21.63 –1.25 7.406 1.563 54.841 –9.257
17 16 12.10 2.75 –2.125 7.563 4.514 –5.842
18 12 5.55 –1.25 –8.675 1.563 75.247 10.843
19 12 7.50 –1.25 –6.725 1.563 45.219 8.406
20 14 8.00 0.75 –6.225 0.563 38.744 –4.668
Total 265 284.49 217.750 1,542.150 305.888
Average 13.250 14.225 10.888 77.108 15.294
From the last two columns of Table 1.3, you can see that Var(S) is 10.888 and Var(Y) is 77.108.
Hence
55 . 975 0 . 28
924 . 15 108 . 77 888 . 10
924 .
15 = =
= ×
rSY (1.27)
Exercises
1.5 In the years following the Second World War, the economic growth of those countries that had suffered the greatest destruction, Germany and Japan, was more rapid than that of most other industrialized countries. Various hypotheses were offered to explain this. Nicholas Kaldor, a Hungarian economist, argued that the countries that had suffered the worst devastation had had to invest comprehensively with new plant and equipment. Because they were using up-to-date technology, their marginal costs were lower than those of their competitors in export markets, and they gained market share. Because they gained market share, they needed to increase their productive capacity and this meant additional investment, further lowering their marginal costs and increasing their market share. Meanwhile those countries that had suffered least, such as the U.S. and the U.K., had less need to re-invest. As a consequence the same process worked in the opposite direction. Their marginal costs were relatively high, so they lost market share and had less need to increase capacity. As evidence for this hypothesis, Kaldor showed that there was a high correlation between the output growth rate, x, and the productivity growth rate, p, in the manufacturing sectors in the 12 countries listed in the table.
When a critic pointed out that it was inevitable that x and p would be highly correlated, irrespective of the validity of this hypothesis, Kaldor proposed a variation on his hypothesis.
Economic growth was initially high in all countries for a few years after the war, but in some, particularly the U.S. and the U.K., it was soon checked by a shortage of labor, and a negative cycle took hold. In others, like Germany and Japan, where agriculture still accounted for a large share of employment, the manufacturing sector could continue to grow by attracting workers from the agricultural sector, and they would then have an advantage. A positive correlation between the growth rate of employment, e, and that of productivity would be evidence in favor of his hypothesis.
Annual Growth Rates (%)
Employment Productivity
Austria 2.0 4.2
Belgium 1.5 3.9
Canada 2.3 1.3
Denmark 2.5 3.2
France 1.9 3.8
Italy 4.4 4.2
Japan 5.8 7.8
Netherlands 1.9 4.1
Norway 0.5 4.4
West Germany 2.7 4.5
U.K. 0.6 2.8
U.S. 0.8 2.6
The table reproduces his data set, which relates to the period 1953/1954 to 1963/1964 (annual exponential growth rates). Plot a scatter diagram and calculate the sample correlation coefficient for e and p. [If you are not able to use a spreadsheet application for this purpose, you are strongly advised to use equations (1.9) and (1.17) for the sample covariance and variance and to keep a copy of your calculation, as this will save you time with another exercise in Chapter 2.].
Comment on your findings.
1.6 Suppose that the observations on two variables X and Y lie on a straight line Y = b1 + b2X
Demonstrate that Cov(X, Y) = b2Var(X) and that Var(Y) = b Var(X) , and hence that the sample22 correlation coefficient is equal to 1 if the slope of the line is positive, –1 if it is negative.
1.7*Suppose that a variable Y is defined by the exact linear relationship Y = b1 + b2X
and suppose that a sample of observations has been obtained for X, Y, and a third variable, Z.
Show that the sample correlation coefficient for Y and Z must be the same as that for X and Z, if b2 is positive.