Relationships between random variables
5.2 THE PEARSON’S LINEAR CORRELATION COEFFICIENT
If information is obtained that some random variables are mutually statistically depend-ent during an investigation, further research should be directed towards the analysis of this interdependence.
Generally, the first area of consideration as to whether statistical interdependence was stated is correlation analysis.
Correlation is the certain stochastic relation between two or more random variables. It relies on such a regularity that changes in the values of one variable are accompanied by systematic stochastic changes in the values of the second variable or variables. Sometimes, correlation is described as a certain degree of stochastic ‘brotherhood’ of the changes in random variables.
There are a few different methods that can be used to check whether or not there is a cor-relation between variables.
One of most commonly applied methods is the construction of a statistical table like the one that was presented in the previous chapter (table of independence). If the numbers lie principally on the main diagonal or the largest numbers lie on the main diagonal, then it can be expected that the random variables are correlated linearly. If the numbers are located in a certain characteristic way in the table but not linearly, a supposition can be formulated that a nonlinear correlation exists between random variables.
A different method, which is often applied in practice, is the construction of a diagram of the dispersion of the random variables in a rectangular coordinate system X, Y in which pairs of observations (xi, yi) are used. If these points are accumulated in such a way that they can be closed by an ellipse, then we can suspect that a linear correlation exists between random variables. If the points are closer to each other, a stronger relationship can be expected. If the points are scattered, one can expect a lack of correlation.
Example diagrams are shown in Figure 6.2.
A precise statement about whether the correlation between random variables exists can be obtained by analytical reasoning.
There are many correlation measures in mathematical statistics, such as: linear correlation coefficients, rank correlation coefficients, nonlinear correlation coefficients, partial correla-tion coefficients, multiple correlacorrela-tion coefficients and so on. Many of these measures have found applications in mining engineering and they have been used for years.
Let us discuss the problem of correlation between two random variables.
There is no doubt that the measure of correlation that has been most commonly applied for years is the Pearson’s linear correlation coefficient. It is defined by formula (1.79). Recall that this is a normalised measure and is determined over a [−1, +1] interval. In the case of functional relationship between variables X and Y, this coefficient becomes 1 when the incre-ment in the values of one variable is accompanied by an increincre-ment in the values of a second
Book.indb 145
Book.indb 145 12/9/2013 12:26:02 PM12/9/2013 12:26:02 PM
variable. If the relationship is reversed but is still strict, then the coefficient reaches –1. Gener-ally, the greater the value in the modulus of the correlation coefficient, the stronger the inter-dependence between the random variables. Remember, we are only considering the linear relationship between the variables. If the random variables are mutually independent, then the value of the correlation coefficient is zero.
Let us now turn from theory to practice.
Having some information about the general population, we can use estimator (3.46); how-ever, some varieties of this measure can be found in statistical books.
Similarly, as in many previous research situations, the point of interest is whether the esti-mate that is obtained is statistically significant or not. Obviously, a test is needed to resolve the problem. Usually, the procedure is as follows.
A statistical hypothesis H0 is formulated that proclaims that there is no correlation between the random variables that are being investigated which is noted as: H
0: ρ = 0; ρ means the correlation coefficient in the whole population. An alternative hypothesis rejects it.
There are at least two ways to verify the null hypothesis.
One way is to apply the Student’s statistic, namely: if n ≥ 3 and the verified hypothesis is true one then the statistic
t=
− −
R R
X Y n R
RX Y ,
1 ,
2 2 (5.7)
has the Student’s distribution with n − 2 degrees of freedom2. From the table with the critical values of the Student’s distribution, we read off the value t (α, n − 2) for the presumed level of significance α and the number n − 2 degrees of freedom. If the estimate (5.7) is not lower in the modulus than the critical value, then the verified hypothesis should be rejected. Other-wise, there is no basis to reject the null hypothesis.
The second way is to use the critical values table for the Pearson’s correlation coefficient.
They are given in Table 9.13; parameter ν here is n − k and k is the number of random vari-ables (k = 2) that are being investigated.
■ Example 5.2
At the beginning of seventies of the 20th century a reliability investigation of mine hoist head ropes comprising 35 mines was carried out in Poland.
The information that was gathered contained: the technical data of the hoists and opera-tional parameters such as, for instance, the average number I of winds per day. The speed of the increment of wire breaks was assumed as the measure of rope wear.
The number of days T of work was calculated until the moment at which the speed attained a certain value presuming that this value was the maximum for all of the ropes that were investigated. All of the ropes had the same construction (triangle shape of strands) and for this reason it was presumed that the wear measure was selected well.
2 The number two is connected with the fact that two random variables are considered.
Book.indb 146
Book.indb 146 12/9/2013 12:26:03 PM12/9/2013 12:26:03 PM
After having defined these two parameters, an interesting problem was whether these two values were correlated, i.e. when the increment in the intensity of the hoist work was noticed, it should have been accompanied by a decrement in the rope durability.
The empirical pairs (T, I) are shown in Figure 5.1.
A statistical hypothesis was formulated which stated that there was no correlation between random variables, i.e. H0: ρ = 0 versus an alternative hypothesis that rejected it. The level of significance was presumed α = 0.05.
The Pearson’s correlation coefficient was calculated obtaining RT,I= −0.333.
For the given sample size and the presumed level of significance the critical value was calculated by interpolating values because the critical values are given for n = 30 and for 35 in Table 9.13. In the case that was considered ν = 33. The formula (5.7) that gives can also be applied:
Table 5.2. Auxiliary calculations.
T Days
I
Av number of winds/day
1 316 680
2 205 720
3 960 640
4 440 680
5 331 680
6 693 680
7 782 640
8 360 640
9 331 640
10 472 640
11 305 680
12 522 400
13 146 620
14 226 560
15 321 520
16 525 480
17 409 520
18 239 560
19 479 520
20 333 880
21 363 880
22 525 840
23 414 880
24 405 840
25 345 880
26 290 880
27 462 880
28 554 680
29 643 680
30 560 520
31 276 520
32 729 400
33 617 480
34 913 480
35 940 460
CH05.indd 147
CH05.indd 147 12/12/2013 2:51:01 PM12/12/2013 2:51:01 PM
t= −
( )
= −0 333
1−
(
− 2 33 2 03. .
Making use of Table 9.3 we get the critical value:
t(α = 0.05, n – 2 = 33) ≅ 2.04
The critical value is greater than the empirical value and thus we have no ground to reject the null hypothesis.
Nevertheless, both values are very close to each other and this fact creates a sensitive point.
If the level of significance is presumed to be slightly higher, the result of the reasoning is dif-ferent and the final conclusion as well.
As there are some doubts about the result of inference, let us check it using the critical values for the Pearson’s correlation coefficient. For the presumed level of significance and the known sample size (Table 9.13), we have the critical value 0.334, which is also obtained from the interpolation. Thus, because this way of reasoning gives the same result, there is no ground to reject the verified hypothesis, but—again—both values, the empirical and critical ones, are very close to each other. Notice that from an engineering point of view we suspect the exist-ence of a certain relationship between the random variables that are being investigated. This was the reason that we paid attention to the very small difference between the critical and the empirical values. At this that stage of our analysis, we accept the outcome of that part of the statistical inference, although it looks as though further research should be conducted.
Our conclusion stating that there is no correlation between variables being investigated can be strengthened by analysing Figure 5.1 and Figure 6.2. It is easy to notice that the empirical points scattering in Figure 5.1 does not correspond with any sketch visible in Figure 6.2.
Our previous considerations comprised problems related to the stochastic interdependence between two random variables. Now it is time to enlarge the scope of interest for a greater number of random variables.
If the number of random variables of interest is k and the relationship between them is lin-ear, we can then estimate the linear correlation coefficients for the whole set of pairs of vari-ables. These coefficients can be arranged in matrix R, which is called a correlation matrix:
1000 900 800 700 600
Durability (days)
500 400 300 200 100
350 450
Winds/day
550 650 750 850
Figure 5.1. Empirical points for the hoist head ropes: durability of the rope versus the average number of winds that were executed.
Book.indb 148
Book.indb 148 12/9/2013 12:26:04 PM12/9/2013 12:26:04 PM
R = the element is defined as 1. The properties of the correlation matrix R are determined by the properties of the covariance matrix ∑, according tothe relation:
Σ = D R D