THE PEARSON’S LINEAR CORRELATION COEFFICIENT

Relationships between random variables

5.2 THE PEARSON’S LINEAR CORRELATION COEFFICIENT

If information is obtained that some random variables are mutually statistically depend-ent during an investigation, further research should be directed towards the analysis of this interdependence.

Generally, the first area of consideration as to whether statistical interdependence was stated is correlation analysis.

Correlation is the certain stochastic relation between two or more random variables. It relies on such a regularity that changes in the values of one variable are accompanied by systematic stochastic changes in the values of the second variable or variables. Sometimes, correlation is described as a certain degree of stochastic ‘brotherhood’ of the changes in random variables.

There are a few different methods that can be used to check whether or not there is a cor-relation between variables.

One of most commonly applied methods is the construction of a statistical table like the one that was presented in the previous chapter (table of independence). If the numbers lie principally on the main diagonal or the largest numbers lie on the main diagonal, then it can be expected that the random variables are correlated linearly. If the numbers are located in a certain characteristic way in the table but not linearly, a supposition can be formulated that a nonlinear correlation exists between random variables.

A different method, which is often applied in practice, is the construction of a diagram of the dispersion of the random variables in a rectangular coordinate system X, Y in which pairs of observations (x_i, y_i) are used. If these points are accumulated in such a way that they can be closed by an ellipse, then we can suspect that a linear correlation exists between random variables. If the points are closer to each other, a stronger relationship can be expected. If the points are scattered, one can expect a lack of correlation.

Example diagrams are shown in Figure 6.2.

A precise statement about whether the correlation between random variables exists can be obtained by analytical reasoning.

There are many correlation measures in mathematical statistics, such as: linear correlation coefficients, rank correlation coefficients, nonlinear correlation coefficients, partial correla-tion coefficients, multiple correlacorrela-tion coefficients and so on. Many of these measures have found applications in mining engineering and they have been used for years.

Let us discuss the problem of correlation between two random variables.

There is no doubt that the measure of correlation that has been most commonly applied for years is the Pearson’s linear correlation coefficient. It is defined by formula (1.79). Recall that this is a normalised measure and is determined over a [−1, +1] interval. In the case of functional relationship between variables X and Y, this coefficient becomes 1 when the incre-ment in the values of one variable is accompanied by an increincre-ment in the values of a second

Book.indb 145

Book.indb 145 12/9/2013 12:26:02 PM12/9/2013 12:26:02 PM

variable. If the relationship is reversed but is still strict, then the coefficient reaches –1. Gener-ally, the greater the value in the modulus of the correlation coefficient, the stronger the inter-dependence between the random variables. Remember, we are only considering the linear relationship between the variables. If the random variables are mutually independent, then the value of the correlation coefficient is zero.

Let us now turn from theory to practice.

Having some information about the general population, we can use estimator (3.46); how-ever, some varieties of this measure can be found in statistical books.

Similarly, as in many previous research situations, the point of interest is whether the esti-mate that is obtained is statistically significant or not. Obviously, a test is needed to resolve the problem. Usually, the procedure is as follows.

A statistical hypothesis H₀ is formulated that proclaims that there is no correlation between the random variables that are being investigated which is noted as: H

0: ρ = 0; ρ means the correlation coefficient in the whole population. An alternative hypothesis rejects it.

There are at least two ways to verify the null hypothesis.

One way is to apply the Student’s statistic, namely: if n ≥ 3 and the verified hypothesis is true one then the statistic

− −

R R

X Y n R

RX Y ,

1 ,

2 2 (5.7)

has the Student’s distribution with n − 2 degrees of freedom². From the table with the critical values of the Student’s distribution, we read off the value t (α, n − 2) for the presumed level of significance α and the number n − 2 degrees of freedom. If the estimate (5.7) is not lower in the modulus than the critical value, then the verified hypothesis should be rejected. Other-wise, there is no basis to reject the null hypothesis.

The second way is to use the critical values table for the Pearson’s correlation coefficient.

They are given in Table 9.13; parameter ν here is n − k and k is the number of random vari-ables (k = 2) that are being investigated.

■ Example 5.2

At the beginning of seventies of the 20th century a reliability investigation of mine hoist head ropes comprising 35 mines was carried out in Poland.

The information that was gathered contained: the technical data of the hoists and opera-tional parameters such as, for instance, the average number I of winds per day. The speed of the increment of wire breaks was assumed as the measure of rope wear.

The number of days T of work was calculated until the moment at which the speed attained a certain value presuming that this value was the maximum for all of the ropes that were investigated. All of the ropes had the same construction (triangle shape of strands) and for this reason it was presumed that the wear measure was selected well.

2 The number two is connected with the fact that two random variables are considered.

Book.indb 146

Book.indb 146 12/9/2013 12:26:03 PM12/9/2013 12:26:03 PM

After having defined these two parameters, an interesting problem was whether these two values were correlated, i.e. when the increment in the intensity of the hoist work was noticed, it should have been accompanied by a decrement in the rope durability.

The empirical pairs (T, I) are shown in Figure 5.1.

A statistical hypothesis was formulated which stated that there was no correlation between random variables, i.e. H₀: ρ = 0 versus an alternative hypothesis that rejected it. The level of significance was presumed α = 0.05.

The Pearson’s correlation coefficient was calculated obtaining R_T,I= −0.333.

For the given sample size and the presumed level of significance the critical value was calculated by interpolating values because the critical values are given for n = 30 and for 35 in Table 9.13. In the case that was considered ν = 33. The formula (5.7) that gives can also be applied:

Table 5.2. Auxiliary calculations.

T Days

Av number of winds/day

1 316 680

2 205 720

3 960 640

4 440 680

5 331 680

6 693 680

7 782 640

8 360 640

9 331 640

10 472 640

11 305 680

12 522 400

13 146 620

14 226 560

15 321 520

16 525 480

17 409 520

18 239 560

19 479 520

20 333 880

21 363 880

22 525 840

23 414 880

24 405 840

25 345 880

26 290 880

27 462 880

28 554 680

29 643 680

30 560 520

31 276 520

32 729 400

33 617 480

34 913 480

35 940 460

CH05.indd 147

CH05.indd 147 12/12/2013 2:51:01 PM12/12/2013 2:51:01 PM

t= −

( )

^{= −}

0 333

1−

(

⁻ ² ³³ ^{2 03}

. .

Making use of Table 9.3 we get the critical value:

t(α = 0.05, n – 2 = 33) ≅ 2.04

The critical value is greater than the empirical value and thus we have no ground to reject the null hypothesis.

Nevertheless, both values are very close to each other and this fact creates a sensitive point.

If the level of significance is presumed to be slightly higher, the result of the reasoning is dif-ferent and the final conclusion as well.

As there are some doubts about the result of inference, let us check it using the critical values for the Pearson’s correlation coefficient. For the presumed level of significance and the known sample size (Table 9.13), we have the critical value 0.334, which is also obtained from the interpolation. Thus, because this way of reasoning gives the same result, there is no ground to reject the verified hypothesis, but—again—both values, the empirical and critical ones, are very close to each other. Notice that from an engineering point of view we suspect the exist-ence of a certain relationship between the random variables that are being investigated. This was the reason that we paid attention to the very small difference between the critical and the empirical values. At this that stage of our analysis, we accept the outcome of that part of the statistical inference, although it looks as though further research should be conducted.

Our conclusion stating that there is no correlation between variables being investigated can be strengthened by analysing Figure 5.1 and Figure 6.2. It is easy to notice that the empirical points scattering in Figure 5.1 does not correspond with any sketch visible in Figure 6.2.

Our previous considerations comprised problems related to the stochastic interdependence between two random variables. Now it is time to enlarge the scope of interest for a greater number of random variables.

If the number of random variables of interest is k and the relationship between them is lin-ear, we can then estimate the linear correlation coefficients for the whole set of pairs of vari-ables. These coefficients can be arranged in matrix R, which is called a correlation matrix:

1000 900 800 700 600

Durability (days)

500 400 300 200 100

350 450

Winds/day

550 650 750 850

Figure 5.1. Empirical points for the hoist head ropes: durability of the rope versus the average number of winds that were executed.

Book.indb 148

Book.indb 148 12/9/2013 12:26:04 PM12/9/2013 12:26:04 PM

R = the element is defined as 1. The properties of the correlation matrix R are determined by the properties of the covariance matrix ∑, according tothe relation:

Σ = D R D

In document Statistics for Mining Engineering-(2014) (Page 160-164)