Synthesis of data—regression analysis
6.2 LINEAR REGRESSION
Let us start from the simplest case when the proposed regression3 line is a simple straight line function:
y=β1 ++ββ0+ξ (6.2)
This function also has a very wide application in mining engineering, e.g. the speed of the corrosion of a shaft furnishing versus time can be described by a linear function (Carbogno et al. 2001); the resistance to the compression of soil samples depending on the depth of the sample location can also be modelled by this function (Bejamin and Cornell 1977), the utilisation rate for the means of transport in relation to its productivity can be designed using pattern (7.2) (Lin Zaikang et al. 1997), the unit energy of an excavation as a function of the rock compressive strength can be modelled using the linear function (Ceylanoğlu and Görgülü 1997) and the relationship between the peak cutting force and the cutting depth is also linear (Brown and Frimpong 2012).
The point of our interest will be an estimation of structural parameters β0 and β1 in order to obtain the analytical recipe of the right side of equation (7.2) that will give the best fit to the empirical data. Notice a certain subtleness. Due to the fact that the information in hand
3 The term ‘regression’ is used in several different sciences such as: biology, economics, geography, psy-chology, geology and so on. In statistics, it is usually understood as the analysis or measure of the asso-ciation between random variables.
Y
X
Figure 6.1. Empirical distribution of points that is probably a picture of a two-dimensional random variable.
Book.indb 159
Book.indb 159 12/9/2013 12:26:25 PM12/9/2013 12:26:25 PM
Y
X Consider
y = ax + b + ξ a)
Y
X Consider
y= a + bx + cx2 + ξ b)
Y
X Consider
y= baxcξ or y= bxacξ or y= a + bx + cx2 + ξ
c)
Y
X Consider
y= b + a log x + ξ or y= bxacξ lub y = axξ/(x + b)
d)
Y
X Consider
y= (a + bx + ξ)−1 or y= x[a + (b + ξ)x]−1 or y = a/x + ξ
e)
Y
X f)
Consider
y= k(a + be−cx + ξ)−1 or y = a + bx + cx2 + dx3 + ξ
Figure 6.2. Examples of the distribution of empirical points and the proposed analytical regression functions.
Book.indb 160
Book.indb 160 12/9/2013 12:26:26 PM12/9/2013 12:26:26 PM
is in the form of a sample, we have no possibility of finding the real values of the model parameters; we can only estimate these parameters.
Denote these estimates by b
0 and b
1 appropriately and estimate the random component;
the so-called residuals, by u. These residuals are defined as the differences:
i yyi−(b xb xb1 ii++bbb0) (6.3) and they take both negative and positive values.
If an assumption is made that it does not matter whether the difference value is negative or positive, then the best estimates of the unknown parameters of equation (6.3) will be such assessments for which the following sum
S ui attains the minimum. Therefore, the following set of equations should be solved:
∂
which gives the following set of equations:
b x y x
The above set is called the set of normal equations of the least squares method.
Move this consideration onto a theoretical ground.
There is a statistical population in which the values of the categories create a certain two-dimensional distribution. Let the relationship between these variables be given by formula (6.2). Thus, the following relationships hold:
β ρσ
where: ρ is the linear correlation coefficient between variables X and Y.
E(X), E(Y) are the expected values of the random variables X and Y, respectively.
σX, σY are the standard deviations of the random variables X and Y, respectively.
Because the only information is the sample taken, the estimators of the unknown struc-tural parameters are obtained by solving the set of equations (6.6), namely:
b RS
where the symbols here are already well known.
Having such estimators the question arises as to what kind of properties do these relation-ships have?
Look at the stochastic ‘mechanism’ that generates the observations.
Assumptions of the classical least squares method are as follows (Goldberger 1966, Draper and Smith 1998):
a. Equation (6.2) means that each observation of yi; i = 1, 2, …, n is the linear function of the observation xi and the random component
b. The random component ξ is a random variable with a zero expected value and an unknown constant variance σξξξξ2, i.e. E(ξ) = 0 and σξξξξ2= const as well as σξξξξ2>0
c. The random variables ξi and ξj are uncorrelated for i ≠ j, thus cov (ξi, ξj) = 0 d. The variable xi is non-random, thus xi and ξi are independent for every i.
If the assumptions above are fulfilled, then the estimators just obtained are the best unbi-ased estimators with the minimum variance. If these assumptions are not satisfied, then the estimators have worse statistical properties and their application can mean that the estimates obtained will be of a low likelihood. If, for instance, these estimators are applied to safety problems, it may happen that a safety risk will be underestimated.
Often, an additional assumption (e) is made, which states that the random component distribution is normal, and then we decide on the classical model of normal linear regression (Goldberger 1966).
After the estimation of the structural parameters, an assessment of the random compo-nent should be made. The applied method of the least squares ensures that the mean value equals zero or the mean value that is assessed using the sample will be negligibly different than zero. The important information will be a measure of its dispersion.
The unbiased estimator of the unknown variance σξξξξ2 of the random component in the model of the linear regression is determined by the function:
su n uui
i n
2 u2
1
1
= 2
−
∑
= (6.9)Having the estimates (b0, b1) of the structural parameters (β0, β1) of the regression function that describes how these variables are mutually dependent and knowing the estimate of the unknown variance of the random component, the question can be formulated as to whether these estimates are significant.
If assumption (e) is a rational one, the significance can be easily verified.
Formulate a hypothesis that states that there is no linear relationship, H0: β1= 0 between the variables that are being investigated; an alternative hypothesis rejects it.
It can be proved (Goldberger 1966) that the statistic:
t=b
∑ ∑
= xs
i i n
u
b1
b 2
1(xi−x)
(6.10)
has the Student’s distribution with n − 2 degrees of freedom.
Therefore, if |t| (6.10) is above the critical value taken from Table 9.3 for a presumed level of significance α, then the null hypothesis should be rejected. Otherwise, there is no basis to discard the verified conjecture.
The reasoning that is performed here can easily be generalised.
Maintaining assumption (e), the confidence interval for the parameter β1 can be deter-mined by applying the following formula:
Book.indb 162
Book.indb 162 12/9/2013 12:26:30 PM12/9/2013 12:26:30 PM
b t
presuming the level of probability 100(1 − α). By calculating expression (6.11) and presum-ing the plus sign, one obtains the right-side boundary; presumpresum-ing the minus sign the left-side boundary is obtained.
Testing the significance of the regression can be conducted by making use of the statistic F Snedecor’s if the relationship between the random variables t and F is known (see the end of Chapter 1).
If positive information is obtained, i.e. there is a statistically significant linear relation-ship between the variables tested, then the standard deviations of the random variables of estimated parameters will be very useful information. They give information about how good the estimates are.
The standard deviation of parameter b1 is given by formula:
S s
where as the standard deviation of parameter b0 is:
S s x
The greater the standard deviation, the smaller the accuracy of the estimates. The standard deviation of the estimator is called its mean error.
Very important information for the researcher carrying out the investigation is in the sequence of the residuals: ui yyyi−yi( )t yyi−(b xbb xb xb1 ii++bbb0), which is in fact the sequence of the differences between the empirical values of the variable being explained and its theoreti-cal values. This sequence is a representation of the random component that is not directly observable.
The sum of these residuals should be zero or insignificantly different from zero. The sequence should be stationary, should have constant dispersion and a lack of autocor-relation in accordance with the assumptions that were made. It is recommended that it be checked whether these conditions are fulfilled in all of the cases that are being considered.
■ Example 6.1
A tribology investigation was carried out to analyse the wear process of the linings used in the disc brakes of a winder. The linings were made from different materials and some changes in the production process had been introduced.
One of the investigation results was the course of the wear process of the lining for the disc that was fluorescently nitrided versus the number of brakes that were executed by the tester.
The results of the investigation are presented in Figure 6.3.
The results of the investigation clearly indicated that the relationship between the number of brakes that were executed x and the linear loss of linings (measured in mm) is linear, which was expected based on the literature on the subject.
Book.indb 163
Book.indb 163 12/9/2013 12:26:34 PM12/9/2013 12:26:34 PM
Firstly, the linear correlation coefficient was calculated that gives:
RXY R = 0 990.
This value is high, which ‘suggests’ that there is really a significant linear relationship between the variables that were investigated. Formally, a hypothesis was formulated that stated that there is no linear correlation between the variables versus an alternative hypoth-esis that rejects the statement of a null hypothhypoth-esis.
For a presumed level of significance α = 0.05 and a sample size n = 21, the critical value, which is 0.433, was taken from Table 9.13. Thus, the null hypothesis should be rejected in a favour of the alternative supposition. It can be stated that there is a strong linear relationship between the number of brakes that were executed and the linear loss of the lining.
The next step was the estimation of the structural parameters of the linear function.
Using the set of questions (6.6), the following estimates were obtained:
b1 b
b
b . 3 bbbb00= −0 1320.
Thus, the relationship between the variables that were of interest can be expressed as:
y 1 342 10× −3 0 132+u 1.60
0.00 0 200 400 600 800 1000 1200
Number of brakes executed x
Linear wear of lining y mm
Figure 6.3. The linear wear of the lining for the disc that was fluorescently nitrided versus the number of brakes that were executed by the tester.
Book.indb 164
Book.indb 164 12/9/2013 12:26:37 PM12/9/2013 12:26:37 PM
The accuracy of the estimation was determined by two standard deviations:
Sb Sb
S S
0 1
b b
b 0 029 SSbb =4 10−5 The residuals are presented in Figure 6.4.
The mean loss and the corresponding standard deviation were as follows:
u 4 7 10. × −4 sssuu=0 060. mm
The mean loss was not precisely zero because of the rounding up of some values. The small value of the standard deviation indicates that the theoretical function was properly selected and fit the empirical values well.
A sequence of the residuals was calculated and is presented in the Table below. This series was the object of further investigations.
Firstly, the stationarity of the sequence was tested using the Spearman’s correlation coefficient.
The coefficient was calculated and the result was:
rS= 0.056
This is very low value and it was suspected that this sequence was uncorrelated with the number of brakes executed. A null hypothesis was formulated stating that there was no linear correlation between the number of brakes and the goodness of fit of the theoretical function to the empirical values, H0: ρ = 0 versus an alternative hypothesis rejecting it.
A level of significance was maintained as previously. For the known sample size and α = 0.05, the critical value was 0.368 (Table 9.14).
The empirical value was significantly lower than the critical one; there was no ground to reject the verified hypothesis.
Let us conduct this investigation further by orientating it on dispersion testing, first of all.
i ui
0 0.048
1 0.051
2 0.063
3 0.026
4 0.029
5 −8.05 ⋅ 10−3 6 −5.2 ⋅ 10−3
7 −0.032
8 −0.039
9 −0.047
10 −0.074
11 −0.081
12 −0.058
13 −0.055
14 7.6 ⋅ 10−3
15 −0.02
16 −0.027
17 −0.064
18 0.039
19 0.132
20 0.105
Book.indb 165
Book.indb 165 12/9/2013 12:26:39 PM12/9/2013 12:26:39 PM
By dividing the sequence of residuals in half and calculating the standard deviation for each sub-sequence (subsample), we have:
S1 S2
S S
S 0 028 SSSS22=0 0520. mm
These figures differ significantly at first glance. It is necessary to verify whether this differ-ence is statistically important. The test that can be applied in this case is that one based on a comparison of the variances of random variables.
By calculating the variances, a null hypothesis was formulated that stated that these vari-ances differ non-significantly, HHH0:σ121 σ22
2 . Looking at Figure 6.4, we can suspect that the dispersion increases. Thus, an alternative hypothesis can be formulated as: HHH0:σ121 σ22
2< . The test is based on the F Snedecor’s statistic because:
S S S2
S2 S1
S2 = F(n(nn11 1nnn22 1)
where n1, n2 are the size of the first and the second subsample, respectively4.
Calculating, we have SSS22/SSS1122 3.45. Compare this value with the corresponding critical one for a level of significance α = 0.05 and the subsamples sizes (9, 9) which is:
F0.05(9, 9) = 3.18 (Table 9.6)
The empirical value is distinctly above the critical one5. The null hypothesis should be rejected on the presumed level of significance. The dispersion in the second half of the obser-vation is significantly greater than in the first half. It looks as though we are right to suggest that the dispersion increases with an increase in the number of brakes that were executed.
The last step in the investigations of the residuals can be autocorrelation testing.
Calculate the correlation coefficient between the values that are distant from each other by one, two and three steps. The results of calculations are as follows:
r1 rr2 r
r r
r( )a rrrrrr22( )( )a =0 3740. rr3( )a =0 042.
4 To be more precise, it is assumed that the subsamples are taken from a Gaussian distribution. It looks as though this assumption holds in the case of the residuals in this case.
5 Remember that if the alternative hypothesis rejects only what the null hypothesis says, the critical value is the quantile F0.025(9, 9).
0 5 10 15
−0.1 0 0.1 0.2
20 j
u j
Figure 6.4. The residuals of the function of the linear mass loss of disc brake lining.
Book.indb 166
Book.indb 166 12/9/2013 12:26:40 PM12/9/2013 12:26:40 PM
Let us check the null hypothesis that states a lack of autocorrelation of the order c of the investigated random variable, H0: ρc= 0 versus an alternative hypothesis H1: ρc> 0.
A measure that allows the null hypothesis to be tested—as we already know—is the statistic:
χχχχ2( )) ((( ))
( )
rrc( ) 2By making all of the necessary calculations and taking the critical values from the Chi-squared distribution for a presumed level of significance α = 0.05, we have the following results:
11.283 (3.841) 2.658 (5.991) 0.761 (7.815) where the numbers in brackets are the critical values.
Only an autocorrelation of the first order is statistically significant. This is important information because in the majority of cases the existence of the autocorrelation is con-nected with physical reasons. Rarely is the autocorrelation concon-nected with a purely random arrangement of numbers. In the case being considered, it would be worthwhile to undertake an investigation to identify these reasons, i.e. the physical process that is generating the auto-correlation and very likely causing that the increases in the dispersion.
We can only state that when the lining is successively worn, important information for short-term prediction will be information on the wear at the current moment of time. The existence of the autocorrelation of residuals makes the statistical properties of the applied estimators to deteriorate but improves the process of inferences about the future. The prob-lem of forecasting the degree of the wear of a lining can be significant for the functional reliability of the brake. However, what is more important is the vital information from the
point of view of safety. ◀
6.3 LINEAR TRANSFORMATIONS AND MULTIDIMENSIONAL MODELS