Geometallurgical Modeling Methodology
4.3. Classification Methods
4.4.1. Regression
The most widely used statistical methods for numerical modeling are single or multiple linear (or nonlinear) regressions. This method is regarded in this thesis as one way of prediction of comminution parameter values. In simple linear regression the variations in the dependent variable are attributed to changes in only a single independent variable. However in some cases several factors simultaneously affect a dependent variable. Multiple regression analysis is a method for combining the effects of several factors concurrently. In this thesis the dependent variables are comminution attributes (i.e. A*b
and BMWi) and independent variables are petrophysical properties (density, magnetic susceptibility, P-wave velocity and P-wave amplitude).
When relating petrophysical properties to comminution attributes, a linear model can provide a good benchmark against which to judge more complex techniques. When several petrophysical variables must be combined to estimate the comminution parameter, it may become necessary to extend the regression model to account for nonlinear effects.
The coefficient of determination (R2) is a parameter in regression analysis that reflects the degree of variability and correlation between two parameters. It varies from 0 to 1 depending upon the degree of correlation between variables. The closer the value to 1 the higher the correlation. The coefficient of determination (R2) is simply the square of the correlation coefficient and it is incorrectly interpreted by many researchers as a reliable parameter in statistical modeling. There are cases where R2 between a dependent and an independent attribute based on a nonlinear function is higher than a linear one, but the nonlinear regression formula is not necessarily more accurate function than a linear one in terms of prediction. The root mean square (RMS) error a commonly used measure of error can be applied as a guide in selection of appropriate regression function. The RMS error is a measure of difference between values that are known and the values that have been predicted by regression equation. Figure 4.1 shows correlation between two attributes (X and Y) using a nonlinear and linear regression fit. Although a nonlinear regression fit (a power function) has a higher coefficient of determination (R2=0.67) than the linear regression fit (R2=0.45), however the root mean square error of the linear regression fit is 4% less than nonlinear one. It is therefore desirable to calculate the root mean square error as a supplementary parameter for judging the accuracy of a model.
In model building using regression analysis it is highly desirable to simplify a multiple regression equation where possible for better understanding and ease of use. Simplification of a regression equation is normally conducted by reducing the number of parameters in the model. However, selection of the most important attributes for a model is an issue. Stepwise regression analysis (Tan et al, 2006) which takes into account statistical criteria for attribute selection can be used for such purposes. The stepwise process can be carried out in two ways namely Backward and Forward regression.
In the Backward method, all of the input parameters (i.e. predictors) are included in the model initially. The variable that is least significant based on p-value (Brownlee, 1960) is then removed and the model is refitted. Normally a p-value less than 0.05 would be
significant and a p-value more than 0.05 is considered insignificant during regression analysis. This parameter is authomatically calculated by most statistical software e.g. Statistica. Forward regression starts with an empty model. The variable that has the most significance based on p-value when it is the only predictor in the regression equation is placed in the model. Then each subsequent step adds the variable that has statistically relatively greater significance in comparison with the remaining variables.
y = 0.0011x1.3146 R2 = 0.6782 0 20 40 60 80 100 120 140 160 0 2000 4000 6000 X Y 1 y = 0.0188x - 8.2769 R2 = 0.4553 0 20 40 60 80 100 120 140 160 0 2000 4000 6000 X Y 1
Figure 4.1. Correlation between X and Y dataset showing a nonlinear (a) and linear (b) regression fit. Note that although the correlation coefficient between the dataset (a) is higher than the dataset (b) but the accuracy of regression model for the dataset (b) is better than (a). RMS error value for dataset a and b is 25.11 and 24.15 respectively.
For the geometallurgical class definition methods described in this chapter, the stepwise regression has been applied for both GC and PC approach (Chapter 5 and 6). Firstly all petrophysical properties (including parameter averages and their standard deviations) are included in standard multiple regression analysis. R2 and RMS error are considered when all parameters are included. Then Stepwise regression (using the forward and backward approach) is conducted to assess the most significant parameters in the regression. The selection of significant parameters is based on p-value. For each approach (forward and backward), R2 and RMS error are considered and compared with standard multiple regression analysis including all variables. Finally the ‘best’ model is selected on the combined basis of smallest number of parameters and lowest RMS error value. It should be noted that in some cases two models may have only very small differences in RMS value. In such cases the preferred model is the one with smaller number of parameters in the regression model.