Correlation and regression analysis

Chapter 3 Correlation Method

3.0 Correlation and regression analysis

The correlation method used in this work is that of multiple linear regression analysis, MLRA. This is a common technique in statistics, and is an extension of the simple regression of y against x. A set of y-values, the dependent variable, may be linearly related to the independent variable x. A relationship between two variables is shown, and if it is linear when y values are plotted against x values, then a straight line can be drawn through the points, and the equation can be written as,

y = mx + c (3.1)

Here c - is the intercept on the y -axis and m - is the slope of the line. The x-variable is the explanatory or independent variable. If there are scattered points in the plot, then drawing a straight line may not be too obvious, and any line chosen will affect the prediction of y values. In such a case, a method called least squares is often used to decide the best straight line to choose. This is done by taking into account all the deviations between observed and estimated values of the variable from the line, squaring them and adding them up.^ The criterion of least squares is that the best line is the one with the least sum of squared deviations. This can be drawn when one variable is present, but starts to be complicated when the number of variables increases, as in the case of the general solvation equation. In this case the calculation is done by the use of computer.^

Correlation gives the association between the variables, but it is the regression that uses the variables to help explain the variation in the dependent variable and thus estimates the parameters of the model, provides a test of the validity of the model, and the calculation of the confidence limits of the parameter. Since the correlation alone can not measure the success of the relationship between the variables, other statistical methods can be used, such as the standard deviation of the estimate, sd, the correlation coefficient, r, and the F-statistic. Standard deviation is the square root of the quantity (sum of squares of deviations of individual results from the mean, divided by one less than the number of results in the set) and is given by:

S d = [ X ( x , ~ x ) V ( n - 1)]’'^ (3.2)

Standard deviation has the same units as the property being measured. It becomes a more reliable expression of precision as (n) gets larger, so sd is a means of assessing the reliability of an equation, and is also used in considering the significance of deviant points. The sd measures the spread of a distribution around the mean. A low sd value indicates a low spread, i.e. a good relationship, and a high sd value indicates that the data set contains a high distribution of points from the mean, which is unfavourable in MLRA. The correlation coefficient gives the measure of success of the correlation of the dependent variable (y) against the independent variable (x). The equation for the correlation coefficient which take into account the standard deviation is:

Here y = 1, ( y -y ) ^ /n and y is the mean, I y /n. The quantity^ is the variance of the sample values of y. From this equation it can seen that as the sd-> 0, r ^ 1, ie. The correlation get nearer to perfection, r is a measure of how closely the data set fits the relationship given by the MLRA and can range from -1 through to 1. A value o f -1 or 1 indicates that the data set is explained by the correlation equation perfectly, while a value of zero means there is no relationship between the data set and the MLRA. A negative value of r - may be interpreted as a poor correlation by an inexperienced eye so more often than not it is r^ - that is quoted in relation to multiple linear regression, r^ has values of zero through to one and is basically an indicator of how well the regression analysis explains the relationship among the variables. Now though the correlation r, is often used, it is r^, that is meaningful because it gives the fraction of the variance of y which is explained by the regression equation. It is more convenient to convert it to a percentage. Thus when r^ = 0.90 the regression equation explains about 90 % of the variance. It is very important that the correlation coefficient should be considered in relation to the number of data sets correlated. Students’ t-test can be used to investigate the significance of the coefficient, and assumes normal distribution of the errors. The t-test is set as a confidence limit, usually at 90 % but can go up to 99 % depending on the accuracy of the test required. In MLR analysis, the t-test is performed on each individual variable to test their significance. Sometimes not all the variables are necessary and this would be indicated by the level of significance. Another significant test, which is used in MLRA is the F-statistic or the Fisher statistic. This test accounts for the number of variables, v , present and the number of data point ( n). The value of the F-statistic gives an indication of the quality of the regression, to determine whether the observed

relationship between the dependent and independent variables occurs by chance the higher the value of F, the better is the regression.

L yii — V _{l j / ( 1 --r- ))/} _(3.4)

Here r, is the correlation coefficient, n, is the number of data points and v is the degrees of freedom, which is (v -1), where v, is the number of variables. From the equation, it can be seen that the main factors that contribute to the improvement of the regression are n, and r, because as these two parameters increase, F increases. Once a MLRA output has been obtained it is essential to measure how reliable the relationship is, i.e. it is necessary to validate the model so any predicted values can be obtained with accuracy and confidence. The main problem with MLR is its sensitivity to collinearities among the independent variables. Collinearities occur when there is a high degree of linear correlation between two or more of the independent variables. It is therefore very important to make sure the variables used in MLR, i.e. the solute descriptors in the case of the Abraham Solvation Equation, are well defined and independent. To use MLRA the number of data points is important, as this improves the reliability of the correlation. It has been suggested that at least five data points are required per variable, but this seems to be the very minimum that should be used.

In document Physicochemical descriptors for organic compounds of environmental importance (Page 54-57)