4.5 Data analysis methods
4.5.1 Field data analysis
The goal of the following procedures was twofold: the first was to assess the similarity of the two field campaign datasets, and secondly was to identify if there were any significant statistical relationships between individual field metrics. Relationships between metrics were investigated using correlation and regression modelling. This process was necessary for assessing the ability to reduce the number of measurements required by field data collection in future works.
4.5.1.1 Comparison of the 2010 and 2012 field populations
In order to be able use the validation field dataset, it was first necessary to determine the overall similarity or dissimilarity between the individual recorded field metrics recorded in 2010 and 2012. This was necessary in order to assess the dataset best used for validation
2012. Two non-parametric tests, both utilising two independent samples, were utilised to test the two populations, the Mann-Whitney-Wilcoxon and Kolmogorov-Smirnov test. These tests can be used to test if the population distributions are identical without assuming them to follow the normal distribution. Both the Mann-Whitney-Wilcoxon and Kolmogorov-Smirnov tests were implemented through the R software.
The Mann-Whitney-Wilcoxon test is used in experiments in which there are two conditions and different subjects have been used in each condition (Field, 2013),in this case data from two field campaigns. Operating under the assumption the observations are independent of one another, the observations from both groups are combined and ranked, with the average rank assigned in the case of ties. The number of ties should be small relative to the total number of observations. If the populations are identical in location, the ranks should be randomly mixed between the two samples.
The Kolmogorov-Smirnov test is a more general test that detects differences in both the locations and shapes of the distributions between the two populations (Field, 2013). The Kolmogorov-Smirnov test is based on the maximum absolute difference between the observed cumulative distribution functions for both samples. When this difference is significantly large, the two distributions are considered different.
For both of these statistical techniques the null hypothesis was that the two populations were identical. The hypothesis was tested by applying the two independence tests within the R software. The p< 0.05 significance level was used, thus if the p value met this criterion the null hypothesis is rejected.
4.5.1.2 Spearman’s rho bivariate correlation
In order to determine if there were any correlations between collected field metrics a bivariate correlation analysis was instituted within the SPSS (version 19) (IBM) statistical software. The bivariate correlation method was that of Spearman’s rho. The Spearman correlation coefficient is a non-parametric measure of the strength and direction of association that exists between two variables measured on an ordinal scale.Spearman’s tests works by first ranking the data, and then applying Pearson’s equation to those ranks (Field, 2013). As the direction of the relationship is unknown between variable pairs, all correlations use the two-tailed method.
A correlation matrix was created where every field metric was tested against every other field metric, thus a correlation coefficient and an estimate of significance (p) was calculated for every relationship.
4.5.1.3Ordinary Least Squares (OLS) regression analysis
Linear regression was used to determine the nature of the relationships between all field plot- level metrics. More specifically this method uses the least squares regression approach. ‘Ordinary Least Squares’ (OLS) means that the overall solution minimizes the sum of the squares of the errors. OLS is a method for estimating the unknown parameters in a linear regression model. A more detailed outline of this approach is available in Field (2013). The resulting relationship can be expressed by a simple formula (see Equation 4.11).
The multiple linear regression approach was implemented through the SPSS software. Multiple forward stepwise regression approach was used, assessing the outputs for evidence of supporting the assumptions of:
(i) linearity; (ii) normality;
(iii)homogeneity of variance; (iv) independence;
(v) model specification.
Those factors are summarised in Table 4.14.
Table 4.14 – Testing the assumptions of a regression analysis (Chen et al., 2003)
Assumption/concern: Description:
i Linearity The relationships between the predictors and the outcome variable should be linear.
ii Normality The errors should be normally distributed – technically, normality is necessary only for the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed iii Homogeneity of variance
(homoscedasticity) The error variance should be constant
iv Independence The errors associated with one observation are not correlated with the errors of any other observation
v Model Specification The model should be properly specified (including all relevant variables, and excluding irrelevant variables)
Considering the important issue of multicollinearity within this situation, when there is a perfect linear relationship among the predictors, the estimates for a regression model cannot be uniquely computed (Field, 2013). As the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated. Through the examination of diagnostic statistics available in SPSS: (i) ‘tolerance’; (ii) ‘variance inflation factor’ (VIF); and (iii) the ‘condition index’, multicollinearity can be detected. Explanations of these metrics can be found in Field (2013). Each of the various diagnostics was used to test the correlations between all of the field-plot metrics. If an assumption was found to be false, of concern or to be out of tolerance, the regression was re-run and one or more predictor values removed in order to improve the regression. The output was then assessed again.
The regression result was then assessed for its significance using an Analysis of Variance (ANOVA) F-test and the student’s t-test. Here if a value of p is below 0.05, it is reported as significant for both F- and t-values (Chen et al., 2003; Field, 2013). If a coefficient was not significant, it was dropped from the regression. Additionally, the Residual Standard Error (RSE) was checked for each model; out of all the models produced those which minimised the RSE were selected.
In order to get an indication of how much of the variance encountered in the dependent variable was accounted for by the regression model, the R-squared statistic was utilised. This is an overall measure of the strength of association and does not reflect the extent to which any particular independent variable is associated with the dependent variable.
The output from the regression model yields the inputs for the regression equation. These are the values in the regression equation for predicting the dependent variable (Y) from the independent variables (x). The regression equation is presented as:
= 0 + 1 ∗ 1 + 2 ∗ 2. . . + ∗ [4.11]
whereb0 is the intercept, b1…bn is the coefficient which corresponds to the independent
(predictor) variable x1…xn(e.g. b1 corresponds to x1). This equation was generated for each
individual field plot attribute, as predicted by combinations of other field plot metrics using multiple forward stepwise regression. Non-significant metrics were removed from the
analysis to produce minimum adequate models. Additionally, efforts were made to limit the standard errors of the best models.