4.5 Data analysis methods
4.5.2 Hyperspectral analysis
As outlined in section 4.4, metrics extracted from hyperspectral datasets totalled 5 from the ITC classification and 149 from the spectral indices (154in total). This combined both leaf-on and leaf-off datasets. Below is a description of various statistical approaches used to derive a statistical method of predicting field-level attributes from hyperspectral data. Hyperspectral statistics were derived for the 30x30m areas corresponding to the field-plot locations. Data from the 21 field-plots visited in 2010, one attribute at a time, were then regressed against the hyperspectral values as a means of generating the required equations for predicting the field plot-level metrics over the whole study site.
4.5.2.1 OLS linear multiple regression
As in section 4.5.1 an OLS regression analysis was performed in order to determine the nature of the relationships between a field plot-level metric (dependent variable) and one or more hyperspectral-derived metrics (explanatory variables) using SPSS. Multiple forward stepwise regression was used as described above, to produce minimum adequate models predicting field plot-level attributes from the hyperspectral-derived metrics.
4.5.2.2 Akaike's Information Criterion
As an alternative method of deriving the regression relationships the Akaike's Information Criterion (AIC) technique was explored.AIC in essence balances the number of parameters and fit to the data (likelihood). This technique was implemented using the R statistical software. AIC is a measure of the relative ‘goodness of fit’ of a statistical model and is defined by the equation of:
= 2 − 2ln( ) [4.12]
where k is the number of parameters in the regression model, and L is the maximized value of the likelihood function for the estimated model. A small value of AIC indicates a better combination of simplicity and fit to the data.
= + 2 ∗ ( + 1)− − 1 [4.13] where n denotes the sample size. Therefore, AICc includes a penalty correction for extra parameters.
Within the R software the “MuMin” (Multi-Model Inference) package(Barton, 2012) was used to run the AICc analysis, using the ‘Dredge’ function. This function generates a set of models with combinations (subsets) of the terms in the global model, with optional rules for inclusion. The function runs through each possible combination of variables in order to derive the most significant regression equation which accounted for the most variance. Unfortunately the statistical tool could not accept more than 30 input metrics due to computer memory limitations.
The number of possible combinations of predictor variables (≥155) presented a number of problems as the number of potential permutations was vast. There was therefore a high risk of identifying spurious relationships. Below is an adaption of the methods outlined in Langton et al. (2010) and Burnham and Anderson (2002). To avoid this problem a further phase of modelling was carried out in an attempt to identify those variables which would be significant while reducing the potential for collinearity.
A ‘data mining’ exercise was conducted in order to investigate other important predictor variables. To determine which variables had the most potential for prediction of forest attributes, automatic stepwise AICc selection was used on a subset of six random predictor variables for500,000 iterations. If a predictor variable was significant, it was recorded for each of the iterations. Each application of an AICc model was assessed using ANOVA test. Each input variable had a corresponding F-test and p-value. As before, a variable was considered significant if p ≤ 0.05.The result of this process was a table summarising which variables were significant for each of the random subset selection iterations. A results table was then produced where a count for each time an attribute was significant was calculated as a measure of which attributes were of most relevance to a given field plot-level variable. A full list of this R code for this task is presented in Appendix D section D.1. This was applied for each of the field-level variables.
Following this process, the 20 predictor variables with the highest counts were input into a further AICc process in order to derive a regression equation. If variables were known a
priori to have no relation to the independent variable they were removed and the next best
predictor variable was added. At this stage a limit was imposed on the number of predictor variables allowed into the stepwise AICc regression in each step. A maximum of 6 of the 20 predictor variables could be entered in any single iteration of the model, this was in order to account for processing time and system memory limitations. The AICc was run adding in each possible combination of 1-6 variables. The delta-AIC value of each model was then assessed in order to determine the likelihood of the candidate model. When delta-AIC was less than or equal to 2, the given model was suggested to be within the range of plausible models that best fit the observed data (Burnham and Anderson, 2002). Therefore any model with an AIC above 2 was discounted.
As in the previous section, a number of diagnostic tests were applied to assess the regression assumptions in this model, as listed in Table 4.14.Significance tests such as the ANOVA and students t-test were available as before. Diagnostics designed to detect multicollinearity were VIF and tolerance. These were calculated using the ‘Faraway’ package in R (Faraway, 2011). The condition index was calculated using the ‘perturb’ package (Hendrickx, 2012). A full listing of this R code is presented in Appendix D section D.2. If the model failed these diagnostic tests, the variable(s) identified as not significant and/or collinear were removed from the analysis and the AICc procedure re-run. Efforts were made to limit the standard error of the model. AICc delta and weight values were reported for each model.