CHAPTER 5 CAN AN EXPLORATORY TECHNIQUE AND A
5.2.4 Modelling procedures
Three modelling techniques, generalised linear modelling (GLM) (McCullagh and Nelder 1989), boosted regression trees (BRT) and boosted generalised linear models (BGLM), were used to explore whether the presence of beetles was influenced by the 15 CWD and 9 CWD-site attributes (see Table 5.3 and Table 5.4). Multiple modelling techniques were chosen to (i) test and compare model efficiency, (ii) accommodate the large number of combinations of data possible in this study (822 logs, three sub-plots in each log, 20 rotten-wood type categories, 14 different species of trees and six beetle species), and (iii) decrease the chance of overfitting (Shapire 2003).
Generalised linear models were fitted using CWD and CWD-site predictor variables but their outputs where much more complicated than what is presented below and they did not perform as well. For these reasons only the AUC results are presented for the BGLM. The models were also fitted without age of
regeneration from 1990s and forest class taken out, but this made no difference to model performance. The results are therefore the best supported outcomes from a large number of exploratory model evaluations.
Generalised Linear Model (GLM)
Generalised linear modelling proceeded using a forward stepwise selection
technique, which reduced the risk of over fitting. The starting model included only rotten-wood type as a predictor variable and presence/absences the dependent variable. All covariates were then added to the model (CWD and CWD site variables) and the change in fit assessed using likelihood ratio tests (based on changes in deviance of the two competing models).
128
Regression trees
Tree-based models partition the predictor space into hyper-rectangles that are a recursive subdivision of covariate space. They then fit the most probable constant to each region. Figure 5.2 shows a simple two dimensional example. The decision rules embodied by the tree (Figure 5.2, right) correspond to a recursive
partitioning of the covariate space into rectangles (Figure 5.2, left). Within each rectangle the model fits a mean, so the fitted surface is constant within each rectangle and discontinuous across the edges. The partitioning itself is selected so that the response is as homogenous a possible within each rectangle.
The recursive subdivision or binary splitting creates ‗tree growth‘, as it is repeatedly applied to its own output until the stopping criterion is reached. For non-boosted trees, the stopping criterion is set low, which builds huge trees which then have to be pruned. For boosted regression trees the criterion is set higher so branching tends to stop earlier. An effective strategy (and the one which was used in this study) is to ‗grow‘ a large tree and then prune it by collapsing the weakest links, as identified using cross-validation (Breiman et al. 1984; Hastie et al. 2001; De‘Ath 2007).
One great advantage of modelling using regression trees is that the results can be presented as a simple sequence of decision rules. Other advantages of this model include that any type of predictor variable can be used; the final model is not affected by monotone transformations or predictor variables that have different measurement scales; and poor performing predictor variables are seldom selected (Elith et al. 2008). In addition, regression trees are not sensitive to outliers and if predictor variables have missing data, surrogates are used (Breiman et al. 1984). They represent information that is intuitive and easy to visualise (Elith et al. 2008). De‘ath and Fabricius (2000) give a good description of regression trees for ecological use.
129
Figure 5.2 Simple design of the final output of a classification tree (right), showing two predictor variables, split at two points respectively > 1.1 and < 1.1, and terminal nodes A - F (response variables). In the case of this study: >1.1= tree species and <1.1 = forest class. A and D are the beetle species most likely to be found in CWD of a particular tree species or forest class (for example, A could equal L. menalcas and D could be C. deplanata). The decision rules embodied by the tree (right) correspond to a recursive partitioning of the covariate space into rectangles (left).
Boosting
In this study, regression tree models were boosted to improve model accuracy. Rather than finding a single highly accurate prediction rule (or single best fitting model), boosting identifies and averages the results from many competing models (Schapire 2003).
For boosted regression trees, many trees are fitted in sequence. The tree fitted in the first step is the tree that gives the best overall classification. Trees fitted in subsequent steps focus on residuals from the previous step, so that when applied in concert, the fitted trees outperform any single tree. Schapire (2003) provides a good overview on the boosting approach and its applications. Boosting is unusual in that by merging many simple classification rules, it automatically performs a forward stepwise procedure of model selection (Eilth et al. 2008).
In the present study overfitting was a potential problem because the number of predictor variables exceeded the number of sites (i.e. 35 sites and >40 predictor variables) and some descriptors were highly complex, using rotten-wood type categories as an example: the process of fitting the rotten-wood types is complicated by the large number of categories which increases the risk of overfitting. But their inclusion is justified because all rotten-wood types are recognisably different and cannot be pooled or re-categorised to decrease the terms used in the model. If they were to be pooled for example, pale vs. dark
0 1 2 1 2 A B D C E F A B C D E F 1 < 1.1 1 > 1.1 1 < 0.8 1 > 0.8 1 < 1.4 1 < 1.5 1 > 1.5 1 (0.8) (1.1) (1.4) (1.5)
130
rotten-wood types, this would make the models too general, potentially losing valuable ecological information.
Boosted regression trees (BRT)
Boosted regression trees were used to assess the influence of 15 CWD covariates (diameter at intersect, volume of sample spot, rotten-wood type proportion (proportion expressed as a percentage), rotten-wood type, log length, decay class, log decay class, log off ground, log burnt, log cut, species present, adult or larva, number of individuals, tree species close to log end), 9 CWD-site covariates (age of regenerating forest, volume sampled, volume of dead wood (CWD), diameter sampled, Rgr, Rgen and Mat/Rgr, most recent fire, broad forest class, dominant eucalypt species, non-eucalypt species) on species presence for each species separately, while protecting against the risk of overfitting. Producing a BRT is a technique that aims to improve the performance of a single model by fitting many models and combining them for prediction. BRT uses two algorithms: ‗regression trees‘ from the classification and regression tree group of models, and ‗boosting‘ which builds and combines a collection of models. For a discussion on the
important features of BRT in an ecological context and a tutorial of the model see Elith et al. (2008) and Ridgeway (2006).
Because boosting produces hundreds and even thousands of decision trees the R ‗gbmboost‘ package (Bühlmann and Hothorn 2007) was used to estimate the relative influence of each predictor variable. Model interpretation was further facilitated by the used of partial dependency functions, which show the effect of each influential predictor variable on the response after accounting for the effect of all other predictor variables in the model (Elith et al. 2008). Although these graphs may not be a perfect representation of effect sizes for each predictor variable because of correlations or interactions among them, they provide a useful basis for model interpretation (Elith et al. 2008; Friedman and Meulman 2003). Partial dependency functions were run for the top eight performing variables. In R, the boosted regression tree statistical package is called ‗gbmboost’ but will be referred to as BRT in this thesis.
The boosting technique was also applied to the generalised linear models and from now on these models will be referred to as boosted generalised linear models
131
(BGLMs). The same predictor variables were used in the BGLMs (15 CWD variables and 9 CWD-site variables).