4. Spatially explicit model predicting residual vegetation patch existences within boreal
4.2. Methods 123
4.2.6. Model validation 126
The inherent uncertainty of predictive models needs to be evaluated and quantified (Beauvais et al. 2006). This has been an integral component of any model development and is useful for determining the suitability of a model for specific applications, and compare different modelling techniques (Pearce and Ferrier 2000). The predictive ability of a model (i.e., how accurate a model should be) depends on the conditions to which the model is applied, the types of questions asked, and the alternatives available (Jepsen 2004). However, a model with high predictive performance can generally be used for predicting changes under alternate future scenarios, and informing resource management decisions (Beauvais et al. 2006).
In a presence/absence model, there are two possible prediction errors: false positives (Type I error) and false negatives (Type II error) (Fielding and Bell 1997; Anderson et al. 2003). These errors can result from insufficient sample size, measurement error, and insufficient spatial resolution in the mapped environmental predictors (Pearce et al. 2001). In this study, the prediction error was evaluated based on the discrimination ability of the model to correctly distinguish between positive and negative records (i.e., residual and null-residual patches respectively). This has traditionally been expressed using a confusion matrix as shown in Figure 4.3 (Fielding and Bell 1997). The quadrant of the matrix is populated by cross-tabulating the observed and predicted category of each point in the evaluation set. Elements a and d in the quadrant are considered as correct classifications where a indicates the number of positive sites (residual patches) correctly predicted and d denotes the number of negative sites (null-residual patches) correctly predicted. The elements of c and b are usually interpreted as omission and commission errors respectively
Figure 4.3. The derivation of the confusion matrix used as a base for measuring the performance of presence-absence models. The table cross-tabulates observed and predicted patterns: a) true positive; b) false positives; c) false negatives; and d) tree negatives.
Actual
Predicted
Residual (Present)
Null-residual (Absent)
127
One specific way of evaluating the predictive performance of a model is to split the data into training and testing, which are respectively used to develop and validate the model. However, there is no standard rule for splitting the data into training (calibration) and testing (validation) set (Beauvais et al. 2006). Fielding and Bell (1997) summarizes the different approaches that have been used to allocate cases for training and testing; including re-
substitution, bootstrapping, randomization, prospective sampling, and k-fold partitioning (hold-out or external methods). However, a classic approach to evaluate the accuracy of a predictive model is to compare the model with independent data (i.e., data not used to develop the
prediction model). Refaeilzadeh et al. (2008) also indicated that one of the natural approaches of model validation is the use of a hold-out validation approach with independent data, and this has been used in this study to assess the prediction accuracy of RF model. Given the 11 fire events, the data records from an individual fire event (e.g., F01) was held-out for testing while the records from the remaining fire events (i.e., F02 to F11) were used for training purposes .
4.2.6.1. Performance with a fixed-probability threshold
The use of predicted maps for various applications may not be captured in a single map accuracy value; several measures of accuracy should be incorporated (Moisen and Frescino 2002). Some of the measures of model performance are reviewed in (Fielding and Bell 1997; Liu et al. 2009). Each of the measures tends to emphasize on a particular aspect of model
performance, and hence serves a specific purpose (Beauvais et al. 2006). Some of these global measures of model performance (e.g., Table 4.1) can be computed from a 2 by 2 contingency table of predictions and observations shown in Figure 4.3. The simplest and most widely used measure of prediction accuracy is the percent correctly classified (PCC) but model assessment using the overall accuracy, with no indication on the present or absent success might be misleading. Therefore, the overall measure of accuracy can be broken into present success (Sensitivity – Sn) and absent success (Specificity – Sp) (Table 4.1). . The former, also known as true positive fraction, refers to the proportion of presence (i.e., residual patches) correctly
predicted; the later (true negative fraction) is the proportion of absence (null-residual patches) correctly classified. The three indices, which are also referred to as fixed-probability threshold measures, capture a bit of the information on model performance and when presented together they provide most users a good sense of model quality. In this study, the validity of the RF model was initially assessed using the three measures of model performance: PCC, Sn, and Sp (Table 4.1). All classification analyses were carried out in the R statistical package (R development core Team 2013).
128
Table 4.1. Potential measures of presence-absence model’s performance. The measures of model’s performance are derived from the confusion matrix shown in Figure 4.3. The formulae are based on correctly predicted positive occurrences (a), falsely predicted positive occurrences (b), falsely predicted negative occurrence (c), and correctly predicted negative cases (d).
Measure: Calculation
Percent correctly classified (PCC)
Present success rate (Sn)
Absence success rate (Sp)
4.2.6.2. Threshold-independent measures of model performance
The fixed-probability threshold is based a single cut-off value, a value that is used to translate predicted probabilities into a binary (0 and 1) class where the default threshold is 0.5 (Jepsen 2004; Beauvais et al. 2006). Yet, the choice of an appropriate threshold value is difficult, often arbitrary, and affects the measures of model performance. It does not necessarily provide a more accurate accuracy measure (Manel et al. 1999). Therefore, a more universal approach,
threshold-independent indices (i.e., methods based on broad spectrum of threshold values), are needed (Pearce et al. 2001). Liu et al. (2009) summarizes some of the threshold-independent accuracy measures of model performance, and one of the most widely used measures is receiver operating characteristics (ROC) curves (Fielding and Bell 1997; Jepsen 2004; Beauvais et al. 2006; Peters et al. 2007). This was originally developed by signal processing and medical
researchers (Zweig and Campbell 1993; Peters et al. 2007), and has recently been integrated into distribution modelling for assessing model’s performance (Jepsen 2004).
The ROC curve provides a graphical depiction of model’s discrimination ability over a range of threshold values (Pearce and Ferrier 2000). It is obtained by plotting all true positive fractions (sensitivity values; on y-axis) and false positive fractions (1- specificity; on x-axis) over all available thresholds (Zweig and Campbell 1993; Fielding and Bell 1997). A model with perfect discrimination ability has an ROC plot that passes through the upper left corner, representing perfect sensitivity (true-positive fraction = 1) and perfect specificity (false-positive fraction = 0) (Figure 4.4). The theoretical plot for a test with no discrimination (i.e., a completely random guess or chance of performance) is a 45° diagonal from the lower left corner to the upper right corner (Zweig and Campbell 1993; Pearce and Ferrier 2000; Fielding and Bell 1997).
129
Figure 4.4. Hypothetical example of ROC graph in which the sensitivity (true positive proportion) is plotted against the false positive proportion for a range of threshold probabilities. A perfect model follows left of the axis and top of the plot while the 45° line represents the sensitivity and false positive values expected to be achieved by chance alone for each decision threshold.
The ROC plot for assessing model performance has received a considerable attention because 1) it is simple, graphical, and easy to understand visually, and 2) of its discrimination ability of the presence-absence over a wide range threshold values (Zweig and Campbell 1993). Therefore, the ROC curve was generated to assess the predictive performance of the model, independently of a specific threshold set to classify the data records into residual and null-residual patches. In order to construct ROC curves, the predicted probabilities of residual occurrence across the events (and spatial resolutions) were used to generate several confusion matrices, one for each possible cut-point. A cut-point represents a threshold probability above which the
residual patch is modelled to be present (Peters et al. 2007). Pearce and Ferrier (2000) noted
0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0
1- Specificity (absence success)
S ensit iv ity (presence s u cc ess) Low performance High performance
Ideal test (perfect classification) (AUC = 1.0) Chan ce pe rfora mnce (AUC = 0. 5)
130
that large numbers of sensitivity and false positive pairs (i.e., based on a threshold interval of 0.01) would result in a better fit. Thus, the threshold interval at 0.01 across the predicted probability range was used to produce ROC plots over 100 threshold values evenly spaced across the range of available predicted values (from 0.0 to 1.0).
However, comparing ROC curves directly from the plot has never been easy and is subjective (Eunsik and Wenbao 2011); a single index that describes the discrimination ability of a model is required (Zweig and Campbell 1993; Pearce and Ferrier 2000). The area under the resulting ROC curve, which is referred to as AUC, is then considered as an indicator
(discrimination index) of model’s performance. The AUC provides a single measure of model’s ability to distinguish between residual and null-residual patches, independent of a specific threshold value (Munoz and Felicisimo 2004; Peters et al. 2007; Refaeilzadeh et al. 2008). The AUC is expressed as a proportion of the total area of the unit square defined by the false positive and true positive axes (Pearce and Ferrier 2000), with high AUC values (i.e., large areas under the curve) indicates a high predictive performance of a model. ROC plots for each of the fire events using R were produced; for each of the ROC curve the AUC value was also computed. As a general rule, the AUC value ranges from 0.5 for a model with no discrimination ability to 1.0 for models with perfect discrimination ability (Table 4.2). In order to test whether each of the ROC index computed was significantly greater than 0.5, a statistical test based on Wilcox test was also computed.
Table 4.2. Classification of AUC values for assessing model performance (source: Swets 1988). The presence-absence models can be categorized as strong, marginal, or poor model based on the AUC values.
AUC values Description Remarks
0.5 Random guess Discrimination ability of a model is
equivalent to the one obtained by a random model (i.e., random assignments of predicted values to sites)
0.5 – 0.7 Low accuracy (Poor
discrimination ability) – weak model
Sensitivity rate is not much more than the false positive rate
0.7 – 0.9 Reasonable
discrimination ability – marginal model
Useful application
> 0.9 High accuracy (Good
discrimination ability) Strong model
The sensitivity rate is high relative to the false positive rate