4.6 Figures
5.2.4 Evaluation methods
5.2.4.1 Confusion matrix and derived metrics
In binary classication problems, given the model predictions and the true observations, there are four possible outcomes that can be expressed in a confusion matrix (Table 5.2).
Translated into PPA terminology, a species that was predicted as likely to establish (ob- tained a high risk index) that did become established, it is a true presence (top left in Table 5.2). A species that was predicted as unlikely to establish (obtained a low risk index) that did become established, it is a false absence (top right in Table 5.2). A species that was predicted as likely to establish that did not become established is a false presence, and a species that was predicted as unlikely to establish and did not become established, is a true absence.
Table 5.2: Confusion matrix
prediction
present absent
reality present True presencesTP False absencesFA absent False presencesFP True absencesTA
Two commonly used performance measures are the false absence rate and the true presence rate. The true presence rate is also known as sensitivity and it is a widespread performance measure that quanties both the ability of the model to detect true presences and avoid false absences (Fielding & Bell, 1997). Sensitivity, in the context of PPA, is the model's ability to correctly predict species that did become established.
Equivalent measures can be derived from the top row of the confusion matrix, which are the true absence rate and the false absence rate (Table 5.2, top row). They are the proportions of correctly and incorrectly predicted absences of all real absences. The true absence rate is also called specicity (Fielding & Bell, 1997) and quanties the model's ability to correctly detect the species that did not become established.
There are a plethora of performance metrics that can be derived from the confusion matrix, but they have one main problem and that is that they are threshold dependent. That is, any method that generates scores ranging between 0 and 1 (as the PPA risk indices do) needs a threshold value over which the method's predictions values are considered pres- ences and below which are considered absences. This value of the threshold is dependent on the distribution of the risk indices of each method, therefore it is not suitable to com- pare dierent methods by comparing their performance at a given value of the threshold. As a consequence, many strategies have been developed to assess the overall performance
of a model to be able to compare the prediction accuracy of dierent models. 5.2.4.2 ROC evaluation
The receiver operating characteristics (ROC) is a technique to select classiers based on a visualization of their performance (Fawcett, 2006). The ROC consists of a two-dimensional plot of the results of the classication model for a set of thresholds. The true positive rate, or sensitivity, is plotted in the y-axis and the false positive rate, or specicity (actually 1- specicity is plotted instead to obtain ordered results), is plotted on the x-axis. Thus, the plot shows all the possible trade/os between benets and costs of the model. A classier model is optimal if it lies on the convex hull of the set of points in ROC space.
An interesting property of the ROC space is that ROC curves are insensitive to changes in class distribution. If the proportion of present to absent species changes in a data set, the ROC curves will not change. The explanation of this phenomena is in the confusion matrix (Table 5.2). The class distribution or the proportion of absences to presences, is the relationship of the top to the bottom row. Any performance metric that uses values from both columns will be sensitive to class skews. Whereas ROC graphs, since they are based upon true presence rate and false presence rate, each dimension is a strict row ratio, thus do not depend on class distributions (Fawcett, 2006).
5.2.4.3 Overall performance metrics
The most well known performance metric to compare prediction models across a wide range of science disciplines is the area under the curve (AUC) which is literally the area comprised under the ROC curve of the model in the unit squared ROC space. The AUC reports the probability that the model will rank a randomly chosen present species higher than a randomly chosen absent species (Krzanowski & Hand, 2009) and is equivalent to the Wilcoxon test of ranks (Fawcett, 2006). The AUC can also be dened as the mean specicity value assuming a uniform distribution for the sensitivity (Anagnostopoulos et al., 2012).
However, it has been reported by many that the AUC as a metric has one important aw. That is, the AUC compares all the values of Sensitivity to Specicity in a way that assigns the same relative severity of misclassication cost to wrongly classifying a presence
(false presence or Type I error) than to wrongly classifying an absence (false absence or Type II error) (Lobo et al., 2008; Hand, 2009; Anagnostopoulos et al., 2012; Hand & Anagnostopoulos, 2014).
In terms of biosecurity, it is clearly more important to avoid predicting false absences than it is to avoid predicting false presences. Species correctly predicted to have high potential to establish may not have done so (false presences), either through chance, or because eective border control measures had excluded them. Such false presences could naively be considered as incorrect predictions, but may not be because with more time, the predictions of high risk could prove true. Similarly, in areas such as medical diagnosis of life threatening diseases, a false alarm (false presence) generally cost less than a missed case (false absence) (Anagnostopoulos et al., 2012), however it is very dicult to ask the end user or the researcher to specify the real cost of one misclassication over the other (Hand & Anagnostopoulos, 2014) .
As a consequence, Hand (2009) developed a metric called H measure that is analogous to AUC while explicitly accounts for dierent misclassication costs for dierent errors (Type I error and Type II error, which are also called commission and omission error in Lobo et al. (2008)). Specically, the H measure treats missclassications of the smaller class as more serious than those of the larger class (Hand & Anagnostopoulos, 2013), which in biosecurity terms translates into penalizing the PPA methods much more for the false absences they produce rather than for the false presences or true absences.
To perform the computation of the confusion matrix, ROC plots and H measure and AUC we used the package `hmeasure' (Anagnostopoulos et al., 2012) for the statistical software R (R Foundation for Statistical Computing, 2014).