5. Incorporating biological traits and environmental adaptation in correlative species distribution models
5.4.2 The effect of variable selection
It is important to understand the impact of variable selection on the interpretation of the underlying environmental space shared by presence points that are assumed to represent suitable environmental conditions for a species. The potential geographic distribution of any species is a function of this common environmental space at least in terms of the assumptions of correlative modelling (Elith & Leathwick, 2009; Austin & Van Niel, 2011). Therefore, working with the most appropriate variables for a given species is as important as identifying any cause of variation within presence data. For example, according to the directional SDE analysis (Figure 5.6) the aestivating presences of P. brassicae were significantly further from the majority of the remaining presence data when plotted in a feature space constructed out of variables selected according to the aestivating presence locations, however this discrimination was not visible when the various presence data classes were plotted either in a feature space constructed of variables selected based on the unclassified presence points or non-aestivating presence points. In fact, the directions of the maximum standard deviations (indicated by the long axes of the SDEs) of the respective P. brassicae classes show that the distribution direction of the aestivating presences is different from the non-aestivating presences when projected on environmental space derived from variables selected according to all presences or non-aestivating presences. This result signifies that variables selected based on the non-aestivating or the combined presence data would not have characterised the information within the aestivating presence locations. That result implies an effective masking of the contribution from the aestivating presences during direct modelling predictions.
On the contrary, the SDEs in the feature space based on variables selected according to aestivating presences, show that: 1) the maximum standard deviational direction is similar for all presence points, showing that any of the environmental conditions between 1SD and 2SD, SDEs are likely to be occupied and the fact that the species is not in equilibrium, 2) that the newly invaded New Zealand locations were perfectly aligned within the standard deviational direction of distribution of both presence data classes as well as the overall presence data distribution, where it is contained in the 2SD SDE of both aestivating and non- aestivating presence classes (Figure 5.13-B).
163
SDE ellipses were shown to be predictive when assessing appropriate environmental variables for P. brassicae. All the predicted presences from the direct prediction for New Zealand (no presence data classes) fell into the 2SD ellipse of all presence data classes. And all the predicted presences from the combination model fell in either of the ellipses of the aestivating, non-aestivating or unclassed presences (Figure 5.13-B). That shows that some areas in New Zealand were only predicted by the combination prediction and would have been missed if the direct model that is trained in all presences was used. Testing the use of the SDE with more species presence data of known within variation is recommend before the method can be accepted as a general variable selection optimization tool.
Another important point to be made on the use of SDEs for such analyses is that the variables might have complex and non-linear interactions, and since SDEs are based on a linear statistics they may not be appropriate for all cases. However, an effective method could be to use a non-linear PCA or other non-linear dimension reduction methods that constructs the feature space using different non-linear functions instead of the direct linear relationship assumed in the PCA. Then one could undertake the SDE on the resulting feature space.
Provided that a statistical method rather than expert knowledge is used to perform variable selection from the provided predictor data, the type and number of variables selected entirely depends on the training dataset and the algorithm used for variable selection. That also means that the composition of the training dataset (presence and absence points) directly affects the variables chosen. If there are mixed components within the presence data that are likely to be explained by significantly different environmental variables as shown in this study, it is important to subscribe the appropriate variables for each component by separately considering the various presence data groups that correspond to a distinct population of the species.
Morency et al. (2010) provided an interesting methodology on joint feature selections for multi-modal data in their study that was designed to provide better human-virtual interaction systems. Such a process could be applied for joint variable selection in multi- modal presence datasets that compare the variables selected for the individual component presence data classes with the unclassified presence data to build a variable set that can
164
jointly explain all components. Such variable selection could be an alternative to modelling the components separately and combining predictions. But more studies need to be conducted to verify if the jointly selected variables effectively predict areas that would have been predicted if the components were individually modelled.
5.4.3 Combined predictions
A simple rule was set to combine the different component predictions based on the distinct presence data components identified from D. v. virgifera and P. brassicae presence datasets. Proceeding with predicting the potential distribution of the species according to the different components within the presence data separately, and combining the results later is the simplest and probably the most straight forward method for dealing with mixed component presence data.
However, the rules used to combine the component predictions could be a source of considerable uncertainty as there was no unified performance measure to apply to the combined predictions. For example, the rule to combine the component predictions in this study was set in a such a way that the majority presence data component is given precedence when it comes to assigning values to the final combined prediction and whenever the major component failed to predict an area the alternative component was used to assign prediction values. This rule maximizes sensitivity of the combined predictions which means more environmental variation will be accounted for, compared to individual component predictions. It unfortunately also leads to a low specificity which introduces significant commission error in the combined prediction compared to the individual component predictions.
The comparison between the direct prediction and the combined prediction based on model sensitivity and overall accuracy showed that the direct prediction had better scores (Figure 5.9). Which means for the global extent the direct prediction performed well. However, The combined P. brassicae potential distribution prediction correctly identified the invaded regions in New Zealand and Chile (where P. brassicae is confirmed to be established, but for which there were no presence points included in the training and test datasets).
165
These externally predicted presence locations compelled me to investigate the combined prediction further. One reason for the low model performance of the combined models could be the test data used in the assessment. The presence and pseudo-absence points used to test the individual component models were also used for the combination model. While the presence test data does not pose any problem, the pseudo-absences however can impose a very conservative measure for the combined prediction. This is because it is highly likely that pseudo-absences generated for the aestivating population could encompass non- aestivating presences and vice versa. Therefore, a subset of the pseudo-absences generated for the individual components might include presence points in the combined prediction leading to lower overall accuracy. Such drawback can be avoided by setting aside a percentage of presence points from all identified components as a test dataset for the combined prediction, which means these data points should not be used either as a training or test data for the individual component predictions. In this study, it was not possible to set such data aside as one of the components had very few points.
Mixed component modelling is a new practice in SDM analysis, therefore an in depth study and analysis is required to develop sound mixed model evaluation methods to confidently compare direct and combined predictions and decide the better choice case by case, as the choice is likely going to depend on the species data and study extent. In case of this study it is clear that the combined prediction gave better information regarding suitability of New Zealand to Pieris brassicae. This is further shown in the SDE analysis (Figure 5.6 A) where the invaded area in New Zealand would not have been predicted in the direct prediction because those locations were closer to the aestivating P. brassicae population presences which were left as outliers in the direct prediction.
The use of mechanistic models that depend on independent physiological limits as a test system to validate correlative SDM predictions has been suggested in a number of studies (Kearney et al., 2008; Monahan, 2009; Aragón et al., 2010; Buckley et al., 2010). Clearly, however the physiological limits need to be known and established by experimentation, which is not the case for many invasive insect species.
When physiological limits of a species are known, it may be especially important to validate such combined predictions using physiological environmental thresholds of the target
166
species, as it is often the areas where the species is not currently established that are likely to contribute to most of the uncertainty in correlative SDM results and particularly in combination prediction results such as demonstrated in this study. For instance, when the combined global potential distribution of P. brassicae in this study was compared to the prediction obtained by training models directly on both aestivating and non-aestivating population presences (Figure 5.11), the core distribution of P. brassicae in Europe is predicted similarly in both cases. For areas away from the P. brassicae native range the combined prediction identified more suitable areas. For example, in the American and African continents, and more locations in Australia and New Zealand (Figure 5.11-B).
There was no information to validate some areas predicted in the combined prediction, for example in Africa and North America. A number of questions need to be answered to validate these new predictions that were not identified with the direct model prediction. Such as, are these new predictions a result of high commission error? Or a result of unmasking the effect of the aestivating presences which were not considered in the direct predictions? Or can it be a combination of both factors suggested above? Such questions can only be answered if the species is introduced into these areas, or mechanistic models based on physiological thresholds are used to independently verify the discrepancy in predictions. Meanwhile, the SDE analysis of the P. brassicae presence components in the various feature spaces as well as the improved prediction of the locations where this species has invaded in New Zealand, with the combined model, suggests that within-presence data variation can affect over all potential species distribution predictions.