Ecological Niche Factor Analysis - Ecological Niche Factor Analysis (ENFA): Biomapper

Chapter 2. Presence-only habitat suitability models

2.2.2. Ecological Niche Factor Analysis (ENFA): Biomapper

2.2.2.2. Ecological Niche Factor Analysis

ENFA is not sensitive to irrelevant data or to its order, as all useful information is extracted and summarised into the ecological niche factors, so although non-relevant data will increase the computation time and memory needs, it will not significantly influence the accuracy of the result (Hirzel, 2004). However, mathematically, ENFA cannot be computed with more environmental variables (termed ecogeographical variables, EGVs, in Biomapper) than species presence points, and practically it is best to have at least three times more presences than EGVs (Hirzel, 2008). Therefore, when there are only a low number of presences and/or a high number of potential EGVs and it is unknown which EGVs are most important for a given species, Hirzel (2008) suggests grouping the EGVs into classes (e.g. habitat type, topography), ideally so that the number of EGVs in each group is less than one third of the number of presence points, and then computing an ENFA separately with each group. The ‗best‘ EGVs (those that have the highest coefficients on the marginality and specialisation factors; see below) from the ENFA outputs for each group should then be kept and pooled together and a final ENFA computed and used for the habitat suitability computations.

Therefore, as the number of training records available for the species ranged from 70-180, but there were 98 environmental (EGV) layers available, this procedure was followed. As noted by Hirzel (2008), there is no automated variable selection function in Biomapper, in the sense of the regression stepwise methods (see section 3.2.2.3, Chapter 3). Instead, there is factor interpretation, through the information provided by the ENFA (see section 2.3.2.2.1.2).

The number of presence locations required for use in Biomapper is not clear, as it depends on several factors such as the variance of the study area, the specialisation

of the focal species, the design and accuracy of the sampling (Hirzel, 2008). Although Hirzel (2008) has generally used several hundreds of points, he also notes that far fewer points could have been used without significantly decreasing the accuracy of the model, perhaps even as few as 20 or 30 points. The environmental layers (EGV‘s) were split into 6 groups (habitat cover, patch area, patch compactness, edge density, Euclidean distance, and soil, terrain and climate (see Appendix 7 for explanation)) and the ENFA‘s run for each species with each of the groups. Negative or very large eigenvalue warnings were ignored at this stage, as this was not going to be the final selection of variables for the ENFA.

The most relevant environmental variables (EGVs) were selected by examining the ENFA score matrix (which indicates how the factors are correlated with the variables), in particular the marginality factor (column 1). In ENFA, the first axis is chosen so as to account for all the marginality of the species, and the following axes so as to maximize specialization, i.e. the ratio of the variance in the global distribution to that in the species distribution (Hirzel et al., 2002). The marginality maximizes the multivariate distance of the EGVs between the cells occupied by the species and the cells within the whole reference area (Sattler et al., 2007). The coefficients mi of the marginality factor express the marginality of the focal species on each EGV, in units of standards deviations of the global distribution (Hirzel et al., 2002). The marginality is defined as the absolute difference between the global mean and the species mean, divided by 1.96 standard deviations of the global distribution (to remove any bias introduced by the variance of the global distribution) (Hirzel et al., 2002). Further explanation of marginality can be found in Hirzel et al. (2002). The higher the absolute value of a coefficient (close to 1 or -1), the further the species departs from the mean available habitat regarding the corresponding variable (i.e. the more particular the habitat relative to the global habitat). Negative coefficients indicate that the focal species prefers values that are lower than the mean with respect to the study area, while positive coefficients indicate preference for higher-than-mean values (Hirzel et al., 2002). A low marginality value (close to 0) indicates that the species tends to live in average conditions throughout the study area (Hirzel, 2008).

The rows of the score matrix are the EGV contributions to each factor and the other columns are the V-1 specialisation factors (V is the number of variables). Specialization is defined as the ratio of the standard deviation of the global distribution to that of the focal species (Hirzel et al., 2002). The specialisation factors account for the decreasing residual variance after removal of upper-ranked explanatory factors (and therefore most of the variance is explained by a few of the first factors), and denote to what extent the species‘ EGVs distribution is narrow with respect to the overall distribution of the EGVs in the whole reference area (Sattler et al., 2007). The inverse of specialization is therefore a measure of species‘ tolerance (Sattler et al., 2007). For the specialisation factors it is only the absolute value that is important (the signs are arbitrary), so EGVs that had particularly high values (>0.5 or <-0.5) on the first few specialisation factors were also considered (as long as the sign of the coefficient of the marginality factor for that variable was appropriate (see below)), as the higher the absolute value, the more restricted is the range of the focal species on the corresponding variable (Hirzel et al., 2002).

When selecting EGVs for the final pool of ‗best‘ variables, only the EGVs with positive coefficients (positive marginality values) for habitat percentage cover, patch area, patch compactness, edge density, soil type percentage cover and aspect layers, were selected. This was because if, for example, percentage cover of coniferous woodland had a strong negative marginality value, it may end up being selected as one of the most important variables (and therefore used as a factor to develop the HS map). However, this indicates that the species tends to occur where there is less coniferous woodland cover than globally (throughout the site) available. Consequently, anywhere that did not have coniferous woodland (regardless of what else the habitat type was) may be more likely to have a high HS value. This is also an issue in stepwise regression (see section 3.2.2.3, Chapter 3). However, for Euclidean distance to habitat type variables, it is negative coefficient (marginality) values that are relevant, because they indicate that the species tends to occur in sites that are closer to that habitat type. For the elevation, slope and climate variables, the coefficient may be positive or negative as the species occurrence may be associated with more positive or more negative values than the global mean.

There was generally a clear group of variables within the groups with more positive or negative marginality values (which tended to be <-0.1 or >0.1). If the species correlation tree showed a pair or group of variables to be highly correlated (>0.5), only the variable(s) of the pair or group with the highest marginality value were selected.

2.2.2.2.1. ENFA with selected ‘best’ variables from each of the groups

An ENFA was run with the selected EGVs from the six groups of EGVs for each species. If there was no eigenvalue warning then a habitat suitability (HS) map was created (see section 2.2.2.3). If a negative or very large eigenvalue warning was received then the species and global correlations trees were examined as negative or very large eigenvalues can be caused by highly correlated variables (Hirzel, 2008). Where required, one out of each of the pairs of highly correlated (>0.5) variables was removed, based on which had the lower marginality value. Loss of information through the removal of these variables was not a problem as the removed maps contain mostly redundant information (Hirzel, 2008). The ENFA was then re-run and the eigenvalues checked again. If there was still a warning, then the process was repeated until a warning message was no longer received.

It was also important to ensure that the number of EGVs used for the final ENFA did not include more than the suggested number of variables (i.e. no more than one third of the total number of presence data in the training data set; see Appendix 9 for details). If the number of selected variables from the groups was greater than suggested, the number of variables could be reduced by removal of correlated variables (as above) or those with the lowest marginality and specialisation factors. A list of the final set of variables used in the final ENFA for each species can be found in Appendix 9.

Biomapper does not produce a formula for the habitat suitability, as the maps are produced based on an environmental envelope algorithm fitting to the observed distribution in the niche space (Hirzel, 2008). Therefore, for interpreting the ecological requirements of the study species, Hirzel (2008) suggests examining the coefficients of the ecological niche factors from the ENFA score matrix, which indicate how marginal and specialised the species is on the various relevant

environmental variables. This was carried out and the results are displayed in section 2.3.2.2.1.2.

In document Habitat suitablity modelling in the New Forest National Park (Page 60-64)