Data set and variable selection - Project methodology

2. Project methodology

2.5 Data set and variable selection

2.5.1 Data scale

The SDMs were constructed using the independent (predictor) variables detailed below. A set of data was generated for each of the variables at the same scale. The selected scale can result in different results. This has been seen in studies in Europe where the models based on a 10’ scale (approximately 18km x 18km [New et al., 2002]) predict a complete loss of habitat, whereas a 25m scale predict suitable habitat for 100% of species (see Bellard, et al.,

2012; Randin, et al. 2009). Randin et al. (2009) used accurate climatic data and elevation, to interpolate variables at a local level (25m x 25m). It is possible to model climatic data for Africa at this scale via interpolation, but due to the limited number of weather stations in the

area this would not add additional information and would limit the validity of that data (New,

et al., 2002). In addition, the 10’ scale has successfully been used to model mammalian species distributions for Africa previously (Thuiller, et al., 2006). Therefore all other datasets were aggregated or disaggregated to this scale/domain.

2.5.2 Land transformation data

Land Transformation (LT) data (Sanderson et al., 2002) was incorporated using a weighted filter. The LT dataset represents the ‘Human Footprint’ and uses four sources of data to demonstrate the human influence on Earth. These sources are population density, LT, access and electrical power infrastructure. LT data was aggregated (resampled) from the original 0.5’ resolution to 10’ scale. The maximum values were calculated for each grid cell. Values range from 0-100 (no transformation-complete transformation) and were subsequently divided by 1000 for incorporation into the distribution probabilities. The initial probability (IP) of occurrence from the model is weighted by the LT to provide a final probability (FP) for each grid cell: 𝐹𝑃𝑖 = 𝐼𝑃𝑖 × 𝐿𝑇𝑖 where 𝑖 is a 10’ grid cell (Thuiller et al., 2006b).

2.5.3 Data sources

2.5.3.1 Climate variables present and future

Climate data was sourced from WorldClim (Hijmans et al., 2005) and used to provide both the current and future climate conditions based on a 10’ grid. This dataset provides the basic variables of mean precipitation and temperature. The CRU CL2.0 dataset (New

et al., 2002) provided wind, and elevation data required to produce the potential evapotranspiration values (PET) (calculations from Allen, et al., 1998 [Ch. 3-4]) that have been used to predict species distributions (see - plants, insects, birds: Huntley, et al.,

2004; plants: Thuiller, et al., 2005; mammals: Thuiller, et al., 2006).

By using the temperature range, and grouping the monthly values in uniquely different ways, it was possible to evaluate climatic variables as below:

 Mean monthly/yearly temperature

 Hottest/coldest mean monthly temperature

 Absolute hottest/coldest monthly temperature (hottest/coldest month +/- ½ the temperature range)

 Temperature Range (absolute hottest – absolute coldest)

 Mean monthly/annual precipitation (with/without log values)

 Driest/Wettest two months

 Potential (Reference) evapotranspiration (based on climate, elevation, solar radiation)

2.5.3.2 Soil and vegetation - variable aggregation

The non-climatic datasets were Normalized Difference Vegetation Index (NDVI) (Tucker

et al., 2005), and two soil datasets from the Harmonized World Soil Database (HWSD,

2012) and the US Department of Agriculture (USDA, 2005). These datasets were chosen because of the ecological links to antelope. NDVI data has previously been demonstrated as a good indicator of antelope distribution and abundance (Pettorelli et al., 2009; Mueller et al., 2007). Soil data has also been suggested as an important factor influencing species distribution (savannah herbivores: East, 1984; burrowing owls: Stevens, et al., 2012). The HWSD dataset contained a number of variables for soil including the selected Cation Exchange Capacity (CEC) value for topsoil. This provided the numeric nutrient fixing capacity of the soil [range 0-88.4] (Nachtergaele et al., 2012). The USDA dataset offered categorical soil type data.

These variables were aggregated as their original scale was finer than the climatic data. NDVI data were at a scale of 8x8km. The two sets of soil data were on 30’’ (arcseconds) scale. All datasets were aggregated to 10’ scale. All aggregation was achieved using custom written Java programs that finds all the 30’’ or 8x8km grid cells within each 10’ grid cell. For the soil data the mode value was then used to aggregate the data providing the most common soil. For the NDVI data, the mean value was used.

2.5.3.3 Ecological and morphological data

The data on diet, morphological attributes, and social structure were collated from a number of sources (Bro-Jørgensen, 2007, 2008, unpublished; Kingdon, 2003; Gagnon & Chew, 2000; Estes, 1991; Jarman, 1974)1_{. These data are used to establish relationships}

1_{All available diet proportion data are from Gagnon and Chew (2000) except hirola that was not in}

the dataset. Hirola (Beatragus hunteri) values are set as 92.5% grass, 7.5% browse, 0% fruit following Cerling et al.'s finding that C3 vegetation formed 5-10% of the diet (Cerling et al., 2003). The hirola is a member of the alcelaphini tribe of antelopes (Estes, 1991) that are considered "predominately pure grazers" (Cerling et al., 2003). The other species in the tribe of a similar size have no fruit in their diet with similar amounts of browse to that set above (from Gagnon & Chew, 2000).

between behavioural, morphological, and ecological traits with range change over time, climatic conditions, conservation status, and dispersal ability.

2.5.4 Variable selection

The following describes the independent processes used to identify the variables to be included in the species distribution models:

1. BIOMOD models for all species were produced 26 times using different combinations of variables. On each occasion, the variable importance function (see 2.4) showed which variables were most important within the models. This identified the commonly important variables, across the 26 BIOMOD iterations, for multiple species. Throughout the process the sensitivity, specificity and area under the curve (AUC) values for the receiver operating characteristic (ROC) were analysed to ensure models retained “High usefulness” (AUC > 0.9) where possible (see Huntley et al., 2004).

2. Variables needed to be either static, or have future projections, to allow species distributions to be predicted in the future. Soil was assumed static, as was elevation. However, vegetation indices, such as NDVI, change over time. There are no continent-wide projections for vegetation indices so these were eliminated. NDVI also had a strong correlation with mean precipitation in Africa (r = 0.83).

3. Correlated variables were identified by correlation analysis and principal components analysis (PCA). The variables analysed via PCA were a reduced set of all those evaluated via BIOMOD. I selected the most important variables from the variable importance analysis above (Hottest, coldest, mean, and range of temperature; Log of mean precipitation). Where two or more variables derived from a single variable (e.g. mean precipitation, log of mean precipitation), I selected the variable from those models producing the highest AUC scores. I included the variables that have previously been used in identifying species distributions (elevation, soil, and evapotranspiration). In the case of soil, the nutrient fixing capacity variable was used as this offered data for 100% of the study areas where the others did not. Finally, I include driest and wettest three month periods. The driest three month period variable has been proven valuable in identifying species distributions (Butt et al., 2008; Bukley & Jetz, 2007). The wettest three month period is not commonly used, however, in forest areas the wettest three month period is

negatively associated with above-ground biomass (Lewis et al., 2013). This has the potential to impact on those species and was therefore included in the PCA.

Due to the high level of correlation the eigenvalues were calculated for each PCA component; those with values greater than one were then further analysed (Kaiser- Guttman criterion: Foster et al., 2012; Jackson, 1993). Table 2-1 provides the summary of the PCA. The log of annual precipitation was consistently the most important variable (from BIOMOD) and was also highlighted in principal component 1 (PC1) in the PCA analysis. Table 2-2 displays the variable loadings (eigenvectors) for each principal component. PC1 highlights the importance of hottest temperature and temperature range alongside the log of annual precipitation. It suggests links between high rainfall and small ranges in temperature which agrees with tropical forest areas. This also suggests lower hottest temperatures in high rainfall areas which, when taking the large desert areas with low rainfall and high hottest temperatures, seems reliable.

PC2 demonstrates the importance of mean temperature, coldest temperature, and elevation. It logically suggests that higher elevation is tied to colder mean temperature and colder coldest temperatures, and vice versa. PC3 accounts for 10.3% of the variance and is dominated (eigenvector 0.915) by soil nutrient fixing capacity suggesting this is an important variable.

4. Two variables displaying correlation and/or similarities may have different ecological importance. For example, hottest mean temperature has a negative correlation (r = -0.57) with the annual precipitation. However, they are

hypothesized to be independently important variables in relation to the ecology and morphology of some species, for example, desert species.

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

Standard deviation 2.063 1.651 1.014 0.898 0.750 0.586 0.390 0.329 0.137 0.000

Eigenvalues 4.257 2.724 1.027 0.806 0.562 0.343 0.152 0.109 0.019 0.000

Proportion of Variance 0.426 0.273 0.103 0.081 0.056 0.034 0.015 0.011 0.002 0.000

Cumulative Proportion 0.426 0.698 0.801 0.882 0.938 0.972 0.987 0.998 1.000 1.000 Table 2-1: Summary of principal components analysis. The first three principal components with eigenvalues greater than 1, and accounting for 80% of the variance, are further assessed (Kaiser-Guttman criterion: Foster, et al., 2012; Jackson, 1993).

Figure 2-3: Left – Eigenvalues from the principal component analysis identifying the importance of the first two principal components. Right – a biplot of the principal components analysis demonstrating the close correlation of the precipitation variables (grouped together). Principal components axis 1 (labelled PC1) identifies the log of annual precipitation (LogMeanPrecip), hottest temperature (HottestTemp), and Temperature Range

(TempRange) as important variables within that principal component (see Table 2-2 for loading scores). Principal components axis 2 (PC2) is influenced by Coldest Temperature (ColdestTemp), Mean Temperature (MeanTemp), and Elevation. The variables analysed are those often identified as important via BIOMOD’s variable importance that could be projected into the future, or are static (elevation, soil nutrient fixing capacity [SoilTCEC]). PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 Coldest temperature 0.202 -0.535 0.040 0.028 -0.013 0.299 -0.035 0.280 0.462 -0.536 Hottest temperature -0.412 -0.260 0.009 -0.064 -0.115 -0.398 -0.223 0.379 0.425 0.462 Temperature range -0.422 0.236 -0.024 -0.063 -0.065 -0.487 -0.119 0.035 -0.073 -0.707 Mean temperature -0.153 -0.562 0.019 -0.132 -0.125 0.038 -0.057 0.237 -0.753 0.000 Log driest 3 months 0.296 -0.079 -0.313 0.544 -0.659 -0.268 0.024 -0.064 -0.031 0.000 Log wettest 3 months 0.385 -0.172 0.016 -0.400 0.091 -0.532 0.607 0.016 0.038 0.000 Soil nutrient fixing capacity 0.081 -0.007 0.915 0.355 -0.038 -0.158 0.026 0.015 -0.049 0.000 Elevation 0.167 0.418 0.169 -0.445 -0.558 0.221 -0.021 0.458 -0.014 0.000 Log mean precipitation 0.421 -0.117 0.080 -0.341 -0.015 -0.229 -0.716 -0.348 -0.001 0.000 Evapo- transpiration -0.374 -0.224 0.164 -0.276 -0.460 0.180 0.220 -0.621 0.169 0.000

Table 2-2: Principal components loadings for each variable within the principal component. PC1-3 are those where the eigenvalues are greater than 1 and are investigated further. Bold values are significant variables within the principal components.

2.5.5 Final model variables

Models including fewer variables were preferred for both model parsimony and to provide clarity in the production of the optimal climatic values for species. Following the PCA and variable importance assessment the three following variables were selected:

 Log of annual precipitation: This was consistently the most important variable for a wide range of species. PCA confirmed this importance linked with temperature variables. Rainfall has previously been highlighted as an important variable in predicting savannah species and biomass (Hopcraft et al., 2009; East, 1984). Rainfall is also a key driver of vegetation in an area which in turn provides different foraging opportunities for species.

 Hottest temperature: Highlighted by PC1 as important and as a physiologically important variable. Each species has a thermoneutral zone, a range of conditions where a species can be active without the body temperature exceeding high or low limits, beyond which they must expend energy and/or water to maintain body temperatures within tolerance levels. Larger species typically have wider thermoneutral zones due to smaller body surface to mass ratio and reduced thermal conductance (Owen-Smith, 2002; Schmidt-Nielsen, 1990; Lindstedt & Boyce, 1985). Hottest temperature is closely correlated with temperature range and important in the production of optimal temperature range for each species.

 Coldest temperature: This variable was highlighted in PC2 and is important in relation to the thermoneutral zones and the production of the optimal temperature ranges.

As noted above, hottest and coldest temperatures were selected as they offered the opportunity to produce the temperature range optimal value for each species. Temperature range itself was not included in the model as it was originally derived from hottest and coldest temperatures and was strongly correlated with both. Hottest and coldest temperatures were also constantly high in the variable importance scores. These temperature variables provide the opportunity to investigate biome specific traits such as desert (large temperature ranges), and tropical forest (small temperature ranges). These final three variables offer models with similarly high sensitivity, specificity, and AUC values compared to models with more variables.

Other variables highlighted as important, but not included in the models, were soil nutrient fixing capacity, mean temperature and range, and elevation. The USDA soil dataset only

covers 90% of Africa and for this reason it was not included in the model. Soil variables were also consistently very low on the variable importance results from BIOMOD. Mean temperature and range are strongly correlated to both hottest and coldest temperatures. Elevation is strongly negatively correlated with hottest temperature (Pearson’s product- moment correlation: r = -0.607; p<0.001).

In document The impact of climate change on the distribution and conservation status of African antelopes (Page 30-37)