Identifying components in occurrence data

5. Incorporating biological traits and environmental adaptation in correlative species distribution models

5.2.2 Identifying components in occurrence data

5.2.2.1 Cluster analysis – D. v. virgifera

Cluster analysis was used to investigate if there was a significant difference between populations in Central America and the rest of the D. v. virgifera range and if it could affect over all model prediction. A presence dataset that incudes geographically referenced D. v. virgifera occurrences from N. America, Central America and Europe was prepared. This

Var. No Variable Name DVVI DVVN DVVall PbAes PbNAes Pball

01 Annual mean temperature (°C)  

02 Mean diurnal temperature range (mean(period max-min)) (°C)  

03 Isothermality (Bio02 ÷ Bio07)  

04 Temperature seasonality (C of V) 

05 Max temperature of warmest week (°C)  

06 Min temperature of coldest week (°C) 

07 Temperature annual range (Bio05-Bio06) (°C)  

08 Mean temperature of wettest quarter (°C)

09 Mean temperature of driest quarter (°C)  

10 Mean temperature of warmest quarter (°C)  

11 Mean temperature of coldest quarter (°C) 

12 Annual precipitation (mm) 13 Precipitation of wettest week (mm)

14 Precipitation of driest week (mm)   

15 Precipitation seasonality (C of V) 

16 Precipitation of wettest quarter (mm)

17 Precipitation of driest quarter (mm)   

18 Precipitation of warmest quarter (mm) 19 Precipitation of coldest quarter (mm)

20 Annual mean radiation (W m-2₎    

21 Highest weekly radiation (W m-2₎ 

22 Lowest weekly radiation (W m-2    

23 Radiation seasonality (C of V)   

24 Radiation of wettest quarter (W m-2₎

25 Radiation of driest quarter (W m-2₎    

26 Radiation of warmest quarter (W m-2₎ 

27 Radiation of coldest quarter (W m-2₎    

28 Annual mean moisture index 

29 Highest weekly moisture index

30 Lowest weekly moisture index   

31 Moisture index seasonality (C of V) 

32 Mean moisture index of wettest quarter

33 Mean moisture index of driest quarter   

34 Mean moisture index of warmest quarter 

35 Mean moisture index of coldest quarter  

36 Elevation (m)

37 Slope (deg)

38 Aspect (deg)

139

dataset comprised 39 environmental variables that were extracted at the recorded occurrence points of D. v. virgifera. Principal component analysis (PCA) was used to transform the presence dataset onto artificial orthogonal axes to explain most of the variance in the environmental variables while reducing collinearity.

K-means clustering was performed on the first three principal components of the PCA transformed data. The parameter K for K-means clustering was set to two as the aim was to test for variation between presences from the invaded and native ranges of D. v. virgifera. The geographic projection of the clustered presence points showed that all but one of the presence points in Central America were included in one cluster while all the presence points from outside of Central America were included in the second cluster. To denote invaded range the first cluster was labelled I and to denote native range the second cluster was labelled N. It is important to note that D. v. virgifera is now considered native to N.

America. The reference to the N. American range as invaded here is strictly limited to this study, because here the native range is referenced to Central America due to earlier endemism of the species to that area (Coats et al., 1986).

To test if there was significant variation between these two clusters, the means and standard deviations of the two clusters on the first principal component were assessed.

Let

I = the set of values from the first principal component extracted at the presence points of D. v. virgifera in the invaded range cluster

N = the set of values from the first principal component extracted at the presence points of D. v. virgifera in the native range cluster

Then the means for each cluster are given by,

𝐼̅ =_𝑛𝑖1∑ 𝐼𝑗and 𝑁̅ =_𝑛𝑛1 ∑ 𝑁𝑗 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − eq5.1

Where I = [I1, I2…Ij] and N = [N1, N2…Nj] and ni is the number of presences in the invaded cluster and nn is the number of presences in the native cluster

𝐼̅ 𝑎nd 𝑁̅ were used to approximate the population mean of the invaded (𝜇𝐼) and native (𝜇𝑁)

ranges respectively.

140

𝑆_𝐼= √(𝐼𝑗−𝐼̅)2

𝑛𝑖−1 and 𝑆𝑁 = √ (𝑁𝑗−𝑁)2

𝑛𝑛−1 − − − − − − − − − − − − − − − − − − − − − − − − − − − −eq5.2

𝑆_𝐼and 𝑆𝑁 were used to approximate the standard deviation of all occurrences in the invaded

(𝜎𝐼) and native (𝜎𝑁) ranges respectively.

The above sample mean and standard deviations of the two populations were used to parameterize a mixed normal random probability density function that contained the sample estimates of the invaded range and the native range as shown in eq. 5.3.

𝑓(𝑥|𝜇, 𝜎) =1

2( 𝑓𝐼(𝑥| 𝐼̅, 𝑆𝐼) + 𝑓𝑁(𝑥| 𝑁̅, 𝑆𝑁))− − − − − − − − − − − − − − − − − − − − − − − eq5.3

The normal probability density function (PDF) of the native and invaded component are given by 𝑓_𝐼(𝑥| 𝐼̅, 𝑆_𝐼)

=

1

√

2𝜋

𝑆𝐼2

𝑒

(−(𝑥−𝐼̅)_2𝑆 2 𝐼2 ) and 𝑓_𝑁(𝑥| 𝑁̅, 𝑆_𝑁)

=

1

√

2𝜋

𝑆𝑁2

𝑒

(−(𝑥− 𝑁̅)_2𝑆 2 𝑁2 ) − − − − − − − eq5.4

And the mixture density of the invaded and native range components are given in Eq. 5.5. Proof and justification for the formulae in Eq. 5.4 and Eq. 5.5 are given by Reschenhofer (2001).

𝑓

_𝑀

(𝑥|𝜇

𝐼

, 𝜇

𝑁,

𝜎

𝐼

, 𝜎

𝑁

)=

1₂

[

1 √_2𝜋𝑆_𝐼2

𝑒

(−(𝑥−𝐼̅)2 2_𝑆𝐼2 )

+

1 √2𝜋𝑆𝑁2

𝑒

( −(𝑥− 𝑁̅ )2 2_𝑆𝑁2 )

_]

_{− − − − − − − − − eq5.5}

The combined normal distribution from Eq. 5.5 was plotted for visual investigation of variation between the two components of the combined dataset.

A likelihood ratio bimodality test was also performed on the complete D. v. virgifera presence data, this method is more robust than by simply plotting the mixed PDF of the two samples as it compares the sample distribution against a unimodal curve option and an unrestricted fit set by the sample parameters (Holzmann & Vollmer, 2008). The test confirmed the variation between the two clusters of D. v. virgifera presence points, DvvI and

141

5.2.2.2 Biological traits as a precursor to environmental variation in presence data - P. brassicae The P. brassicae training dataset was classified into two user defined classes to represent the aestivating and non-aestivating populations of P. brassicae. This classification followed the geographic boundaries of the aestivating P. brassicae population as per the description by Held and Spieth (1999) and Spieth et al. (2011).

The more or less permanent geographical cline (Figure 5.2) that was reported by Spieth et al. (2011) to represent the transition between aestivating and non-aestivating populations of P. brassicae was constructed using spatial markers given in their publication. All presence points south of the cline in continental Europe were recorded as aestivating and all other presence points were recorded as non-aestivating.

Out of the total 2,241 spatially unique P. brassicae presence points, 35 fell into the aestivating class and the remaining 2,206 points were classed as non-aestivating. Assessing multimodality of the P. brassicae presence dataset using the equations described above is difficult as the aestivating class represents only 1.5 % of the sample dataset and any significant difference could be due to spurious variation that shows a local maxima due to lack of data. Moreover, environmental variation may remain undetected due to the small number of observations for the aestivating class. This is because datasets with possibly two components do not always need to be bimodal, as well, a unimodal dataset could appear to have two modes if there is no sufficient data to characterise its true distribution (Holzmann & Vollmer, 2008). Therefore, a separate method was employed to check if the contribution of the aestivating population towards the overall potential distribution of P. brassicae was masked when using all presence points in model predictions.

The number and type of variables selected according to presence locations from the aestivating and non-aestivating populations were compared. To check for variation between environments associated with aestivating and non-aestivating presence points, their relative position in the feature space of variables selected according to the aestivating presences as well as the non-aestivating presences were mapped.

142 Figure 5.2 Classification of P. brassicae presence points.

The diagonal black line shows the cline where P. brassicae populations transition from non- aestivating to aestivating types. The black squares show spatial markers (place names) used to describe the geographical boundary of P. brassicae by Spieth et al. (2011). Labels show place names along with an “a” or “na” suffix which means aestivating or non-aestivating respectively. It was reported that P. brassicae populations were not aestivating in Gerona even if aestivating populations were found 70 km south of Gerona. Accordingly a circle around Gerona was drawn with 70 km radius to mark a point through which the transition should pass while being north of Lequetio and tangent to the circle at the same time. While this left Vilafranca, where aestivating populations were reported out of the aestivating side, I proceeded with the above line as it satisfies all the other descriptions.

Directional distribution standard deviation ellipses were used to assess the proximity of the New Zealand invaded locations to the aestivating, non-aestivating and combined presence points. Directional distribution ellipses are usually used to assess central tendency, dispersion and directional trends of spatial features (Lefever, 1926). The derivation of the standard deviational ellipse has been improved by Furfey (1927) to use Cartesian co-

143

ordinates, and a further improved derivation of the directional ellipse of a spatial data distribution was given by Gong (2002). The naming of the standard deviational curve in geographical space as an “ellipse” has been questioned by both Furfey (1927) and Gong (2002), as other geometrical forms of the curve were obtained depending on the spatial dispersion of the distribution of a given data. However referring to the standard deviation ellipse (SDE) as the standard deviation curve as suggested by Gong (2002) confuses it with the familiar standard deviation curve usually used for the bell shaped normal standard deviation distribution. Therefore, the SDE is referred as an ellipse in this study. Major spatial analysis software including ESRI®_{’s ArcGIS also still refer to the SDE curve as ellipse.}

I adopted this method to assess the proximity of the New Zealand P. brassicae locations to both aestivating and non-aestivating presences in the feature space of variables selected according to aestivating and non-aestivating presences. Directional ellipses are used for spatial data, where autocorrelation is assumed to decline as the distance between points increases. The principal component values used to construct the feature space were also based on continuous environmental variables that co-vary in the environmental space fulfilling the assumption for the use of SDEs.

The SDEs for the aestivating and non-aestivating P. brassicae classes were derived from the parameters of presence points distribution on the PCA transformed environmental feature space. Three types of feature space were tested, the first two based on variables selected according to aestivating and non-aestivating presence points respectively, the third feature constructed based on variables selected for the unclassed presences (complete presence data).

The directional standard deviation ellipses for the presence points in the aestivating and non-aestivating class as well as for the complete presence dataset (with no classification) was constructed as follows.

144

Let, Xi and Yi denote the value of a presence point from a given presence class on the X and Y axes respectively, where X and Y are the first (PC1) and second (PC2) principal components of the feature space constructed out of the PCA transformed environmental variables.

The standard deviations of each presence class according to the X and Y axes are given by

𝑆

𝑥

= √

∑(𝑋𝑖−𝑋̅)

𝑛

, and 𝑆

𝑦

= √

∑(𝑌𝑖−𝑌̅)2

𝑛 − − − − − − − − − − − − − − − − − − − − − − − − − − −eq5.6

Where n = number of points in the presence class, and 𝑋̅ and 𝑌̅ are the mean centres of all the points in the presence class on the X and Y axes respectively.

To construct the standard deviational ellipse at the direction of maximum standard deviation, standard deviations 𝑆𝑥 and 𝑆𝑦, obtained in Eq. 5.6 are rotated by an angle 𝜃 at the

mean centres 𝑋̅ and 𝑌̅. The angle 𝜃 is determined by selecting the angle that maximizes the resultant standard deviations 𝑆𝑥 and 𝑆𝑦. A simplified formula for the determination of the

angle 𝜃 as well as the derivation of the rotated standard deviations 𝜎𝑥 and 𝜎𝑦 based on the

angle 𝜃 was given by Mitchell (2005) and these are given in Eq.5.7 – 5.9.

tan 𝜃 = 𝑎 + 𝑏

𝑐

− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −eq5.7 Let 𝑥̅ and 𝑦̅ be the deviations of the points on the transformed x’ and y’ coordinate from the mean centre.

Then,

a = ∑(𝑥̅

𝑖2

− 𝑦̅

𝑖2

), b = √(∑ 𝑥̅

𝑖2

− 𝑦̅

𝑖2

)

+ 4(∑ 𝑥̅ 𝑦

𝑖

̅)

𝑖 2

, and c = 2 ∑ 𝑥̅ 𝑦

𝑖

̅

𝑖 − − − − − − eq5.8

Using the relationship between Eq. 5.7 and Eq. 5.8 the standard deviations in the rotated axes are given by Eq. 5.9.

𝜎

𝑥

= √2√

∑(𝑥̅𝑖cos 𝜃−𝑦̅ sin 𝜃)𝑖

𝑛

and 𝜎

𝑦

= √2√

∑(𝑥̅𝑖sin 𝜃−𝑦̅ cos 𝜃)𝑖 2

𝑛

----

− − − − − − − − − − eq5.9

Thus, the centroid of the ellipse is at 𝑥̅ and 𝑦̅, and 2𝜎𝑥 and 2𝜎𝑦 are the long and short axes of

145

Two directional standard deviational ellipses (1SD and 2SD) were derived for each presence data class PbAes (n=35) and PbNaes (n= 2,206) as well as the unclassified P. brassicae dataset (n=

2,241). The SDEs were computed using spatial statistics extension in ArcGIS.

The direction of the ellipse was also used as an additional measure to determine the direction of the different presence data distributions in the feature space. As the standard deviations of the distributions studied in the feature space are the long and short axes of the ellipse, let the values obtained in Eq. 5.9 be given as

𝜎

𝑠ℎ𝑜𝑟𝑡

and 𝜎

𝑙𝑜𝑛𝑔

. In the event these two

values are equal the distribution is circular or uniformly distributed in all directions.

To determine whether the distribution is directional in the feature space, I used the recommendation by Gong (2002) to obtain the circularity index (Ci) of the distribution by

using the ratio between the two axes. Smaller values show directionality (oblong ellipses) whereas values closer to one show that the distribution is circular. The same assumptions were extended for features on the variable space as the ones used to implement the SDE in a geographical space. Additionally, the exact direction of the ellipse was depicted on the plots by drawing a straight line through the mean centre of the ellipses at an angle 𝜃 determined in Eq. 5.7.

Ci =

𝜎_𝜎

𝑠ℎ𝑜𝑟𝑡

𝑙𝑜𝑛𝑔 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − eq5.10

In document Modelling invasive species-landscape interactions using high resolution, spatially explicit models (Page 153-160)

Identifying components in occurrence data

5. Incorporating biological traits and environmental adaptation in correlative species distribution models

5.2.2 Identifying components in occurrence data

=

1

2𝜋

𝑒

=

1

2𝜋

𝑒

𝑓

(𝑥|𝜇

, 𝜇

𝜎

, 𝜎

)=

[

𝑒

+

𝑒

]

𝑆

= √

, and 𝑆

= √

tan 𝜃 = 𝑎 + 𝑏

𝑐

a = ∑(𝑥̅

− 𝑦̅

), b = √(∑ 𝑥̅

− 𝑦̅

)

+ 4(∑ 𝑥̅ 𝑦

̅)

, and c = 2 ∑ 𝑥̅ 𝑦

̅

𝜎

= √2√

and 𝜎

= √2√

----

𝜎

and 𝜎

. In the event these two

𝜎𝜎

_]

𝜎_𝜎