Statistical analysis - Materials and methods

Chapter 2: Materials and methods

2.8 Statistical analysis

2.8.1 Exploratory morphological statistical analysis

Two exploratory data analysis techniques (i.e. not necessarily inferential) were used to assess the utility of morphology in delineating between species without any a priori knowledge of their genetic groupings.

2.8.1.1 Principal coordinate analysis

Principal coordinate analysis (PCO) provides a geometric representation of the distances and dissimilarities between specimens and extracts principal coordinates to describe the major trends in multidimensional data (Legendre and Legendre, 1983). PCO analysis looks for patterns of morphological structure between and within taxonomic groups (Davis, 2001). This statistical approach is commonly used in plant systematics (Loo et al., 2001; Henderson 2006). PCO analysis was employed over principal component analysis (PCA), as PCO analysis robustly handles mixed qualitative and quantitative datasets (Legendre and Legendre, 1983). However, a disadvantage of PCO analysis is that it does not provide a breakdown of the component scores associated with each variable. The PCO statistic was calculated using PAST statistical software v.2.17.

2.8.1.2 UPGMA cluster analysis

Cluster analysis is an exploratory tool for classifying objects, whereby the association between specimens is assessed (Legendre and Legendre, 1998). In cluster analysis, no statistical assumptions are made about the data. An UPGMA (unweighted pair-group method using arithmetic averages) was employed in this thesis, as this algorithm is the standard cluster analysis approach employed in systematics (e.g. Hayward et al., 2004; Henderson, 2006). This statistic was calculated using dendroUPGMA programme (Garcia-Vallve et al., 2010) because its output can be saved in the Newick format, which can be subsequently imported into the Phylowidget software (Jordan et al., 2008) to create a circular dendrogram. It should be noted that the output of the UPGMA cluster analysis in PAST software and the dendroUPGMA programme are mutually comparable. However, UPGMA analysis was not calculated in the PAST software because it did not yield an optimal visual presentation of the clustering patterns in the large datasets used in this thesis i.e. the UPGMA cluster analysis dendrograms created in the PAST software extended across multiple pages.

2.8.2 Classification techniques

Three distinct multivariate classification analyses were employed by this thesis to assess the relative importance of morphological traits for discriminating between genotypes. In addition, the use of these classification approaches provides an opportunity to compare the efficacy of each of the techniques to each other and to the two exploratory statistical approaches (PCO analysis and UPGMA cluster analysis). The three classification techniques were conducted in SPSS v22.

2.8.2.1 Discriminant function analysis

Discriminant function analysis is one of the most widely employed statistical approaches used in systematics to investigate taxonomic differences and to delineate between morphologically similar specimens (Fisher, 1936). This statistical approach is widely utilised in foraminiferal systematics (Quillévéré et al., 2013; Weiner et al., 2015). DFA discriminates amongst pre-defined groups of individuals based on a combination of variables, which are used to create classification functions, which are themselves used to determine group membership of the specimens (Henderson, 2006).

It is important to recognise that DFA requires the control of several assumptions including multivariate normality (as discussed at greater depth in Tabachnik and Fidell, 2007). However, this classification technique is relatively insensitive to violations of its internal assumptions (Tabacknick and Fidell, 2007; Hammer and Harper, 2008). This is crucial, as ecological/ taxonomic datasets almost never fulfil all these assumptions (Williams, 1983; Sarawasti and Sabnis, 2006). The classification performance of this procedure was cross-validated by a leave one out approach. This approach omits one individual from the dataset, then recalculates the discriminant function and assigns this specimen to a group using the new discriminate function (Klecka, 1980).

2.8.2.2 Decision tree analysis

A non-parametric decision tree approach to classification was also employed in studies in this thesis. This approach can robustly handle complex ecological data, address non-linear relationships and can handle missing data (Breiman et al 1984; De’ath and Fabricius, 2000; Feldesman, 2002). Presently, decision trees are seldom used in foraminiferal taxonomy (Saraswati and Sabnis, 2006) but have been used to great effect in public health (e.g.; Robledo et al., 2007) and aquaculture research (e.g. Elliot and Owens, 2015). Two decision tree algorithms were employed by this thesis. The first decision tree CART (Classification and regression tree) is built upon a binary recursive partitioning and tree development (Feldesman, 2002). In contrast, CHAID analysis (Chi-square Adjusted Interaction Detection) (Kass, 1980) uses recursive partitioning and tree development which classifies based on a dependant measure and a large series of possible predictors. The difference between CART and CHAID analysis is that the CHAID tree is not restricted to binary decisions, i.e. CHAID allows for more branching than CART if there are significant differences (Rokach and Maimon, 2007). Additionally, CHAID analysis has been identified as the optimal technique for handling large and unequal datasets (Breiman et al., 1984).

A ten V-fold cross validation approach was employed for both CHAID and CART analysis, whereby the data is split into ten random subsamples which were taken from the dataset (Rockach, 2007). A tree was computed ten times, each time one of the subsamples was omitted from the computation. The cross validation estimates were computed for each of the ten test samples and the results were averaged to give a cross-validation error.

2.8.2.3 K-nearest neighbour analysis

K-nearest neighbour analysis (K-NN) is a non-parametric approach which can discriminate between genotypes by assessing the similarity of a specimen to its nearest neighbour (Dudan, 1976). This procedure predicts the test category based on the K training sample and classifies the specimen into the category with the highest probability (Kim et al., 2011). This statistical approach is rarely employed within taxonomy, but is commonly utilised in ecology (Mäkelä and Pekkarinen, 2004) and medical research (Polat, 2012; Belekar et al., 2015).

In document Reconciling molecules and morphology in benthic foraminifera : a morphometric study of Ammonia and Elphidiidae in the NE Atlantic (Page 62-65)