Results - Measuring the variability of an automatic whistle classifier

Part I. Classification

Chapter 2: Measuring the variability of an automatic whistle classifier

2.3. Results

2.3.1. Data description

The total number of sections used in the training dataset was unequal between species (Table 2-1). The majority of whistle contours in the data came from bottlenose and common dolphins. The number of sections for both Risso’s and white beaked dolphin was very small: e.g., only four and 3 sections respectively when only an eighth of the sections were used to train the classifiers.

Table 2-1: Number of sections Sj for each species used to train the classifier. The number of sections is dependent on the proportion of the data used to train the classifier. The first classifier used half of all the sections, the second a quarter and the third an eighth, whereas for the prediction, 100% of the sections are used to train the classifier.

Sj Proportion of training sections 50% 25% 12.5% 100% Bottlenose dolphin 422 211 105 844 Common dolphin 595 297 148 1190 Risso’s dolphin 17 8 4 34

White beaked dolphin 15 7 3 30

White sided dolphin 55 27 13 110

TOTAL 1104 550 273 2208

The variance of a Dirichlet distribution follows a bell shape curve moving from 0 to 1 with a maximum when {•N_OP‚ = 0.5 (Figure 2-4). The observed variances when half, a quarter and

Part I Classification Chapter 2: Measuring the variability of an automatic whistle classifier

an eight of the data section are used to train the classifier followed the same bell curve shape but with smaller values than the theoretical Dirichlet variances (Figure 2-4).

Figure 2-4: Variances of the classification probabilities ("ij) for a given classification probabilities (•ij)

and a training sampling size (S). S is the proportion of the sections used to train the classifier: half of the sections used to train the classifier (black open circles), a quarter of the sections (red triangle) and an eighth of the sections (blue cross). Symbolised with a black cross are the variances as function of probabilities obtained from a Dirichlet distribution directly.

2.3.2. Model selection

Model 3 (variance dependent on the total number of section for all species, S) was the model with the smallest AIC and residual sum of squares (r2)(Table 2-2). In this model the unknown parameter Š_{_} was not significantly different to zero (p>0.05) whereas Š₆ was positively correlated to the Vij’s.

ˆOP = 70.19N̂OP*1 + N̂_{‹ ‰ 1}OP/‰ 0.01

Table 2-2: ∆∆∆∆ AIC, AIC and residual sum of squares for the three models

Model ∆ ∆ ∆ ∆ AIC AIC r2

Model 1 18.38 -475.50 7.2×10 -3

Model 2 51.63 -442.25 11.1×10 -3

Model 3 0 -493.88 5.6×10 -3

Model 2 (for which the concentration parameters were associated with the number of sections for each species within the training dataset, Sj) was the model exhibiting the worst fit.

With Model3, the predictions of the variances if the classifier had been trained with all the sections available ranged from 0 (when N̂ij =0) to 7.10-3.

Figure 2-5 : Observed data (open symbols) versus predicted (lines) and extrapolation (bold black triangles) with full dataset. Each colour represents a sampling size as described in previous figure.

2.3.3. Comparison of the variance with the version of the PWC described in Gillespie et al. (2013)

Standard deviations were measured from these predicted variances and they were compared with the standard deviation measured with the original PWC (Table 2-3). The standard

Part I Classification Chapter 2: Measuring the variability of an automatic whistle classifier

deviation measured with the modified version of the whistle classifier was smaller than with the original version; the average standard deviation for all the confusion matrices was 3.9% (±3%), whereas the average standard deviation measured with the PWC of Gillespie et al., (2013), was 8.2% (±9%). Only for three classification probabilities the predicted variance is slightly larger (for white sided dolphin misclassified as bottlenose dolphins, bottlenose dolphins misclassified as Risso’s dolphin and white beaked dolphins)

Table 2-3 Estimated standard deviation by the least squares model 3 if 100% of the data were used to train the classifier. Values in brackets show the measured standard deviation by the PWC of Gillespie et al., (2013) when 2/3 of the data are used to train the classifier. BND=Bottlenose dolphin COD=common dolphin, RSD =Risso’s dolphin, WBD=white beaked dolphin and WSD= white sided dolphin

True Species

Standard deviation in % BND COD RSD WBD WSD

BND 8.6 (26.7) 5.8(9.6) 2.3 (4.5) 2.8 (6.2) 1.8 (1.4)

COD 7.5 (18.0) 8.2(11.9) 0.0 8.7 (27.0) 5.5 (15.1)

RSD 2.5 (2.2) 0 2.4 (4.5) 0.0 0.0

WBD 3.3 (3.1) 5.5 (5.8) 0.0 8.8 (28.6) 3.4 (4.1)

WSD 4.7 (11.1) 4.6 (5.0) 0.0 4.3 (8.8) 6.7 (15.8)

In parallel to the least square method used, a Generalised Additive Model (GAM) was fitted to the data. These models gave a better fit of the data however the extrapolation to estimate what would have been the variance if 100% of the data were used to train the classifier appeared not to be realistic. For this reason only the result of non-linear least square models is presented here.

2.4. Discussion

With the Model 3 depending on the probabilities of classification and the total number of training sections of the classifier, the prediction of the data was the best obtained and seemed

reasonable. This model, selected because of its smaller AIC, tended to homogenise the variance between species. This homogenisation was a consequence of the denominator S (total number of training section) of the model. A more easily defended model is one where species with less data in the training data generated more variability. Model 2 should have captured this factor, because of the denominator of the model being directly dependent on the number of training section per species. The worse AIC value for Model2 than Model3 is perhaps a consequence of the fact that for this model the predictions were based on a small number of data. For each sample size only five data (one per species) were available. In theory, the model with the best diagnostics for fit is considered the 'best' statistically (Model 3 in this case) but biologically, another model (in this case Model 2) may be preferred. In this specific case the homogenisation of the variance generated by Model 3 will make the final precision of the estimate of the true number of detections less sensitive to the amount of detections for each species. Consequently the precision of the true number of detections for rare species will probably be lower and vice versa higher for the common species than if Model 2 was used.

In conclusion, this chapter proposed a new approach to try to measure the training variability of a whistle classifier. Other solutions may exist requiring a statistical approach more robust to small datasets and dealing with the complexity of the bootstrap method used by the PWC classifier. The following chapters show the importance of the quantity and quality of the training dataset to develop a reliable (low uncertainty) and accurate (high correct classification probability) classifier. Then the second part of this thesis will demonstrate how and why estimates of uncertainty in the performance of a whistle classifier should always be associated with the estimated confusion matrix if the acoustic data are to be used to estimate abundance of species.

In document Assessing and correcting for the effects of species misclassification during passive acoustic surveys of cetaceans (Page 50-55)