The evaluation of different models against each other needs to be done in a consistent manner for the resulting choice of classification model implemented on the ESD to be justified. The competitive testing of models is complicated by the wide variety and different orders of implementation complexity of al- gorithms that can perform classification. This necessitates the specification of a formal testing environment that will ideally result in the identification of
an algorithm or set of algorithms that can be said to perform ‘best’ at ESD acoustic classification.
The specification for the testing environment includes the separation of the data that will be used to train and test the models. It also includes the performance measure that is used to assess the model. The goal of the testing environment is to be able to test the models quickly and consistently against a fair representation of the ESD acoustic data. The outcome of the testing environment is the knowledge of which models to consider for further optimisation.
The result of the testing environment also provides a measure of how ‘learn- able’, or suitable for classification, the problem is. Thus far in this disserta- tion the holding assumption was that the ESD functionality can be obtained through classification. The results of the testing environment show whether the acoustic data are indeed a measure for determining rock structural stability. If a variety of different learning algorithms perform poorly, it is an indication that there is a lack of necessary differentiating structure to be learned.
7.2.1
Feature resampling
The main hidden obstacle in classification training is the problem of overtrain- ing. To illustrate how damaging overtraining of a model can be to its op- erational performance, consider the following example. A very simple model can be constructed from feature data that purely memorise the training data. Effectively, the model is a matching template to the training data that knows the predicted class value of each given sample. This model would have per- fect accuracy in subsequently predicting the training set, but would inevitably perform very badly on any unseen data for which it does not have an exact match. This model is said to be unable to generalise. It is highly important to realise generalisation in this model testing environment.
k-Fold cross-validation is the standard rigorous technique in training mod- els. Each time the algorithm is run, it will be trained on 90% of the data and tested on 10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on. It is the technique chosen for the use in the specification for the testing environment to provide an unbiased assessment of the accuracy of the model for the ESD. This measure of accuracy was used to determine the optimal tuning parameters for each model. However, as will be explained in Subsection 7.2.3, the models that are competitively evaluated against each other should still be measured against an independent testing dataset that the models were not exposed to during training. Therefore a separate testing set is defined, and it is kept separate from the training set. The final measurement of performance is done on the testing set, and this measure will be used to determine statistically significant differences between the performance of the models.
7.2. MODEL TESTING APPROACH 97
For the test environment, a ratio of 80:20 was chosen for the subsetting of the feature data into a training set and a testing set. This effectively means that for the cleaned operational set of 537 samples, the resulting training subset has 403 samples, and the testing subset 134 samples. The sets were chosen in such a way that the class distribution present in the operational dataset was replicated in both the testing set and the training set.
7.2.2
Performance measure
In order to compare different classification models with each other, a per- formance measure indicating their fitness needs to be defined. Various well- defined classification performance measures exist, so the task was to choose one relevant to the functionality of the ESD.
Classification models do not typically generate a discrete prediction, but rather the probability between 0 and 1, which is then interpreted by some threshold as a discrete prediction. For instance, a model could return a value of 0.71, which would be rounded up and interpreted as the class 1. In the particular case of the ESD classification requirements, there are two predicted classes, ‘safe’ and ‘unsafe’. The class value of 1, alternatively called the ‘event’, is typically associated with the predicted class that is of most interest to the problem. The ESD is designed to mitigate rock-fall risk, and rocks determined to be in loose structural cohesion are directly responsible for rock falls. In the case of the ESD, therefore, it was decided that the class of ‘unsafe’ is the event of interest for detection. Hence, 0 = ‘safe’ and 1 = ‘unsafe’.
A useful way to present the results of a model’s prediction is in the form of a confusion matrix. A confusion matrix is a cross-tabulation of the observed and predicted classes for the data. The confusion matrix for this research is shown in Table 7.1. The first value is called the True Positive (TP), and it is the number of samples that were correctly predicted to be an event. True Negative (TN), correspondingly, relates to the correct prediction of a non- event. A False Positive (FP) is the case where an event is predicted, but in fact there is no observed event. An example of an FP is when the ESD predicts an ‘unsafe’ state when the rock is in fact secure. A False Negative (FN) occurs when the prediction is for a non-event, but in reality there is an event. An FN example is when the ESD presents a ‘safe’ prediction when the rock mass is dangerously loose.
Table 7.1: Confusion matrix template indicating the number of true positives (T P ), false positives F P , true negatives (T N ) and false negatives (F N )
Real ‘unsafe’ Real ‘safe’ Predicted ‘unsafe’ T P F P
Two additional statistics relevant to the performance measure for this re- search can be derived from Table 7.1. The sensitivity of a model is the rate at which the event of interest, i.e. ‘unsafe’, is predicted correctly for all samples having the event, or
Sensitivity = T P
T P + F N (7.2.1) The sensitivity of the prediction measures the accuracy in the event popu- lation. The converse statistic is the specificity of the model. The specificity is defined as the rate that non-event samples, i.e. ‘safe’, are correctly predicted, or
Specif icity = T N
F P + T N (7.2.2) Sensitivity and specificity tend to be inversely related to each other in most practical model outcomes. An ideal model with perfect prediction, and hence perfect TP and TN values, would have a value of 1 for both these statistics, but that rarely happens and did not happen in this research.
Recall that model outputs are not discrete numbers, 1 and 0, but rather lie on a continuous scale between these two values that indicate probability. The assumption does not hold that rounding the output number to either 1 or 0 at the 0.5 threshold is ideal - in fact, the choice of threshold directly affects the sensitivity and specificity of the model. Intuitively, this can be understood through the following example. If the chosen threshold is virtually zero, then the model outcome would predict all samples to be ‘unsafe’ and no false negatives would occur, and hence the the sensitivity of the model would be perfect. In this same example, no true negative values would be predicted, and the specificity of the model would calculate as zero. This relationship between the variable threshold and the values of specificity and sensitivity can be visually presented in a plot called the Receiver Operating Characteristic curve (Altman and Bland, 1994), typically called by its abbreviation, the ROC curve.
The ROC curve will be used as a quantitative assessment of the models in this chapter. Figure 7.1 shows an example of a ROC curve. A perfect model that predicts completely accurately would have 100% sensitivity and specificity. Visually, the ROC curve would be a single step between (0, 0) and (0, 1), and then remain constant between (0, 1) and (1, 1). The area under the ROC curve for a perfect model would therefore be one. A completely ineffective model results in a ROC curve that closely follows the 45◦ diagonal
line and has an area under the ROC curve of 0.5. The useful characteristic of this is that it is possible to compare different models visually if their ROC curves are superimposed on the same plot.
Referring to Table 7.1 again, attention should be paid to the FN field. In this research, a false negative would occur if the ESD predicts a rock mass to be
7.2. MODEL TESTING APPROACH 99
Figure 7.1: Smoothed ROC curve of example model showing the relation between specificity and sensitivity with diagonal reference line
structurally cohesive when it is actually loose. This is a category of prediction with a potentially dangerous result. Arguably, a misleading prediction is worse than no prediction at all due to the possibility of convincing the miner that an existing rock-fall risk can be ignored. The statistic that relates to this FN value is the sensitivity of the model. In the choice and optimisation of the model, its sensitivity will be the most important consideration. As an extension of this decision, the area under the ROC curve will be used as the main performance measure when evaluating models against each other.
7.2.3
Model testing environment
The model testing environment was implemented in the R programming lan- guage. This language was chosen after evaluating the different implementation options discussed in Section 3.3.7 on page 46. R is an open source software programming language that is widely used in the science of classification. The main strength of R is in its reliance on user-submitted modules to extend the language for specific tasks. In particular, virtually all state-of-the-art models currently in use are very well represented in the package repository for R.
In fact, the wide variety of models submitted by various authors presents a problem for implementation. R does not impose a function structure for mod- els, and therefore each model package has different implementation parame- ters and usage. This complexity is effectively masked by the caret package
developed by Dr Khun who is the Director of Non-clinical Statistics at Pfizer Global R&D (Kuhn, 2008). The caret package (short for Classification And REgression Training) creates a unified interface for modelling and prediction in R, and interfaces to 147 regression and classification models.
The additional benefit of caret is the helper functions it implements to assist in the streamlining of model tuning using resampling. These include functions to split the dataset into testing and training datasets that each still maintain the class distribution of the whole. Furthermore, the use of k-fold cross-validation during the training of models can be specified. A useful func- tion provided by caret is the search for optimal model-tuning variables. Some models have specific tuning variables that can be set prior to the training of the model that make the particular model potentially more suitable for de- scribing the problem. The specific tuning variables applicable to each type of model are described during the presentation of the model training results in the following section.
The algorithm to train and select models is as follows: Split feature set into testing and training set;
Define sets of model tuning variables to evaluate; for each tuning variable do
for each cross-validation k-fold of training set do Hold-out specific fold of training data;
Fit the model model on the remainder; Predict the hold-out samples;
end
Calculate the average performance across hold-out predictions; end
Determine the optimal tuning variables based on highest cross-validated accuracy;
Apply model to testing set to predict outcomes;
Construct ROC curve and compare against other models;
This algorithm describes the process to train each model, and provides a measurement of its ability to accurately predict outcomes based on data the model was not exposed to during its training. This provides a measure of the real-world applicability of the constructed model on the ESD in operational environments.