Classification performance - Experimental evaluation

6.3 Experimental evaluation

6.3.1 Classification performance

In this section, we evaluate the classification performance of the proposed GP approaches in the view of several different aspects: selection of the regression or the classification, sparse implementation, selection of the mean and the covariance functions, and utilization of the uncertainty predictions. For each aspect, we describe in what parameter space we searched for the best performing settings, and provide detailed results by the means of tables describing classification performance in the three exteroceptive modalities: visual (attitude) and laser (attitude and position). For completeness we restate that our publicly available classification dataset used for training and testing of the different classifiers has approximately 1.7 km of indoor data with precise and accurate ground truth and anomalous measurements labeled as anomalous, others as normal. It contains 20 runs, which represent standard conditions, and 25 runs with failures of the exteroceptive modalities. We evaluate the approaches in a different dataset from several challenging environments—test cases—having approximately 1.2 km of traveled distance. More details about the dataset can be found in Chapter 6. In our training scenario, the data were shuffled randomly and normalized such as each feature has zero mean and variance equal to one. If not specified otherwise, all evaluation contains following metrics: True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Precision (PR), G-mean value (G), and Area under the Receiver Operation Characteristic (ROC) Curve (AUC). All the metrics are evaluated on the testing portion if the dataset at the best classifier threshold according to the best G-mean value obtained from the training metrics.

6.3.1.1 Classification and regression

In the comparison of GP classification and one-class regression, the gradient-based hyper- parameter optimization did find a solution for both of the methods. However, the resulting one-class regression models predicted only one class, no matter what inputs were given. This is a behavior described in the literature [Kemmler et al., 2013]. We never encountered such prob- lems with the GP classification and the hyper-parameter optimization always converged in our training dataset. Therefore, we decided to pursue solely the GP classification trained with both anomalous and normal data in our further analyses.

Sparse implementation We carried out a thorough evaluation of two sparse implementations: approximation subsampled data (i.e., randomly or evenly selected subset of data) and FITC approximation. All experiments were evaluated using the GP classification with zero mean function and squared exponential kernel with automatic relevance determination.

1. Subsampled data

Selecting a subset of the data is one of the most straightforward approaches to sparse implementations of GP. We study the influence of different features and the number of negative samples. Number of positive samples is fixed and equals to the maximum number of positives available in our dataset. Therefore, we compare performance on following settings:

• features (norm of innovations or innovation vector)

• balanced data (the number of positive samples equals to the number of negative samples)

• 10, 30, 50, 100 % of all negative samples

We list the results in Tab. A.1, A.2, A.3 in Appendix A. 2. FITC approximation

The FITC approximation by inducing points is considered as the baseline technique for sparse implementations. In this evaluation, we study following aspects:

• features (norm of innovations or innovation vector) • number of inducing points

• range of inducing points (fixed or defined by minimum and maximum of the features) • distribution of inducing points (equispaced or random)

Mean and covariance functions In the search of the best mean and covariance functions, we follow the common practice and set the mean function to be constant and zero. The search therefore includes only different covariance functions:

• features (norm of innovations or innovation vector) • covariance functions (SE, RQ, Matern 1/2, 3/2, 5/2) We list the results in Tab. A.7, A.8, A.9 in Appendix A.

Uncertainty of predictions As the GP directly outputs the uncertainty in terms of predicted variance, we study the possibility of discarding the predictions, which have high uncertainty. One scenario was tested for different level of the uncertainty discarded:

• uncertain predictions with uncertainty greater than certain threshold are classified as negatives

We list the results in Tab. A.10, A.11, A.12 in Appendix A.

6.3.1.2 Summary and discussion of the classification performance

Purpose of this section is to provide summary of the classification performance and discuss the results. We assess the individual evaluations in the order they have been presented.

Sparse implementation - subsampled data Main reason to use sparse implementations is to relief the computational burden, because especially data with more than few thousands samples are difficult handle in reasonable time. In our case, the processing time for subsampled data spanned from few second in the case of the balanced data, to several hours for the whole dataset (i.g., 6 hours for VO attitude). It is evident, that it is beneficial to use more negatives, presumably balanced data don’t provide enough samples to cover the whole feature space. There is no direct indication, that using all the samples increase the performance the most (best results were obtained for 50% of negatives, see Tab. A.1 model #8 for VO attitude, Tab. A.2 model #7 for laser attitude, and Tab. A.3 model #8 for laser position). Results with balanced sets have more FP, but the difference is not that crucial, especially if the application aims for really fast training phase.

Sparse implementations - FITC approximation As expected, the FITC implementation performed in some parameter settings slightly better than the straightforward subsampling. The other parameters searched, such as number of inducing points, their range and distribution did not have very large effect on the final results. The best results for all modalities were reached

by the model #19 with equidistant distribution of 5 inducing points over the whole feature space (see Tab. A.4, A.5, and A.6). It is worth mentioning, that there are models, where the training failed and predicted only one class. Therefore, we claim it is necessary to search the parameter space for the best settings for a proper deployment.

Mean and covariance functions Once more, the necessity of searching best parameters repeats and the choice of covariance function shows to be very application specific. For instance, in the case of the VO attitude, the SE and RQ kernels outperform the Matern kernels (see Tab. A.7 models #2 and #4), however, in the case of laser modalities, Matern kernel performs similarly (see Tab. A.8 and A.9). Thereby we confirm, that the smooth SE kernel with automatic relevance determination, a frequent choice among researchers, is also very viable option for our anomaly detection problem.

Uncertainty of predictions Estimating uncertainty of predictions is often discussed in works implementing the GP [Santamaria-Navarro et al., 2015] [Kemmler et al., 2013]. Having the prediction as well as its uncertainty definitely helps to distinguish the cases, which are too far from the training set and have large uncertainty. Our proposed treatment of uncertain predictions (i.e., assigning too uncertain predictions to the negative class) decreased the number of FP, therefore increased precision. The threshold was always set to reflect the percentage of the maximum estimated uncertainty. Most noticeable results were observed in the case of VO modality, where the precision increased by about 70% for a threshold limit of 80% that retained 96% of valid data (model #3 in Tab. A.10). However, it was also shown, that the threshold limit is very application specific and has to be tuned to match the performance expectations. For instance setting the threshold as low as 50% can retain 91% of the VO modality data, but only 12% of the laser data, which is in no case desirable outcome. Unfortunately, when we tried to discard uncertain predictions in the best models, we obtained so far (model #19 in Tab. A.4, Tab. A.5, Tab. A.6), it resulted in lowering the precision and G-mean values, since the threshold discarded too many TP in the process. In the end, using uncertainty as other information source can be beneficial in some cases; however, we have shown, that it does not provide any other performance improvements in an already fine-tuned model.

In document Data fusion for localization using state estimation and machine learning (Page 77-80)