• No results found

1.2 Classification process

1.2.3 Methods to test the classification system

If we were able to try the classifier on all possible input objects, we would know exactly how accurate the classifier is [73]. Unfortunately, this is hardly possible, since only a subset (generally a small subset) of all the possible inputs is available. Thus only an estimated accuracy can be evaluated [73]. When a classification process is performed the data are generally divided into two subsets, one called training set which is used during the training process to train the classifier and a test set which is used to evaluate the performance of the classifier on unseen data. There are a lot of different procedures to evaluate the performance of a classifier.

Advanced vibration analysis for CBM programs 17

Figure 1.9 ROC curves related to different classifiers, each of which is represented by a different color.

Resubstitution error method

Denoting with D the whole data set, the simplest procedure to eval- uate the performance of a classifier is based on the evaluation of the resubstitution error, i.e., the error on the training set. This procedure is called resubstitution error method (shortly R-method) and consists in training the classifier C using the whole data set D and then test the classifier again on D [73]. Thus the training and the test set coincide with D (Fig. 1.10).

Figure 1.10R-method to create training and test sets. The train- ing set coincides with the test set and both coincide with the whole original data set.

18 Fundamentals of Pattern Recognition Since training an algorithm and evaluating its performance on the same data may lead to an overoptimistic result [7, 75, 73], other testing methods were introduced such as the cross-validation (CV) algorithm. A general description of the CV strategy has been given by Geisser [43]. The basic idea is that a good estimate of the performance would be obtained using new data to test the classifier, i.e., data different from the training data [7, 43, 88]. The main idea behind CV is to split the data, once or several times, for estimating the performance of a specific model. Once split, a part of the samples is used to train the model, while the remaining samples are used to test the model and thus to estimate its performance. Compared to the R-method, CV avoids the overoptimistic result problem since the training samples are independent from the test samples. The major interest of CV lies in the universality of the data splitting heuristics. It only assumes that data are identically distributed, and training and test samples are independent, which can even be relaxed [7, 73]. This makes the CV algorithm suitable for many applications in many frameworks, such as regression [43, 118], density estimation [112, 118], and classification [8, 29] among many others. However, some CV procedures have been proved to fail for some model selection problems, depending on the goal of model selection, namely, estimation or identification [7, 73]. The issue of how to organize the training and test sets has been around for a long time [73, 131]. Hereafter we summarize some of the most used CV procedures.

Hold-out method

One of the simplest CV procedures is called Hold-out (shortly H-method) [29]. This method relies on a single split of the original data. Part of data (training set) is used for training the algorithm, and the remaining data (test set) are used to evaluate the performance of the algorithm [73]. However no study exists to define how to split the data, i.e., which is the optimal percentage of samples of the original data which should be used to create the training and the test sets (Fig. 1.11). However, generally, the training set is bigger than the test set [7, 73].

In most real applications, only a limited amount of data is avail- able, thus the single split can result in either a training or a test set too small to be significant [73]. This leads to the idea of splitting the data more than once. The idea is that a single data split yields a validation

Advanced vibration analysis for CBM programs 19

Figure 1.11H-method to create training and test sets. The orig- inal data set is split once. One part forms the training set and the other one the test set.

estimate of the performance, then averaging over several splits yields a cross-validation estimate [7]. For example considering the H-method, we can average the performance results related to several experiments based on the H-method, where each experiment corresponds to a dif- ferent data split [7].

K-fold cross-validation

K -fold cross-validation is one of the most used test procedure. The algorithm consists in choosing an integer K (preferably a factor of N , where N is the number of samples in the original data set D) and randomly divide D into K subsets of size NK. Of the K subsets, one is retained as test set, while the others are used as training set. This procedure is repeated K times choosing, each time, a different subset as test set. At the end the whole accuracy is evaluated as the average of the K estimated accuracies.

The question of how to select K is still open, even though indications can be given towards an appropriate choice [73]. When the goal of model selection is estimation, it is often reported that the optimal K is between 5 and 10 [7, 49, 73]. Fig. 1.12 provides an example of K-fold CV with K = 3.

Leave-one-out

Leave-one-out [2, 43, 73, 118] is the most classical exhaustive CV pro- cedure [7].

It consists in a K -fold cross-validation with K = N where N rep- resents the number of objects in the original data set D [73].

20 Fundamentals of Pattern Recognition

Figure 1.12 K-cross validation method to create training and test sets (example with K = 3). The original data set D is ran- domly divided into 3 subsets of size N3, with N the number of objects of the data set. Of the 3 subsets, one is retained as test set, while the others are used as training set. The procedure is re- peated 3 times choosing, each time, a different subset as test set. The whole accuracy is evaluated as the average of the 3 estimated accuracies.

Dietterich’s 5x2cross validation

The Dietterich’s 5x2cross validation method [30] consists in 5 repeti- tions of a 2-fold cross validation. This gives 10 pairs of error (accuracy) estimates which can be successively averaged to compute the final error (accuracy) [73].