2.5 Machine Learning Algorithms
2.5.6 Comparison of Algorithms
Several studies comparing remote sensing classification algorithms have been pre- sented. [100] compares a SVM to a ML classifier, an ANN classifier and a DT classifier for land cover classification based on satellite images. A spatially degraded TM image at a resolution of 256.5 m per pixel and a corresponding reference map were used in the evaluation study. Random sampling was used for training sample selection. It was stated that when using seven input variables, the SVM was more accurate than the DT or ML classifiers and it gave significantly better results than the ANN classifier in six of the 12 training cases and though insignificantly, gave higher accuracies than ANN in five of the remaining six training cases. With only three input variables, the ANN performed better than the SVM. It was stated that the applicability of the SVM to non-linear decision boundaries depends on whether the decision boundaries can be transformed into linear ones by mapping the input
data into a high-dimensional space and with three input variables the SVM might have less success in transforming complex decision boundaries in the original input space into linear ones in a high-dimensional space. The ML classifier was reported to give the least accurate results in most of the training cases. Further comparisons regarding algorithm stability and speed as well as impacts of non-algorithmic factors were also reported.
Multiple classifiers were compared to the used base classifiers to obtain baseline results for multiple classifiers in [112]. The used algorithms were the minimum Eu- clidean distance (MED) classifier, the Gaussian ML classifier, a conjugate-gradient backpropagation (CGBP) ANN algorithm with two and three layers, a decision ta- ble, a j4.8 [107] DT, which is an implementation of the C4.5 [129] revision 8 decision tree, and a simple 1R classifier [169], which uses only one feature, when it determines a class. MED and 1R gave very low results. The ranking of the single classifiers averaged for the four test cases in descending order is: CGBP, j4.8 DT, decision ta- ble, ML. In most cases the best results were achieved using the boosting algorithms on the j4.8 DT classifier. In one case boosting the 1R classifier and boosting the decision table were superior. In most of the cases boosting gave better results than bagging when comparing the two methods using the same base classifier. Relatively good results were also achieved using consensus theory on the CGBP algorithm, where the logarithmic opinion pool performed better than the linear opinion pool.
A SAM was compared to a ML classifier using hyperspectral data in [113]. Dif- ferent texture measures were compared and the ML classifier was reported to yield generally higher accuracy results than the SAM. Sieve and clump post-classification algorithms were applied to the classification result to remove isolated pixels, which occurred due to the pixel-based approach. This procedure enhanced the results by several percentage points.
[20] compared a ML classifier, a DT, a DT using the boosting algorithm, a SVM and fused SVMs. The fused SVMs use an SVM-based decision fusion of several SVMs trained on the individual data sources. The comparison was based on multisensor data sets. The fused SVMs were reported to outperform the other algorithms. From the single classifiers the boosted DT and the SVM performed best.
In the context of land cover change, [170] compared the ML classifier, SVMs and DTs based on two Landsat images, one Landsat 5 TM image from 1986 and one Landsat 7 enhanced thematic mapper plus (ETM+) image from 2001. He concluded
that high overall accuracies were obtained for all three techniques. However, the DTs performed slightly better with a difference of 3 percentage points for the first image and a difference of 0.4 percentage points for the second image.
Based on hyperspectral remote sensing images SVMs were compared to a RBF neural network and a k-nearest neighbor classifier in [123], where SVMs proved to be much more effective in terms of classification accuracy, computational time and stability to parameter setting. Furthermore, it was stated that SVMs have low sen- sitivity to the Hughes phenomenon (or Hughes effect) [171], which describes that the predictive power of a classifier can reduce as the dimensionality increases. In addition, four strategies to use binary SVMs for multiclass problems were assessed: OAO, OAA and two hierarchical tree structures, whereof the first generates a bal- anced tree and the latter uses the OAA method in a tree structure. The parallel architectures (OAO and OAA) were reported to perform slightly better than the hierarchical tree structures, which was partly explained by the risk of error propa- gation through the tree structure. Another explanation was that simple information as the class prior probabilities, which were used to create the tree structure, cannot properly take into account the underlying affinities among individual classes.
Comparing algorithms is difficult in the case of remote sensing. First, the con- ditions have to be equal for each algorithm in the context of data, resolution, used bands and classification schemes. This is usually true for comparative studies. How- ever, the differences often are very small and depend highly on these conditions. Therefore, one algorithm that performed best in one study might not be the best for a different problem. This effect is comparable to the No Free Lunch Theorem described in [172] for optimization, which states that any superior performance on one problem is paid for by inferior results on other problems. Some studies also use several different test cases and get varying results regarding the best classifier.
Data Acquisition and Test Areas
The available input and reference data is the most important factor in tree species classification. The reference data is critical, as it is the information that classifiers are trained and validated on. As described by Congalton and Green in [27] the expression ground truth data is often used. However, the term ground truth data gives the impression, that this data set is true and therefore correct. An assumption which is violated in many cases as the ground truth data set is also subject to measurement errors or misclassifications. Nevertheless, it needs to be assumed to be correct to implement the training and validation phases. It is suggested to use the term reference data instead, which states what the data is used as a reference and does not imply that the data set is correct.
In order to understand the developed algorithms and methods in the following chapters, it is important to have detailed knowledge of the available data sets that were used. Section 3.1 will therefore introduce additional preprocessing steps and the resulting data sources that will be used for the classification and section 3.2 will give insight on the available reference data sources in more detail, including detailed analysis of the available reference data sources and their quality. The test areas are described in more detail in sections 3.3.1 - 3.3.3. Each test area has a size of about 300 km2 and in each area up to nine tree species groups need to be classified.
For each test area, the available reference data was analyzed and validated and the results are presented in the according sections. Due to missing reference data, in some test areas only a subset of the species groups can be properly trained and therefore classified. According to the specification given by Congalton and Green as described in section 2.4 and in [27], for a map with less than 12 classes and less
than one million acres (∼ 4047 km ), 50 samples per class should be used as test set. Both conditions are met in all the training areas described below and therefore a minimum of 50 samples per species for the test set is used. Two separation schemes for the available training data are widely used:
• 2/3 training samples, 1/3 test samples
• 1/2 training samples, 1/4 validation samples, 1/4 test samples
Combining these rules with the minimum number of 50 test samples that is needed for a reliable accuracy analysis, a minimum number of 150–200 reference samples per tree species for training and testing is required. These sample sizes are hard to achieve for tree species classification applications, as it is not possible to use photo interpretation as reference data. Even for experts it can be very hard or even impossible to correctly classify all nine species groups used here in airborne images and therefore field measurements are needed. As field measurements are very expensive, reference data is rare. More detail on this problem will be given in section 3.2.
Apart from the reference data, the input data that the classification is performed on, is very important. Especially for a complex task as tree species classification, it is necessary to ensure, that a discrimination of the species in the chosen classification scheme is possible based on the available input data. This problem will be analyzed in more detail in section 4.2.
The combination of large test areas (∼ 300 km2) and high resolution data, makes
it necessary to subdivide the area for data storage and data handling. Therefore, all input data is tiled into quadratic tiles of 500 x 500 m, such that the lower left coordinates of each tile can be divided by 500 without remainder.
3.1
Derivative Products
For all test areas secondary derivative products were calculated from the recorded airborne Light Detection And Ranging (LIDAR) data. These derivative products give easier access to specific information in the data sources. The first derivative product described here is the normalized digital surface model (nDSM) described in section 3.1.1. A description of LIDAR intensity data is given in section 3.1.2 and the second derived product, the region images, are described in section 3.1.3 .