6.1 performance on individual disasters
6.1.2 Multiclass classification
Depending on the purpose of the damage assessment, a more precise granularity than binary separation is needed. While most papers on automated building dam- age assessment have solely focused on binary classification, arguing that it is too dif- ficult to distinguish more categories from satellite imagery, this study investigates
6.1 performance on individual disasters 44
Figure 6.2: Scatter plots of the percentage of data points belonging to the destroyed class ver-
sus the AUC (left) and the recall on the destroyed class (right). Each dot represents one disaster, the blue line shows the best polynomial fit, and the blue area the 95% confidence interval.
the performance of the model on multiclass classification. A distinction is made be- tween four classes, according to the original labels of the xBD dataset. These classes are no damage, minor damage, major damage and destroyed.
The set-up of the experiment is equivalent to the binary classification: a model is trained for each disaster and the split of training and test data is 80% and 10% re- spectively. To check if the model is learning, one can explore the training losses, see Figure6.3. We observe that the loss decreases over time, which indicates learning. Moreover, a setting of 100 epochs seems reasonable since the loss stabilizes by then. To train for 100 epochs took between 20 minutes and 10 hours, depending on the training data size. Producing the test results took between three and 26 seconds.
The performance of the model on each disaster is displayed in Table6.2. The harmonic F1, macro F1, and recall over each class are taken as performance mea- sures. Similar to the binary case, one can observe large differences in performance between disasters across all measures. The most striking one is the harmonic F1 score, where five disasters have a score of zero. This stems from the fact that the harmonic F1 immediately goes to zero if the F1 score of one of the classes is zero. Looking at the table, for those five disasters, the recall of at least one of the classes is zero, which results in a zero F1 score for that class and thus a harmonic F1 of zero. The macro F1, the most common measure for multiclass classification, also shows large differences between the disasters. In the macro F1, a very low recall score in one class can be compensated by a high score in another class. This causes the ordering of disasters by the highest value of the performance measure to be dif- ferent when judging on the macro F1 compared to the harmonic F1, which shows the importance in the choice of performance measure.
6.1 performance on individual disasters 45
F1 Recall Class percentage
Disaster Harm. Macro No Min. Maj. Des. No Min. Maj. Des. Palu 0.000 0.702 0.981 1.000 0.000 0.824 83.8 0.0 1.8 14.4 Sunda 0.000 0.332 0.997 1.000 0.000 0.286 98.7 0.0 0.7 0.6 Midwest 0.000 0.372 0.999 0.000 0.429 0.000 97.0 1.7 1.0 0.3 Puna 0.000 0.449 0.980 0.000 0.000 0.800 78.6 1.6 1.6 18.2 Florence 0.000 0.466 0.975 0.000 0.922 0.000 76.8 2.3 19.9 1.0 Nepal 0.073 0.482 0.989 0.010 0.459 0.368 74.3 13.0 12.1 0.6 Harvey 0.381 0.538 0.899 0.162 0.861 0.143 50.5 12.1 36.1 1.4 Michael 0.495 0.542 0.914 0.401 0.413 0.352 65.0 24.0 8.4 2.6 Matthew 0.543 0.577 0.382 0.858 0.315 0.676 16.9 50.0 18.1 15.0 Tuscaloosa 0.706 0.740 0.952 0.716 0.500 0.781 74.4 16.1 3.7 5.8 Joplin 0.758 0.787 0.961 0.705 0.516 0.907 55.9 15.6 7.4 21.1 Moore 0.771 0.802 0.995 0.500 0.766 0.887 87.1 4.0 2.4 6.5 Guatemala 1.000 1.000 1.000 1.000 1.000 1.000 93.8 0.0 0.0 6.2
Table 6.2: Results of multiclass classification per disaster, sorted by the harmonic F1 score.
No, Min., Maj., Des. indicate the no damage, minor damage, major damage, destroyed classes respectively.
Looking closely at the recall per class, the majority class always has the highest score. Previous research has suggested that it is harder to distinguish the minor and major damage classes [32,28], but the results of this experiment do not support that statement. Recall scores for those classes are not in all cases good, but not significantly worse compared to the scores of the destroyed class. It seems more likely that low recall scores for the minor and major damage classes can be attributed to the low percentage of data points belonging to those classes. This hypothesis is further explored in Section6.1.4 where it is shown that this is indeed part of the explanation. From this experiment, it can thus be concluded that it is too simple to state that minor and major damage cannot be recognized well. This experiment shows that it highly depends on the disaster and the percentage of labeled samples whether this statement holds.
Due to the lack of a suitable performance measure for multiclass imbalanced data, it is hard to compare and understand the performance of different disasters by looking at the given performance measures. A confusion matrix is another method to understand performance. This matrix can especially be helpful in our case, where the classes are ordered and we thus rather have the model predict classes that are close to the actual label, e.g. it is preferred that the model predicts major damage than that it predicts no damage when the label is destroyed. Figure6.4shows the confusion matrices for two disasters, the Nepal flooding and Joplin tornado.
In the confusion matrix of Nepal, we can see that the model has a high tendency to predict no damage, even for destroyed samples. This is non-desirable behavior and gives the insight that the model is not learning the correct damage-related features. To understand where the mispredictions come from, a qualitative analysis can be done. Figure 6.5 shows four examples of misclassified buildings, two where the level of damage is overpredicted and two where it is underpredicted. From here we can see that some mispredictions are explainable. For example, the top left building does not seem to be fully surrounded by water due to the trees, judging from the area of the building crop. However, if zooming, it could be seen that the building is fully surrounded by water, leading to a categorization of major damage. From the given input to the model however, it is understandable that the building is labeled as no damage. In the bottom left, it is very hard to see the building due to the blurriness and the small building size. However, it is understandable that the building is predicted as destroyed instead of the real label no damage since the post image does appear completely flooded such that no building is visible anymore, and the white pixels on the pre image might be interpreted as the original building. The right bottom figure overpredicts damage, probably due to the lightning that
6.1 performance on individual disasters 46
Figure 6.4: Confusion matrices of the model trained and tested on the Nepal flooding (left)
and Joplin tornado (right).
makes the field appear like water. Nevertheless, the model is also making mistakes that should be recognizable, such as the building on the top right in Figure 6.5. This small qualitative analysis indicates that the model is not learning all relevant features, but also that not all data points with damage show clear damage-related features.
For Joplin, the confusion matrix shows a different pattern than for Nepal. In general, more samples are classified correctly and if they are misclassified, they are often classified as one of the adjacent classes. This is a more desirable behavior and indicates that the model is learning relevant features from the data. Again, examining a few samples helps to understand the misclassifications, see Figure6.6. From here we can see that most misclassifications are understandable. Sometimes the model even outperforms the human annotators, such as in the bottom left where the ground truth label is clearly wrong. The top left shows minor damage, but this is difficult to differentiate due to not showing big holes or a blue tarp, which is the case for most minor damage caused by the Joplin tornado. The bottom right is wrongly predicted as destroyed, and this is probably due to the debris in front of the building. The top right is clearly mispredicted and it is ambiguous why the model mistook this building as showing no damage. One possible explanation could be that most damaged buildings show brownish damage features which is not the case here. However, if this is the explanation it indicates that the learned representation of damage features is not fully correct. Concluding, the model seems to learn most damage-related features of the Joplin tornado really well, sometimes even outperforming the human annotators.
The disadvantage of confusion matrices and qualitative analysis is that it is hard to compare a range of disasters based on them. Combining the individual numbers of all disasters and confusion matrices of Joplin and Nepal, it can be concluded that the multiclass model holds predictive value for a subset of disasters, also on the minor and major damage classes. When there is a clear distinction in the features of the damage classes with the human eye, the model seems to learn those distinctions. Concurrently, more vague and mixed imagery gives the model a hard time learning damage-related features and causes it to lean towards the majority class.