Human Evaluation - Automatic aviation safety reports classification

The expressed concerns about the process of categorizing the reports by the human analysts derived in the secondary goal of this research. The development of the first goal to apply machine learning to automatically classify the safety reports confirmed such concerns: the trained models for taxonomies with a considerable amount of examples were not performing as expected.

In order to diagnose such problem, the misclassified examples were examined. It was discovered that labels are partially missing from some examples. To a lesser extent, it was also discovered that some labels were incorrectly assigned. Such discoveries led us to examine the current categorization process which is described in the next paragraph.

The analysts specialize in one, two or three types of reports. This means that some do- main knowledge is required to analyze each type of report. The analysts come from different backgrounds. Some of them are full-time analysts while others work part-time. In certain cases, temporary analysts will categorize some reports. A single report will be categorized by just one analyst and saved into the database. Once a report has been labeled, the label- ing outcome won’t be checked again by another analyst. This situation has a few exceptions such as an analyst reviewing a report for historical reasons and then “fixing” the labels after considering that something is not right. It is important to consider that the categorization is done among other tasks that need to be performed in the analysis of the same report, which means that the analyst’s attention and expertise is not only focused on categorizing reports. Additionally, we have identified that it is not completely clear when certain labels should be applied. During the training process that the analysts go through, they are handed some documentation about the taxonomies and how to apply certain labels, however, such documentation is not extensive. Even though some label names are self-descriptive, this could be not enough and a clear description should be provided. Furthermore, the big taxonomy poses a problem because it is not feasible for the analyst to go through the entire list of possible labels to annotate the report with the exact labels as pointed out by [36]. Given this fact, we cannot consider our current dataset as the golden ground truth for our machine learning effort, meaning that there exists the possibility that a prediction is marked as a false positive while it is a true positive.

We strongly believe that some of the theories of content analysis [25] can be used to measure and improve the report categorization process. Consequently, we arrange an experiment to measure the reliability of the process using Krippendorff’s multi-valued α for

point to solve the problems exposed above.

The experiment was carried out with 5 analysts that participated voluntarily. Each analyst was asked to review up to 50 real reports according to the report types that they categorize in their day to day work. If an analyst normally categorizes more than one report type, the 50 reports will contain a proportional amount of each report type. This proportion is calcu- lated over the report type distribution in our dataset, which means, that stratified sampling was employed instead of sampling them uniformly. To make things clear, if two analysts specialize on the ASR type, each analyst will label the same 50 ASR reports. For the case of those categorizing GHR, CIR and ENV types; each analyst will need to classify the same 37 GHR, 5 CIR, and 8 ENV reports. The number of 50 reports was agreed with the analysts according to their workload.

A web application was created in order to gather the data for this experiment. The user interface was designed to be the most similar to the application they normally use to categorize the reports. An screenshot of such user interface is displayed in Appendix A.3. It was decided to create such application because we wanted the analysts to use a single interface to read and categorize the reports. Additionally, as the reports for this experiment were real, the web application allowed us to hide the labels for the reports that were already classified, therefore preventing bias. The results were collected anonymously, so we cannot trace back the answers to a particular analyst.

The outcome of the experiment is the data to calculate the current inter-annotator agree- ment and to suggest some improvements in the categorization process. After implementing the mentioned suggestions, the experiment can be carried out again to determine if the suggestions resulted in any improvement to the process. It is out of the scope of this research to implement and measure the impact of such suggestions.

Chapter 7

Results & Discussion

The results of all the experiments that were executed during this project are presented in the following subsections.

The following abbreviations are used in the tables: binary relevance (BR), logistic regression (LR), LightGBM (LGBM), decision tree (DT), multi-label decision tree (MLDT), support vector machines (SVM) and Classifier Chain (CC). Binary relevance and classifier chain are methods that require a base classifier, therefore, we announce the respective method fol- lowed by the base classifier used. For example, BR LR means that the method used is binary relevance with a logistic regression base classifier.

Even though we report diverse metrics for our experiments, we are looking to optimize the macro F0.5 measure. We chose the macro average because we have an unbalanced

dataset with almost a power-law distribution. Macroaveraging gives equal importance to each class, making it suitable to evaluate the overall model in our case. TheF0.5 metric was

chosen by business needs (see Section 2.8).

7.1 Baseline Evaluation

Table 7.1 displays the results obtained using a 5-fold cross validation.

Table 7.1:Baseline results BRLR MLDT Macro precision 0.4194 0.3215 Macro recall 0.1634 0.2746 Macro F1 0.2120 0.2862 Macro F0.5 0.2773 0.3020 Micro precision 0.8115 0.4742 Micro recall 0.3421 0.4214 Micro F1 0.4813 0.4462 Micro F0.5 0.6367 0.4626 31

In document Automatic aviation safety reports classification (Page 37-40)