4.4 Evaluation of SADGE Framework
4.4.1 The More Efficient and Effective Approach for Extracting Archi-
We conducted two experiments to find out the more efficient approach for extracting ar- chitectural issues among manual, automatic and semi-automatic approaches; the results
would indicate the nominated approach, among automatic and semi-automatic to be in- cluded in SADGE. Besides, the experiment results would evaluate the efficiency and ef- fectiveness of the automatic approach against the completely manual approach. In both experiments (with IT students and expert architects), the participants were, in the first stage, asked to manually annotate a sample text from architectural documents. The out- come of this step is representative for the manual approach. Prior to the experiment sessions, the sample text was annotated by the Automatic Annotator (AA) to be represen- tative for the automatic approach. In the second stage of both experiments, the annotated sentences by AA were given to the participants for accepting/rejecting AA annotations. The outcome of this step was representative for the semi-automatic approach. Details about the preparation, sessions, materials, participants and stages of experiments are pub- lished in Paper 3 (P3) and Paper 4 (P4).
Table 4.1 shows the results of the experiment with 19 IT students. Lower processing time of the automatic approach compared to the manual and semi-automatic approaches is not a surprising result. But when it comes to the recall, we did not expect that the automatic approach gives a higher result. The lower recall in the semi-automatic approach compared to the automatic approach, reveals that the participants have rejected some of the true positive results of the automatic part (AA annotations). We speculated the expertise level of the participants as the reason, as it is reported in Paper 3.
Approach Time (min) Recall (%)
Manual 9 38
Automatic 0.03 86
Semi-automatic 3 55
Table 4.1: Results of experiment on IT students
We conducted a new experiment to test this hypothesis, after improving the framework. This time the participants were 21 expert software/IT architects and the material were different from the first experiment. Table 4.2 shows the results.
Approach Time (min) Recall (%)
Manual 14 34
Automatic 0.03 53.5
Semi-automatic 7 32
Table 4.2: Results of experiment on expert architects
Although the numbers are different from the previous experiment, the same pattern has occurred again: Besides processing time that is much lower in the automatic approach compared to the other two approaches, recall is shown to be higher in the automatic approach. However, a statistic test needs to be run to find out whether this difference of recall is significant or not. To test data normality, the data was analyzed applying
4.4. EVALUATION OF SADGE FRAMEWORK 43 Shapiro-Wilks test. Since the data was shown to have a normal distribution, a two-sample Kolmogorov-Smirnov (K-S) test was applied to compare the recall in the automatic, man- ual, and semi-automatic approaches.
Comparison p-value
Automatic vs. Manual 0.001774
Automatic vs. Semi-automatic 0.0001581
Table 4.3: Results of K-S test
As Table 4.3 shows, the recall of the automatic approach is significantly higher than the manual and semi-automatic approaches, setting p-value threshold at α=0.01. A signifi- cantly lower recall of semi-the automatic approach compared to the automatic approach in the experiment with expert architects rejects the assumption about the expertise level being the reason for rejecting some of the correct suggestion of AA by participants. In the search for justification of the lower recall of semi-automatic approach, we found that researchers in the field of psychology of decision-making have an empirical explana- tion: ”Several studies have shown that human decision makers are inferior to a prediction formula even when they are given the score suggested by the formula. They feel that they can overrule the formula because they have additional information about the case, but they are wrong more often than not” [Kah11]. Therefore, the research in the psychology field suggests ”to maximize predictive accuracy, final decisions should be left to formulas, especially in low-validity environments” [Kah11]. Low-validity environments are the do- mains that involve a substantial degree of uncertainty and unpredictability [Kah11]. The task of annotating architectural issues in a document has a significant degree of uncer- tainty. Hence, it can be considered as a low-validity context, and the lower recall of the semi-automatic approach compared to the automatic approach, is not a surprising result. We should note that final decision here is not the architectural decision, but the deci- sion about annotating or not annotating a sentence as an architectural issue, to avoid any misunderstanding. SADGE is not an expert system replacing humans with artificial in- telligence for decision-making. Rather, as a support (recommender) system it highlights recurring issues, thus empowering architects in making more comprehensive decisions. In summary, considering the metrics of evaluation, processing time and recall, the au- tomatic approach has shown to be more efficient than the semi-automatic approach for extracting architectural issues. In the initial version of the framework presented in Paper 3, we considered the manual fine-tune stage as a part of the framework. After the exper- iment with the experts, we conclude that this stage can be excluded. However, based on the results of the case study presented in Paper 4, we selected the hybrid method as the proper method for construction of the CoT. The hybrid method includes the manual boot- strapping and therefore the framework can be considered semi-automated, and its name Semi-Automated Design Guidance Enhancer (SADGE) is still justified.