CHAPTER 4: DEALING WITH SUBJECTIVITY
4.2 Evaluating the Quality of Design
4.2.4 Evaluating our Bayesian Methodology
To empirically validate our approach, we followed a three step approach. First, we built our corpus in such a way as to support multiple, contradictory opinions. Then, we performed an analysis of the different symptoms presented earlier and evaluated their usefulness. Finally, we executed our detection models to compare their performance to DECOR.
4.2.4.1 Building an Uncertain Corpus
To test our approach, we used the corpus of DECOR. This corpus contains a set of classes that are instances of every anti-pattern type (as judged by 3/5 groups). Addition- ally, we had the opinions of 2 additional developers concerning those classes. To obtain a corpus that supports conflicting opinions, we considered that every opinion is a vote
59 (with data in DECOR counting as 3 positive votes). Thus, if a class tagged by DECOR was also found to be an anti-pattern by one of our developers, then it would have 4 votes for, and one vote against. All classes that were not part of the corpus had 3 negative votes as our developers did not review them.
We could not encode this information directly in our model, thus we proposed the use of “bootstrapping”, a technique that multiplies existing data in a corpus. Our class with 4 positive votes would thus be included 4 times as a yes, and once as a no. If that class were the only class considered in the corpus, its symptoms would consequently have a probability of 80% of being evaluated as an anti-pattern. Finally, since we can- not be sure that our prior knowledge (P(Q), unconditionally) is similar across projects, we balanced our corpus to have an equal number of positive and negative classes by randomly sampling our data set.
4.2.4.2 Systems Analysed
Programs ♯ Classes KLOCs
GanttProject v1.10.2 188 31
Xerces v2.7.0 589 240
Total 777 271
Table 4.IV: Program Statistics
We used two open-source Java programs to perform our experiments: GanttProject v1.10.2, Xerces v2.7.0, presented in Table 4.IV. GanttProject2 is a tool for creating
project schedules by means of Gantt charts and resource-load charts. Xerces3is a family of software packages for parsing and manipulating XML.
We chose GanttProject and Xerces because DECOR had annotated the corpus. Fur- thermore, they were small enough so that students could understand the general software architecture in order review the annotations. We used the POM framework [GSF04] to extract metrics.
2http://ganttproject.biz/index.php 3http://xerces.apache.org/
Anti-patterns Symptoms % in GanttProject p-values % in Xerces p-values AP Not AP AP Not AP Blob B1 100% 4% X 94% 3% X B2 0% 7% × 5% 1% 53% B3 0% 26% × 0% 19% × B4 72% 8% X 63% 48% X S.C. S1 100% 6% X 89% 6% X S2 49% 3% X 19% 10% 70% S3 38% 4% X 12% 2% 19% S4 0% 44% × 27% 49% × S5 0% 2% × 36% 18% X F.D. F1 75% 47% X 87% 18% X F2 70% 39% X 60% 19% X F3 6% 8% 93% 67% 26% X F4 63% 17% X 7% 8% 92% F5 6% 4% 87% 0 3% × F6 75% 35% X 33% 57% 7%
Table 4.V: Intra project validation, salient symptom identification
4.2.4.3 Salient Symptom Identification
As mentioned before, some symptoms used by DECOR are inadequate in an in- dustrial context. Figuring out which symptoms are useful can allow an IV&V team to tweak their detection models. In this section, we identify which symptoms are useful in a detection process.
We decided to use a simple univariate proportion test to identify the interesting symp- toms for anti-patterns in both systems. The symptoms tested correspond to those pre- sented in Tables 4.I, 4.II and 4.III. The test evaluates if the difference in the proportion of anti-patterns and non anti-patterns is statistically significant (not due to chance). When the difference is significant and the symptom is more present in anti-patterns, then it is a useful detector for that system.
We applied this approach to both systems and the results are presented in Table 4.V. The table contains the different proportions measured and the significance. When a symptom was significant (<= 0.01 level), it is presented by a X; when it was not sig- nificant, the exact value is shown; and when the relationship is the contrary of what is expected, we use a × symbol. The first thing we notice is that there are many inap- propriate symptoms. There are two possible explanations: the measure used might be incorrect, or the symptom itself might not apply. In the case of symptom B3 for Blobs, we are interested in using cohesion of a class, but the LCOM5 measure is not useful.
61 As mentioned in Section 2.4.3, although there are many alternative measures, none in the literature are shown to be a better measure of cohesion. The second thing we no- tice is that there are a large number of symptoms that are useful only for one system; their importance is context-dependent. Naming convention is such a case (S5 and F3). The terms used are only useful for predictions on Xerces. Obviously, the development teams followed different coding practices. Finally, we observed that a maximum of two symptoms are useful for every model. These symptoms are thus the least context-depend symptoms. The conclusion of the analysis of symptoms is that simply encoding heuris- tics found in the literature is not a good idea to produce a general purpose detection model.
This conclusion leads us to our empirical validation. We tested two scenarios. First, we tested a general approach for which we built our models using data on one system and tested it on the other. This corresponds to the approach of an IV&V team that uses general knowledge to detection anti-patterns in a new system. The second scenario eval- uates the importance of local data. We built a model for a system using only symptoms that affect that system, and training on the local data (we used 3-fold cross-validation). This scenario corresponds to an IV&V team tracking a system over a long period of time during which they could adapt their prediction models.