Evaluating our Bayesian Methodology - Evaluating the Quality of Design

CHAPTER 4: DEALING WITH SUBJECTIVITY

4.2 Evaluating the Quality of Design

4.2.4 Evaluating our Bayesian Methodology

To empirically validate our approach, we followed a three step approach. First, we built our corpus in such a way as to support multiple, contradictory opinions. Then, we performed an analysis of the different symptoms presented earlier and evaluated their usefulness. Finally, we executed our detection models to compare their performance to DECOR.

4.2.4.1 Building an Uncertain Corpus

To test our approach, we used the corpus of DECOR. This corpus contains a set of classes that are instances of every anti-pattern type (as judged by 3/5 groups). Addition- ally, we had the opinions of 2 additional developers concerning those classes. To obtain a corpus that supports conflicting opinions, we considered that every opinion is a vote

59 (with data in DECOR counting as 3 positive votes). Thus, if a class tagged by DECOR was also found to be an anti-pattern by one of our developers, then it would have 4 votes for, and one vote against. All classes that were not part of the corpus had 3 negative votes as our developers did not review them.

We could not encode this information directly in our model, thus we proposed the use of “bootstrapping”, a technique that multiplies existing data in a corpus. Our class with 4 positive votes would thus be included 4 times as a yes, and once as a no. If that class were the only class considered in the corpus, its symptoms would consequently have a probability of 80% of being evaluated as an anti-pattern. Finally, since we can- not be sure that our prior knowledge (P(Q), unconditionally) is similar across projects, we balanced our corpus to have an equal number of positive and negative classes by randomly sampling our data set.

4.2.4.2 Systems Analysed

Programs ♯ Classes KLOCs

GanttProject v1.10.2 188 31

Xerces v2.7.0 589 240

Total 777 271

Table 4.IV: Program Statistics

We used two open-source Java programs to perform our experiments: GanttProject v1.10.2, Xerces v2.7.0, presented in Table 4.IV. GanttProject2 _{is a tool for creating}

project schedules by means of Gantt charts and resource-load charts. Xerces3is a family of software packages for parsing and manipulating XML.

We chose GanttProject and Xerces because DECOR had annotated the corpus. Fur- thermore, they were small enough so that students could understand the general software architecture in order review the annotations. We used the POM framework [GSF04] to extract metrics.

2_{http://ganttproject.biz/index.php} 3_{http://xerces.apache.org/}

Anti-patterns Symptoms % in GanttProject p-values % in Xerces p-values AP Not AP AP Not AP Blob B1 100% 4% X 94% 3% X B2 0% 7% × 5% 1% 53% B3 0% 26% × 0% 19% × B4 72% 8% X 63% 48% X S.C. S1 100% 6% X 89% 6% X S2 49% 3% X 19% 10% 70% S3 38% 4% X 12% 2% 19% S4 0% 44% × 27% 49% × S5 0% 2% × 36% 18% X F.D. F1 75% 47% X 87% 18% X F2 70% 39% X 60% 19% X F3 6% 8% 93% 67% 26% X F4 63% 17% X 7% 8% 92% F5 6% 4% 87% 0 3% × F6 75% 35% X 33% 57% 7%

Table 4.V: Intra project validation, salient symptom identification

4.2.4.3 Salient Symptom Identification

As mentioned before, some symptoms used by DECOR are inadequate in an in- dustrial context. Figuring out which symptoms are useful can allow an IV&V team to tweak their detection models. In this section, we identify which symptoms are useful in a detection process.

We decided to use a simple univariate proportion test to identify the interesting symptoms for anti-patterns in both systems. The symptoms tested correspond to those presented in Tables 4.I, 4.II and 4.III. The test evaluates if the difference in the proportion of anti-patterns and non anti-patterns is statistically significant (not due to chance). When the difference is significant and the symptom is more present in anti-patterns, then it is a useful detector for that system.

We applied this approach to both systems and the results are presented in Table 4.V. The table contains the different proportions measured and the significance. When a symptom was significant (<= 0.01 level), it is presented by a X; when it was not significant, the exact value is shown; and when the relationship is the contrary of what is expected, we use a × symbol. The first thing we notice is that there are many inap- propriate symptoms. There are two possible explanations: the measure used might be incorrect, or the symptom itself might not apply. In the case of symptom B3 for Blobs, we are interested in using cohesion of a class, but the LCOM5 measure is not useful.

61 As mentioned in Section 2.4.3, although there are many alternative measures, none in the literature are shown to be a better measure of cohesion. The second thing we notice is that there are a large number of symptoms that are useful only for one system; their importance is context-dependent. Naming convention is such a case (S5 and F3). The terms used are only useful for predictions on Xerces. Obviously, the development teams followed different coding practices. Finally, we observed that a maximum of two symptoms are useful for every model. These symptoms are thus the least context-depend symptoms. The conclusion of the analysis of symptoms is that simply encoding heuris- tics found in the literature is not a good idea to produce a general purpose detection model.

This conclusion leads us to our empirical validation. We tested two scenarios. First, we tested a general approach for which we built our models using data on one system and tested it on the other. This corresponds to the approach of an IV&V team that uses general knowledge to detection anti-patterns in a new system. The second scenario evaluates the importance of local data. We built a model for a system using only symptoms that affect that system, and training on the local data (we used 3-fold cross-validation). This scenario corresponds to an IV&V team tracking a system over a long period of time during which they could adapt their prediction models.

In document Modelling software quality : a multidimensional approach (Page 80-83)