• No results found

3.2 Controlled Multiple-Case Study Design for

3.2.4 Data Analysis

Analyses addressing RQ1: Code smells were aggregated at the system level, and

each system was ranked according to theamount of code smellsthey contained and their code smell density (i.e., less code smells and lower smell densities mean better maintain- ability ranking of a system). After the systems had undergone maintenance work, the maintenance outcomes: total effort, and the introduced defects at the system level were collected per system to rank them accordingly. To avoid the learning effect problems, we used only the data from the first round per developer. Cohen’s kappa coefficient4 was

used to statistically measure the degree of agreement between the code-smell-based and maintenance-outcome-based rankings. Previous maintainability assessments of the sys- tems based on a subset of C&K metrics and expert judgments, as reported in Ref. [6], were also compared with the maintenance-outcome-based rankings to analyze the differences in accuracy between the code-smell-based, expert-judgment-based, and metrics-based ap- proaches for maintainability assessments.

Analyses addressing RQ2: This question was addressed by focusing on Java files as

the unit of analysis and by using multiple regression analysis. Effort at file level (effort used to view or update a file) was the variable to be explained. Variables representing the different code smells, the file size (measured in LOC), the number of revisions on a file, the system, the developer, and the round were included as independent variables. Several regression models, with different subsets of variables, were built to compare their fit and to discern the predictive capability of each of the variables considered.

Analyses addressing RQ3: This question was addressed by focusing on Java files as

the unit of analysis and by using binary logistic regression analysis. The variable to ex- plain was the variable “problematic,” which wastrue(1) if a file was deemed problematic during maintenance by at least one developer who worked with the file, butfalse other- wise (0). The different types of code smells, files size (measured in LOC), and change size (churn) were used as independent variables. A principal component analysis (PCA) using orthogonal rotation (varimax) was conducted on a set of files to observe patterns of collo- cated code smells. A follow-up qualitative analysis based on the data from the interviews and the think-aloud sessions was performed (1) to support/challenge the findings from the binary logistic regression and (2) to understand better how the presence of a code smell contributed to the problems experienced by the developers during maintenance.

4Cohen’s kappa coefficient is a statistical measure to represent inter-rater agreement for categorical

Analyses addressing RQ4: This question was addressed through a compilation and synthesis of the relevant qualitative data related to problems encountered by the develop- ers during the maintenance work. The record of the problems was based on observational notes, think-aloud sessions, and progress interviews. Based on the origin of a problem, each problem was categorized as a source-code-related or non-source-code-related. The extent to which code smells can explain the problems during maintenance was investigated by observing the proportion of problems that were related to the source code compared with the problems caused by other factors (e.g., problems related to infrastructure or external services). The set of problems associated with the source code was further in- vestigated by examining how many of these problems could potentially be related to the presence of code smells. This was done by examining the presence of code smells in files related to maintenance problems.

Analyses addressing RQ5: This was addressed through a mainly qualitative analysis,

which compared the developers’ perceptions on the maintainability of the systems with the goal of identifying a set of factors relevant to maintainability. These factors were related to current definitions of code smells to observe their conceptual relatedness. The transcripts of the open-ended interviews were analyzed through open and axial coding [135]. The identified factors were summarized and compared across cases using a technique called cross-case synthesis[152]. The factors derived from this analysis were compared with the factors reported in a previous study [6], which were extracted via expert judgment.

Analyses addressing RQ6: This question was addressed through a descriptive case

study with a detailed account of how concept mapping (a technique from social research) can be adapted to software engineering. This technique was suggested as a structured approach to enable the usage of expert judgment in guiding the selection, combination, and interpretation of code smells for maintainability assessments. Several software engineering researchers and a senior software engineer (with more than 25 years of experience at that time) participated as the experts in the concept mapping process. We compared the resulting concept maps (representing the maintainability of the four systems) with the results from the expert assessment reported in Ref. [6] to evaluate the validity of the concept mapping technique.

Summary of Results

In this section, the key results of the papers submitted as part of this thesis are summa- rized.

4.1

Code Smells as System-Level Indicators of

Maintainability

System-level indicators of maintainability based on code smells were investigated in four systems where a system’s maintainability was ranked according to code smell measures and compared with respect to the maintenance outcome measures – change effort, and number of defects. Figure 4.1 shows the standardized values of the total number of code smells and the code smell densities for each system. When differences in code smells were not adjusted for differences in size (i.e., by using the total number of code smells rather than the code smell density), the smallest system (system C) was considered the most maintainable. Conversely, when code smell density was considered (which adjusts for the size of the systems), the largest system (system B) becomes more maintainable.

Figure 4.2 presents the standardized scores for both effort and defects. As can be seen, these two variables give similar ranks regardless of the use of effort or defects as our main- tenance outcome measure. Figure 4.2 suggests that system C is the most maintainable, followed closely by system D. System A has an intermediate maintainability level, while system B is assessed to be the least maintainable system.

In Ref. [6], the code-metrics-based assessment approach resulted in system D being the most maintainable, systems A and B being medium maintainable, and system C being the least maintainable. On the other hand, the expert-judgment-based approach resulted in systems A and D as the most maintainable, system C as medium maintainable, and system B as the least maintainable. When comparing the rankings from all the

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 A B C D Std. score System Smells Smell Density

Figure 4.1: Standardized number of code smells and code smell densities.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 A B C D Std. score System Effort Defects

assessment approaches with the actual maintenance outcomes (see Table 4.1), thenumber of code smells gave the best matching [with a kappa coefficient (Po) of 0.75]. Conversely, the code smell density displayed zero matching with the maintenance outcomes. The evaluation approaches reported in Ref. [6] showed a medium level of agreement [kappa coefficient (Po) of 0.50].

Table 4.1: Comparison of levels of agreement with actual maintenance outcome.

Assessment approaches Level of matching or agreement Kappa coefficient

Code metrics from [6] Matching D as most maintainable and A as intermediate. 0.50

Expert Judgment from [6] Matching D as most maintainable and B as least maintainable. 0.50

Number of code smells Matching C as most maintainable, A as intermediate and B as

least maintainable.

0.75

Code smell density No matching with maintenance outcomes 0.00

Code smell density analysis implied that system B (the largest system of all four) was highly maintainable, which is not an accurate assessment, at least for the size and types of maintenance tasks involved in the project. This result suggests that one should be careful when using code smell density to compare systems differing greatly in size. However, when only considering systems similar in size (i.e., when excluding system B), code smell density reflected better the levels of maintainability according to the outcomes in terms of effort and defects. To illustrate this, Figure 4.3(a) and 4.3(b) show parallel plots1 of

the standardized scores for the number of code smells and code smell density, effort, and defects for the three systems with similar sizes. This figure shows that the degree of correspondence between code smell measure and effort/defects is better for code smell density than for number of code smells.

A C D System

Smell density E ort Defects

Num Smells E ort Defects

(a) (b)

Figure 4.3: Parallel plots for systems A, C, and D on the level of matching between the standardized scores of (a) number of code smells and (b) code smell density versus maintenance

outcomes, that is, effort and defects.

The figure suggests that the effect of code smell density tended to be sensitive to larger differences in system size but that the use of this measure would improve if systems of similar sizes were compared and might provide more information than just the sum of the number of code smells per system. When comparing code-smell-based assessments with other assessment approaches, the C&K metrics provided more insight into which system had the most “balanced design” (e.g., they pointed out the absence of deviant classes in a system), but this measure tended to ignore the effect of the task size when maintenance tasks were of small/medium size. Expert-judgment-based assessment was the most flexible of all the three approaches because it considered both the effect of the system size and the potential maintenance scenarios (e.g., small versus large extensions). We conclude that an advantage of the use of code smells is that when comparing similarly sized systems (i.e., with the use of code smell density), they can spot critical areas that experts may overlook.