The effort by the code smell research community has, to a great extent, focused on the formalization and detection of code smells. In the current body of knowledge, few empiri- cal studies shed light on the actual impact of code smells on software maintenance. Code smell detection tools can aid in the detection and measurement of code smells. However,
their interpretation and refactoring-related decisions still rely on expert judgment due to the lack of knowledge on quantifiable relations between code smells and maintainability.
From the identified empirical studies in code smells, it is possible to observe that not all code smells are equally harmful. Also, code smells are not harmful to the same extent over different contexts, indicating that their effects are potentially contingent on contextual variables or interaction effects. For example, Li and Shatnawi [77] found that the presence of Shotgun Surgery leads to defects. D’Ambros et al. [28], on the other hand, found no such connection between Shotgun Surgery and defects. Results from studies on Duplicated Code suggest that the effects of duplication depend on factors such as the programming language; e.g., the results from the COBOL system differed from that of the other types of systems in the study by Juergens et al. [63]. Similarly, results from studies on God Class seem to give different results. Deligiannis et al. [30] reported that the presence of a God Class indicates problems, while Abbes et al. [1] concluded that a God Class in isolation is not harmful. Olbrich et al. [108] reported that God Class is less connected with problems when adjusting for differences in file size, i.e., when file size was used as an independent variable in the regression model.
One reason for the difficulty of integrating and interpreting the results may be the vari- ations in the dependent (outcome) variables, i.e., the variables used to represent main- tainability. In the current studies on code smells, there are basically two categories of dependent variables: effort (the amount of time spent or the size of the changes required to finish the tasks) andquality (the presence or number of defects in the resulting prod- uct). Only Deligiannis [31] measured effort as the actual work effort spent on maintenance, which was recorded on video. Lozano et al. [80] and Olbrich et al. [108] used, instead of a direct measurement of effort, measures related to change frequency, change impact, and change size. However, it is questionable how good these surrogates are for measuring effort, and these studies do not refer to other studies that have validated or investigated the relationships between actual maintenance effort (time) and its surrogates. Quality is also measured using different measures in the previously reported empirical studies. Deligiannis [31, 30] used measures related to correctness, completeness, and consistency as quality indicators. The remaining studies used the number of defects per class or line as measures. Monden [100] used the number of revisions as a quality-related variable, arguing that a module is, on average, less maintainable the more times the module has been revised.
An additional reason why the empirical results may be hard to interpret is the varia- tions in the context of the studies and in the research methods applied. The context of the studies varies with respect to the domain of the system and its size, the type of task and its size, the characteristics of the developers, and the code smell detection procedure.
The research methods and contexts of the studies reported in this review include one con- trolled experiment (Deligiannis [30]) in a context with students and relatively small tasks, one case study (Deligiannis [31]) in an academic context, two case studies that analyzed the existing code in commercial systems (Monden [100] and Juergens [63]), and several post hoc correlation and regression studies involving OSS projects [69, 77, 107, 108].
OSS projects have opened a new arena for post hoc correlation and regression studies in software engineering. However, the ability to claim cause-effects and to explain the results may be limited in such studies due to the inaccessibility of much of the process and context information that affects the outcomes of a software project.
Despite the plethora of detection methods and analysis tools, very little is reported on how code smells actually perform when conducting maintainability assessments. Studies seldom address the applicability of code smells in industrial, real-life contexts and validate essential aspects such as their descriptive richness or their capability to assess different maintainability factors.
We believe that more empirical studies are needed to support refactoring decisions and maintainability assessments based on code smells. Moreover, we think that more consistent operationalizations of maintainability constructs would ease the level of com- parability across studies. Last but not the least, we believe that more in vivo studies are required, involving professionals, realistic maintenance tasks, and industry-relevant sys- tems, to ease the transfer of results from a study to the software industry. Realistic study contexts, in spite of being more difficult to attain, are likely to enable higher confidence in the results and lead to more practical insights to both academia and industry. The present research attempts to address segments of the identified knowledge gap and to improve the use of code smells in industrial software maintenance contexts. For this purpose, we take different perspectives of maintainability into account. Outcome-based interpretations of maintainability (e.g., effort, defects, and change size) are considered, but we also include qualitative process-related aspects (e.g., number and types of mainte- nance problems and developers’ perception of maintainability) to better understand the underlying mechanisms of the code smell effects.