Clone Assessment and Management - Why and How to Control Cloning in Software Artifacts. Elmar J

Graph Theory Probably the most closely related problem in graph theory is the well known NP-completeMaximum Common Subgraph problem. An overview of algorithms is presented by

Bunke et al. [31]. Most practical applications of this problem seem to be studied in chemoinformatics [191], where it is used to find similarities between molecules. However, while typical molecules considered there have up to about 100 atoms, many Matlab/Simulink models consist of thousands of blocks and thus make the application of exact algorithms as applied in chemoinformatics infea- sible.

Summary We require a clone detection algorithm for Matlab/Simulink models to investigate the extent of cloning in industrial Matlab/Simulink models.

ProblemWhile the existing approaches for clone detection in graphs and models provided valuable

inspiration, none is suitable to study the extent of cloning in industrial Matlab/Simulink models.

Contribution Chapter 7 presents a novel clone detection approach for data-flow models that is

suitable for Matlab/Simulink and scales to industrial-size models.

3.4 Clone Assessment and Management

This section outlines work related to clone management; to be comprehensive, we interpret this to comprise all work that employs clone detection results to support software maintenance.

3.4.1 Clone Assessment

Clone detection tools produce clone candidates. Just because the syntactic criteria for type-xclone candidates are satisfied, they do not necessarily represent duplication of problem domain knowledge. Hence, they are not necessarily relevant for software maintenance. If precision is interpreted as task relevance, existing clone detection approaches, hence, produce substantial amounts of false positives. Clone assessment needs to achieve high precision to get conclusive cloning information.

The existence of false positives in produced clone candidates has been reported by several researchers. Kapser and Godfrey report between 27% and 65% of false positives in case studies investigating cloning in open source software [122]. Burd and Bailey [32] compared three clone detection and two plagiarism detection tools using a single small system as study object. Through subjective assessments,38.5%of the detected clones were rejected as false positives. A more com-

prehensive study was conducted by Bellon et al. [19]. Six clone detectors were compared using eight different subject systems. A sample of the detected clones was judged manually by Bellon. It was found that—depending on the detection technique—a large amount of false positives are among the detected clones. Tiarks et al. [217] categorized type-3 clones detected by different state-of-the-art clone detectors according to their differences. Before categorization, they manually excluded false positives. They found that up to75%of the clones were false positives.

Walenstein et al. [229] reveal caveats involved in manual clone assessment. Lack of objective clone relevance criteria results in low inter-rater reliability. Similar results are reported by Kapser

3 State of the Art

et al. [124]. Their work emphasizes the need for measurement of inter-rater reliability to make sure objective clone relevance criteria are used.

Some work has been done on tailoring clone detectors to improve their accuracy: Kapser and God- frey propose to filter clones based on the code regions they occur in. They report that such filters can successfully remove false positives in regions of stereotype code without substantially affecting recall [122]. In addition, all clone detection tools expose parameters whose valuations influence result accuracy. For some individual tools and systems, their effect on the quantity of detected clones has been reported [121]. However, we are not aware of systematic methods on how result accuracy can be improved.

Summary Unfortunately, there is no common, agreed-upon understanding of the criteria that determine the relevance of clones for software maintenance. This is reflected in the multitude of different definitions of software clones in the literature [140, 201]. This lack of relevance criteria introduces subjectivity into clone judgement [124,229], making objective conclusions difficult. The negative consequences become obvious in the study done by Walenstein et al. [229]: three judges independently performed manual assessments of clone relevance; since no objective relevance criteria were given, judges applied subjective criteria, rating only 5 out of 317 candidates consistently. Obviously, such low agreement is unsuited as a basis for improvement of clone detection result accuracy.

ProblemClone detection tools produce substantial amounts of false positives, threatening the cor-

rectness of research conclusions and the adoption of clone detection by industry. However, we lack explicit criteria that are fundamental to make unbiased assessments of detection result accuracy; consequently, we lack methods for its improvement.

ContributionChapter 8 introduces clone coupling as an explicit criterion for the relevance of code

clones for software maintenance. It outlines a method for clone detection tailoring that employs clone coupling to improve result accuracy. The results of two industrial case studies indicate that developers can estimate clone coupling consistently and correctly and show the importance of tailoring for result accuracy.

3.4.2 Clone Management

In [141], Koschke provides a comprehensive overview of the current work on clone management. He follows Lague et al. [149] and Giesecke [78] in dividing clone management activities into three areas: preventive management aims to avoid creation of new clones; compensative management aims to alleviate impact of existing clones andcorrectivemanagement aims to remove clones. Clone Prevention The earlier problems in source code are identified, the easier they are to fix. This also holds for code clones. In [149], Lague et al. proposes to prevent the creation of new clones by analyzing code that gets committed to the central source code repository. In case a change adds a clone, it needs to pass a special approval process to be allowed to be added to the system.

Several processes [5, 51, 177] employ manual reviews of changes before the software can go into production. The LEvD process [51] we employ for the development of ConQAT, e. g., requires

3.4 Clone Assessment and Management

all code changes to be reviewed before a release. Manual review is supported by analysis tools, including clone detection. Clones thus draw attention during reviews and are, in most cases, marked as review findings that need to be consolidated by the original author. While this scheme does not prevent clones from being introduced into the source code repository, it does prevent them from being introduced into the released code base.

Existing clone prevention focuses on the clones, not on their root causes. However, while causes for cloning remain, maintainers are likely to continue to create clones. To be effective, clone prevention hence needs to analyze—and rectify—the causes for cloning.

Clone Compensation Clone indication tools point out areas of cloned code to the developer during maintenance of code in an IDE. Their goal is to increase developer awareness of cloning and thus make unintentionally inconsistent changes less likely. Examples include [46, 59, 60, 92, 94, 102, 103, 218]. Real-time clone detection approaches have been proposed to quickly deliver update-to-date clone information for evolving software to clone indication tools [126, 235].

Linked editing tools replicate modifications made to one clone to its siblings [218]. They thus promise to reduce the modification overhead caused by cloning and the likelihood to make unintentionally inconsistent modifications. A similar idea is implemented by CReN [102] that consistently renames identifiers in cloned code.

Both clone indication and linked editing tools operate on the source code level. In a large system, clone comprehension, and thus clone compensation, can be supported through tools that offer inter- active visualizations at different levels of abstraction. Examples include [219], [238] and [125]. Besides supporting comprehension of clones in a single system version, clone tracking tools aim to support comprehension of the evolution of clones in a system. Several tools to analyze the evolution of cloning have been proposed, including [60, 83, 85, 85, 132, 133, 181, 216]. In [91], Harder and Göde discuss that clone tracking and management face obstacles and raise costs in practice.

Clone Removal Several authors have investigated corrective clone management. Fanta and Ra- jlich [68] report on an industrial case study in which certain clone types were removed manually from a C++ system. They identify the lack of dedicated tool support for clone removal as an ob- stacle for clone consolidation. Such tool support is proposed by other authors: Komondoor [136] investigates automated clone consolidation through procedure extraction. Baxter et al. [16] proposes to generate C++ macro bodies as abstractions for clone groups and macro invocations to replace the clones. In [8], Balazinska et al. present an approach that consolidates clones through application of the strategy design pattern [74]; in their later paper [9], the same authors present a approach to support system refactoring to remove clones. In a more recent paper, the idea to suggest refactor- ings based on the results from clone detection is elaborated by Li and Thompson in [154] for the programming language Erlang.

Several authors have identified language limitations as one reason for cloning [140,201]. To counter this, some authors have investigated further means to remove cloning. Murphy-Hill et al. study clone removal using traits [179]. Basit et al. study clone removal in C++ using a static meta programming language [15].

3 State of the Art

Organizational Change Management Existing research in clone management primarily deals with technical challenges. But, to achieve adoption, and thus impact on software engineering practice, further obstacles have to be overcome. In his keynote speech published in [40], Jim Cordy outlines barriers in adoption of program comprehension techniques, including clone detection, by his industrial partners. Cordy does not mention technical challenges or immaturity of existing approaches, but instead business risks, management structures and social and cultural issues as central barriers to adoption. His reports confirm that adoption of clone detection or management approaches by industry faces challenges beyond the capabilities of the employed tools. Work of other researchers confirms challenges in research adoption beyond technical issues [38, 69, 209].

Introducing clone management to reduce the negative impact of cloning on maintenance efforts and program correctness, is not a problem that can be solved simply by installing suitable tools. Instead, it requires changes of the work habits of developers. To be successful, introduction of clone management must thus overcome obstacles that arise when established processes and habits are to be changed.

Challenges faced when changing professional habits are not specific to the introduction of clone management. Instead, they are faced by all changes to development processes, including the introduction of development or quality analysis tools. Furthermore, they are not limited to changes to the development process, but instead permeate all organizational changes. This has been realized long ago—management literature contains a substantial body of knowledge on how to successfully co- erce established habits into new paths [43, 130, 143–145, 152, 153], some dating back to 1940ies.

Summary The research community produced substantial work on clone management, targeting prevention, compensation and removal of cloning. Much of this work focuses on a single management aspect, for example clone indication or tracking. However, the challenges faced by successful clone management are not limited to developing appropriate tools. Instead, they require both an understanding of the causes for cloning and changes to existing processes and developer behavior. Changing established behavior is hard. Work in organizational change management has shown that it encounters obstacles that need to be addressed for changes to succeed in the long term. This is confirmed by reports on reluctance to adopt clone management [40] and other quality analysis approaches [38] in industry.

ProblemSuccessful introduction of clone management requires changes to established processes

and habits. Existing work on clone management, however, focuses primarily on tools for individual management tasks. This does not facilitate organizational change management. Without it, though, clone management approaches are unlikely to achieve long-term success in practice.

ContributionChapter 8 presents a method to introduce clone control into a software maintenance

project. It adapts results from organizational change management to the domain of software cloning. Furthermore, it documents causes of cloning and their solutions for effective clone prevention. The chapter presents a long term industrial case study that shows that the method can be employed to successfully introduce clone control, and reduce the amount of cloning, in practice.

In document Why and How to Control Cloning in Software Artifacts. Elmar Juergens (Page 48-52)