CHAPTER 9 PAPER 4: HOW MUCH REALLY CHANGES? A CASE STUDY OF
11.6 Related Work
Clones are well-established in the literature with hundreds of papers dedicated to their de- tection and application. A comprehensive survey of the field is in [99]. In the following, we relate our work to already published papers explaining similarities and differences.
11.6.1 Applications in large-scale and industrial context
Large-scale clone detection study in an industrial setting can be traced back to [32]. Large scale, inter-project clone detection was investigated in [70], and [62]. With a growing interest from the industry and the technology achieving scalability, other industrial clone experiments have been published. Notably, authors of [33, 34] from Microsoft did an experiment with code clones and investigated developer feedback. Further developments in industrial applications were produced by [116] and [110] who led experiments on clone application and management in an industrial context. Our research shares the industrial context and large scale projects with those studies. We differ from those previous experiments by being in a broader industrial setting with more than 10 times the lines of code, with many projects coming from different development teams interested in finding clones between projects instead of clones within isolated projects. Also, because of our context, the focus of the experiment was on finding coarse grain sets of clones representing entire common subsystems instead of localized clone management. In that sense, our study is a departure from the traditionnal use of clone detection technology in the literature.
In an industrial context with a focus on different products branching from a common ancestor, inter-system clone detection is a requirement to detect the common parts in the originally cloned sub-systems. Therefore, clone detectors able to operate in an inter-system context may be better suited for industrial applications. Other authors, such as [69] and [29],
like us, have already explored the context of large scale inter-system clone detection. Our industrial results suggest that we should proceed with this approach.
A good example of a study of similarity between common code bases is in [51]. That work identified many replicas of common parts of a mobile operating system. In the case of large scale development, these duplications, which are never merged together, generate more maintenance and co-maintenance tasks. Developers also sometimes end up programming already existing functionalities. This case is closely related to our study, sharing both a common context of telecommunications related applications and heavy similarities centered on operating systems.
Advocating the use of clones as refactoring opportunities is widespread in the literature, as in the recent work of Tsantalis [76]. However, as those works focus on refactoring clone pairs and small clone clusters localized in small regions of the code base, they do not take into account larger cloning phenomenona like the replication of entire sub-modules and thus offer little to actually factor out broader, cohesive software. Moreover, our industrial experiment and feedback from developers suggests that localized clones are readily refactored before being committed to a central repository and thus do not live long enough to become an issue. In the end, our work suggests changing the focus of refactoring activities from localized clones to management of higher abstraction (such as module) clones. This is an important distinction between our findings and the current state-of-the-art of clone analysis.
11.6.2 Summary of Different Clone Detection Techniques
Despite the numerous tools already available today, many new techniques and tools are cre- ated and published every year in various conferences and journals. The following will briefly cover the most recent techniques published as well as the established ones still mentioned in the literature.
In recent years, new techniques have tried to solve precision and recall issues with type-3 and type-4 clones. In regards to type-3 cloning, techniques based on n-grams have gained some popularity. Kamiya in [59] has used n-grams to detect type-3 and type-4 clones in Java bytecode. Yuan in [118] has used frequency vectors of 1-gram and the cosine distance to detect type-3 clones. Lavoie in [80] has used frequency of generalized n-grams with the normalized Manhattan distance with space partitioning to detect type-3 clones. Sajnani in [105] has also used similarity on n-grams, but within a parallel framework.
Without relying on n-grams for type-3 clone detection, other tools use distance on token strings or image strings. Murakami in [92] uses a weighted Levenshtein distance, also called Smith-Waterman, to detect type-3 clones. Kamiya also uses token strings in [61].
Locality sensitive hash (LSH ) has been exploited by [57].
Clone detection by suffix tree matching was introduced by the authors of [72]. Although naive suffix tree matching is usually best fit for type-1 and type-2 clone detection, the authors proposed algorithms and heuristics to close gaps between matched segments in order to achieve competitive type-3 clone detection. The technique has been evaluated in [40] and the released tool iClone dates back to the incremental version of the suffix tree in [46].
Older techniques, like [22], used AST -based matching for type-3 clone detection. Al- though they produced good results, these techniques are less used today because they lack scalability.
Detecting code clones is also possible using by-products of the code instead of the code itself. For example, Program Dependency Graphs (PDG) have been used by [53] and [75]. Analysis of the memory behaviour is another technique and can be found in [64].
Semantic related techniques have emerged in recent years. A prime example of that is in [87], which uses Latent Semantic Indexing (LSI ).
Although most of the tools mentioned in this section are incremental, particular attention should be given to [19] which does incremental clone detection on source code repositories, which might be relevant for large scale industrial settings.
11.6.3 Parsing Techniques
Like our work with C/C++ and Java, other clone detectors have tackled difficulties with different languages. Microsoft’s C# has been explored by [8], PHP by [39], and Assembler by [35]. Simulink models, although not code, have also been explored with the NiCAD clone detector in [9]. Clone detection in Simulink models has been explored as well in [7] and [37]. Although a minor variant of the existing parsing techniques, the combination of a secondary language lexer with a main parser to infer the original lexical analysis information is a novelty.
11.6.4 Applications
Many applications of clone detection have already been proposed and investigated. Some of them might be relevant to industrial needs. For reference purposes, we provide a quick overview of those applications.
Discordant security clones have been investigated by [39]. Discordance of clones may also be used to detect bugs in programs, like in [84] and in [56].
Application of clone detection technology has also gone beyond source code.
Spreadsheets have gathered interest and were investigated in [52]. Ontology alignment using code clone detectors has also shown promising results in [43].
Among all the currently published applications, we are the first to investigate clones in test suites and clones in TTCN-3.