• No results found

We handle this in a final step by checking the inclusion relationship between the reported clone pairs. In the example, this reveals that the nodes from circle 2 are entirely contained in one of the pentagons and thus there has to be a clone of this circle in the other pentagon, too. Using this information (which analogously holds for the rectangle), we can find the third circle to get a clone group of cardinality 3. If there was an additional clone overlapping circles 2 and 3, we had no single clone pair of the circle clone group and thus this approach does not work for this case. However, we consider this case to be unlikely enough to ignore it.

Scalability The time and space requirements for clone pair detection depend quadratically on the overall number of blocks in the model(s). While for the running time this might be acceptable (though not optimal) as we can execute the program in batch mode, the amount of required memory can be too much to even handle several thousand blocks.

To solve this, we split the model graph into its connected components. We independently detect clone pairs within each such component and between each pair of connected components, which still allows us to find all clone pairs we would find without this technique. This does not improve running time, as still each pair of blocks is looked at (although we might gain something by filtering out components smaller than the minimal clone size). The amount of memory needed, however, now only depends quadratically on the size of the largest connected component. If the model is composed of unconnected sub models, or if we can split the model into smaller parts by some other heuristic (e. g., separating subsystems on the topmost level), memory is, hence, no longer the limiting factor.

We measured performance for the industrial Matlab/Simulink model we analyzed during the case study presented in 5, which comprises 20,454 blocks: the entire detection process—including pre- and postprocessing—took 50s on a Intel Pentium 4 3.0 GHz workstation. The algorithm thus scales well to real-world models.

7.4 Postprocessing

Postprocessing comprises the process steps that are performed to the detected clones before the results are presented to the user. In ConQAT, postprocessing comprises merging, filtering, metric computation and tracking.

7.4.1 Steps

Filtering removes clones that are irrelevant for the task at hand. It can be performed based on

clone properties such as length, cardinality or content, or based on external information, such as developer-created blacklists.

7 Algorithms and Tool Support

Clone trackingcompares clones detected on the current version of a system against those detected

on a previous one. It identifies newly added, modified and removed clones. If tracking is per- formed regularly, beginning at the start of a project, it determines when each individual clone was introduced.

The following sections describe the postprocessing steps in more detail. Postprocessing steps are, in principle, independent of the artifact type. Each step—filtering, metric computation and clone tracking—can thus be performed for clones discovered in source code, requirements specifications or models. However, for conciseness, this section presents postprocessing for clones in source code. Since the same intermediate representation is used for both code and requirements specifications, all of the presented postprocessing features can also be applied to requirements clones. Most of them, in addition, are either available for clones in models as well, or could be implemented in a similar fashion.

7.4.2 Filtering

Filtering removes clone groups from the detection result. ConQAT performs filtering in two places:

localfilters are evaluated right after a new clone group has been detected;globalfilters are evaluated after detection has finished. While global filters are less memory efficient—the later a clone group is filtered, the longer it occupies memory—they can take information from other clone groups into ac- count. They thus enable more expressive filtering strategies. ConQAT implements clone constraints based on various clone properties.

The NonOverlappingConstraintchecks whether the code regions of sibling clones overlap. The

SameFileConstraint checks if all sibling are located in a single file. The CardinalityConstraint

checks whether the cardinality of a clone group is above a given threshold.

The ContentConstraint is satisfied for a clone group, if the content of at least one of its clones

matches a given regular expression. Content filtering is, e. g., useful to search for clones that contain special comments such asTODOorFIXME; they often indicate duplication of open issues. Constraints for type-3 clones allow filtering based on their absolute number of gaps or their gap ratio. If, e. g., all clones without gaps are filtered, detection is limited to type-3 clones. This is useful to discover type-3 clones that may indicate faults and convince developers of the necessity of clone management. Clones can be filtered both for satisfied or violated constraints.

Blacklisting Even if clone detection is tailored well, false positives may slip through. For con- tinuous clone management, a mechanism is required to remove such false positives. To be useful, it must be robust against code modifications—a false positive remains a false positive independent of whether its file is renamed or its location in the file changes (e. g., because code above it is modified). It thus still needs to be suppressed by the filtering mechanism.

ConQAT implements blacklisting based on location independent clone fingerprints. If a file is re- named, or the location of a clone in the file changes, the value of the fingerprint remains unchanged. For type-1 and type-2 clones, all clones in a clone group have the same fingerprint. A blacklist

7.4 Postprocessing

stores fingerprints of clones that are to be filtered. Fingerprints are added by developers that con- sider a clone irrelevant for their task. During postprocessing, ConQAT removes all clone groups whose fingerprint appears in the blacklist.14

Fingerprints are computed on the normalized content of a clone. The textual representation of the normalized units is concatenated into a single characteristic string. For type-1 and type-2 clones, all clones in a clone group have the same characteristic string; no clone outside the clone group has the same characteristic string—else it would be part of the first clone group. The characteristic string is independent of the filename or location in the file. Since it can be large for long clones, ConQAT uses its MD5 [192] hash as clone fingerprint to save space. Because of the very low col- lision probability of MD5, we do not expect to unintentionally filter clone groups due to fingerprint collisions.

Blacklisting works for type-1 and type-2 clones in source code and requirements specifications. It is currently not implemented for type-3 clones. However, their clone group fingerprints could be computed on the similar parts of the clones to cope with different gaps of type-3 clones.

Cross-Project Clone Filtering Cross project clone detection searches for clone groups whose clones span at least two different projects. The definition ofproject, in this case, depends on the

context:

Cross project clone detection can be used in software product lines to discover reusable code frag- ments that are candidates for consolidation [173]; or to discover clones between applications that build on top of a common framework to spot omissions. Projects in this context are thus individual products of a product family or applications that use the same framework.

To discover copyright infringement or license violations, it is employed to discover cloning between the code base maintained by a company and a collection of open source projects or software from other owners [77, 121]. Projects in this context are the company’s code and the third party code. ConQAT implements aCrossProjectCloneGroupsFilter that removes all clone groups that do not span at least a specified number of different projects. Projects are specified as path or package prefixes. Project membership expressed via the location in the file system or the package (or name space) structure.

Figure 7.16 depicts a treemap that shows cloning across three different industrial projects15. Areas A,BandCmark project boundaries. Only cross-project clone groups are included. The project in the lower left corner does not contain a single cross-project clone, whereas the other two projects do. In both projects, most of it is, however, clustered in a single directory. It contains GUI code that is similar between both.

14All blacklisted clone groups are optionally written to a separate report to allow for checks whether the blacklisting feature has been misused to artificially reduce cloning.

7 Algorithms and Tool Support

Figure 7.16: Cross-project clone detection results

7.4.3 Metric Computation

ConQAT computes the cloning metrics introduced in Chapter 4, namely clone counts, clone cover- age and overhead. Computation of counts and coverage is straight forward. Hence, only computa- tion of overhead is described here in detail.

Overhead is computed as the ratio of RF SSSS −1. If, for example, a statement in a source file

is covered by a single clone that has two siblings, it occurs three times in the system. Perfect removal would eliminate two of the three occurrences. It thus only contributes a single RFSS. RFSS computation is complicated by the fact that clone groups can overlap.

RFSS computation only counts a unit in a source artifact 1

times cloned number of times. In the above

example, each occurrence of the statement is thus only counted as 1

3 RFSS. We employ a union-find

data structure to represent cloning relationships at the unit level. All units that are in a cloning relationship are in the same component in the union-find structure, all other units are in separate ones. For RFSS computation, the units are traversed. Each unit accounts for 1

component size RFSS.

7.4.4 Tracking

Clone tracking establishes a mapping between clone groups and clones of different (typically con- secutive) versions of a software. Based on this mapping, clone churn—added, modified and re- moved clones—is computed. Tracking goes beyond fingerprint-based blacklisting, since it can also associate clones whose content has changed across versions. Since different content implies differ- ent fingerprints, such clones are beyond the capabilities of blacklisting.

7.4 Postprocessing

ConQAT implements lightweight clone tracking to support clone control with clone churn informa- tion. The clone tracking procedure is based on the work by Göde [83]. It comprises three steps that are outlined in the following:

Update Old Cloning Information Since the last clone detection was performed, the system may have changed. The cloning information from the last detection is thus outdated—clone po- sitions might be inaccurate, some clones might have been removed while others might have been added. ConQAT updates old cloning information based on the edit operations that have been per- formed since the last detection, to determine where the clones are expected in the current system version.

ConQAT employs a relational database system to persist clone tracking information. Cloning infor- mation from the last detection is loaded from it. Then, for each file that contains at least one clone, thediff between the previous version (stored in the database) and the current version is computed.

It is then used to update the positions of all clones in the file. For example, if a clone started in line 30, but 10 lines above it have been replaced by 5 new lines, its new start position is set to 25. If the code region that contained a clone has been removed, the clone is marked as deleted. If the content of a clone has changed between system versions, the corresponding edit operations are stored for each clone.

Detect New Clones While the above step identifies old and removed clones, it cannot discover newly added clones in the system. For this purpose, in the second step, a complete clone detection is run on the current system version. It identifies all its clones.

Compute Churn In the third step, updated clones are matched against newly detected ones to compute clone churn. We differentiate between these cases:

Positions of updated clone and new clone match: this clone has been tracked successfully between system versions.

New clone has no matching updated clone: tracking has identified a clone that was added in the new system version.

Updated clone has no matching new clone: it is no longer detected in the new system version. The clone or its siblings have either been removed, or inconsistent modification prevents its detection. These two cases need to be differentiated, since inconsistent modifications need to be pointed out to developers fur further inspections. Tracking distinguishes them based on the edit operations stored in the clones.

Churn computation determines the list of added and removed clones and of clones that have been modified consistently or inconsistently.

7 Algorithms and Tool Support