• No results found

Preprocessing transforms the source artifacts into an intermediate representation on which clone detection is performed. The intermediate representation serves two purposes: first, it abstracts from the language of the artifact that gets analyzed, allowing detection to operate independent of idiosyncracies of, e. g., C++ or ABAP source code or texts written in English or German; second, different elements in the original artifacts can be normalized to the same intermediate language fragment, thus intentionally masking subtle differences.

This section first introduces artifact-independent preprocessing steps and then outlines artifact- specific strategies for source code, requirements specifications and models.

7.2.1 Steps

ConQAT performs preprocessing in four steps: collection, removal, normalization and unit creation. All of them can be configured to make them suitable for different tasks.

Collection gathers source artifacts from disk and loads them into memory. It can be configured

to determine which artifacts are collected and which are ignored. Inclusion and exclusion patterns can be specified on artifact paths and content to exclude, e. g., generated code based on file name patterns, location in the directory structure or typical content.

Removal strips parts from the artifacts that are uninteresting from a clone detection perspective,

e. g., comments or generated code.

Normalizationsplits the (non-ignored parts of the) source artifacts into atomic elements and trans- forms them into a canonical representation to mask subtle differences that are uninteresting from a clone detection perspective.

Unit creationgroups atomic elements created by normalization into units on which clone detection

is performed. Depending on the artifact type, it can group several atomic elements into a single unit (e. g., tokens into statements) or produce a unit for each atomic element (e. g., for Matlab/Simulink graphs).

7.2 Preprocessing

The result of the preprocessing phase is an intermediate representation of the source artifacts. The underlying data structure depends on the artifact type: preprocessing produces asequenceof units

for source code and requirements specifications and agraphfor models.

7.2.2 Code

Preprocessing for source code operates on the token level. Programming-language specific scanners are employed to split source code into tokens. Both removal and normalization can be configured to specify which token classes to remove and which normalizing transformations to perform. If no scanner for a programming language is available, preprocessing can alternatively work on the word or line level. However, normalization capabilities are then reduced to regular-expression-based replacements2.

Tokens are removed if they are not relevant for the execution semantics (such as, e. g., comments) or optional (e. g., keywords such asthisin Java). This way, differences in the source code that are

limited to these token types do not prevent clones from being found.

Normalization is performed on identifiers and literals. Literals are simply transformed into a single constant for each literal type (i. e., boolean literals are mapped to another constant than integer liter- als). For identifier transformation, a heuristic strategy is employed that aims to provide a canonical representation to all statements that can be transformed into each other through consistent renaming of their constituent identifiers. For example, the statement »a = a + b;« gets transformed to »id0 = id0 + id1«. So does »x = x + y«. However, statement »a = b + c« does not get normalized like this, since it cannot be transformed into the previous examples through consistent renaming. (Instead, it gets normalized to »id0 = id1 + id2«.) This normalization is similar to parameterized

string matching proposed by Baker [6].

ConQAT does not employ the same normalization to all code regions. Instead, different strategies can be applied to different code regions. This allows conservative normalization to be performed to repetitive code—e. g., sequences of Java getters and setters—to avoid false positives; at the same time, non-repetitive code can be normalized aggressively to improve recall. The normalization strategies and their corresponding code regions can be specified by the user; alternatively, ConQAT implements heuristics to provide default behavior suitable to most code bases.

Unit creation forms statements from tokens. This way, clone boundaries coincide with statement boundaries. A clone thus cannot begin or end somewhere in the middle of a statement.

Shapers insert unique units at specified positions. Since unique units are unequal to any other unit, they cannot be contained in any clone. Shapers thus clip clones. ConQAT implements shapers to clip clones to basic blocks, method boundaries or according to user-specified regular expressions.

2For reasons of conciseness, this section is limited to an overview. A detailed documentation of the existing processors and parameters for normalization is contained in ConQATDoc at www.conqat.org and the ConQAT Book [49].

7 Algorithms and Tool Support

7.2.3 Requirements Specifications

Preprocessing for natural language documents operates on the word level. A scanner is employed to split text into word and punctuation tokens. Whitespace is discarded. Both removal and normal- ization operate on the token stream.

Punctuation is removed to allow clones to be found that only differ in, e. g., their commas. Fur- thermore, stop wordsare removed from the token stream. Stop words are defined in information retrieval as words that are insignificant or too frequent to be useful in search queries. Examples are “a”, “and”, or “how”.

Normalization performs word stemming to the remaining tokens. Stemming heuristically reduces a word to its stem. ConQAT uses the Porter stemmer algorithm [187], which is available for various languages. Both the list of stop words and the stemming depend on the language of the specifica- tion.

Unit creation forms sentence units from word tokens. This way, clone boundaries coincide with sentence boundaries. A clone thus cannot begin or end somewhere in the middle of a sentence.

7.2.4 Models

Preprocessing transforms Matlab/Simulink models into labeled graphs. It involves several steps: reading the models, removal of subsystem boundaries, removal of unconnected lines and normal- ization.

Normalization produces the labels of the vertices and edges in the graph. The label content depends on which vertices are considered equal. For blocks, usually at least the block type is included, while semantically irrelevant information, such as the name, color, or layout position, are excluded. Additionally, some of the block attributes are taken into account, e. g., for theRelationalOperator

block the value of the Operatorattribute is included, as this decides whether the block performs a greater or less than comparison. For the lines, we store the indices of the source and destination ports in the label, with some exceptions as, e. g., for aproductblock the input ports do not have to

be differentiated. Furthermore, normalization stores weight values for vertices. The weight values are used to treat different vertex types differently when filtering small clones. Weighting can be configured and is an important tool to tailor model clone detection.

The result of these steps is a labeled model graphG= (V, E, L)with the set of vertices (or nodes)

V corresponding to the blocks, the directed edgesE ⊂ V ×V corresponding to the lines, and a

labeling functionL:V ∪E →N mapping nodes and edges to normalization labels from some set

N. Two vertices or two edges are considered equivalent, if they have the same label. As a Simulink block can have multiple ports, each of which can be connected to a line,Gis a multi-graph. The

ports are not modeled here but implicitly included in the normalization labels of the lines.

For the simple models shown in Figure 7.3 the labeled graph produced by preprocessing is depicted in Figure 7.4. The nodes are labeled according to our normalization function. (The grey portions of the graph mark the part we consider a clone.)