• No results found

Extract and Save Blocks

In document Software Maintenance At Commit-Time (Page 58-62)

Chapter 5. Preventing Code Clone Insertion At Commit-Time

5.2 Approach

5.2.3 Extract and Save Blocks

A block is a set of consecutive lines of code that will be compared to all other blocks to identify clones. To achieve this critical part of PRECINCT, we rely on TXL [39], which is a first-order functional programming over linear term rewriting, developed by Cordy et al. [39]. For TXL to work, one has to write a grammar describing the syntax of the source language and the transformations needed. TXL has three main phases: parse, transform, unparse. In the parse phase, the grammar controls not only the input but also the output form. The code sample below — extracted from the official documentation — shows a grammar matching an if-then-else statement in C with some special keywords: [IN] (indent), [EX] (exdent) and [NL] (newline) that will be used for the output form.

d e f i n e i f s t a t e m e n t i f ( [ expr ] ) [ IN ] [ NL ] [ s t a t e m e n t ] [EX] [ opt e l s e s t a t e m e n t ] end d e f i n e d e f i n e e l s e s t a t e m e n t e l s e [ IN ] [ NL ] [ s t a t e m e n t ] [EX] end d e f i n e

Then, the transform phase will, as the name suggests, apply transformation rules that can, for example, normalize or abstract the source code. Finally, the third phase of TXL called unparse, unparses the transformed parsed input to output it. Also, TXL supports what the creators call Agile Parsing [47], which allows developers to redefine the rules of the grammar and, therefore, apply different rules than the original ones.

PRECINCT takes advantage of that by redefining the blocks that should be ex- tracted for the purpose of clone comparison, leaving out the blocks that are out of scope. More precisely, before each commit, we only extract the blocks belonging to the modified parts of the source code. Hence, we only process, in an incremental manner, the latest modification of the source code instead of the source code as a whole.

We have selected TXL for several reasons. First, TXL is easy to install and to integrate with the standard workflow of a developer. Second, it was relatively easy to create a grammar that accepts commits as input. This is because TXL is shipped with C, Java, C-sharp, Python and WSDL grammars that define all the particularities of these languages, with the ability to customize these grammars to accept changesets (chunks of the modified source code that include the added, modified, and deleted lines) instead of the whole code.

Algorithm 1 presents an overview of the “extract” and “save” blocks operations of PRECINCT. This algorithm receives as arguments, the changesets, the blocks that have been previously extracted and a boolean named compare history. Then, from Lines 1 to 9 lie the f or loop that iterates over the changesets. For each changeset

(Line 1), we extract the blocks by calling the extract blocks (Changeset cs) function. In this function, we expand our changeset to the left and the right to have a complete block. @@ −315 ,36 +315 ,6 @@ i n t i n i t p r o c e s s t r e e s y s d e p ( P r o c e s s T r e e T ∗∗ r e f e r e n c e ) { m a c h p o r t d e a l l o c a t e ( mytask , t a s k ) ; } } − i f ( t a s k f o r p i d ( mytask , pt [ i ] . pid , − &t a s k ) == KERN SUCCESS) {

− mach msg type number t count ; − t a s k b a s i c i n f o d a t a t t a s k i n f o ;

As depicted by above, changesets contain only the modified chunk of code and not necessarily complete blocks. Indeed, we have a block from Line 2 to Line 6 and deleted lines from Line 7 to 10. However, in Line 7 we can see the end of a block, but we do not have its beginning. Therefore, we need to expand the changeset to the left to have syntactically correct blocks. We do so by checking the block’s beginning and ending (using { and }) in C for example. Then, we send these expanded changesets to TXL for block extraction and formalization.

For each extracted block, we check if the current block overrides (replaces) a previous block (Line 4). A block is a piece of code that contains the opening and closing attributes of a code-block (i.e. { and } for Java) and is at least 5 lines long. In such a case, we delete the previous block as it does not represent the current version of the program anymore (Line 5). Also, we have an optional step in PRECINCT defined in Line 4. The compare history is a condition to delete overridden blocks.

We believe that deleted blocks have been removed for a good reason (bug, default, removed features, . . . ) and if a newly inserted block matches an old one, it could be worth knowing to improve the quality of the system at hand.

In summary, this step receives the files and lines, modified by the latest changes made by the developer and produces an up to date block representation of the system at hand in an incremental way. The blocks are analyzed in the next step to discover potential clones.

Data: Changeset[] changesets; Block[] prior blocks;

Boolean compare history;

Result: Up to date blocks of the systems for i ← 0 to size of changesets do

Block[] blocks ← extract blocks(changesets); for j ← 0 to size of blocks do

if not compare history AND blocks[j] overrides one of prior blocks then

delete prior block; end

write blocks[j]; end

end

Function extract blocks(Changeset cs) if cs is unbalanced right then

cs← expand lef t(cs);

else if cs is unbalanced lef t then cs← expand right(cs);

end

1 return txl extract blocks(cs);

Algorithm 1: Overview of the Extract Blocks Operation

In document Software Maintenance At Commit-Time (Page 58-62)