Chapter 7. Data collection
7.1.5 Identification of attributes
7.2.2.1 Tracking of methods across snapshots
The data collection algorithm searches the methods defined on each logical change. This means that the tracking of methods across snapshots must be done using their names. However, the names of methods change along their lifetime. In order to be able of calculate the lifetime of a method accurately, we store in a separate table the way in which methods are renamed across their lifetime. This table is called method translation (shown in Figure 7-14). The table stores a new identifier for each method (id), and assigns to it the identification corresponding method(s) found during data collection (realMethod_id).
Figure 7-14. Tables that store the methods renamed reconstructed using origin analysis
In this way, the information stored about two methods, renamed versions of the same method, can be reconstructed before the analysis phases without adding complexity to the data collection algorithm. The rest of this section will cover the algorithm used to detect the previous version (i.e. the origin) of methods that seem new because they have a combination of name and location that was not stored before in the database.
Figure 7-15. Origin analysis principle. The methods that are supposed to be new may be a renamed version of any of the methods that are supposed to be deleted.
As explained in the previous chapter, current approaches to find the origin of methods are not adequate for the analysis of applications over a large interval of time because they are costly to
logical changes
Conventions
Unchanged method Method deleted New method
7.2. Phase application
calculate in terms of time. Therefore, we propose a new algorithm. As explained before, origin analysis algorithms aim to find the origin of a method that seems new, from the set of methods that seem deleted (i.e. the candidates). To decide if the ‘new’ method and the candidate are the same method, origin analysis techniques use two concepts: the identification of the methods, and the algorithm to compare such identifications. Our algorithm uses two identifications for a method: its complete name, and its content (LOCs). The name of the method is the identification used to filter most of the candidate methods to be the origin of the new one. The second identification (i.e. the contents of the method) is used only in case there is more than one candidate after the filtering done using the name of the methods.
Figure 7-16. Our filtering of candidate methods (empty circles) to find the origin of the method that seems new (filled circle).
The first filter of candidates is done by eliminating those methods that at any point of their lifetime co-existed with the analyzed method16. This means that the first filter eliminates those candidates17 that ‘revived’ while the method analyzed was alive. This happens when the methods are temporarily eliminated using comments. ‘Revived’ methods also occur when the file that hosts the method is accidentally deleted from the repository, being re-added afterwards. The second filter divides the candidate methods into three groups depending on their name. The
16 Analyzed method: The one supposedly new.
17 Candidate: Method that was deleted when the analyzed method was created.
N am e si m il ar it y moved renamed different parameters L O C s im il ar it y L if et im e o v er la p p in g L O C s im il ar it y a b o v e th re sh o ld H ig h es t si m il ar it y possible origin Un iq u en es s o f o ri g in s origin
identification of methods by the name considers the fully qualified name of the class and the signature of the method. The first group of candidates has the methods in the same class and with the same parameters but with different name (in Figure 7-16, the ‘renamed’ set). The second group of candidates has the methods in the same class and with the same name but with different parameters (in Figure 7-16, the ‘different parameters’ set). The third group of candidates has methods with the same signature located in a different class or package than the analyzed method (in Figure 7-16, the ‘moved’ set). The third filter measures the similarity between the lines of code of the analyzed method and the candidates. Given that calculating this similarity may be costly in terms of time, only some of the sets of candidates are compared. By default, the candidates that are in the same class as the method analyzed (i.e. renamed candidates or different parameters candidates) have priority over the methods that seem moved. That means that similarity judged by the lines of code is not calculated for moved candidates. However, if the majority of the candidates belong to the same file or class, it is likely that the whole class or file was moved. In that case, moved candidates have priority, discarding the calculation of similarity by the lines of code for renamed candidates or different parameters candidates. The fourth filter eliminates the candidates whose LOCs similarity is below a threshold of 70%. In case there is no candidate remaining after this filter, the set of candidates that was not analyzed previously is compared for LOC similarity against the method analyzed. The fifth filter takes the candidates with the highest LOC similarity. In case more than one method has the highest similarity, the algorithm assumes that the method does not have a clear origin and marks it as a new method. It is assumed that the method analyzed is a new method cloned from those candidate methods that resemble it. Finally, the last filter checks that the origin of every method is unique, i.e. that two different methods do not have the same origin. This means that there are cases in which two methods apparently created at the same time are very similar to a third origin ‘candidate’ method that was deleted at the same time. Note that this happens when a method is divided into two methods, each one with a different signature than the initial method. This is a common refactoring on large methods, with several parameters, which handle multiple responsibilities. As explained before, in these cases, the origin is assigned to the new method that has higher similarity with the deleted method. In case of having equal similarity between the ‘new’ methods and the origin, the algorithm assumes that the origin was a method deleted and that the ‘new’ methods are indeed new.
7.2. Phase application
Code) of the method and the candidate is detected. For each line in the shorter method, the algorithm finds the line that resembles it most from the lines that compose the larger method. Then the similarity between the two methods is the sum of the similarities found for all the lines of the shorter method, over the number of lines in the shorter method.
The similarity between two lines of code is calculated as the average of the percentage of characters that have in common parallel tokens in the lines. The percentage of characters in common between two tokens is the number of pairs of consecutive characters that the tokens have in common (counted twice), over the characters of letters of both tokens. Note that the common characters are counted twice in order to obtain one when comparing the similarity on equal strings. The tokens are recognized by taking into account the special characters in the language such as spaces, operators, braces, semicolons, etc. Nevertheless, tokens may also be divided into words whenever there is a change of capitalization as in taxesCalculation, or when there are intermediate characters like in taxes_calculation. The separation of tokens into words permits to increase the accuracy on the similarity of the semantics between the analyzed lines.
The results of the extraction of logical changes are shown in Table 7-3.
Table 7-3. Summary of methods identified in the analyzed applications
Application Number of methods detected
initially
Number of methods after origin analysis Freecol 4099 4050 JEdit 8434 8004 Ganttproject 14895 14616 Columba 28876 28376 JBoss mod. 12144 12132
7.2.3
Identification of clones
Giesecke pointed out desirable characteristics from clone detection tools, which include: being language independent, being independent of the detection approach, and being able to detect clones at different levels of similarity [Giesecke '07]. The first characteristic is desirable because it permits analyzing applications written in different languages, which would permit comparing the impact of the programming language in the types of clones that an application may have. Besides, a language independent clone detector allows the analysis of a wider range
of applications. The last two characteristics permit deciding which clone instances are false positives.
CCFinder is an automatic clone-detection tool that uses lexical analysis to normalize the source code, which is then transformed by language dependent rules into a sequence of tokens. Finally, a string based comparison between the tokens locates the clones [Kamiya '02]. We decided to use CCFinder as clone detection tool for several reasons. First, it would allow us to compare our results with many empirical studies on cloning ([Monden '02; Ueda '02; Kapser '03; Kapser '04; Kim '05; Geiger '06; Kapser '06a; Kapser '06b]). Second, because CCFinder is capable of detecting three of the four types of clones in terms of similarity level (see in this chapter section ‘Classifications of clones’ on page 51). Third, because CCFinder has one of the best recall levels, while still keeping a reasonable precision, from the tools that have been used for large scale analyses [Bellon '07] (see section ‘Advantages and disadvantages of each code representation’ on page 44, and ‘Comparison of the most popular clone detection tools’ on page 50). The fact that it produces data with high recall and reasonable precision permits to have a rich dataset that can be filtered if desired, as Giesecke suggests [Giesecke '07]. Finally, given that CCFinder is based on token comparison, its time performance is very good, which is an important requirement when analyzing the history of clones.
The data collection algorithm parses CCFinder's output. Although CCFinder can detect clones on different programming languages, one of the intermediate files produced by CCFinder that needs to be parsed depends on the programming language; therefore, our algorithm only handles Java applications.
CCFinder was configured to find code clones with a minimal length of 30 tokens, with tokens of at most 10 characters, to distinguish different identifiers, and to ignore block structures so that clones are not partitioned, as shown in the example below.
ccfx.exe d -i inputFile.files
-o outputFile.ccfxd -b 30 -t 10 –v
The size of the token was chosen because it is the default of CCFinder, probably for performance reasons. However, this does not mean that similarity in larger tokens is dismissed. In case the token is larger than 10 characters, the token is divided into several tokens of the same type that are compared sequentially. Therefore, identical tokens would be recognized
7.2. Phase application
anyway.
The output of CCFinder is a file containing the clone tokens that form the clone relations and the clone families, and an intermediate file for each source code file analyzed that stores the translation of the file to tokens. An example of the output of CCFinder is presented below:
source_files { 1 C:\client\ClientOptions.java 1587 2 C:\client\control\ClientModelController.java 422 ... } clone_pairs { 7 1.4-43 1.40-79 918 1.601-667 1.1211-1277 1039 1.739-770 1.780-811 1039 1.739-770 286.95-126 ... }
The first set of lines indicates each file: its identification, its full path, and its number of tokens. The second set of lines describes all clone relations in the application in three columns. The first column contains a unique identifier per clone family, it indicates that the clone relation in the following two columns belong to that family. The second and third columns specify the fragments that compose the clone relation: the first number indicates the identifier of the file in which the fragment is located, and the two following numbers indicate the first and last tokens that belong to the fragment. For instance, the first line of the clone relations says that there is a clone relation belonging to the family identified as 7, the first fragment goes from the token 4 to the token 43 of the file 1, and the second fragment goes from the token 40 to the token 79 of the file 1.
It is necessary to translate the outcome of CCFinder to lines cloned within each method and to which clone family those lines belong. Therefore, it is necessary to know the conversion of files to tokens, which is contained in the intermediate files produced by CCFinder. An example of such files is shown below. Each line in the intermediate files represents a token. The first column indicates the position of the token, the second column indicates its length, and the third column indicates the type of token. Three hexadecimal numbers separated by periods describe the position of the token. The first number indicates the line of the file in which the token is (starting with line 1), the second number indicates the column in which the token begins
(starting with column 1), and the third number indicates in which character the token start counting from the first character in the file and starting with character zero.
6.8.41 +0 (def_block 6.8.41 +5 r_class 6.e.47 +8 id|ListSet 6.16.4f +1 (brace 9.9.88 +7 id|ListSet 9.10.8f +1 (paren 9.11.90 +1 )paren 9.13.92 +1 (brace a.3.97 +5 id|super a.8.9c +1 (paren a.9.9d +1 )paren a.a.9e +1 ; ...
The lines of code described by the token description above are:
...
6 public class ListSet {
7 private Vector myElements = new Vector();
8
9 public ListSet() {
10 super();
11 }
...
Notice that CCFinder does not translate all the statements in the source code e.g. none of the elements in line 7 have a token identification. Furthermore, note that modifiers are not taken into account in the translation. The mapping from tokens to methods is done using the lines of code that belong to each method, which are updated after each logical change, and stored in the table that stores the changes per methods (methodChange). The tokens are stored in the columns startToken and endToken, the rest of the values in the row are filled with the identifier of the method (method_id), and of the logical change (commitTransaction_id), and default/empty information for the rest of columns (see Figure 7-20 in next section).
7.2. Phase application
Figure 7-17. Tables that store clone relations between methods, and their corresponding clone families