Categorising Revisions - Software Defect Prediction Using Static Code Metrics: Formulating a Me

7.1 Barcode

7.1.2 Categorising Revisions

Initially each revision will have its change category assigned to one of the following three categories:

• Not A Chosen File Type • Initial Revision

• Yet To Be Assessed

Table 7.1 shows that there are 16 C source files partially comprising Barcode. It is from the history of these files that function (or module) based fault data can be extracted. Module-level fault data is the most common granularity of fault data reportedly used within the literature. This is probably because module-level predictions, if accurate, would be far more useful in practice than package, file or class level ones (see Chapter 2). This is due to there being typically less code to search through and comprehend for each ‘defective’ prediction made, assuming a manual code review is the resulting action taken.

When initialising the Barcode fault database, all files not ending with a ‘.c’ extension were labelled as “Not A Chosen File Type”. This included C header files, as although they may contain faults, they do not contain module implementations. It also included the single Python file; it was decided not to include this file as it is not a main part of the system. Additionally, having to focus on another language as well as C would have increased complexity later on.

7.1. Barcode 99

By definition, the first/initial revision of a file (revision 1.1 in CVS) cannot include a fault-fix of a previous version of the same file (that exists in the revision control system). Therefore, the initial revision of each Barcode file (as long as it is a ‘chosen file type’) is automatically labelled with the “Initial Revision” category. This category can conceptually be seen to imply the “Definite Non-Fault-Fix” category, as even if the new file contains code intended to rectify faults contained in other files, it cannot contain code intended to rectify faults contained in a previous version of the same file (that exists in the revision control system). Such within-file fault- fixes are of main interest in this study, as they are the most suitable when trying to determine the fix-inducing changes (see Chapter 2). It is worth mentioning that CVS has no native notion of file renaming; therefore, a problem with this method is that a renamed file containing a fault-fix will be missed.

All revisions which had file names ending in ‘.c’ and which were not initial revisions were thus labelled as “Yet To Be Assessed”. It was these files that would be manually examined to determine the intended purpose of their modifications. There were 110 such revisions, 31% of all revisions in total.

The first 5 categories shown in Table 7.2are on an ordinal scale. They describe the degree of membership a revision has with respect to being an intended fault-fix. The first category on this ordinal scale is “Definite Fault-Fix”, the last is “Definite Non-Fault-Fix”, and “Unsure” is the midpoint. The result of manually examining each revision is typically a classification into one of these five categories. If part of the changes to a revision are to a module, and those changes are deemed to be an intended fault-fix, then the revision is labelled as either a “Definite Fault-Fix” or a “Probable Fault-Fix”, depending on the experts degree of certainty. Additionally, the module header where the fix is thought to have occurred is recorded in a textual database field (there can be multiple entries if required). All headers not recorded in this field are therefore implied to be in the “Definite Non-Fault-Fix” category for that particular revision.

The final category to be discussed: “Outlier”, is used in circumstances where: • There are no module (function) changes. Because the module level is

the chosen fault granularity level in this study, any revision with no module changes should not be considered with respect to module-based faults. This can occur when, for example, a global variable change is the only modification made in a revision.

• The only change is a syntax-error fix. This occurs when developers commit their code to the server without first checking that it successfully compiles. Syntax errors in compiled languages (such as C) cannot affect end users as they prevent successful compilation. Software defect predictors are not intended to detect such errors as they can be found with far more efficiency by compilers.

100 Chapter 7. Obtaining Fault Data

Figure 7.1: The main screen of the revision-labelling front-end.

When labelling each of the revisions it was often possible to examine not only the source code and corresponding commit message, but also the previously mentioned change log. The change log is well maintained and typically contains concise details on what has been changed, by whom, why, and where. For example:

2000-01-26 Alessandro Rubini <[email protected]>

* code128.c (Barcode_128_encode): new encoding: full-featured code128

Here it is shown that on the 26th of January 2000, Alessandro Rubini commit- ted the ‘code128.c’ source file. The changes made were to the ‘Barcode_128_encode’ function, and the purpose of this modification was to add a new encoding (‘code128’). Such textual notes can be invaluable when trying to determine the intention of modifications, especially for individuals who did not take part in the development process.

To simplify the process of manually labelling revisions, I developed a graphical Java front-end to the modified CVSanaly2 MySQL database. This front-end was used as the interface for the entire labelling process. A screenshot of the front-end main screen is shown in Figure7.1. The figure shows that after selecting a repository name and a transaction (change-set) number, the following information is displayed:

7.1. Barcode 101

• The transaction committer name.

• The number of transaction-comprising revisions (check-ins). • The transaction commit message.

• And for each revision: its unique identification number, timestamp, relative filepath, change category, and corresponding note. Only the change category and note fields are editable by the user. The change category options are the same as those in Table7.2, and are presented to the user via a combo box. The note field is for entering either the suspected fault-fix module header(s) (‘+’ separated list), or the justification for an outlier label (either ‘NO MODULE CHANGES’ or ‘SYNTAX ERROR FIX ONLY’).

In the table at the bottom of the window (in Figure 7.1) where each of the revision details are displayed, there will always be precisely one of the revisions highlighted. The highlighted revision indicates which revision the user is interested in when they select the tabs (at the top of the window) to display: the file source, the textual diff against revision n − 1, or the fully annotated source, where each line is labelled with the revision number at which it was last modified.

The process for categorising revisions defined above was carried out twice. The first iteration was a trial run where all of the available categories were utilised. The second iteration was more rigorous than the first, and a more definitive classification was given to each revision. Thus, by the end of the second iteration, only three of the categories from Table7.2were used to classify the required source files: “Definite Fault-Fix”, “Definite Non-Fault-Fix” and “Outlier”.

7.1.3 Findings

The process of manually labelling each of the 110 Barcode source revisions typically involved examining the diff of both the source file and the change log (if present in the transaction), as well as the corresponding commit message. This was a lengthy and highly cumbersome process, and there were many failed attempts that resulted in the whole process having to be started afresh. The main problems were: difficulties in determining what was and was not a fault-fix, difficulties in being consistent with labellings across all revisions, and discovering part way through the labelling process that a database structure modification was required. The latter problem is understandable when considering that this was a first attempt at collecting fault data. This highlights the benefit of starting with a small system such as Barcode, as with a larger system having to restart the labelling process several times may have consumed too much time.

102 Chapter 7. Obtaining Fault Data

The failed labelling attempts, although highly demotivating and time consuming, did have the positive outcome of helping to more clearly define precisely what was considered a fault-fix. By the final labelling iteration a fault-fix was defined as: A module-based non-syntactic modification intended to rectify undesired program behaviour caused by one or more previous versions of the containing module. Note that ‘undesired program behaviour’ in this case is considered from the end-users perspective. Changes such as replacing the following line:

i f ( ! s t r l e n ( characterArray ) ) with:

i f ( characterArray [ 0 ] == ' \ 0 ' )

to remove the unnecessary function call were not counted as a fault-fix because they should not affect an end user (in the context of the Barcode system). Note that in this context the tiny increase in execution speed and decrease in memory usage should not be a factor. Furthermore note that the strlen function is a part of the 1989 ANSI C (C89) standard, so compatibility issues should also not be a factor.

The distribution of file change categories for the 110 manually classified revisions is shown in Figure7.2. This figure shows that the category for the fault-fixes had the highest number of revisions (50), the category for the non-fault-fixes had the least number of revisions (29), and there were a surprisingly large number of outliers (31). Most of these outliers (94%) were due to there being no module changes within the modified file. If it had been known beforehand that such a large number of revisions were to contain no module changes, such revisions would have been assigned their own exclusive category, and a tool for their automated classification would have been developed. This is potential future work.

There were many issues when trying to determine what was and was not considered a fault-fix. To revisit the example just described involving the strlen function, it could be reasonably argued that this was a fault-fix rather than a refactoring, as it resulted in more computationally efficient code. A second example is a revision involving the replacement of the snprintf function with the sprintf function, as the former does not conform to C89. I did not consider this to be a fault-fix as I was only interested in faults that could affect end users, and my definition of an end user is not someone who has to compile the software. It would be entirely reasonable to have a different definition where end users where expected to have to compile the software however, especially for open-source systems. A third example is that of revisions involving only minor output-formatting changes. Although I classified these as fault-fixes because they rectified undesired program behaviour, others may believe that such trivial output-formatting issues do not constitute a fault.

7.1. Barcode 103

Figure 7.2: The distributions of the manually classified revisions.

The potential subjectivity of the fault-fix categorisation process is highlighted further by the findings in [Hall 2010], where three researchers independently labelled the Barcode source revisions and were found to have low inter-rater reliability. These findings suggest that detailed documentation of the labelling process is a prerequisite of reasonable quality fault data, especially if that data is publicly available. Without such documentation (and ideally the originating data sources) it is difficult to have a satisfactory level of confidence in the labels of the data points.

Although the issue of subjectivity was partly mitigated in this study because of the clear fault-fix definition (given previously) and the details of each categorisation being recorded, data set construction ceased because of a lack of domain expertise. I had no prior experience of barcode-label programming and was not especially fa- miliar with such low-level C. This meant that for revisions where the change log was not particularly detailed, it was often very difficult to have confidence in a classification. This leads me to believe that perhaps categorising past revisions should be undertaken only by those who actively develop(ed) a system, or who are particularly knowledgeable in the application domain. Better still, more sophisticated support for documenting the intended purpose of modifications could be integrated into revision control software. This would help automate the process of data set construction, and could potentially increase the accuracy of fault data sets made in future.

104 Chapter 7. Obtaining Fault Data

7.2 Conclusions

The work described in this chapter illuminated that constructing accurate software fault data is far more difficult than may be initially perceived. This is especially true for those who did not take part in the development of the system being studied. A major difficulty when constructing fault data is the process of categorising whether or not a past revision contained a fault-fix. Along with the technical difficulties of comprehending the intended purpose of the modifications, there is also the subjectivity as to the definition of a fault. Therefore, to produce accurate and meaningful fault data, strict definitions must be made, documented, and consistently adhered to.

These findings bring back into light the lack of documentation available for the NASA and PROMISE data sets. It may be that this is a much more severe problem than previously thought. The sparse documentation available for the NASA data sets was described in Section 4.1.1. For the PROMISE data sets, there is typically even less documentation, if any is provided at all. For data sets such as the NASA ones where the original data sources (code and revision control archives) are not publicly available, the importance of data set documentation is magnified. This is because there is little opportunity for future data integrity checks. Thus, the worth of the NASA data sets is called into question once more.

Chapter 8 Finalising the Methodology

Contents

8.1 Obtaining Fault Data . . . 105

8.2 Analysing & Cleansing Fault Data . . . 106

In document Software Defect Prediction Using Static Code Metrics: Formulating a Methodology (Page 110-117)