• No results found

5.3 Conclusions

6.1.2 Method Data Cleansing

The NASA data sets are available from the aforementioned NASA MDP and PROMISE repositories. For this study I used the original versions of the data sets from the NASA MDP Repository (see Section 4.1.1). Note, however, that the main issues also apply to the PROMISE versions of these data sets, which are for the most part simply the same data in a different format2.

6.1.2.1 Initial Pre-Processing: Binarisation of Class Variable & Removal of Module Identifier and Extra Error-Data Attributes

In order to be suitable for binary classification, the error-count attribute is com- monly reported in the literature (see [Menzies 2007b, Lessmann 2008, Elish 2008] for example) as being binarised as follows:

def ective? = (error_count ≥ 1). (6.1)

It is then necessary to remove the ‘unique module identifier’ attribute, as this gives no information toward the defectiveness of a module. Lastly, it is necessary to remove all other error-based attributes, to make the classification task worthwhile. This initial pre-processing is summarised in Figure 6.1. As the NASA data is often reportedly used post this initial pre-processing, an overview of each data set is given in Table 6.1. In this table the number of original recorded values is defined as the number of attributes (features) multiplied by the number of instances (data points). For simplicity missing values are given no special treatment. The number of recorded values metric is used to quantify how much data comprises a data set. These values will be revisited later on to help determine how much of the original data has been removed during data cleansing.

Figure 6.1: Initial pre-processing pseudo-code.

rmAttributes = [ MODULE, ERROR_DENSITY, ERROR_REPORT_IN_6_MON, ERROR_REPORT_IN_1_YR, ERROR_REPORT_IN_2_YRS ] for dataSet in dataSets:

for rmAttribute in rmAttributes: if rmAttribute in dataSet:

dataSet = dataSet - rmAttribute dataSet.binarise(ERROR_COUNT)

dataSet.rename(ERROR_COUNT, DEFECTIVE)

2Note this is no longer the case; since the publication of [Gray 2011b] updated versions of the

6.1. Data Quality Issues 69

Name Language Features Instances Recorded

Values % Defective Instances CM1 C 40 505 20200 10 JM1 C 21 10878 228438 19 KC1 C++ 21 2107 44247 15 KC3 Java 40 458 18320 9 KC4 Perl 40 125 5000 49 MC1 C & C++ 39 9466 369174 0.7 MC2 C 40 161 6440 32 MW1 C 40 403 16120 8 PC1 C 40 1107 44280 7 PC2 C 40 5589 223560 0.4 PC3 C 40 1563 62520 10 PC4 C 40 1458 58320 12 PC5 C++ 39 17186 670254 3

Table 6.1: Details of the NASA data sets post initial pre-processing.

6.1.2.2 Stage 1: Removal of Constant Attributes

A numeric attribute which has a constant/fixed value throughout all instances is easily identifiable as it will have a variance of zero. Such attributes contain no in- formation with which to discern modules apart, and are at best a waste of classifier resources. Each data set had from 0 to 10 percent of their total attributes removed during this stage, with the exception of data set KC4. This data set has 26 con- stant attributes out of a total of 40, thus 65 percent of available data contains no information with which to train a classifier.

This stage removes data that may be genuine, but in the context of machine learning it is of no use and is therefore discarded. Regarding data set KC4, it appears as though many of the metrics have not been collected; instead of leaving them out of the data set originally however, they were instead included with all values equal to zero.

An additional note regarding data set KC4 is that two of its attributes: ‘essential complexity’ and ‘essential density’, have two unique values each, but in each case, one of the values occurs just once. This data may be valid, but after the data divide into training and testing set, it may be that the training data contains a constant attribute. This can be problematic for some learning techniques, and is therefore something that researchers should be aware of.

70 Chapter 6. Major Methodological Issues

6.1.2.3 Stage 2: Removal of Repeated Attributes

In addition to constant attributes, repeated attributes occur where two or more at- tributes have identical values for each instance. Such attributes are therefore fully correlated, which may effectively result in a single attribute being over-represented. Amongst the NASA data sets there are two repeated attributes (post stage 1), namely the ‘number of lines’ and ‘loc total’ attributes in data set KC4. The differ- ence between these two metrics was poorly defined at the NASA MDP Repository. However, they may be identical for this data set as (according to the metrics) there are no modules with any lines either containing comments or which are empty. For this data cleansing stage I removed one of the attributes so that the values were only being represented once. I chose to keep the ‘loc total’ attribute label as this is common to all 13 NASA data sets.

This stage again removes data that may be genuine, because it can be prob- lematic when data mining. It is interesting that data set KC4 has had so much data removed in these first two stages. Table 6.1shows that KC4 is unique in that it is the only data set based on Perl code. Therefore, it may be that the metrics collection tool (McCabeIQ 7.1) was more limited in the metrics it could collect for this language.

6.1.2.4 Stage 3: Replacement of Missing Values

Missing values may or may not be problematic for learners depending on the classi- fication method used. However, dealing with missing values within the NASA data sets is very simple. Seven of the data sets contain missing values, but all in the same single attribute: ‘decision density’. This attribute is defined as ‘condition count’ di- vided by ‘decision count’, and for each missing value both these base attributes have a value of zero. It therefore appears as though missing values have occurred because of a division by zero error. In the remaining data set which contains all three of the aforementioned attributes but does not contain missing values, all instances with ‘condition count’ and ‘decision count’ values of zero also have a ‘decision density’ of zero. Because of this I replace all missing values with zero, ensuring consistency between data sets. Note that in [Bezerra 2007] all instances which contained missing values within the NASA data sets were discarded. It is more desirable to cleanse data than to remove it, as the quantity of possible information to learn from will thus be maximised.

This stage adds data via the replacement of missing values, because they are problematic for many learning techniques. Note, however, that some researchers may not wish to carry out this stage, if they are using a learning method that is resilient to missing values (such as naïve Bayes). Additionally, some researchers may wish to exclude derived features (such as ‘decision density’) altogether. There is more discussion on this in Section 6.1.2.7.

6.1. Data Quality Issues 71

6.1.2.5 Stage 4: Enforce Integrity with Domain-Specific Expertise The NASA data sets contain varied quantities of attributes derived from simple equations of other attributes, which are useful for checking data integrity. Addi- tionally, it is possible to use domain-specific expertise to validate data integrity, by searching for theoretically impossible occurrences. The following is a non-exhaustive list of checks that can be carried out for each data point:

• Halstead’s length metric (see [Halstead 1977]) is defined as: ‘number of oper- ators’ + ‘number of operands’.

• Each token that can increment a module’s cyclomatic complexity (see [Mc- Cabe 1976]) is counted as an operator according to the original NASA MDP Repository. Therefore, the cyclomatic complexity of a module should not be greater than the number of operators + 1. Note that the minimum cyclomatic complexity is 1.

• The number of function calls within a module is recorded by the ‘call pairs’ metric. A function call operator is counted as an operator according to the original NASA MDP Repository, therefore the number of function calls should not exceed the number of operators.

These three simple rules are a good starting point for removing noise in the NASA data sets. Any data point which does not pass all of the checks contains noise. Because the original NASA software systems/subsystems from where the metrics are derived are not publicly available, it is impossible for us to investigate this issue of noise further. The most viable option is therefore to discard each offending instance. Note that a prerequisite of each check is that the data set must contain all of the relevant attributes (post stage 1). Six of the data sets had data removed during this stage, between 1 to 12 percent of their data points in total.

During this stage it is possible to not only remove noise (inaccurate/incorrect data), but also problematic data. A module which (reportedly) contains no lines of code and no operands and operators should be an empty module containing no code. Should such a data point be discarded? As it is impossible for us to check the validity of the metrics against the original code, this is a grey area. An empty module may still be a valid part of a system, it may just be a question of time before it is implemented. Furthermore, a module missing an implementation may still have been called by an unaware programmer (one who does not know of the missing implementation). As the module is unlikely to have carried out the task its name implies, it may also have been reported to be faulty. Despite this, researchers need to decide for themselves what to do with data that cannot be proved to be noisy, but is nonetheless strange. For example, the original data set MC1 (according to the metrics) contains 4841 modules (51% of modules in total) with no lines of code. I feel that it would therefore not be unreasonable to remove such data points, or even reject the entire data set altogether.

72 Chapter 6. Major Methodological Issues

6.1.2.6 Stage 5: Removal of Repeated and Inconsistent Instances

The most severe issue when using the NASA data sets for classification experiments is that of repeated data points. Unfortunately, this issue is often ignored in the defect prediction literature. Repeated, redundant, or duplicate data points are data points (or instances) that appear more than once within a data set. They are either noise, most probably caused by a faulty data collection process, or, if they are genuine, they occur (in this domain) when many modules have the same values for all measured metrics; for example, when they have the same number of: lines of code, lines of comments, blank lines, operands, operators, unique operands, unique operators, function calls, and so on. Additionally, these modules have also been assigned the same class label referring to whether they are or are not ‘defective’. This situation is clearly possible in the real world; for example, in an object-oriented system, there may be many simple accessor and mutator methods that share identical metrics and have not been reported as faulty. However, such data points may be problematic in the context of machine learning, where is it imperative that classifiers are tested upon data points independent from those used during training [Witten 2005]. The issue is that when data sets containing repeated data points are split into training and testing sets (for example by an x% training, 100−x% testing split, or n-fold cross- validation), it is possible for there to be instances common to both sets. With test data included in the training data, the learning task is either simplified or reduced entirely to a task of recollection. Ultimately however, if the experiment is intended to show how well a classifier could generalise upon future, unseen data points, the results will be erroneous as the experiment is invalid. This is because the assumption of unseen data has been violated, due to the test data being contaminated with training data. Note that because of the closed-source nature of the NASA data sets, it is impossible to know whether the repeated data points are genuine or are noise. Inconsistent (or conflicting) instances are another issue, and are very similar to repeated instances in that both occur when the same feature vectors describe multiple modules. The difference between repeated and inconsistent instances is that with the latter, the class labels differ, thus (in this domain) the same metrics would describe both a ‘defective’ and a ‘non-defective’ module. This is again possible in the real world, and while not as serious an issue as the repeated instances (in the case of the NASA data sets), inconsistent data points can be problematic during binary classification tasks. When building a classifier which outputs a predicted class set membership of either ‘defective’ or ‘non-defective’, it is illogical to train such a classifier with data instructing that the same set of features is resultant in both classes. I focus more on repeated data points than inconsistent ones in this study, as for most data sets the proportion of repeated instances is considerably larger. Note, however, that it is possible for a data point to be both repeated and inconsistent.

6.1. Data Quality Issues 73

Adding all data points into a mathematical set is the simplest way of guaran- teeing that each one is unique. This ensures that classifiers will be tested on unseen data, regardless of how the data is divided. From here it is possible to remove all inconsistent pairs of modules, to ensure that all feature vectors (data points irrespec- tive of class label) are unique. The proportion of instances removed from each data set during this stage is shown in Figure 6.2. All data sets had instances removed during this stage, and in some cases the proportion removed was very large (90, 79 and 75 percent for data sets PC5, MC1 and PC2, respectively). Note that for most data sets the proportion of inconsistent instances removed was negligible. This is partly due to the methodology of removing all repeated instances first, and then in- consistent pairs second, as some of the inconsistent instances are also repeated ones. Looking at Figure6.2, it appears highly unlikely that all of the repeated data points are genuine; for example, I find it extremely difficult to believe that more than 60% of modules within a large software system would share the same number of: lines of code, lines of comments, blank lines, operands, operators, unique operands, unique operators, function calls, and so on.

74 Chapter 6. Major Methodological Issues

6.1.2.7 Other Issues

The most well-known issue regarding use of the NASA data sets in classification experiments is that of the varied levels of class imbalance (see Table 6.1). The table shows that data set KC4 has an almost balanced class distribution, whereas data set PC2 has only 0.4% of data points belonging to the minority class. This is an issue that researchers should be aware of. Learning from imbalanced data is an active area of research within the data mining community, I therefore refer readers to standard texts [Witten 2005,Chawla 2004,He 2009,Batista 2004]. Note, however, that defect prediction researchers need to be very careful in the way they assess the performance of their classifiers when using highly imbalanced data (see [Davis 2006,Zhang 2007,Gray 2011a] and Section 6.2).

Another issue is that, as mentioned previously, there are attributes within the NASA data sets that are simple equations of other attributes. While useful for checking data integrity, they can be problematic (or simply a waste of computa- tional resources) depending on the learning technique used. For example, support vector machines utilising a Gaussian radial basis kernel will typically not benefit from the inclusion of such attributes, as they will be implicitly calculated. Ad- ditionally, other highly correlated attributes can be found within the data sets, which are known to harm classification performance with many learning techniques [Hall 1999,Howley 2006]. Therefore, in some contexts, researchers may wish to ad- dress these issues. This usually involves removing attributes during pre-processing and/or utilising a feature selection technique on the training data.

6.1.3 Findings

Figure6.3shows the proportion of recorded values removed from the 13 NASA data sets (after basic pre-processing, see Table6.1) post the 5 stage data cleansing process just defined. Stages 1 and 2 of this process can remove attributes (features), stage 3 can replace values, and stages 4 and 5 can remove instances (data points). This was the motivation to use the number of recorded values (attributes ∗ instances) metric, as it takes both attributes and instances into account. Figure6.3shows that between 6 to 90 percent of recorded values were removed from each data set in total. Of the data cleansing processes with the potential to reduce the quantity of recorded values, it is the removal of repeated instances that is, by far, responsible for the largest average proportion of data removed (see Figure 6.2). This raises the following questions: Is the complete removal of such instances really necessary? Why are there so many repeated data points and how can they be avoided in future? What proportion of seen data points could end up in testing sets when this data is used in classification experiments? What effect could having such quantities of seen data points in testing sets have on classifier performance? Each of these questions are addressed in the sections that follow.

6.1. Data Quality Issues 75

Figure 6.3: The proportion of recorded values removed during data cleansing.

6.1.3.1 Is the complete removal of repeated instances really necessary?

Removing all repeated instances during initial pre-processing is a simple way to prevent the problems they can cause. The most serious of these problems is test set contamination, the potential effects of which will be discussed in detail in Section

6.1.3.4. An additional issue with repeated data points, separate to the problem of test set contamination, may occur as a result of model construction (training and optimisation). Training a model on data containing small proportions of repeated data points is typically non-problematic. For example, a simple oversampling tech- nique is to duplicate minority class data points in the training set(s). Using training data which contains repeated data points is reasonable, as long as training and test- ing sets share no common instances. Note, however, that the issue with excessive oversampling: overfitting, may also occur here. Overfitting can be identified when a model obtains good training performance, but poor performance on unseen test data [He 2009]. It is also possible to cause overfitting by optimising model parameters using a validation set (a withheld subset of the training set, see Section3.5) contain- ing duplicate data points. It is for this reason that [Kołcz 2003] recommend “tuning a trained classifier with a duplicate-free sample”. Note that this also applies when optimising using multiple validation sets, for example via n-fold cross-validation. It is also worth noting that feature selection techniques can be negatively affected by duplicates [Kołcz 2003].

76 Chapter 6. Major Methodological Issues

Following the data cleansing process to remove the repeated instances, researchers will be able to use off-the-shelf data mining tools (such as Weka) to carry out exper-