Description of the System and Data - Key Points of the DAD Approach

5.5 Key Points of the DAD Approach

6.2.1 Description of the System and Data

Case Study 1 (see Chapter 4) investigates the defect history of a large legacy system of size over 20 million SLOC and age over 20 years (see Section 4.3.1). This case study (Case Study 2) focused further upon a core subsystem of that system, which is of size over 1.5 million SLOC and age over 13 years. In particular, this study investigates the defect-fix history (defect records and change logs) of three major, successive, releases of this (sub)system.

This system contains 10 components (labeled C0–C9) and each component is implemented by a number of code files (mainly, written in the language C). The system’s size increased by 14% from release 1 (about 1.6 million SLOC) to release 2 and then by 7% to release 3. The size growth in releases 2 and 3 were mainly due to enhancements; subsequently, restructuring was carried out on release 3 in order to improve the system structure. These three releases are still under active maintenance now.

We focused on the defect-fix history (defect records and change logs) of these three releases, containing approximately 1100, 550 and 600 defect records while spanning approximately six, five and three years respectively. Each defect-fix record includes key information pertaining to: the release, phase and component in which the defect was discovered; the state that the defect is currently in (e.g., “working”, “validated”, “closed”, etc.); the reference that indicates defect redis- covery (i.e., associates a defect to its previous occurrence); the submit date that this defect was submitted to the defect-tracking database; and the file(s) and component(s) that were changed in order to fix the defect.

Table 6.1 shows four example defect-fix records. For example, defect 0020 (see column “ID”) was discovered in component C1 (see column “Component”) while “testing” (see column “Phase”) release “r1” (see column “Release”), and this defect was fixed in two code files “C1/.../foo1.C” and “C2/.../foo2.C” in components C1 and C2 (see column “Component*”) respectively.

Table 6.1: Example defect-fix records (only key fields).

ID Release Phase Component File Component* 0020 r1 testing C1 C1/.../foo1.C C1 0020 r1 testing C1 C2/.../foo2.C C2 0021 r1 field C3 C3/.../foo3.C C3 0021 r1 field C3 C4/.../foo4.C C4

Note that the values in columns “Component” and “Component*” could be different. For example, Table 6.1 shows that defect 0020 was discovered in component C1 (the “Component” value) but was fixed in components C1 and C2 (the “Component*” values). Likewise for defect 0021 which was discovered in component C3 but was fixed in components C3 and C4. Also note that the “Component*” field does not exist in the collected, raw, defect-fix dataset, which was inserted later during the data cleaning process, described below.

6.2.2 Data Collection and Clean-up Procedures

The collection process for the defect records for this case study is similar to that for the defect dataset under investigation of Case Study 1. We do not repeat this process here; see Section 4.3.2 for details. The collection process for the change logs for this case study is described below. Changes were made to the code base in order to fix each newly recorded defect in the defect-tracking database; the changes were logged in the version control system. We extracted the change logs from the version control system.

Next, we describe the five main steps used to clean up the collected, raw defect- fix dataset, as below. Some of these steps are similar to those for cleaning up the defect dataset in Case Study 1 (see Section 4.3.2). Note that these steps were carried out mainly with programming scripts.

Step 1: we removed defects which are rediscoveries or not closed or validated in

the system. This was carried out based on thestatefield of defect records. In particular, defects records which are not “closed”, “integrated”, “delivered”

and “validated” are excluded from the dataset.

Step 2: we removed change logs where changes were made in non-code files (e.g.,

documentation files). Here the code files are mainly .c files (in the language C). This was carried out based on the “file” field of change logs.

Step 3: we filled in the “Component*” value for each defect-fix record (see ex-

amples in Table 6.1). For each defect record, the “Component*” value is a component name indicating a component that was changed in order to fix this defect. This information is not recorded automatically in the defect-fix database. A simple text analysis technique was used to identify the component name from the “File” field and copy that component name to the “Component*” field. For example, it identified the component name “C1” from the “File” field value “C1/.../foo1.C”, so the corresponding “Compo- nent*” value is “C1”.

Step 4: we identified theInternal andField phases for defect records. Note that

the internal phase subsumes functional and system testing and performance quality assurance phases. Defect-fix records from other phases (e.g., devel- opment1_{) were removed. This step was carried out based on the} _phase _field of the defect records (see examples in Table 6.1).

Step 5: we removed “outliers” from the dataset. For example, we find that there

is a defect which required fixes in 130 code files while the other defects required fixes in, on average, approximately 2.2 code files (at most 80 code files; see Figure 6.2). This defect was treated as an outlier and was thus excluded from the analysis.

1_{The reason we removed defect-fix records made during the development phases (e.g., design}

and coding) is that developers could have fixed several defects but recorded them together as one defect and they also could have made changes in the code base which were recorded as fixes to a defect but which were not really fixing that defect.

6.2.3 Data Analysis Procedures

We then wrote programming scripts to analyze the defect-fix records. Statistical methods such as Pearson correlation coefficient (or Pearson-value) and Spearman’s rank correlation coefficient (or Spearman-value)2 _{were also used to evaluate cor-} relation and persistence for components and fix relationships measures.

The data analysis procedures were carried out based on MCDs identified from the defect-fix dataset. In particular, we identified MCDs based on the “Com- ponent*” values (rather than the “Component” values). For example, Table 6.1 indicates that defects 0020 and 0021 are two MCDs. Meanwhile, the fix relationships among components can be consequently identified when a MCD is identified. For example, there is a fix relationship between components C1 and C2 because of MCD 0020. Likewise for the fix relationship between components C3 and C4 (due to MCD 0021). Note that, in this case study, we only identified binary fix relationships in the system. The reason is that the majority of MCDs span only two components, see Figure 6.1 (in Section 6.2.4 below) for details. Fix relationships spanning more than two components were subsequently decomposed into a binary form and thus were used in the study.

Note that, in Case Study 2, the method of MCD identification is different from that for Case Study 1. The former is based on matching change logs to defect records (mainly, “Component*” field information); the latter is based only on defect records (by identifying parent-children relationships – Section 4.2). See a detailed comparison of these two methods in Section 7.1.

The data analysis procedures of this case study were centered on the MCDs in the system. They are incorporated into the case study design; see Section 6.2.5 for details. Before we describe this study design, we first give some descriptive statistics about the defects in the subject system, below.

2_{Similar to Pearson correlation coefficient (or Pearson-value), Spearman’s rank correlation}

coefficient (or Spearman-value) is a correlation measure. It is considered as being the Pearson correlation coefficient between two ranked data arrays, ranging from -1 to +1.

In document Characterizing and Diagnosing Architectural Degeneration of Software Systems from Defect Perspective (Page 104-108)