Metrics and Time - Software Measurement for Functional Programming

Large software projects are often developed and maintained over many years. As such software ages it can become difficult to maintain, develop, and debug. The

history of the development of such a system, particularly the change history as might be contained in a revision control system such as CVS [35], can provide interesting data about the program which may be used with metrics in several ways. The remainder of this section is divided into the following parts.

• Section 2.5.1 examines work that measures attributes of the change history of a program.

• Section 2.5.2 shows how the change history of a program may be used to validate a metric.

• Section 2.5.3 summarises this section.

2.5.1 Time as a metric

In [40] Graves and his co-workers explored the effect of using the history of a program’s development to predict where a program is likely to become unman- ageable. They found that the change history contained more useful information than could be obtain from a single snapshot of the program. For their work they compared the following measures:

• Number of Past Faults. This method predicts the number of faults to be found in a module in the future using a constant multiple of the number of faults found over a past period of time. This provided a reasonably accurate prediction of the number of faults and proved to be difficult to improve upon.

• Number of Deltas. Using the number of changes to a module over its entire history provided a better prediction of the number of faults to be found than those measures that were generated from a single snapshot of the program such as lines of code. This measure may be related to the Number of Past Faults metric because there is likely to be changes, or deltas, after a fault has been discovered. Likewise, if there are a large number of deltas then there may be an increased probability of faults being introduced.

• Average Code Age. Combining the average age of code within a module with the number of deltas produced a measure which increased the accuracy of the number of deltas method.

• Weighted Time Damp. This method computes the fault potential of a module by adding contributions from each change made to a module such that the larger or more recent a change is, the greater the contribution, with recent large changes contributing the most. This method also incorporates a damping mechanism to avoid transient events such as a single large change from skewing the result.

Of all these methods, the weighted time damp metric provided the most accurate prediction of the number of faults likely to occur in the future. For their experiment a 1.5 million line subsystem of a telephone switching system was used. The metrics described above, along with other complexity and size metrics such as Lines Of Code, were used to predict the number of faults in the system and these results were compared with the actual fault occurrences. The Number of Past Faults metric was used as a benchmark against which the other methods could be compared. From these experiments they found that the change history provides much more useful and accurate predictions than simple metrics that are applied to a single snapshot of the program. Particularly good correlation with the number of faults was obtained when the data from the change history is combined with “snapshot” metrics such as Lines Of Code. An example of such a metric is the Weighted Time Damp measure.

2.5.2 Using time to validate metrics

Barnes and Hopkins [10] describe how they applied simple metrics such as pathcount to a software library written in Fortran and compared the results with the bug fixing changes appearing in the change history of the library. They found that there was a high correlation between routines which required post release mainte- nance and routines which exhibited a high pathcount value in excess of 105_{. Their}

results showed that 41% of all the bug fixes occurred in routines with a pathcount value in excess of 105_{, while those routines accounted for only 16% of the total}

number of routines in the library. They therefore calculated that routines with a pathcount value greater than 105 _{where six times more likely to contain a bug}

than routines with a pathcount value of less than 105_.

One hypothesis that could account for such a result is that if bugs are dis- tributed randomly throughout the program code, and that routines with larger pathcount values also have a larger number of lines of code, then those larger routines would be statistically more likely to contain bugs.

This hypothesis depends on there being a correlation between the size of a routine and its pathcount. To perform a quick test to see if there is a correlation, we plotted a graph of pathcount values against routine size, which is shown in Figure 5.

This graph shows that there is a trend for the pathcount values to increase with the routine size. Further analysis showed a statistically significant correlation of 0.4334 between the pathcount values and the size of a routine, measured in lines of code. However, this correlation is quite low, so it is still not clear if Barnes and Hopkins results could be caused by random placement of bugs. When considering the pathcount metric in more detail it seems clear that as a pathcount value increases, the number of lines of code must also increase because there is a limit to the number of execution paths that may be present in a given number of lines of code. It therefore seems likely that the correlation between function size and pathcount is caused by this relationship.

Because of this observation it is unclear if Barnes and Hopkins could have achieved similar results by using function size rather than pathcount values. One reason why pathcount may be a more discriminating predictor of faults than function size may be that pathcount measures cover a significantly larger range of values than the function size metric, and may therefore have a finer granularity.

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 0 200 400 600 800 1000 1200

Path Count (log)

Lines Of Code Plot of Path Count Against Lines Of Code

Figure 5: Plot of path count values against code size in Lines Of Code. Path count plotted on logarithmic scale.

2.5.3 Summary

Barnes and Hopkins provide a scientific approach to the validation of metrics, something that has sometimes been lacking in the field. Their method of using the change history of a library to allow validation of predictions based on metric values is an innovative use of the change history.

This section has also shown that the change history contains much useful information about the state of a software system and that combining metrics with software change history can make software measurement a more powerful tool.

Using change history as part of the measurement process may be relatively simple for many large software systems because such systems often employ a source code revision control system such as CVS, from which it is sometimes possible to extract information automatically. However it is worth noting that this can depend heavily upon the frequency of commits to the revision control system and the quality of the log messages associated with them. This is discussed more in Chapter 3.

In document Software Measurement for Functional Programming (Page 64-69)