3.3 Experimental Methodology
4.1.7 Interaction of pattern attributes
In the previous sections several metrics have been presented, each of which mea- sures a specific attribute of pattern expressions. On their own, none of these metrics provide a particularly high correlation with the number of changes. The intriguing question that arises from this is whether these attributes interact in any way to form a more accurate measure.
The first step to investigating any possible interactions is to see if any of the measures are correlated with each other. This can be done using a correlation matrix. The correlation matrices for the pattern measures taken from the Peg Solitaire and Refactoring programs are shown in Table 14 in Appendix C.
The correlation matrix for both programs show that several of the pattern metrics are strongly correlated. In particular, it appears that “Number of pattern variables” (p1), “Sum of depth of patterns” (p2) and “Pattern size” (p6) are strongly correlated in both programs. In the Refactoring program “Number of constructors in pattern” (p5) is also strongly correlated with this group. One explanation for this is that these measures may all be measuring the size of a pattern.
Looking at the correlation matrix again it appears that apart from the clus- ter of strongly correlated measurements the other metrics have minimal cross- correlation.
It is interesting to see how the metrics might be combined to provide a higher correlation with the number of changes. This can be done using a regression
analysis, however when variables in a regression analysis are strongly correlated the result can be inaccurate and so it is advisable to replace the strongly correlated metrics with a representative metric.
For this work “Sum of depth of patterns” (p2), for the Peg Solitaire program, and “Number of pattern variables” (p1), for the Refactoring program, were chosen as the representatives of the cluster of correlated metrics. These were chosen because they have the highest correlation with the number of changes out of the metrics that form the clusters.
Because the results for different metrics may not be measured in the same scales it is important to normalise them so that they all have equal weighting in the regression analysis. This is achieved by performing the analysis on the “z-scores” of the metric results. The z-scores normalise the metric values such that the z-scores have a mean value of zero and a standard deviation of one. The results of the regression analysis are shown in Table 21 in Appendix D.
The results of the regression analysis show the multiple correlation coefficient, R, to be 0.1584 for the Peg Solitaire program, which is not statistically signifi- cant, and 0.6015 for the Refactoring program. These values are higher than the highest individual correlation values for each program, although in the case of the Refactoring program, only marginally.
These results suggest that there is only a small amount of interaction between the metrics. The results from the Refactoring program show that 36% of the variance can be explained by the metrics so it is worth examining this regression analysis in more detail.
The coefficients shown in Table 21 in Appendix D for the regression analysis of the Refactoring program shows that the largest contribution comes from the “Number of pattern variables” (p1) metric. The correlation value for this metric is only slightly less than the multiple correlation coefficient. This suggests the following observation.
is its size, as indicated by the number of pattern variables.
It seems that for pattern attributes the other measurements do not significantly increase the correlation with the number of changes, although this may be caused by the relatively small number of occurrences of those attributes in the Refactoring program.
4.1.8
Summary
In this section it has been shown that there are several attributes of pattern ex- pressions that can be measured. Each attribute measures a distinct component of a pattern expression that might add to a pattern’s complexity. These attributes can therefore be thought of as atomic attributes. Analysis of these atomic at- tributes shows that some of the attributes are strongly correlated. These strongly correlated attributes appear to be measuring the size of a pattern in various ways. The two case studies highlight the differing results that can occur for the measurements correlations in differing contexts. In the Peg Solitaire program the “Number of constructors in pattern” (p5) measurement had a slight negative correlation, while in the Refactoring program it had a positive correlation. This may be due to the differing uses of constructors in the two programs. In the Peg Solitaire program the constructors are used in simple data types that have little nesting, and as such the naming of the constructors helps to document the code. By contrast, the Refactoring program uses large, complex mutually recursive nested data types to represent parse trees. In this case the constructor names are sometimes generic and add little documentation to the code. The complex nature of the data types makes it easy to introduce errors, and hence the metric has a positive correlation.
Performing regression analysis suggests that the largest influence on the cor- relation with the number of changes is the size of the pattern.
taken may affect the correlation of the metric. For instance the “Number of con- structors in pattern” (p5) measurement appears to exhibit different correlations between programs with complex data types, such as the Refactoring program, and programs with simple data types, such as the Peg Solitaire program. Because of this it seems that combining such a metric with some measures of the correspond- ing data type could produce a greater correlation, and therefore perhaps a more accurate prediction.