Studying software evolution using artefacts’ shared information content

(1)

Contents lists available atScienceDirect

Science of Computer Programming

journal homepage:www.elsevier.com/locate/scico

Studying software evolution using artefacts’ shared

information content

✩

Tom Arbuckle

∗

Computer Science and Information Systems, University of Limerick, Limerick, Ireland

a r t i c l e i n f o

Article history: Received 3 August 2010

Received in revised form 7 November 2010 Accepted 9 November 2010

Available online 27 November 2010 Keywords: Software evolution Software measurement Information theory Kolmogorov complexity Similarity metric Information content CompLearn

a b s t r a c t

In order to study software evolution, it is necessary to measure artefacts representative of project releases. If we consider the process of software evolution to be copying with subsequent modification, then, by analogy, placing emphasis on what remains the same between releases will lead to focusing on similarity between artefacts. At the same time, software artefacts – stored digitally as binary strings – are all information. This paper introduces a new method for measuring software evolution in terms of artefacts’ shared information content. A similarity value representing the quantity of information shared between artefact pairs is produced using a calculation based on Kolmogorov complexity. Similarity values for releases are then collated over the software’s evolution to form a map quantifying change through lack of similarity. The method has general applicability: it can disregard otherwise salient software features such as programming paradigm, language or application domain because it considers software artefacts purely in terms of the mathematically justified concept of information content. Three open-source projects are analysed to show the method’s utility. Preliminary experiments onudevandgitverify the measurement of the projects’ evolutions. An experiment onArgoUMLvalidates the measured evolution against experimental data from other studies.

1. Introduction

Within software evolution, several broad themes can be identified [1]. Moreover, the term is often used interchangeably

with software maintenance [2]. Specifically, therefore, our focus here is on the theme of change of software artefacts in

time or across versions. Following the everyday use of the word evolution, we want to examine how software is modified in response to its environment or requests from its users. With our eye on listed challenges in software evolution research

[3–5], we want to study how software evolves.

A rich source of evolutionary data comes from mining software repositories [6]. Open-source projects have proliferated

largely solving former problems with access to proprietary development information. Given that we can obtain many different kinds of software artefacts, how are their evolutions to be measured?

Software is generally measured using software metrics. This field has a long and venerable heritage [7–10] and a

disappointing preponderance of difficulties and disagreements [11–19]. With hundreds of software metrics to choose from,

it is difficult to make a convincing case that any one software metric is significantly better. They tend to be designed for particular purposes and need to be calibrated against development context.

An application of information theory, specifically employing a measurement based on the (relative) Kolmogorov complexity, provides a means, disregarding purposes, languages and context, of measuring software and thereby software evolution. Software is, after all, information and that information is representative of the decisions made in its design and

✩_{Figures in this paper make use of colour. Obtaining an electronic or colour-printed version of the paper may aid comprehension.} ∗_{Tel.: +353 61 23 4284.}

(2)

construction. By choosing information theoretic measurement, we attack the hard problem of software measurement with the precision and ineluctability of a mathematically defined concept.

The object of this paper is to continue to reinforce our claims [20–22] by experimentally examining data from

open-source projects. After preliminary studies of the projectsudev[23] andgit[24], a more detailed exploration of the project

ArgoUML[25] permits validation of the technique against results obtained by other researchers.

The structure of the remainder of this paper is as follows. First, we clarify our terminology and relate shared information measurement to the software engineering literature on measurement and comparison. We state the thesis of the paper and the approach to be followed in validating it. Then, a self-contained section provides only the background theory necessary to understand the experiments. Next we describe the experimental approach. The steps to be followed, interpretation of results, and threats to validity are detailed. We conduct two preliminary studies. There follows a third, comparative, experiment in which the results obtained are validated against those of other researchers. Related work is outlined. The conclusions and future work are presented. Finally, appendices on certain technical aspects of the theory are provided, together with the bibliography.

2. Theme and thesis

2.1. What is software evolution?

We define software evolution as the way in which software artefacts change between versions. A version need not represent a full release but could simply represent the current development status. Versions need not be chronologically ordered. Provided the artefacts are representative, we do not restrict their type. In practice, we often consider source code, representations of structure, representations of behaviour or the machine instructions themselves.

There is an extensive literature on how software evolves in this sense. Mens and Demeyer’s paper concerning software

evolution metrics [26] provides key references. In addition, Fernandez-Ramil et al. [27] have written a review starting from

the early work of Belady and Lehman and going on to discuss studies of evolution of open-source systems. Israeli and

Feitelson’s study of Linux kernel evolution [28] provides a recent notable addition.

2.2. Software measurement

=

(software) metrics?

Evolution implies change and in order to quantify change, we need a measurement method. Software is traditionally

measured using software metrics, more properly called software measures [29].

Early methods of measuring software include counting instructions [30] or lines of code (LoC) [31] with LoC being

suggested as a baseline software measurement as late as 1983 by Basili and Hutchens [32]. Further early publications on

software metrics include Rubey and Hartwick [33]. McCabe’s cyclomatic complexity [7], and Halstead’s software science [8].

Reading early reviews of the field, such as Perlis et al. [9] or Cook [34], it is plain that, on the one hand, the need for means of

measuring software has been clearly recognised, and, on the other hand, the field has already run into difficulties and been the topic of intense debate. Indeed, as we have already mentioned, metrics are also the subject of subsequent critical papers

[11–14]. More recent metrics including those by Chidamber and Kemerer [10], the metrics by Lorenz and Kidd [35] and the

MOOD set of metrics [36], focus on the measurement of object-oriented code. While Chidamber and Kemerer’s metric suite’s

popularity means that it has almost become ade factostandard, it has not escaped extensive criticism [15–17]. The MOOD

metrics have also been criticised [18] although other authors have found them to be useful [37]. The Lorenz and Kidd metrics

are similarly the subject of claim and counter-claim [38,39]. It is clear that the subject of the measurement of software is

both difficult and contentious. As Abran et al. state in their 2003 paper [19]

This is a clear indication that, when looked at from an engineering perspective, measurement in software engineering is far from being mature and that it constitutes a fairly weak engineering foundation for the field of software engineering.

One area of software measurement that has gained acceptance is the area of software measurement for project estimation

and management. From early work by Putnam [40] and Albrecht [41], function point measurement has progressed to become

a necessary part of software process improvements initiatives such as CMM or CMMI. Boehm et al. and Jones describe two

current leading methods [42,43].

Finally, attempts to employ information theory to measure software have, almost without exception, involved the use of

(Shannon [44]) entropy. Starting from Campbell [45] and Hellerman [46], two recent examples are Sarkar et al. [47] or Anan

et al. [48]. We will not employ entropy. SeeAppendix Afor more details.

2.3. Edit distances and beyond

There are alternatives to using software metrics and comparing the values they produce on different artefacts. During

coding, a developer comparing two files will commonly use a file differencing tool, such as the UNIX commanddiff to

(3)

representation of the edits needed to transform one file into another and then find the minimum cost path, the ‘edit distance’,

to traverse this graph. (See also the recent language independent Ldiff [51].) A further series of tools, such as JDiff [52,53]

and UMLDiff [54] are more aware of the semantic consequences of change in object-oriented (OO) environments like Java

and can make more human-meaningful comparisons as a result. Tools for model differencing [55,56] and semantic distances

[57] are also relevant here.

Edit distances lead to the consideration of the use of mathematical metrics for the comparison of software. The key to this idea is that we can think of a ‘space’ of edits and measurement of difference as being the result of calculating a ‘metric’ on the space to provide a ‘distance’ between two points (files) in this ‘space’.

A mathematical metric,d, is a real, single-valued, non-negative function, that defines a distance between a set of points.

A metric has the properties

1. d

(

x

,

x

)

=

0 if and only ifx

=

y(identity); 2. d

(

x

,

y

)

=

d

(

y

,

x

)

(symmetry);

3. d

(

x

,

z

)

≤

d

(

x

,

y

)

+

d

(

y

,

z

)

(triangle inequality);

for membersx,y, andzin the set. A normalised metric is a metric whered

(

x

,

y

)

≤

1.

There are several relevant mathematical metrics. The Hamming distance [58] and the Levenshtein distance [59] (see also

[60]) are used to measure the edit distance between strings. The Bhattacharyya metric [61] has been found to be a good

metric for frequency distributed data [62]. The Kullback–Leibler divergence [63] (not symmetric and so not a true metric) is

commonly used for comparison of distributions.

Mathematical metrics are precisely defined and require no estimation or judgemental evaluation to calculate. In what follows, we will use the word metric to mean mathematical metric: software metrics will not be employed.

2.4. Similarity, not difference

In contrast with measuring dissimilarity in terms of distances, it can also be useful, in general, to look at the complementary facet of shared similarity. In other words, in the case of software artefacts, rather than following the

conventional path of usingdiff, for example, to show what has changed between versions of files, we can use some other

kind of tool to see what remains the same between them.

One way in which things remain the same is when they are copied. A copy of a section of code within a file is software

cloning. Recent publications [64–68] and reviews [69–71] show that this is an active research theme. Our concern is not

clone detection as such: the search for clones involves search through multiple sources.

A second form of copying is duplication of the whole file, important because we consider evolution to be copying

plus modifications. There are many early examples of studies to detect copying (plagiarism) and collusion [72–76]. Often

studies on detection of software duplication [77–80] employ Karp and Rabin’s important ‘fingerprinting’ algorithm [81] (and

variants), but there are also other approaches [82–86]. There are several recent tools for plagiarism detection [87–90]. We

also note Church and Helfman’sdotplots[91] employed [92] for visualisation.

Finally, there are a few papers dealing with similarity that are of direct relevance to this work. Kirk and Jenkins [93]

compared code before and after deliberate obfuscation to determine similarity. Chen et al. [94] examine information shared

between code sources for program plagiarism detection in their plagiarism detection program SID. Similarly, the system by

Zhang et al. [95] employs a measurement of shared information and clustering to look for plagiarism. A recent paper by

Cebrián et al. [96] discusses an approach to the validation of plagiarism detection systems employing shared information to

compare programmatically generated code variants. What all of these references have in common is the use of information theory – specifically Kolmogorov complexity based information distance – for looking at similarity between coding artefacts.

2.5. Thesis and proposed experimental validation

The thesis of this paper is that a theoretically justified means of measuring software and tracking software evolution is provided by an application of information theory to measure the similarity, in terms of shared information, between artefacts representative of software releases.

In contrast with other approaches, our investigation of software evolution in this paper:

•

thinks in terms of mathematical metrics and dispenses with ‘‘software metrics’’;

•

thinks in terms of information and dispenses with particularities of semantics, of coding, and of languages; and

•

looks at similarity and ceases to think of difference.

The paper answers the following questions.

1. To what extent can the measurement of shared information be used to measure software evolution? 2. To what extent is this practical?

3. How does this compare with the measurement of software evolution by other means?

The approach used to answer these questions and validate the method is one of experimental investigation and comparison of results with those obtained by other methods. Before moving on to the series of experiments, in the next section we explain the underlying theoretical approach.

(4)

3. Background theory 3.1. What is information?

Unfortunately, the word ‘‘information’’ has too many connotations from normal usage to be used safely without further explanation. We need to address the potential confusion between ‘‘information’’ and ‘‘meaning’’ or ‘‘semantics’’.

By ‘‘information’’, we mean sequences of binary digits. In other words, we are thinking about information recorded in

strings of ones and zeros such as 011000101010001

. . .

.

Our use of digital (rather than analogue) computers means that the information we process on our computers is stored and processed in terms of binary information regardless of the encoding we place on it. In particular, software artefacts and programs are binary strings.

This matters because, when a human-understandable programming artefact – source code, for example – is modified as part of its evolution, we meaningfully interpret those changes. A one bit change in the source code’s binary string may

represent a change in a programming symbol from a ‘

>

’ to a ‘

≥

’. Our interpretation, the meaning and the behaviour of the

executable produced from the source will likely be completely different. However, seen as a stream of binary data, for any non-trivial binary stream, we would be hard pressed to detect the difference. Instead, we need some kind of computation to provide the result of the comparison.

When we dispose of semantic interpretation and treat software artefacts as binary information, we dispose of context and meaning but, in return, can ask for a report of a precisely defined quantity, independent of human interpretation, that provides a measurement of the information in the artefact.

3.2. Measurement of information content

The branch of information theory concerned with measuring the information content of (binary) objects is Kolmogorov

complexity. Discovered independently by Solomonoff [97], Kolmogorov [98] and Chaitin [99], it is also sometimes known

as ‘algorithmic entropy’. The Kolmogorov complexityK

(

x

)

of a (binary) objectxis defined to be the length of the shortest

(prefix-free) binary program to computexon a universal computer, such as a universal Turing machine. It gives the number

of bits to computationally describex. The book by Cover and Thomas [100] provides a succinct overview of Kolmogorov

complexity. For a more detailed treatment, see the book by Li and Vitányi [101].

Kolmogorov complexity can be shown to be a non-partial recursive function. This means that it is not computable on any standard computational machine. Therefore, some means must be found of approximating it if it is to be used for practical

(real-world) problems. One way to do this is described inAppendix B. However, Kolmogorov complexityK

(

x

)

is an absolute

measure of the information content of one object,x. Two completely unrelated objects could provide identical measurements

of their corresponding information contents. A means of looking for quantities of shared information is required.

3.3. Measurement of shared information content

The relative quantity of shared information between two binary objectsx and yis measured by the Normalised

Information Distance (NID), introduced by Li et al. [102]. The NID is calculated as:

NID

(

x

,

y

)

=

max

{

K

(

x

|

y

),

K

(

y

|

x

)

}

max

{

K

(

x

),

K

(

y

)

}

(1)

where the conditional Kolmogorov complexity ofxgivenydenotedK

(

x

|

y

)

, for example, is the length of the shortest program

for a universal Turing machine to outputxfor an inputy.

There are several points to note about this formula. 1. It is exact: there is no approximation.

2. It is relative: it describes the amount of information shared with respect to the amount of information in the larger object.

We can interpret 1

−

NID

(

x

,

y

)

as the number of bits of shared information per bit of information of the string with more

information.

3. It is normalised: values of the NID are constrained to lie between zero and one, with the value zero meaning that the compared objects are identical.

4. It is a mathematical metric, satisfying the mathematical metric properties up to an additive term ofO

(

1

/

K

)

, whereKis

the maximum of Kolmogorov complexities involved [102].

5. It is uncomputable since the conditional Kolmogorov complexities are uncomputable.

If it were possible to calculate the NID, we would have an exact, mathematical measurement of the quantity of shared information between two (binary) objects. (An absolute information distance and links between this work and the physics

of computation, including reversible computation, are given in Bennett et al. [103].)

3.4. Calculating similarity

The NID, a mathematically exact, dimensionless, number measuring similarity cannot be calculated exactly. It is possible to approximate it if we recognise that the Kolmogorov complexity of a binary object is the length of a compressed

(5)

representation of the object. The program to be run on the Turing machine is, by definition, the shortest representative program. It is the most compressed representation of the object as a program.

General algorithms for the compression of data are not tuned for each individual data sample they are required to compress but nevertheless are an attempt to produce a more compact representation of the original. This was the intuition

used by Li et al. [102] to describe a practical approximation. (Li et al. published a slightly different version of the metric earlier

[104].) The approximation’s requirements and practical implications were studied in detail by Cilibrasi and Vitányi [105].

Referring to an instantiation of any compression algorithm as a ‘real’ compression mechanism or ‘real’ compressor, as

opposed to the ideal one to be run by the Turing machine, it was shown [102,105] that the NID could be approximated in

terms of real compression mechanisms as a normalised compression distance (NCD). Denoting byxythe concatenation ofx

andyand byC

(

x

)

the approximation of a Kolmogorov complexityK

(

x

)

by the length of the compressed data produced by

an instance of a real compressor, the NCD is given as: NCD

(

x

,

y

)

=

C

(

xy

)

−

min

{

C

(

x

),

C

(

y

)

}

max

{

C

(

x

),

C

(

y

)

}

.

(2)

Cilibrasi and Vitányi were able to place bounds on the accuracy of the approximation by showing that it was dependent on the accuracy of the approximation of an ideal compressor by a real one. They further showed that necessary properties for compression mechanisms were idempotency, monotonicity, symmetry and distributivity. Fortunately, these are features of

most real-world compressors. (See, however, [106] for some additional analysis.)

4. Experimental approach 4.1. Aims and objectives

The purpose of these experiments is to validate the hypothesis that an application of information theory measures software evolution. Given two versions of a file, where one has evolved into the other, the quantity of shared information between them will give a measurement of the degree of similarity between the files. Files that have undergone little change will be detected as similar; files that have been modified will be detected as less similar. Note, again, that since we are looking at ‘information’ rather than meaning, the type of changes that are significant to the measurement will be of a different nature to the semantic changes significant to a human observer.

One aim of this work is to show that major phases in the development of software artefacts, not necessarily code, are represented by major changes in their shared information content. Conversely, a second aim is to show that major changes in the shared information content of artefacts measure development activity. The quantity of shared information between

artefacts is measured using the CompLearn implementation of the NCD [107].

4.2. Steps for validation

The experimental approach to the validation consists of a series of steps.

1. Select an artefact which is taken to be representative of the evolution of the software package. This can be a concatenation of source files, for example, but other possibilities, including software binaries or representations of program structure or run-time behaviour are possible.

2. Create the input data set by extracting or creating the artefact for all of the releases to be studied.

3. Select a compression mechanism to be employed to calculate the NCD. See Section4.4for details.

4. Use the compression mechanism to calculate the array of NCD values for pairwise comparisons of all the artefacts in the data set. Pairwise comparisons are performed in each order to verify symmetry.

5. Plot the array (using Python matplotlib’simshowwithout interpolation). The NCD is normalised and self-comparisons

will have the result of 0.0 for all samples. A grey-scale plot of 16 releases using compressor

comp

will look likeFig. 1(a).

[NCD data:ppmdcompressor applied to projectudev. Details: Section5.1.]

6. Overlay contours on top of the grey-scale plot to enable regions within it to be more easily seen. The contours are created

using matplotlib’scontourfunction. [Intersections between contours and grid lines are found by linear interpolation then

connected by line segments for drawing as contours.] Continuing the previous example, the plot after the addition of

contour lines is as shown inFig. 1(b). The same contour levels are always used.

7. Verify the results by confirming, firstly, that the plot of the matrix shows that the compressor does indeed possess the

desired properties (Section4.4) on the given data set and, secondly, that the plots and known development history of the

project as measured by traditional approaches and change logs are mutually supportive.

4.3. Interpreting the results

The result for each experiment is a square array of NCD values (for the pairwise comparisons of artefacts) that then needs to be interpreted. In addition to the 2D plots already mentioned, we can also create line plots for individual differences between consecutive versions or cumulative difference between one version and a sequence of others.

(6)

(a) Grey-scale plot of NCD matrix. (b) Grey-scale plot overlaid with coloured contours.

Fig. 1.Plotting the NCD distance matrix.

(a) Making a line plot for consecutive versions. (b) Line plot of sample data given inFig. 1(a).

Fig. 2.Creating line plots from a distance matrix.

In the 2D distance matrix plots, the values for the artefacts are arranged in release sequence, left-to-right, bottom-to-top. Values along the anti-diagonal are all zero since these values correspond to self-comparisons. From the main diagonal, moving one element to the right or above corresponds to a comparison with the next release in the sequence. Continuing the comparison vertically or to the right corresponds to comparing the release on the main diagonal to later and later releases. We often see blocks of values, as differentiated by the contour lines, lying along the anti-diagonal. If we start at the bottom left-hand corner of a block, the vertical and horizontal ‘edges’ starting from this corner indicate a region of similarity extending to the release corresponding to the top-left or bottom-right corner. The release at the top-right marks the location of a change where the similarity between consecutive releases is disrupted.

For the two types of line plots, examining individual differences between consecutive versions and cumulative difference

respectively, we extract the necessary data from the NCD distance matrix.Fig. 2(a) shows the bottom-left corner of the

(7)

correspond to a comparison between a release and its successor. A difference plot is created from the matrix by plotting the NCD values of the points lying either to the right or immediately above the anti-diagonal. (Unless we have symmetry, these may not actually have the same values.) The first value in the plot (circular arrow) is a self-comparison, value 0.0. The blue

plot (solid lines) inFig. 2(b) shows this process (solid arrows in2(a)) being applied to the data ofFig. 1(a) producing a plot

of similarity between consecutive versions.

For the cumulative difference plot, dashed arrows in Fig. 2(a) correspond to the process for calculating cumulative

difference for the corner point and are shown as the green plot (dashed lines) inFig. 2(b)). Picking a point on the central

diagonal and moving up or to the right from it, shows how subsequent releases become increasingly dissimilar to it. Starting from a point higher on the anti-diagonal starts the cumulative calculation from a later release.

To relate the plots to the changes in the software artefacts, we look (either at the 2D plot or the individual difference plot) to see where the largest changes are being introduced. These plots give a measurement of the quantity of shared information between the artefacts for the releases so where there is a discontinuity or jump, we can expect there to have been substantive change. Recall that what is being measured is shared ‘information’: certain types of human-meaningful changes such as reorganisations or renamings may represent only minor changes in information content.

Given the points at which we detect change, we seek validation of the method by manually relating the changes to alterations in the coding artefacts. We are currently only looking for supporting evidence but it is also valuable to look for cases where either the results of this method or those of more traditional approaches fail to support each other.

4.4. The NCD, compressors, and data

Cilibrasi and Vitányi [105] show that the NCD is ‘quasi-universal’, minorising every computable similarity distance.

That is, given any effective distance measure, if two objects are found to be close given this measure, then they will also be found to be close using the NCD. They also show that the NCD is robust with respect to the compressor being used. Statistical compressors (such as PPMZ), dictionary based (Lempel-Ziv), and block based compressors (bzip2) as well as special purpose compressors are stated to be permissible. Indeed, the properties needed for a compressor are listed as idempotency, monotonicity, symmetry and distributivity. These properties need to be satisfied up to an additive term

O

(

logn

)

, withnthe maximal length of a binary element involved in the comparison.

Cilibrasi [107] has made an implementation of the NCD available as the open-source project CompLearn. The project

employs a modular approach for compression mechanisms. It is quite easy to create a module for any command-line compression tool and library based compressors can be employed with additional labour. We can therefore test both

commonplace and esoteric compressors for their utility. We employ the suppliedzlib,bzlib,ppmdandlzmamodules of

version 1.1.5 in these tests.

When calculating the Kolmogorov complexity, in finding the shortest possible length for the program to produce the string as output, the Turing machine is acting as an optimal compressor. No ‘real-world’ compression mechanism will be able to find this optimally compact representation. Thus, in choosing a compressor for use in the NCD, one needs to select a compressor that adequately possesses the necessary properties and performs well on the given data. Some compressors may compress well, or display symmetry, on some kinds of data but not on others. On some data, some compressors will produce values of the NCD greater than 1.0. Intuitively, the most effective compression mechanisms will be closer to the optimal answer but (very crudely), higher compression ratios being related to more computation, will take longer to achieve the results. There are also practical considerations about the data: compressors are not designed to compress very small quantities of data (since they are already small); very large files, on the other hand, may exceed the ability of the compressor to seek longer range opportunities for compression. Cilibrasi and Vitányi also mention border effects: when crossing from a first file to a second, the compressor must relearn the regularities present in the second file having been trained on the first. Thus, in a concatenation of several files used as input data, the order of presentation should not matter in principle but will likely matter somewhat in practice.

What this means is that, in carrying out experiments, we apply a set of compressors to our software artefacts and discard those results that obviously do not detect similarity or are asymmetric while at the same time seeking corroboration of compression results from the others. In other words, as we will see, testing the suitability of the compressor on the data is part of the verification process.

Finally, the amount of computation required to perform the calculations being performed here is an issue. Although there are some opportunities for optimisation, the pairwise comparisons do need to be performed in both orders to check the symmetry of the calculation. A single NCD calculation takes seconds. However, on a 2 GHz PowerPC machine running a single-threaded program in memory without swapping, the time taken to calculate the distance matrix for some of the following tests was several days to one week depending on the size of the artefacts and the quality of the compression algorithm involved. Even though computing a line plot will only require a single NCD computation for each point on the line, computation time can be interpreted as an impediment to the practical usefulness of the method.

4.5. Threats to validity

In addition to our earlier papers [20–22] employing the NCD for looking at software code, behaviour and structure, the

(8)

other examples include genomics [105], cross-language textual similarity [108], and the classification of musical styles [109].

Applications are listed in the paper by Li [110] or in Li and Vitányi’s book [101].

In line with the recommendations for empirical studies by Kitchenham et al. [111], however, we are aware of the

following general threats to the validity of these experiments.

4.5.1. Artefacts need to be representative

An NCD measurement will give us a number which is (an approximation of) the quantity of shared information between the artefacts, but it is still an assumption that the artefacts themselves are representative. The point is that we can measure different artefacts which we assume to be representative of different aspects of change in the code. Traces can be representative of dynamic behaviour; abstract syntax trees can be representative of structure; code and binaries are representative in some way of code development. However, we are currently making no attempt to determine which is the

best representation for tracking these evolutions. In addition, as previously stated in Section4.3, certain types of

human-meaningful changes may represent only minor changes in information content. Measurements of representative artefacts may indicate similarities or differences that might not mirror our intuitive expectations despite the measurements reflecting the artefacts’ shared information content.

4.5.2. Quality of approximation and choice of compressor

Recall that the NCD is an approximation of the NID which is in turn an uncomputable quantity. If the properties required for the approximation do not hold, then the correctness of the approximation will be called into question. Following on from

this, we have already discussed (Section4.4) the selection of the compressor, the qualities that need to be possessed by a

compressor and the need to verify the action of the compressor at the same time as applying it to the measurement of the

quantity of shared information. For more detail on the choice of compressors, see the paper by Cebrián et al. [106].

4.5.3. Interpretation of results

Once we have produced the NCD distance matrix, we interpret its results manually. A large change in the quantity of shared information prompts us to examine changes in the code to see what might have caused the discontinuity (or vice versa). If a correspondence is found, we have not yet conclusively established causality. We dispense with the semantic context but then make assumptions about changes in shared ‘information’ content that we attribute back to human-meaningful change. Indeed, validating this is one of the objectives of this work.

4.5.4. Other potential problems

Noise. The data, the software artefacts themselves, are not filtered or pre-conditioned in any way. It is possible that ‘noise’

– extraneous commentary for example – could skew the results. Cebrián et al. [112] have, however, shown experimentally

that the NCD is resistant to noise.

Ordering of data. As already mentioned, ordering of data should not matter in principle but might do in practice. Therefore,

if data is taken from multiple sources to form a concatenation, we use the same programmatic device (Python:‘os.path.walk’)

to locate the artefacts within the directory of files. This does not guarantee the same ordering but it is highly likely within evolving source trees.

Measurement uncertainty. The output of applying a measurement process to produce a measurement should also include

some indication of the variation of the values attributed to that measurement, the measurement uncertainty [19]. No

estimate for the uncertainty in these measurements is presented.

5. Preliminary experiments 5.1. udev

udev[23], coded in C, is a well-documented and integral part of most Linux distributions. The suite of programs included

inudevis responsible for dynamically setting up file system nodes corresponding to hardware devices within the system.

Other programs within the suite serve to generate unique identifiers for device classes (such as USB devices, for example) and to permit other actions to be taken as a result of the run-time addition or removal of a device.

5.1.1. Method

The artefact chosen as being representative of the evolution is the concatenation of all source –

.h

and

.c

– files within

the directory and all of its subdirectories (Python:os.path.walk, regex‘.*

\

.ch

\

Z’) corresponding to each release. No filtering

or pre-conditioning of the files is performed. In all, 141 releases are studied and an index number is assigned to each sequentially. Release 041 does not exist so index 40 corresponds to release 040 but index 41 corresponds to release 042,

(9)

Table 1

udev: reasons for major changes from [22]©[2009] ACM Inc.

Index Release Notable Features Index Release Notable Features

6 6 SCCS files kept in source 73 74 Remove own copy of klibc

16 16 Removal of SCCS files 79 80 Replace libsysfs

43, 44 44, 45 No code changes 98, 99 99, 100 Almost no code changes

53 54 Update klibc with zlib 126 127 libudev info library

5.1.2. Results

Four compressors (see Section4.4) were deployed on the same data set:zlib,bzlib,ppmdandlzma(here labelled aslzmax).

The results of plotting the NCD distance matrices, together with the line plots for compressorsppmdandlzma, are shown

inFig. 3. The same contour lines and mapping between values and grey-scale values are used as shown inFig. 1(b). (Some

preliminary results were published in [22].)

5.1.3. Analysis

As explained in Section4.3, moving from the main diagonal in directions from left to right or from bottom to top

represents comparing a given release with a later one. The value 1

−

NCD

(

x

,

y

)

represents the number of bits of shared

information between samplesxandyproportional to the sample with more information. The areas to top left and bottom

right of the 2D plots therefore represent those comparisons where the quantity of shared information is least.

The most notable feature of the 2D plots is how the results corresponding to different compressors differ from each other.

They can be considered to be presented in (reading) order of increasing quality. Clearly the first result, that forzlib, shows

that this compressor is able only to (correctly) detect when consecutive releases are identical or nearly so. The result for

bzlibis asymmetrical and since symmetry was one of the properties desired of a compressor, this result is discarded. (In

the colour plots forbzlibandppmd, a yellow contour partly obscures the black anti-diagonal.) The remaining two results for

ppmdand forlzmaare quite similar. The result fromppmdis quite usable but that from thelzmacompressor has considerably

more detail although it takes longer to calculate. Results forzlibandbzlibwill not be shown in later experiments.

One observes regions of similarity indicated by blocks (see4.3) along the main diagonal. Examining either the results for

compressorppmdorlzma, it is possible to see that after a few initial releases (around 5), there is a set of around 10 releases

that are very similar to each other followed by two large blocks, the first of 57 releases and the second with the remaining

releases. Within these two large blocks, there are further sub-blocks. In the case of theppmdcompressor, these sub-blocks

are: 17 to 53 together with 53 to 72; and 77 to 126 together with 126 to 141. There is additional detail in the case of thelzma

compressor. Examination of the computed values for the NCD shows that the compressors are all able to determine the two cases where artefacts of identical or almost identical releases were compared (43, 98). These have NCD values of 0.0 and are

most easily seen in the line plots forppmd. The two ‘wings’ emanating from release 5 form the other feature of note. Release

5 is similar to release 17 and onwards but not to releases 6 to 16, probably due to the retention and subsequent removal of

SCCS files in the source (Table 1).

The graphs showing ‘individual’ and ‘cumulative’ similarity for the NCD values for compressorsppmdandlzmaconfirm

the results of the 2D plots while at the same time permitting a comparison of the results from the compressors. The most

striking facet of the ‘individual’ graphs is their structural similarity. Peaks present in thelzmagraph are also present in the

ppmdgraph but are seen as smaller peaks added to a background value of around 0.7. The two near identical releases, also

visible in thezlibresult, are seen as zeros in the graph forppmd. Both ‘cumulative’ results are unhelpful, showing only that

the artefacts for subsequent releases quickly lose similarity with the artefact selected as initial sample.

The alterations thought to be responsible for the large differences in shared information content are listed inTable 1.

Entries in this table are determined by manually examining the ChangeLog for the project. The corners of the blocks in the NCD plots tell us where to look for major changes in the log. The converse of this procedure, looking to see whether blocks in the plots of the NCD matrices reflect major changes in the logs was not performed in this experiment. Using only the logs

to detect where a ‘major’ change in the ‘information’ content (see3.1) of the programs has occurred is not straightforward

and will require additional tool support. See also Section4.5for possible threats to interpretation.

5.2. git

The projectgit[24] is a source code management system, originally written by Linus Torvalds. Its current maintainer is

Junio Hamano. One novel aspect of its operation is that it employs SHA1 digests of binary objects as their identifiers. The system is widely deployed and controls the code for many projects including that for the Linux kernel and the X.org version of the X window system (currently X11R7.5).

The system is still being vigorously developed. At the time of writing, there are 178 consecutive releases (0.01–1.6.5.2)

that include test and preview releases. The releases are ordered by release number for the experiments.git is coded in a

mixture of C, perl and Bourne shell (sh). In this exploratory experiment, we examine only changes in the C components.

However, we could also have made an investigation of the co-evolution of the parts written in different languages as an interesting additional study.

(10)

Fig. 3.Plots and graphs of NCD values forudev. 2Dppmdplot from [22],©[2009] ACM Inc. 5.2.1. Method

The experimental approach in this case is similar to that forudev. For eachgitrelease, all

.c

and

.h

files are located (using

Python:os.path.walk) and concatenated together. These files, without any further processing, are the artefacts used for the

experiment. We apply CompLearn’s implementation of NCD with the compressorsppmdandlzmato generate distance

matrices which are then plotted.

5.2.2. Results

The compressorsppmdandlzmawere employed to calculate the NCD distance matrices plotted inFig. 4. Graphs of the

‘individual’ and ‘cumulative’ similarities for the two compressors are also shown. The compressorszlibandbzlib(not shown)

produced line plots with dots for identical releases similar to the central orange contour for theppmdresult.

5.2.3. Analysis

The NCD measurements with theppmdcompressor reveal that since release 0.7.0,githas been undergoing continuous

(11)

0.01 0.99.9m 1.1. 0 1.2. 3 1.4. 1 1.4.3.4 1.5. 0 1.5.1. 2 1.5.2. 5 1.5.3. 6 1.5. 4 1.5.5.rc 3 1.5.6.rc 3 1.6.0.rc 3 1.6.1.rc 3 1.6.2. 1 1.6. 3 1.6.4. 2 1.6.5. 2 0.01 0.99.9m 1.1.0 1.2.3 1.4.1 1.4.3.4 1.5.0 1.5.1.2 1.5.2.5 1.5.3.6 1.5.4 1.5.5.rc3 1.5.6.rc3 1.6.0.rc3 1.6.1.rc3 1.6.2.1 1.6.3 1.6.4.2 1.6.5.2 compressor [ppmd] 0.01 0.99.9m 1.1. 0 1.2. 3 1.4. 1 1.4.3. 4 1.5. 0 1.5.1. 2 1.5.2. 5 1.5.3. 6 1.5. 4 1.5.5.rc 3 1.5.6.rc 3 1.6.0.rc 3 1.6.1.rc 3 1.6.2. 1 1.6. 3 1.6.4. 2 1.652 0.99.9m 1.1. 0 1.2. 3 1.4. 1 1.4.3. 4 1.5. 0 1.5.1. 2 1.5.2. 5 1.5.3. 6 1.5. 4 1.5.5.rc 3 1.5.6.rc 3 1.6.0.rc 3 1.6.1.rc 3 1.6.2. 1 1.6. 3 1.6.4. 2 1.6.5. 2 0.99.9m 1.1. 0 1.2. 3 1.4. 1 1.4.3. 4 1.5. 0 1.5.1. 2 1.5.2. 5 1.5.3. 6 1.5. 4 1.5.5.rc 3 1.5.6.rc 3 1.6.0.rc 3 1.6.1.rc 3 1.6.2. 1 1.6. 3 1.6.4. 2 1.6.5. 2 compressor [lzmax] Versionsupto 178 [lzmax] Versionsupto 178 [ppmd] 0.0 0.2 0.4 0.6 0.8 1.0

Fig. 4.Plots and graphs of NCD values for compressorsppmdandlzmafor projectgit. Table 2

git: reasons for very similar releases.

Release Reason for similarity Release Reason for similarity 1.1.1 No change from 1.1.0 1.5.3.4 Small fix

1.2.1 Small change from 1.2.0 1.5.4.3 Small fix 1.5.0.5 Related releases 1.5.5.4 Fix a segfault

1.5.1.7 Related releases 1.6.1 No C code change from rc4 1.5.3.1 Packaging change

short sequence of changes around version 1.4.0 was sufficient to make subsequent versions dissimilar to previous versions. Other than in this region, there is little sign of release cycle blocks. The small dots along the main diagonal of the results for

theppmdcompressor distinguish nine nearly identical releases beginning at release 1.1.0.

The results for thelzmacompressor provide a more detailed picture. The initial block before release 0.7.0 can be clearly

seen. The block between 0.7.0 and 1.4.0 is well delineated. What is surprising about the remaining results is the large degree

of continuing similarity detected. There are release blocks but the degree of change is much lower than in the case ofudev.

Even release 1.6.5.2 can be seen to have some degree of similarity with release 0.7.0.

The graphs of ‘individual’ and ‘cumulative’ similarity reinforce the results of the 2D plots. For compressorppmd, the near

similar releases are detected as zeros in the graph as before. Again, the ‘cumulative’ similarity provides little information.

This time, however, the peaks in the ‘individual’ graph for compressorppmdare too small to be easily distinguished against a

background value of around 0.8. The ‘individual’ graph for thelzmacompressor shows these peaks clearly. They correspond

to the corners of the similarity blocks in the 2D plot forlzma. Comparing these results with those forudev, we can already

see that they have a different structure. Future studies will attempt to classify development patterns and relate them to changes in the shared information content in projects of different types.

We verified that there were only minimal changes at the nine points most easily seen as dots on the main diagonal in the

results forppmd(Table 2). A factor that we consider for further verification is that we have only analysed the C component

(12)

1 300 599 898 ₁₁₉₇ ₁₄₉₆ ₁₇₉₅ ₂₀₉₄ ₂₃₉₃ ₂₆₉₆ ₂₉₉₁ ₃₂₉₀ ₃₅₈₉ ₃₈₈₈ ₄₁₈₇ ₄₄₈₆ ₄₇₈₅ ₅₀₈₄ ₅₃₈₃ 600 500 400 300 200 100 0 snapshots additions deletions changes KLOC 100 80 60 40 20 0 v0.9 v0.10 v0.12 v0.14 v0.16 v0.18 v0.20a4 Production Code Test Code Production Classes Test Classes Test Command (2) (1) (3) (4)

Fig. 5.Alternative studies ofArgoUMLby Canfora et al. [113] and Zaidman et al. [114].

the script code separately and then co-evolving with the C code might reveal more about how the developers are able to make strides by quickly modifying script code sitting on a bedrock of more slowly changing C. We hope to report on further

studies ongitin a future publication.

6. Comparative analysis for ArgoUML 6.1. ArgoUML

ArgoUMLis a project for the construction of a tool for the creation of UML diagrams. The project, created and still owned

by Jason Robbins [25] and currently led by Linus Tolke, has a documented history of more than a decade.ArgoUML, in contrast

with the previous two projects, is written in Java. It was chosen because many researchers have studied it in the past and we want to have a basis for comparison of results.

6.2. Previous studies

The evolution of the projectArgoUMLhas previously been studied by other researchers employing different methods.

Canfora et al. [113] employ the Normalised Levenshtein Distance – an edit distance – to examine the evolution of earlier

releases ofArgoUML. Zaidman et al. [114], on the other hand, employedArgoUMLas a case study in their investigation of

the co-evolution of production and test code where the metrics employed were lines of code and counts of classes and commands.

6.2.1. Canfora et al.

An alternative evaluation of the evolution ofArgoUMLis provided by Canfora et al. [113]. The authors were concerned

that it was not possible to distinguish between modifications of code lines versus additions and subtractions within the code repository since a control version system, such as CVS or subversion, views each commit as a sequence of deletions and additions. Canfora et al. demonstrated how an estimate of source code evolution at the level of individual lines could be garnered from CVS repositories by employing information retrieval techniques, including vector space models,

in combination with a normalised edit distance metric, the Normalised Levenshtein Distance. The paper [113] details how

their results are obtained.

Canfora et al. observed the HEAD development trunk of the CVS repository directly and extracted 5525 snapshots corresponding to the period between release 0.9 and release 0.20.a4. They state that on average there are approximately one hundred snapshots between each major release and that, over the 58 releases that this period represents, the number of Java classes grew from 446 to 1538 and the number of lines of code grew from approximately 45 thousand to around

200 thousand. Canfora et al.’s results can be seen inFig. 5(a). The plot lines correspond to additions, deletions and changes

(Y-axis) for each of the 5525 snapshots (X-axis).

6.2.2. Zaidman et al.

In their 2008 paper, Zaidman et al. [114] studied the co-evolution of test and production code. Although the focus of

(13)

Table 3

Detected correspondences: subversion commits, snapshots, labelled releases.

SVN Snap Release SVN Snap Release SVN Snap Release SVN Snap Release

361 32 0.8.1 1716 1025 0.11.4 4141 3039 0.15.3 6073 4685 0.18.a3 405 67 0.9.0 1743 1084 0.12.0 4195 3076 0.15.4 6102 4707 0.18.b1 457 107 0.9.1 2050 1388 0.13.1 4311 3165 0.15.5 6116 4712 0.18.b2 488 133 0.9.2 2487 1766 0.13.2 4382 3220 0.15.6 6146 4729 0.18.1 495 139 0.9.3 2617 1861 0.13.3 4434 3263 0.16.a1 6223 4785 0.19.1 521 161 0.9.4 2750 1969 0.13.4 4490 3308 0.16.a2 6634 5121 0.19.3 884 423 0.9.6 2941 2133 0.13.5 4502 3315 0.16.b1 6799 5259 0.19.4 927 457 0.9.7 3059 2229 0.13.6 4956 3691 0.16.1 6898 5339 0.19.5 977 486 0.9.8 3172 2303 0.14.a1 5066 3784 0.17.1 6920 5352 0.19.6 1035 533 0.9.9 3240 2322 0.14.a2 5325 4018 0.17.2 7158 5562 0.19.7 1045 538 0.10.0 3277 2345 0.14.a3 5617 4284 0.17.3 7281 5667 0.19.8 1144 593 0.10.1 3288 2349 0.14.a4 5668 4319 0.17.4 7340 5714 0.20.a1 1327 723 0.11.1 3332 2383 0.14.b1 5960 4593 0.17.5 7386 5752 0.20.a2 1421 786 0.11.2 3559 2565 0.15.1 6007 4629 0.18.a1 7402 5763 0.20.a3 1646 964 0.11.3 3897 2852 0.15.2 6029 4645 0.18.a2 7416 5773 0.20.a4

defined a growth history view of software evolution with the aim of identifying growth patterns in software. The growth

history view ofArgoUMLbetween version 0.9 and 0.20.a4 that they derived is reproduced asFig. 5(b).

This figure is anX–Y plot in which thex-axis, time, is annotated with project releases. They-axis, on the other hand,

shows five software metrics for the project, namely the number of lines of production code (red); the number of lines of test code (green); the number of classes in production code (blue); the number of classes in test code (magenta); and the number of test commands (cyan). Each metric is presented as a cumulative percentage, with the last considered version being taken as 100%.

6.3. Comparable data

In order to use the results obtained by these authors, which are shown inFig. 5, to validate our results, we first need to

ensure that we employ comparable data. A copy of theArgoUMLrepository used for both studies and the data values needed

to make both plots were obtained from the papers’ authors.

There were two difficulties in synchronising these data sources with our own experiments. Firstly, the subversion repository had been created from a CVS repository using cvs2svn meaning that information about branches became difficult to track. The second difficulty was the use of snapshots by the Canfora team. Although studying the same 7477 commits in the repository, their 5525 snapshots were indexed by seconds since the epoch whereas the repository and the Zaidman data both used release number for indexing (although seconds since the epoch data were also available for each commit). In addition, not every labelled release was present in the Canfora snapshots.

Pairwise comparing the quantities of shared information between either 5525 or 7477 separate releases is not practical in view of the computation involved so instead we decided to study those snapshots and commits which were labelled as

releases and were common to both data sets as matched by their release times. The commandssvnlookandsvn2cvsgraph

were helpful in this regard.Table 3shows the correspondences (commit, snapshot and release) that were found and used in

the following experiments. No unambiguous correspondence could be found for releases 0.14.0, 0.14.1, 0.16.0, 0.18.0, and 0.19.2 so they are omitted from the experiments.

6.4. ArgoUML—experiments

Following the procedure set out in the steps for validation4.2, we select the concatenation of all ‘.java’ files found in

the ‘trunk’ directory. Each release is extracted from the repository and then Python’sos.path.walkwith a ‘.*

\

.java

\

Z’ regex

is used to find and concatenate the files. The compression mechanismsppmdandlzmaare selected and the array of NCD

values calculated for pairwise comparisons of these 60 releases. Plots and graphs for the array are created and the results then compared with the data obtained by the Zaidman et al. and Canfora et al.

6.4.1. NCD measurements

Results of running the NCD tests on the set of releases identified in Section6.3are shown inFig. 6. Data points are

being indexed by subversion commit number. The NCD calculation with theppmdcompressor is no longer able to usefully

detect similarity and, in addition, some values of the NCD exceed 1.0. Therefore, we employ only thelzmacompressor in the

comparisons of the next section.

Blocks representing artefact similarity are unmistakable in the 2D plot and confirmed by the graph of individual values although these blocks appear smaller than in previous plots. Only for the last twenty releases or so, does the 2D plot begin to broaden out. Rather unusually, the cumulative similarity measure does not show the rapid increase seen in the previous two projects. This indicates that the artefact for the first measured release (subversion release 361) retains a greater degree of similarity with subsequent versions than in the previous experiments.

(14)

361 521 1045 1716 2750 3277 4141 4490 5617 6073 6634 7281 7416 361 521 1045 1716 2750 3277 4141 4490 5617 6073 6634 7281 7416 521 1045 1716 2750 3277 4141 4490 5617 6073 6634 7281 7416 521 1045 1716 2750 3277 4141 4490 5617 6073 6634 7281 7416 361 521 1045 1716 2750 3277 4141 4490 5617 6073 6634 7281 7416

compressor [ppmd] compressor [lzmax]

1.0 0.8 0.6 0.4 0.2 0.0

Versions upto 60 [ppmd] Versions upto 60 [lzmax]

Fig. 6.Plots and graphs of NCD values for compressorsppmdandlzmafor projectArgoUML.

individual cumulative additions deletions changes 0.0 0.2 0.4 0.6 0.8 1.0 521 1045 1716 2750 3277 4141 4490 5617 6073 6634 7281 7416

Fig. 7.Graphs of NCD values overlaid on the results of Canfora et al. [113].

6.4.2. Comparison with results of Canfora et al.

Fig. 7shows the result of overlaying our results, the ‘individual’ and ‘cumulative’ similarity graphs shown in black and

(15)

521 1045 1716 2750 3277 4141 4490 5617 6073 6634 7281 7416 individual cumulative production code test code production classes test classes test commands 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Fig. 8.Graphs of NCD values overlaid on the results of Zaidman et al. [114].

all values supplied by Canfora et al. have been scaled by the maximum value of the additions measurement so that all values on the plot lie in the range zero to one.

Examining the plot, areas where the ‘individual’ graph is near zero (continuing similarity) remain flat in the Canfora et al. data. Areas where the ‘individual’ graph remains above zero (continuing change), are also reflected in measured change in the guest values. However, although other peaks in the ‘individual’ plot are reflected in the guest data with matching increases, the peaks at 1045 (0.10.0) and 3240 (0.14.a2) do not appear to be reflected in the Canfora et al. graphs. At the current stage, we do not have any firm ideas about reasons for these differences.

6.4.3. Comparison with results of Zaidman et al.

Fig. 8shows the result of overlaying our results, the ‘individual’ and ‘cumulative’ similarity graphs shown in black and

gold, over the (‘guest’) experimental data of Zaidman et al. Again, the spacing of points along thex-axis is identical (and also

identical to the spacing for the previousFig. 7). However, on they-axis, the percentage values of the guest data have been

scaled by dividing by 100. Some of the guest data exceeds 1.0 (as in the original graphs) to reflect the subsequent deletion of code during development.

Most striking about this plot is the correspondence between the ‘cumulative’ graph and the guest data values for ‘production code’ and ‘production classes’ at least up to roughly subversion release 3277. Peaks in the ‘individual’ graph are reflected in the guest data as are areas where the ‘individual’ graph is flat. The peak at release 1045 is reflected in a large decrease in the values for ‘production code’ and ‘production classes’. The peak at 3240 remains unexplained. The area near release 4141 indicative of continuous change is reflected only in the ‘production code’ graph.

6.5. Discussion

The three data sets – namely ours, those of Canfora et al. and those of Zaidman et al. – are mutually supportive, although there are a couple of intriguing differences. The ‘individual’ peaks do appear to represent points where the project development introduced major change. Areas with no peaks, corresponding to blocks in the 2D diagram, do appear to be areas where the similarity between subsequent releases is being maintained. Where change is occurring between releases, we have a numerical measurement that reflects the degree of similarity.

As previously stated, the artefact being measured is simply the concatenation of all of the ‘.java’ files in the trunk directory. No information about whether these files are employed or not is being used. Some files included in the concatenation may represent junk code or dead code being kept for reference before removal.

7. Related work

Concerning applications of information theory in software engineering, one of the earliest of which we are aware is

that of van Emden [115,116]. Extending themes from Simon and Ando [117] and from Alexander [118], van Emden applied

(16)

partitionings. This work was then applied to practical problems by Chanon [119,120]. Henry and Kafura [121] provide a critique on work, such as Chanon’s, that uses entropy loading for the investigation of software structure. It nevertheless

continues to appear in the literature. See Chapin [122], Torres and Samadzadeh [123] or LaMantia et al. [124] for examples.

The book by Zuse [125] provides a detailed view of the work on complexity metrics before 1990. Khoshgoftaar and Allen’s

survey [126] also provides a useful overview of work on information theoretic approaches in software engineering. The paper

by Allen et al. [127] is particularly interesting in that they use an alternative approximation of Kolmogorov complexity to

show that it is strongly correlated with counting metrics.

The use of metrics for measuring the similarity of software has uses in software clustering. This is discussed in

papers by Andritsos and Tzerpos [128] and by Tzerpos and Holt [129]. A further paper by Lutz [130] discusses a genetic

algorithm approach to software partitioning which employs a measure of software similarity based on the related minimum

description length approach. Harman [131] has also suggested the use of information theoretic metrics in search based

software engineering. Clark et al. [132] and McCamant and Ernst [133] present work on information flow with implications

particularly for software security.

Recent work on information theoretic distance measures, in addition to the papers in Section2.4, include a similarity

measure by Cerra and Datcu [134], Speidel’s work on string complexity of short strings [135], Galas et al.’s measure for the

quantity of information in sets of objects [136] and Long et al.’s multi-document update summarization [137]. Finally we

note the paper by Terwijn et al. which proves that the NID is not semicomputable [138].

8. Conclusions and future work

In this paper, a series of experiments has been carried out to validate the hypothesis that employing information theory to measure the shared information content between software artefacts representative of releases tracks the software’s evolution. The investigation employs a mathematical metric – a normalised metric based on Kolmogorov complexity – rather than ‘software metrics’; employs raw binary ‘information’ rather than semantic context; and thinks of similarity rather than difference when measuring the evolution. This paper therefore makes the following contributions.

1. We have related the use of Kolmogorov complexity to the literature on software measurement, comparison and similarity.

2. We have outlined a procedure for applying the normalised compression distance (NCD) to the measurement of software evolution.

3. We have performed two introductory experiments, applying the NCD to examine the projectsudevandgit.

4. We have performed a comparative analysis of the projectArgoUMLwith two other studies from the literature and

observed a cohesive picture of the software evolution emerging from all three studies.

The experiments provide substantive support for the paper’s hypothesis while also serving to illustrate practical aspects of the technique.

As a short term goal, the next obvious step for these studies is to carry out further empirical validation with larger scale investigations using a variety of compression mechanisms on a larger variety of software projects. The need for study of

co-evolution ingithas also been noted. Currently we seek only results that confirm our hypothesis. Looking for cases where

evolutionary steps are detected by this approach and missed by conventional approaches (or vice versa) will provide valuable information about both. Although we have hints of the relationship between Kolmogorov complexity and counting type

metrics [127], much work remains to be done to relate this measurement to existing software engineering measures. The

classification of evolution patterns demonstrated by these studies has also been suggested for future study.

In the longer term, there are numerous possible avenues to explore. Beyond further studies on source code,

run-time behaviour [20] and program structure [21], we can also envision comparison of specification documents or even

requirements documents. We intuit a connection with the specification methods of Prowell and Poore [139]. Given that

we are employing a similarity measure, using it to search for code clones has been suggested as a possible application. The

method has further implications for testing and maintenance. Nikora and Munson [140], for example, have suggested that

examining structural evolution can be used to predict the number of faults injected into a system. D’Ambros and Lanza [141]

have shown ways to visualise this relationship. Eisenbarth et al. [142] have used run-time analysis to define features in the

behaviour of software. Greevy and Ducasse [143] also use run-time analysis to add power to their analysis of software

evolution. We therefore believe that further studies of dynamic behaviour will be useful. In preliminary (unpublished) studies of the evolution of modules from the Linux kernel we employ the measure for software clustering. We are aware that the studies could potentially have implications for project management which we are keen to pursue.

Finally, in closing, measurement of software and its evolution is difficult but worthwhile. In this paper, we take the novel approach of applying measurements based on the information theoretic concept of Kolmogorov complexity – itself a measure of intrinsic information content – to measure software artefacts’ similarity and, thereby, to study software projects’ evolutions. The approach has general applicability, since measuring purely in terms of information allows us to ignore conventionally important aspects of software, such as programming paradigm, language or application domain. Indeed, as a result of its generality, the method has aspirations towards measurement standardisation. We would like to see the extent to which this work, and similarly directed studies from other authors, can go on to have a positive impact on software engineering as a whole.

(17)

Acknowledgements

The author would like to thank all the reviewers for their helpful suggestions that significantly improved the final text. Mark Lawford provided practical assistance. Adam Balaban and Dennis K. Peters have previously provided support and insight. Andy Zaidman and Bart van Rompaey provided copies of both the ArgoUML repository and the data necessary to

reproduce their diagram in paper [114] and Luigi Cerulo provided the data necessary to reproduce the diagram from paper

[113]. Without their help in providing this data, detailed comparisons with the current work would not have been possible.

The author would like to thank them and their co-authors for their help in allowing use of their data in this way.

‘‘Measure software – and its evolution – using information content’’ in the proceedings of IWPSE-EVOL 2009 (http://doi.acm.org/10.1145/1595808.1595831) and are reprinted with permission.

Appendix A. Entropy

As mentioned in Section2.2, there have been several attempts to employ information theory to measure software. Some

of the previous approaches were described in the related work Section7. What most of these approaches share in common

is the use of entropy or entropy-related measures (such as entropy loading).

In his inspirational paper of 1948, Shannon [44] defined the entropy, in the discrete case, as

H

(

X

)

= −

−

x∈X

pxlogpx

.

(A.1)

Herepx are probabilities of emission of a symbolxtaken from a closed alphabet of symbolsX. Information flows from

a source to a sink over a channel in the model so the emphasis is on transfer of information. The concern is about what different symbols can tell us about the distribution of information in a composite message. The appearance of an unusual message in the stream of symbols has a higher significance because of its scarcity: it conveys more information. The entropy says nothing about the complexity of the symbols themselves.

Using entropy to measure software presents the problems of determining the closed alphabet of symbols, their probability distributions and constraining the measurement to conform to the expectations of the model – particularly the random emission of symbols by the source. This can be attempted, but with software construction not yet a matter of

‘normal design’ [144], it is difficult to see what a common alphabet of symbols for software might be.

Kolmogorov complexity provides a complementary measurement which fills the gap left by the Shannon theory. Rather than looking at the significance of symbols, we measure the information content of the symbols themselves. In exchange for no longer needing symbols and their probabilities, however, it is necessary to approximate an uncomputable quantity.

Appendix B. Approximating the Kolmogorov complexity

Despite their conceptual difference, but also because of their complementary nature, there are close relations between entropy and Kolmogorov complexity. If a particular coding of the information to be measured, the Shannon–Fano coding, can be determined, it is possible to approximate the Kolmogorov complexity in terms of Shannon entropy. This was the

approach taken by Allen et al. [127] in their important software engineering paper. Using the Shannon–Fano coding ofn

items from a setXwith membersx, they explain (ibid., p. 189) that an estimate of the Kolmogorov complexity ofXcan be

written



K

(

X

)

=

nH

(

x

).

(B.1)

By employing graph representations of programs, fixingnby taking elements from closed sets, Allen et al. were able to use

counting arguments to deriveH

(

X

)

and thence



K

(

X

)

. Their experimental results showed that the Kolmogorov complexity

correlates with counting metrics.

References

[1] T. Mens, S. Demeyer (Eds.), Software Evolution, Springer, 2008.

[2] K.H. Bennett, V.T. Rajlich, Software maintenance and evolution: a roadmap, in: ICSE’00: Proceedings of the Conference on The Future of Software Engineering, ACM, New York, NY, USA, 2000, pp. 73–87.

[3] T. Mens, M. Wermelinger, S. Ducasse, S. Demeyer, R. Hirschfeld, M. Jazayeri, Challenges in software evolution, in: IWPSE’05: Proceedings of the Eighth International Workshop on Principles of Software Evolution, IEEE Computer Society, 2005, pp. 13–22.

[4] J. Maletic, H. Kagdi, Expressiveness and effectiveness of program comprehension: Thoughts on future research directions, in: Frontiers of Software Maintenance, 2008, pp. 31–37.

[5] M. Godfrey, D. German, The past, present, and future of software evolution, in: Frontiers of Software Maintenance, 2008, pp. 129–138.

[6] H. Kagdi, M.L. Collard, J.I. Maletic, A survey and taxonomy of approaches for mining software repositories in the context of software evolution, J. Softw. Maint. Evol. 19 (2007) 77–131.

(18)

[7] T.J. McCabe, A complexity measure, in: ICSE’76: Proceedings of the 2nd International Conference on Software Engineering, 1976, p. 407. [8] M.H. Halstead, Elements of Software Science, Elsevier Science Inc., 1977.

[9] A. Perlis, F. Sayward, M. Shaw (Eds.), Software Metrics: An Analysis and Evaluation, MIT Press, 1981.

[10] S.R. Chidamber, C.F. Kemerer, A metrics suite for object oriented design, IEEE Trans. Softw. Eng. 20 (1994) 476–493. [11] M. Shepperd, A critique of cyclomatic complexity as a software metric, Softw. Eng. J. 3 (1988) 30–36.

[12] N. Fenton, When a software measure is not a measure, Softw. Eng. J. 7 (1992) 357–362. [13] C. Jones, Software metrics: good, bad and missing, Computer 27 (1994) 98–100.