• No results found

SOFTWARE MEASUREMENT VALIDATION

In document CRC.Press.Software.Metrics.pdf (Page 139-143)

A Goal-Based Framework for Software Measurement

3.4 SOFTWARE MEASUREMENT VALIDATION

The very large number of software measures in the literature aims to capture information about a wide range of attributes. Even when you know which entity and attribute you want to assess, there are many measures from which to choose. Finding the best measure for your purpose can be difficult, as can-didates measure or predict the same attribute (such as cost, size or complexity) in very different ways. So it is not surprising when managers are confused by measurement: they see different measures for the same thing, and sometimes the implications of one measure lead to a management decision opposite to the implications of another! One of the roots of this confusion is the lack of software measurement validation. That is, we do not always stop to ensure that the measures we use actually capture the attribute information we seek.

The formal framework presented in Chapter 2 and in this chapter leads to a formal approach for validating software measures. The validation approach depends on distinguishing measurement from prediction, as discussed in Chapter 2. That is, we must separate our concerns about two types of measuring:

1. Measures or measurement systems are used to assess an existing entity by numerically characterizing one or more of its attributes.

2. Prediction systems are used to predict some attribute of a future entity, involving a mathematical model with associated prediction procedures.

Informally, we say that a measure is “valid” if it accurately characterizes the attribute it claims to measure. However, a prediction system is “valid”

if it makes accurate predictions. So not only are measures different from prediction systems, but the notion of validation is different for each. Thus, to understand why validation is important and how it should be done, we consider measures and prediction systems separately.

3.4.1 Validating Prediction Systems

Validating a prediction system in a given environment is the process of establishing the accuracy of the prediction system by empirical means;

that is, by comparing model performance with known data in the given environment.

Thus, validation of prediction systems involves experimentation and hypothesis testing, as we shall see in Chapter 4. Rather than being a math-ematical proof, validation involves confirming or refuting a hypothesis.

This type of validation is well accepted by the software engineering community. For example, researchers and practitioners use data sets to validate cost estimation or reliability models. If you want to know whether COCOMO is valid for your type of development project, you can use data that represent that type of project and assess the accuracy of COCOMO in predicting effort and duration.

The degree of accuracy acceptable for validation depends on several things, including the person doing the assessment. We must also consider the difference between deterministic prediction systems (we always get the same output for a given input) and stochastic prediction systems (the out-put for a given inout-put will vary probabilistically) with respect to a given model.

Some stochastic prediction systems are more stochastic than others. In other words, the error bounds for some systems are wider than in oth-ers. Prediction systems for software cost estimation, effort estimation, schedule estimation and reliability are very stochastic, as their margins of error are large. For example, Boehm has stated that under certain circum-stances the COCOMO effort prediction system will be accurate to within 20%; that is, the predicted effort will be within 20% of the actual effort value. An acceptance range for a prediction system is a statement of the maximum difference between prediction and actual value. Thus, Boehm specifies 20% as the acceptance range of COCOMO. Some project manag-ers find this range to be too large to be useful for planning, while other project managers find 20% to be acceptable, given the other uncertainties of software development. Where no such range has been specified, you must state in advance what range is acceptable before you use a prediction system.

EXAMPLE 3.16

Sometimes the validity of a complex prediction system may not be much greater than that of a very simple one. For example, Norbert Fuchs points out that if the weather tomorrow in Austria is always predicted to be the same as today’s weather, then the predictions are accurate 67% of the time. The use of sophisticated computer models increases this accuracy to just 70%!

In Chapter 10, we present a detailed example of how to validate soft-ware reliability prediction systems using empirical data.

3.4.2 Validating Measures

Measures used for assessment are the measures discussed in Chapter 2.

We can turn to measurement theory to tell us what validation means in this context:

EXAMPLE 3.17

We want to measure the length of a program in a valid way. Here, “program”

is the entity and “length” the attribute. The measure we choose must not con-tradict any intuitive notions about program length. Specifically, we need both a formal model that describes programs (to enable objectivity and repeatabil-ity) and a numerical mapping that preserves our intuitive notions of length in relations that describe the programs. For example, if we concatenate two programs P1 and P2, we get a program whose length is the combined lengths of P1 and P2. Thus, we expect any measure m of length always to satisfy the condition

m(P1,P2) = m(P1) + m(P2)

If program P1 has a greater length than program P2, then any measure m of length must also satisfy

m(P1) > m(P2)

We can measure program length by counting lines of code (in the care-fully defined way we describe in Chapter 8). Since this count preserves these relationships, lines of code is a valid measure of length. We will also describe a more rigorous length measure in Chapter 8, based on a formal model of programs. This form of validation has been applied to measures of coupling and cohesion of object-oriented software (Briand et al. 1998, 1999), and to the measurement of diagnosability and vigilance, which are properties of a software design related to testability (Le Traon et  al. 2003; Le Traon et al.

2006).

Validating a software measure is the process of ensuring that the measure is a proper numerical characterization of the claimed attribute by showing that the representation condition is satisfied.

This type of validation is central to the representational theory of mea-surement. That is, we want to be sure that the measures we use reflect the behavior of entities in the real world. If we cannot validate the measures, then we cannot be sure that the decisions we make based on those mea-sures will have the effects we expect. In some sense, then, we use validation to make sure that the measures are defined properly and are consistent with the entity’s real-world behavior.

3.4.3 A Mathematical Perspective of Metric Validation

In Chapter 2, we discussed the theory of measurement, explaining that we need not use the term “metric” in our exposition. There is another, more formal, reason for using care with the term. In mathematical analysis, a metric has a very specific meaning: it is a rule used to describe how far apart two points are. More formally, a metric is a function m defined on pairs of objects x and y such that m(x,y) represents the distance between x and y. Such metrics must satisfy certain properties:

m(x,x) = 0 for all x: that is, the distance from a point to itself is 0.

m(x,y) = m(y,x) for all x and y: that is, the distance from x to y is the same as the distance from y to x.

m(x,z) ≤ m(x,y) + m(y,z) for all x, y and z: that is, the distance from x to z is no larger than the distance measured by stopping through an intermediate point.

There are numerous examples where we might be interested in “math-ematical” metrics in software:

EXAMPLE 3.18

Fault tolerant techniques like N-version programming have been proposed for increasing the reliability of safety-critical systems. The approach involves developing N different versions of the critical software components inde-pendently. Theoretically, by having each of the N different teams solving the same problem without knowledge of what the other teams are doing, the probability of all the teams, or even of the majority, making the same error is kept small. When the behavior of the different versions differs, a voting pro-cedure accepts the behavior of the majority of the systems. The assumption, then, is that the correct behavior will always be chosen.

However, there may be problems in assuring genuine design indepen-dence, so we may be interested in measuring the level of diversity between two designs, algorithms or programs. We can define a metric m, where m(P1,P2) measures the diversity between two programs P1 and P2. In this case, the entities being measured are products. Should we use a similar metric to measure the level of diversity between two methods applied during design, we would be measuring attributes of process entities.

EXAMPLE 3.19

We would hope that every program satisfies its specification completely, but this is rarely the case. Thus, we can view program correctness as a mea-sure of the extent to which a program satisfies its specification, and define a metric m(S,P) where the entities S (specification) and P (program) are both products. Then m(S,P) indicates the distance between the specification and a program that implements the specification.

To reconcile these mathematically precise metrics with the framework we have proposed, we can consider pairs of entities as a single entity. For example, having produced two programs satisfying the same specifica-tion, we consider the pair of programs to be a single product system, itself having a level of diversity. This approach is consistent with a systems view of N-version programming. Where we have implemented N versions of a program, the diversity of the system may be viewed as an indirect measure of the pairwise program diversity.

3.5 PERFORMING SOFTWARE MEASUREMENT VALIDATION

In document CRC.Press.Software.Metrics.pdf (Page 139-143)