Software Metrics: Roadmap

(1)

Software Metrics: Roadmap

By Norman E. Fenton and Martin Neil

(2)

Authors (1/2)

n

Norman Fenton is Professor of

Computing at Queen Mary (University of London) and is also Chief Executive Officer of Agena, a company that specialises in risk management for

critical systems. He is head of RADAR (Risk Assessment and

Decision Analysis)

(3)

Authors (2/2)

n Martin Neil is a Reader in "Systems Risk" at the Department of Computer Science, Queen Mary, University of London, where he teaches decision and risk analysis and software

engineering. Martin is also a joint

founder and Chief Technology Officer

of Agena Ltd (UK)

(4)

Plan

n Introduction

n Brief history of software metrics

n Weaknesses of traditionnal approaches

n Causal models

n Future works

n Comments on the article

(5)

Introduction (1/9)

n The car accidents example

n « Data on car accidents in both the US and the UK reveal that January and

February are the months when the

fewest fatalities occur. »

(6)

Introduction (2/9)

n The car accidents example

n « Thus, if you collect a database of

fatalities organised by months and use

this to build a regression model, your

model would predict that it is safest to

drive when weather is coldest and roads

are at their most treacherous. »

(7)

Introduction (3/9)

n The car accidents example

n “Such a conclusion is perfectly sensible given the data available, but intuitively we know it’s wrong.”

n “The problem is that you do not have all the relevant data to make a sensible

decision about the safest time to drive.”

(8)

Introduction (4/9)

n The car accidents example

(9)

Introduction (5/9)

n “So what has this got to do with software metrics? Well, software metrics has been dominated by

statistical models, such as regression

models, when what is really needed

are causal models.”

(10)

Introduction (6/9)

n Software resource estimation

n “Much software metrics has been driven by the need for resource prediction

models.”

n “Usually this work has involved models of the form”

effort=f(size)

(11)

Introduction (7/9)

n Problems with effort=f(size)

n Size cannot cause effort.

n Such models cannot be used for risk assessment because they lack

explanatory framework.

n Managers can’t decide how to improve

things from the model’s outputs.

(12)

Introduction (8/9)

n Solution: causal modeling

n Provide an explanatory structure to explain events that can then be

quantified.

n Provide information to support

quantitative managerial decision-making during the software lifecycle.

n Provide support for risk assessment and

reduction.

(13)

Introduction (9/9)

n Software resource estimation

(14)

History of metrics (1/13)

n Def.: Software metrics is a collective term used to describe the very wide range of activities concerned with measurement in software

engineering.

(15)

History of metrics (2/13)

n These activities range from:

n Producing numbers that characterize properties of software code

n Models that help predict software

resource requirements and software

quality

(16)

History of metrics (3/13)

n Software metrics are used since the mid-1960’s

n At that time, Lines of Code was used

as a measurement of productivity and

effort

(17)

History of metrics (4/13)

n Problems using metrics:

n Theory and practice have been out of step

n Metrics often misunderstood, misused, and even reviled

n Industry is not convinced of metrics benefits

n Metrics programs are used when things

(18)

History of metrics (5/13)

n The two components of software metrics:

n The component concerned with defining the actual measures

n The component concerned with how we

collect, manage and use the measures

(19)

History of metrics (6/13)

(20)

History of metrics (7/13)

n Rationale for using metrics

n The desire to assess or predict

effort/cost of development processes

n The desire to asses or predict quality of

software products

(21)

History of metrics (8/13)

n “The key in both cases has been the

assumption that product ‘size’ should

drive any predictive models.”

(22)

History of metrics (9/13)

n LOC/programmer month as productivity measure

n Regression-based resource prediction by Putnam and Boehm:

Effort = f(LOC)

n Program quality measurement

(usually defects/KLOC)

(23)

History of metrics (10/13)

n In the mid-1970’s, we recognized the drawbacks of using LOC as a measure for different notions of program size.

n LOC cannot be compared between

high- and low-level programming

languages

(24)

History of metrics (11/13)

n From the mid-1970’s interest in

measures of software complexity and functional size (such as function

points)

n The rational for these metrics is still

to asses quality and effort/cost

(25)

History of metrics (12/13)

n Study of software metrics has been dominated by defining specific

measures and models.

n Much recent work has been

concerned with collecting, managing,

and using metrics in practice.

(26)

History of metrics (13/13)

n Most notable advances

n Work on the mechanics of implementing metrics programs

n

Grady and Caswell: first company-wide software metrics program

n

Basili, Rombach: GQM

n The use of metrics in empirical software engineering

n

Benchmarking and evaluating the

effectiveness of s.e. methods, tools and

technologies (Basili)

(27)

Weaknesses of traditional approaches (1/11)

n “The approaches to both quality prediction and resource prediction have remained fundamentally

unchanged since the early 1980’s.”

(28)

Weaknesses of traditional approaches (2/11)

n These approaches have provided some extremely valuable empirical

results, but cannot be used effectively for quantitative management and risk analysis, the primary objective of

metrics.

(29)

Weaknesses of traditional approaches (3/11)

n Regression-based model for quality prediction:

f(complexity metric) = defect density

n Problems

n Incapable of predicting defects

accurately

(30)

Weaknesses of traditional approaches (4/11)

n A further empirical study (Fenton) shown:

n Size metrics (while correlated to gross number of defects) are poor indicator of defects

n Static complexity metrics are not significantly better as predictors

n Counts of defects pre-release is a very bad indicator of quality

n

The lunch story

(31)

Weaknesses of traditional

approaches (5/11)

(32)

Weaknesses of traditional approaches (6/11)

n These results invalidate models:

n using pre-release faults as a measure for operational quality

n using complexity metrics to predict modules fault-prone post release

n

Complexity metrics were judged ‘valid’ if

correlated with pre-release fault density

(33)

Weaknesses of traditional approaches (7/11)

n Empirical phenomenon observed by Adam (1984):

n “[…] most operational system failures are caused by a small proportion of the latent faults.”

n The fact that fault density (in terms of pre-release faults) was used as a

measure of user perceived software

(34)

Weaknesses of traditional approaches (8/11)

n Explanations of the scatter plot

n “Most of the modules that had high number of pre-release, low number of post-release faults just happened to be very well tested.”

n A module that is never executed will

never reveal latent faults (no matter

how many), hence operational usage

must be taken into account.

(35)

Weaknesses of traditional approaches (9/11)

n Other problems with regression-based models for resource prediction:

n Lack causal factors to explain variation

n Based on limited historical data

n Resource constraints not modeled

n Black box models

n Cannot handle uncertainty

Little support for risk assessment and

(36)

Weaknesses of traditional approaches (10/11)

n The classic problem : “Is this system sufficiently reliable to ship?”

n Useful information:

n Measurement data from testing (such as defects found in various testing phases)

n Empirical data about the process and resources used

n Subjective information about the process/resources

n Very specific and important pieces of

evidence (proof of correctness)

(37)

Weaknesses of traditional approaches (11/11)

n In practice, we only possess fragments of such information.

n The question is how to combine such diverse information and then how to use it to help solve a decision

problem that involves risk.

(38)

Causal models (1/7)

n We need a model that take account of missing concepts from regression- based approaches:

n Diverse process and product variables

n Empirical evidence and expert judgement

n Genuine cause and effect relationship

n Uncertainty

n Incomplete information

(39)

Causal models (2/7)

n Def.: A BBN is a graphical network together with an associated set of probability tables. The nodes

represent uncertain variables and the

arcs represent the causal/relevance

relationship between the variables.

(40)

Causal models (3/7)

(41)

Causal models (4/7)

n Building and executing realistic BBN models is now possible because of recent algorithms and software tools.

n Practical applications:

n Medical diagnosis

n Mechanical failure diagnosis

n Help wizards in Microsoft Office

(42)

Causal models (5/7)

(43)

Causal models (6/7)

(44)

Causal models (7/7)

n Benefits of using BBNs:

n

Explicit modeling of ignorance and uncertainty

n

Combine diverse types of information

n

Makes assumption explicit

n

Intuitive graphical format

n

Ability to forecast with missing data

n

Use of ‘what-if?’

n

Use of subjectively or objectively derived probability distributions

n

Rigorous math semantic

n

Availability of tools like Hugin

(45)