WORKING
PAPER
ALFRED
P.SLOAN
SCHOOL
OF
MANAGEMENT
Productivity
Impacts
of
Software Complexity
and
Developer Experience
Geoffrey
K.
GillChris
F.Kemerer
MIT
Sloan School
WP
#3107-90
January
1990
MASSACHUSETTS
INSTITUTE
OF
TECHNOLOGY
50
MEMORIAL
DRIVE
Productivity
Impacts
of
Software
Complexity
and
Developer Experience
Geoffrey
K.
GillChris
F.Kemerer
MIT
Sloan School
WP
#3107-90
January
1990
Research
support
for thesecond
author
from
theCenter
forInformation
Systems Research
and
the InternationalFinancial Services
Research
Center
is gratefullyacknowledged.
Helpful
comments
on
the first draftfrom
W.
Harrison,
D.
R. Jeffery,W.
Orlikowski,
and
D.
Zweig
are gratefullyacknowledged.
and
Developer
Experience
Geoffrey K.
GiU
Chris F.
Kemerer
MIT
Sloan School ofManagement
Cambridge,
MA
January 1990
Research
support for thesecond
authorfrom
the Center for Information SystemsResearch and
the International Financial ServicesResearch
Center is gratefully acknowledged. Helpfulcomments
on
thefirst draft
from
W.
Harrison, D. R. Jeffery,W.
Orlikowskiand
D.Zweig
arc gratefully acknowledged.MJJ
UBRAfifES
VAR
15
1990
Productivity Impacts of Software
Complexity
and
Developer ExperienceGeoffrey K. Gill
Chris F.
Kemerer
MIT
Sloan School ofMaBagement
Cambridge,
MA
Abstract
The
high costs of developingand
maintaining software havebecome
widely recognized as majorobstacles to the continued successful use of information technology. Unfortunately, against this
backdrop
of highand
rising costs stands only a hmitedamount
of research inmeasuring
softwaredevelopment
productivityand
m
understanding the effect of software complexityon
variations in productivity.The
current research is an in-depth study of a
number
of softwaredevelopment
projects undertaken by an aerospace defense contracting firm in the late 1980s.These
data allow for the analysis of the impactof software complexity
and
of varying levels of staff expertiseon
bothnew
development
and
maintenance.
In terms of the
main
research questions, the data support the notion thatmore
complex
systems, asmeasured
by size-adjustedMcCabe
cyclomatic complexity, requiremore
effort to maintain.For
this data set, theMyers and
Hansen
variationswere
found to be essentially equivalent to theoriginal
McCabe
specification. Software engineers withmore
experiencewere
more
productive, but thislabor production input factor revealed declining marginal returns to experience.
The
average netmarginal impact of additional experience (minus its higher cost)
was
positive.CR
Categoriesand
Subject Descriptors:D2.7
[Software Engineering]: Distributionand
Maintenance;D2.8
[Software Engineering]: Metrics;D^.9
[Software Engineering]:Management;
F23
[Analysis ofAlgorithms
and
Problem
Complexity]: Tradeoffsamong
Complexity Measures; K.6.0[Management
ofComputing and
Information Systems]:General
-Economics;
K.6.1[Management
ofComputing and
Information Systems]: Projectand
PeopleManagement;
K.63
[Management
ofComputing
and
Information Systems]: SoftwareManagement
General
Terms:
Management,
Measurement,
Performance.Additional
Key
Words
and
Phrases: Software Productivity, Software Maintenance, Software Complexity,1
INTRODUCTION
The
high costs of developingand
maintaining software havebecome
widely recognized asmajor
obstacles to the continued successful use of information technology. For example, BarryBoehm
hasestimated that $140 billion is spent annually
worldwide
on
software[Boehm,
1987]. Therefore, even a small increase in productivitywould
have significant cost reduction effects.However,
despite the practicalimportance of software productivity, there is only a
hmited
amount
of research inmeasuring
softwaredevelopment
productivityand
in understanding the effect of software complexityon
variations inproductivity.
The
purpose
of this paper is toadd
to thebody
of researchknowledge
in the areas of softwaredevelopment
productivityand
software complexity. Traditionally, soft\varedevelopment
productivity hasbeen
measiued
as a simple output/input ratio, most typically source lines ofcode
(SLOC)
perwork-month
[Boehm,
1981, Conte, et al., 1986, Jones, 1986].While
this metric has a long research history,and
is widelyused by practitioners, criticisms of the metric have sparked
new
research questions.The
numerator,SLOC,
is
one
measiu^e of output.However,
in the case of software maintenance, it does not take intoconsideration possible differences in the underlying complexity of the
code
being maintained. Since researchers such as Fjeldstadand
Hamlen
have noted that approximately50%
of the effort inmaintenance
may
be in understanding the existing code, neglecting to account for initialcode
complexitymay
be a serious omission [Fjeldstad/Hamlen, 1979].In terms of the denominator,
work-months
(often calledman-months),
this toomay
be only arough
measure
of productivity.Numerous
lab experimentsand
field observations have noted thewide
variations in productivity across individual systems developers [Chrysler, 1978,Sackman,
et al., 1968, Curtis,1981, Dickey, 1981].
The
idea is thatwho
charges thosework-months
could be as important ashow
many
work-months
are charged. Yet, the usage of the aggregatedwork-months
metric ignores this potentialsource of productivity variations.
Another
concern with this metric is that it providesno
informationon
how
those hourswere
spent, in terms of which actiNdties of the systemsdevelopment
life cyclewere
supported. In particular,
some
previous research[McKeen,
1973] has suggested that greater time spent inthe design
phase
may
resultm
better systems.The
current research is an in-depth study of anumber
ofsoftwaredevelopment
projects undertaken by an aerospace defense contracting firm in the late 1980s.The
data collection effortwent
beyond
collecting only aggregateSLOC/work-month
data. In terms of the source code, datawere
collectedon
thethe variations
proposed
byMyers
[1977],and
Hansen
[1978|. For maintenance projects, these datawere
also collectedon
the existing systemcode
before themaintenance
effortwas
begun.Code
sizeand
complexity
were measured on
approximatelylOOK
non-comment
Pascaland
FortranSLOC
in 465modules
of source code.These
data allow for the analysis of the impact of complexityon
bothnew
development
and
maintenance.In terms of the
work-month
information, datawere
collected by personby
week
in terms of boththe
number
of hoursand
the activitiesupon
which
those hourswere
spent. This informationwas
obtained by collectingand
processing approximately 1200 contemporaneously-written activity reports. Therefore,analysis of the impact of varying levels of staff expertise
and
varying styles of projectmanagement
ispossible.
The
statistical results are reported in full in Section 4. In terms of themain
research questions,the data support the notion that
more
complex
systems, asmeasured
by cyclomatic complexity, requiremore
effort to maintain.The
magnitude
of this effect is that the productivity of maintaining theprogram
drops an estimated
one
SLOC/day
for every extra path per 1000 lines of existing code. For this data set,the
Myers
and
Hansen
variationswere
found to be essentially equivalent to the originalMcCabe
specification. In terms of productivity, the data support the notion thatmore
time spent in the designphases of the project is related to increased productivity, at a rate of over six
SLOC/day
for eachadditional percentage point of design activity. Software engineers with
more
experiencewere
more
productive, but there
appeared
to be declining marginal returns to experience.The
average net marginal impact of additional experience (minus its higher cost)was
positive.This
paper
is organized as follows. Section 2 describes themain
research questionsand compares
and
contrasts the currentwork
with previous research. Section 3 describes the data siteand
the datacollection methodology,
and
presents the data. Section 4 presents the analysis of the source code,and
Section 5 presents the results of the productivity
and
complexity analysis. Finally, Section 6 offerssome
conclusionsand
suggestions for future research.2
RESEARCH QUESTIONS
AND
PRIOR
RESEARCH
A
vast literature existson
software engineering metricsand
modeling,and
acomprehensive
reviewis clearly
beyond
the scope of this paper.' Rather, in this section the current research questions will be^
The
interested reader is referred to [Conte, et cd., 1986]and
[Jones, 1986] forbroad
coverage ofdeveloped within the
framework
of specific prior researchon
the following topics: a) software outputmetrics, b)
new
softwaredevelopment
productivity,and
c) softwaremaintenance
productivity.2.1 Soft>vare Metrics
There
aremany
possible ways tomeasure
the productivity of software development.The
most
widely used metric for size involves counting soiu-ce lines of
code
(SLOC)
insome
manner.^The
defmitionof a line of
code adopted
for this study is thatproposed
by Conte,Dimsmore
and
Shen
[1986, p. 35]:A
line ofcode
is any line ofprogram
text that is not acomment
or blank line, regardless of thenumber
of statements or fragments of statementson
the line. This specifically includes all huescontaining
program
headers, declarations,and
executableand
non-executable statements.This set of rules has several major benefits. It is very simple to calculate
and
it is very easy to translate for anycomputer
language.Because
of its simphcity, it hasbecome
themost
popularSLOC
metric [Conte, et al. 1986, p. 35].
Through
consistent use of this metric, resultsbecome
comparable
across different studies.One
problem
with almost all the counting rules forSLOC
is that they ignorecomments. While
comments
are not executable, theydo
provide a useful functionand
are therefore properly regarded aspart of the output of the system
development
process. For this reason,comment
lineswere
separatelycounted
in the current research. This issue is discussedmore
thoroughly in themethodology
section.The
most
widely accepted metric for complexity is thatproposed
byMcCabe
[1976]."^He
proposed
that a
measure
of the complexity is thenumber
of possible paths throughwhich
the software could execute.Since the
number
of paths in aprogram
with abackward branch
is infinite, heproposed
that a reasonablemeasure
would
be thenumber
of independent paths. Afterdefming
theprogram
graph
for a givenprogram, the complexity calculation
would
be:^
As
Boehm
[1987] has described, this
method
hasmany
deficiencies. In particular,SLOC
aredifficult to define consistently, as Jones [1986] has identified
no
fewer than eleven different algorithms forcounting code. Also,
SLOC
do
not necessarilymeasure
the "value" of thecode
(in terms of functionalityand
quality).However,
Boehm
[1987] also points out thatno
other metric has a clear advantage overSLOC.
Furthermore,SLOC
are easy to measure, conceptually familiar to software developers,and
are usedin
most
productivity databasesand
cost estimation models.For
these reasons, itwas
the metricadopted
by this research.
^
Another
optionwould
havebeen
to useone
of the Halstcad Software Science metrics in place ofthe
McCabe
metric [Halstead, 1977].However,
previous research hasshown
no
advantage for these metricsover the
McCabe
metric [Curtis, et al., 1979; Lindand
Vairavan, 1989],and
more
recentwork
hasbeen
critical of Software Science metrics [Shen, Conte,
and
Dunsmore
1983;Card
and
Agresti, 1987].For
theseV(G)
=
e - n+2.
where
V(G)
=
the cyclomatic complexity, c=
thenumber
of edges,n
=
thenumber
of nodes.Analytic;illy,
V(G)
is thenumber
of predicates plus 1 for each connected single-entry single-e.xit graph orsubgraph.
Hansen
[1978] gave four simple rules for calculatingMcCabe's
Complexity:1. Increment
one
for every IF,CASE
or other alternate execution construct.2. Increment
one
for every IterativeDO,
DO-WHILE
or other repetitive construct.3.
Add
two
less than thenumber
of logical alternatives in aCASE.
4.
Add
one
for each logical operator(AND,
OR)
in an IF.Myers
[1977]demonstrated
what
hebeheved
to be a flaw inMcCabe's
metric.Myers
pointed outthat an IF statement with logical operators
was
insome
sense lesscomphcated
than two IF statements withone
imbedded
in the other because there is the possibility of an extraELSE
clause in the second case.For example:
IF (
A
and
B
)THEN
ELSE
is less
complex
than:IF (
A
)THEN
IF (
B
)THEN
else""'
ELSE
Myers
[1977] notes that, calculatingV(G)
by individual conditions (asMcCabe
suggests),V(G)
isidentical for these two cases.
To
remedy
this problem,Myers
suggested using a tuple of(CYCMID,CYCMAX)
where
CYCMAX
isMcCabe's
measure,V(G),
(all four rules are used)and
CYCMID
uses only the fnst three.Using
this complexity interval,Myers
was
able to resolve theanomahes
in the complexity measure.
Hansen
[1978] raised a conceptual objection to Myers'simprovement.
Hansen
felt that becausedid not provide
enough
information to warrant the intrinsic difficulties of using a tuple. Instead heproposed
thatwhat
was
really beingmeasured
by thesecond
half of the tuple(CYCMAX)
was
reallythe complexity of the expression, not the
number
of paths through the code. Furthermore, he felt that rule#3
above
alsowas
just ameasure
of the complexity of the expressions.To
provide a better metric, he suggested that a tuple consisting of(CYCMIN,
operation count) be usedwhere
CYCMIN
is calculated using just the firsttwo
rules for countingMcCabe's
complexity.All of the above tuple
measures
have aproblem
in that, while providing ordering of the complexity of programs, theydo
not provide a singlenumber
that can be used for quantitative comparisons. It is alsoan
open
research questionwhether
choosingone
or the other of these metricsmakes
any practical difference.While
anumber
of authors have attempted validation of theCYCN4AX
metric (e.g. Liand
Cheung,
1987;Lind
and
Vairavan, 1989) there does notappear
to be any empirical validation of the practical significance of theMyers
orHansen
variations. Therefore,one
question investigated in thisresearch is a vaUdation of the
CYCMIN,
CYCMID,
and
CYCMAX
complexity metrics.22
Productivity ofNew
SoftwareDevelopment
While
the issue ofsoftwaredevelopment
productivity is of practical significanceand
has generateda considerable
amount
ofresearch attention,what
has oftenbeen
lacking is detailed empirical data[Boehm,
1987]. Past research can generally be categorized into two
broad
areas.The
first aretermed
"black bo.x"or "macro" studies
which
collect dataon
gross inputs, gross outputsand
limited descriptive data in orderto build software production models.
Examples
of this type ofwork
would
include the classicwork
ofWalston and
FeUx
[1977]and
Boehm
[1981]and
more
recentwork
by Jefferyand
Lawrence
[1985]and
Banker, et al. [1987].
A
second
type of research is so-called "glass box" or "micro"models
inwhich
the typicalmethodology
is a controlled experiment involving software developers doing a small taskwhere
data are collected at amore
detailed activity level, rather than the gross level of the "black box" models. Ejcamplesof this type of "process" research include
Boehm,
et al. [1984],and
Hareland
McLean
[1984].A
possible criticism of the first type of research is that by examining high level data,many
potentially interesting research questions cannot be answered. For example,
one
concern ishow
best toallocate staffing over the phases of the system
development
lifecycle.Data
collected only at the projectsummary
level cannot be used to address this question.to
meet
thedemands
of experimental designand
their practical constraints, the researchers are forced either to use small samples or to abstract from reality (e.g. by using undergraduate students as subjects) to such an extent as to bring the external validity of their results into question [Conte, ct al 1986].The
current researchwas
designed to mitigate both of these potential shortcomings.The
approach
taken is of the "macro" production modelling type, but data
were
collected at an in-deplh, detailed level.Input (effort) data
were
collected by person byweek
and
mapped
tocontemporaneously
collected project phase/task data.Output
data included not onlySLOC,
butMcCabe,
Myers,and
Hansen
complexitymeasures
as well.Therefore these data allow the research to address the following detailed research questions:
What
are the productivity impacts of differing phase distributions of effort? Specifically,does a greater percent of time spent in the design phase
improve
overaU productivity?How
much
does the productivityimprove
with personnel experience?Are
there dechning marginal returns to additional experience,and
if so,what
is theirmagnitude?
These
questions canbe answered on
the basis ofempirical datafrom
real world sitesand
thereforethis research avoids
some
of the potential external vaUdity concerns of the classic experimental approach."*23
Soft^vareMaintenance
ProductivityThe
definition of softwaremaintenance adopted
by this research is that of Parikhand
Zvegintzov[1983],
namely
"workdone on
a software system after itbecomes
operational." Lientzand
Swanson
[1980,pp. 4-5] include three tasks in their definition of maintenance: "...(1) corrective -- dealing with failures in
processing, performance, or implementation, (2) adaptive -- responding to anticipated change in the data or processing environments; (3) perfective
—
enhancing processing efficiency, performance, or system maintainability."The
projects describedbelow
asmaintenance
work
pertain to all of the three Lientzand
Swanson
types.Software
maintenance
is an area of increasing practical significance. It is estimated thatfrom
50-80%
of information systems resources are spenton
softwaremaintenance
[Elshoff, 1976].However,
according to Lientz
and Swanson
[1980] relatively httle research hasbeen done
in this area.Macro-type
studies have includedVessey
and
Weber
[1983]and
Banker, et al. [1987]. Micro-level studies have includedRombach
[1987]and Gibson and Senn
[1989]."Of
course, the field studymethodology
is unable to providemany
of the tight controls provided bythe experimental
approach
and
therefore requires itsown
set of caveats.Some
of these are discussed inOne
area that is of considerable interest for research is the effect of existing system complexityon maintenance
productivity. It is generallyassumed
that a)more
complex
systems are harder to maintain,and
b) that systems sufferfrom
"entropy",and
thereforebecome more
chaoticand complex
as time goeson
[Beladyand
Lehman,
1976).These
assumptions raise anumber
ofresearch questions, in particular,when
does softwarebecome
sufficiently costly to maintain that it ismore
economic
to replace (rewrite) it than to repair (maintain) it? This is a question ofsome
practical importance, asone
type of tool in theComputer Aided
Software Engineering(CASE)
marketplace is the so-called "restructurer", designed toreduce existing system complexity by automatically modifying the existing source
code
[Parikh, 1986;US
GSA,
1987].Knowledge
that a metric such asCYCMAX
accurately reflects the difficulty in maintaininga set of source
code would
allowmanagement
tomake
rational choices in the repair/replace decision, aswell as aid in evaluating these
CASE
tools.3
DATA
COLLECTION
AND
ANALYSIS
3.1
Data Overview
and
Background
of the DatasiteA
prerequisite to addressing the research questions in this studywas
to collect detailed data about anumber
ofcompleted
software projects, including dataon
individual personnel. All the projectsused in this study
were
undertaken in the periodfrom
1984-1989, withmost
of the projects in the lastthree years of that period. All the data originated in
one
organization, a defense industry contractor hereafter referred to asAlpha
Company.
Thisapproach
meant
that because all the projectswere
imdertaken by
one
organization, neither inter-organizational differences nor industry differences shouldconfound
the results.A
possibledrawback
is that this concentrationon one
industryand one
company
may
somewhat
limit the applicability of the results.However, two
factors mitigate this drawback. First, becausethe concepts studied here are widely apphcable across applications, there is
no
a priori reason to believethat there will be significant differences in the trends across industries. Second, the defense industry is a very important industry for software^
and
the results will clearly be of interest to researchers in that field.The
projects at this sitewere
smallcustom
defense appUcations.The company
considers the data contained in this study proprietaryand
therefore requested that itsname
not be usedand
any identifyingdata that could directly connect it with this study be withheld.
The
numeric
data, however, have notbeen
altered in any way.The
company
employed
an average of about 30 professionals in the software^Fortune
magazine
reported in itsSeptember
25, 1989 issue that thePentagon
spends $30 billiondepartment.
The company
produced
statc-of-thc-artcustom
systems primarily for real-time defense-related applications.At
the start of this study, 21 potential projectswere
identified for possible study.Two
of the projects were, however, eliminated as being non-representative of this sample.The
first projectwas
a laboratory experiment inwhich
the major task of the chief software engineerwas
mathematical analysis.While
a smallamount
of softwarewas
written for this project, therewas no
way
to separate the analysis effortfrom
the softwaredevelopment
effort.The
other projectwas
nearly an order ofmagnitude
largerthan the next largest project
and
was
believed to be unrepresentative of theAlpha
population in itstechnical design. This project
was
the only project toemploy
an array processor,and
a substantial fraction of the softwaredeveloped
on
the projectwas
micro-code, also non-representative of theAlpha
environment.The
data set therefore consists of the remaining 19 software projects.These
projects represented 20,134hours (or roughly 11.2 work-years) of software engineering effort,
and
were
selected because of the availabihty of detailed information provided by the engineers' weekly activity reports.Table
3.1.1 is asummary
ofsome
of the gross statisticson
the projects.NCSLOC
is thenumber
of
non-comment
source hnes of code;CSLOC
includescomments;
DURATION
is thenumber
ofweeks
over
which
the projectwas
undertaken;TOTALHRS
is the totalnumber
of labor hoursexpended by
soft\vare
department
personnel;NCPROD
is thenumber
ofnon-comment
lines ofcode
divided by the totalnumber
of labor hours.Median
Mean
Std. Dev.questionnaire describing the project.
5.
Engineer
ProfileData
CollectionForms
—
A
surveyon
the educationand
experience ofevery engineer
who
worked on
one ot the projects.The
first data sourcewas
the sourcecode
generated in each project.Of
thecode
developed forthe projects,
74.4%
was
written in Pascal,and
the other25.6%
in a variety of other third generationlanguages. Following
Boehm
[1981), only deliverablecode
was
included in this study, with deUverable defined as anycode which
was
placed under a configurationmanagement
system. This definitioneUminated
any purely
temporary
code, but retainedcode which
might not havebeen
part of the delivered system, butwas
considered valuableenough
to be savedand
controlled. (Suchcode
included progrjuns for generatingtest data, procedures to recompile the entire system, utility routines, etc.)
The number
of hoursworked on
the projectand
the person charging the projectwere
obtainedfrom
the database.Because
the hourly chargeswere
used to bill the projectcustomer and were
subjectto
United
StatesGovernment
Department
ofDefense
(DOD)
audit, the engineersand managers
usedextreme
care to insure proper allocation of charges. For this reason, these data are believed to be accurate representations of the actual effortexpended on
each project.At
Alpha
Company,
software engineers wrote asummary
called an activity report of thework
theyhad
done on
each project every week.The
standard length of these descriptionswas one
ortwo
pages to describe all the projects. If an engineer
worked on
onlyone
project, the description of thework
for that project
would
occupy
the entire page. If several projectswere
charged, the length of the descriptionstended
to be roughly proportional to theamount
of time worked, with the entire report fittingon
a page.An
example
of an activity report appears inAppendix
A.Project data collection forms developed for this research provided
background
dataon
eachproject.
The
surveys requested descriptions of the project, qualitative ranking of various characteristics of thedevelopment
process(number
of interruptions,number
of changes in specifications, etc),and
ratingsof the quality of the software developed. Unlike the data above, these data are subjective,
and
results basedon
it are interpreted cautiously. It is useful, however, to havesome
measure
of these qualitative variablesin order to obtain
some
feel forhow
the resultsmay
be affected.A
copy
of the surveyform
used to collect these is included inAppendix
B.The
engineer profile forms obtained dataon
each engineer's educational background,work
experience
and
Alpha
company
level. If theemployee
in questionwas
still with thecompany
(24 out of the 34 engineers in the sample), he or shewas
asked to fill theform
out personally. If theemployee no
longer
worked
for thecompany,
personnel recordswere
consulted.A
copy
of the engineer profile surveyis included in
Appendix
C.33
Categorization ofDevelopment
ActivitiesAlthough
unusual for this type of research, this studyhad
access to thewcek-by-week
description ofwhat work
was
done
by each engineeron
each project.These
descriptionswere
filled out by the individual engineers everyweek
to inform theirmanagers
of the content of theirwork
and
also to providedocumentation
of thework
charges in case of aDOD
audit.Because
thesedocuments
were
writtencontemporaneously
by everyonewho
was
workingon
a project, they are believed to present an accurate picture ofwhat
happened
diu-ing the softwaredevelopment
process.Approximately
1200 reportswere
examined
for this research.These
activityreports provided a large descriptive database; however, in order to test the hypothesis relating to productivity a keyproblem was
to determine away
to quantify these data.The method
usedwas
to categorize thework
done
eachweek
by each engineer intoone
of nine standard systemsdevelopment
lifecycle phasesemployed
by Alpha.One
advantage of thisapproach
is that these categoriescorresponded
quite closely to the "local vocabulary", i.e., the engineers atAlpha
Company
tended toorganize their
work
in terms of the above categories. This cotmectionproved
very helpfulwhen
the activityreports
were
read,and
thework
categorized.The
nine categories thatwere
used are described inAppendix
D.
To
categorize a project's activity reports, the researcher selected an engineerand
then read through all the engineer's activity reports sequentially.The
work done on
the project eachweek
was
categorized into
one
ormore
of the project phases. For the large majority of reports onlyone
phasewas
charged per report; however, for a fraction of the activity reports the effort
was
split intotwo
phases.®To
obtain consistency across projectsand
people,and
because access to the datawas
limited toon-site inspection, only
one
researcherwas
used todo
the categorization task for all the reports.Because
the researcher read the activity reports in a sequentialmanner,
hewas
able to follow the logic of the work,as well as getting the "snapshot" that any
one
reportwould
represent. Finally, in two caseswhere
the reportswere
imclear, the engineerswere
shown
theirown
reportsand
asked to clarifywhat
was
actuallydone.'
®
On
afew
reports (less than5%)
amaximum
of three phaseswere
charged.'
Of
course a possible limitation of this single raterapproach
is an inability to check for a possibleIn
some
cases, all of the activity reportswere
not available, typically because the reporthad
been
forgotten or lost. In addition, the reportshad
onlybeen
kept for the past three years so that theywere
generally not available prior to that time.^ For four of the nineteen projects studied,25%
ormore
of these datawere
missing,and no
attemptwas
made
to reduce the activity reports for those projects.To
obtain the best
breakdown
of phases for the unassigned timeon
the remaining 15 projects, such timewas
categorized in proportion to the categorization of the rest of the project.
The
median
amount
of time categorized as unassignedwas
9.9%
(see Table 3.3.1).OBS.
TOTAL
PERCENT
HOURS
UNASSIGNED
Scientists:
Level 1: Principal Scientist. Requires peer review to be
promoted
to this level.Level 2: Staff Scientist. Requires peer review to be
promoted
to this level.Entjineers:
Level 3: Senior Engineer. Entry level lor Ph.D's.
No
peer review required.Level 4: Engineer. Entry level for Master's degree engineers.
No
peer review required.Level 5: Junior Engineer. Entry level for bachelor's degree engineers.
A
sixth levelwas
defined to include support staff (secretary, etc), part time college students incomputer
science,and
other non-full time professionals.Because
many
of the projects .studiedhad
ratherfew people, six levels of detail
were
not required. For this reason, the aggregate levels of scientist (levels1
and
2), engineer (levels 3, 4,and
5)and
support (other)were
used. This simple division is also usedat Alpha, in that to be
promoted
to scientist level, a peer review is required. Non-engineering staffwere
kept in their
own
group
(support staff) because theywere
only para-professional staff.^ Table 3.4.1shows
the
number
of people in each of the levels with each of their highest educational degree levels. All theengineers in this
department
had
technical degrees (most incomputer
science) so thatno
attemptwas
made
to separate the different fields.<
Bachelors BachelorsGraduate
TotalScientist 2 8 10
Engineer 1 14 5 20
Support
4 4Table 3.4.1 Educational Degrees Categorized by Level
It is clear that the
company
levels are related to the highest level of education reached.The
levelis also closely related to the
number
of years of experience asshown
in Table 3.4.2.Average
Number
of YearsAverage
Number
ofYears
of
Company
Experience of Total Industry ExperienceScientist 15^6
200
Engineer
5.1 5.8Table 3.42:
Average
Years of Experience forCompany
LevelsUse
of these levelshad
many
advantages.The
firstwas
that it provided a reasonableway
tocombine
degreeand
length ofwork
experience into a single measure, as these levels aregood
indicators of both experienceand
educational level.The
engineer'scompensation
is also highlydependent on
his or°
Only
5.4%
of the total labor hours for the projects in the datasetwere
charged by these para-professionals.her
company
level,and
therefore project cost is directly related to the use of staff of different levels.'"Finally, use of the "level" concept
was
thecommon
practice atAlpha
Company,
and
therefore provideda natural
way
to discuss the research with the participants. 3.5 Statistics onSource
Code
One
of the advantages of this data site to the researchwas
that the actual sourcecode
was
available, thus all statistics of interest about the
code
couldbe
calculated. In addition toNCSLOC
(non-comment,
non-blank lines of source code), a second statistic calledCSLOC
was
also collected.CSLOC
isNCSLOC
plus thenumber
of lines ofcomment
(still excluding blank lines)."A
potentialproblem
withall
SLOC
statistics is that theydo
not take into account the complexity of the code.As
discussed earlier,the
most
popularmeasure
of complexitywas
developed byMcCabe
and
bothMyers and
Hansen
haveoffered e.xtensions to
McCabe's
measure. All of these complexitymeasures
were
collected at themodule
level in order to provide empirical evidence for the relationship
among
the three measures.The
code
studiedwas
contained in a total of465modules
ofwhich
435were
written in Pascaland
30 inFORTRAN.
Collecting these data required metrics
programs
for each language. Comple.xity resultswere
collected for the vast majority ofcode which was
in Pascal orFORTRAN.
No
complexity statisticswere
generated for
two
projects (one written in C, the other inADA).
Table 3.5.1shows
the percent ofcode
of each project forwhich
the cyclomatic complexity metricswere
generated.They
were
generated for 17 of 19 projects and, of those,100%
of thecode was
analyzed for 13 of the 17. In the remaining fourprojects, an average of
98.3%
of thecode was
analyzed.'^'°
There
is Uttle variation in the salaries ofengineers at a given level.
Because
proposals for projects are costed using a standard cost for each level, but billed using actual costs, large deviations tend to causeclient dissatisfaction
and
are avoided byAlpha
Company.
" If a
comment
appeared on
thesame
line as an executable line ofcode
both thecomment
and
the line ofcode
incrementedCSLOC.
If thismethod were
not used,comments
would
be significantlyundercounted. In particular, the
measurement
ofcomment
densitywould
be distorted.'^
To
adjust for the fact that complexity figures
were
not available for the utility routines for four of the seventeen projects, an average complexity ratio (cyclomaticcomplexity/NCSLOC) was
defined foreach project.
By
assuming that the complexity densitiesremained
constant for the project, the totalcomplexity (including non-Pascal or
FORTRAN
code)was
calculated by multiplying the average complexitydensity by the total
NCSLOC.
This adjustmentwas
done on
only four of the projects,and
for less than2%
of thecode
on
averageon
those four projects. Therefore, this is not believed to be a major source of error.project;
NCPROD
isNCSLOC/TOTALHRS
or the productivity fornon-comment
lines of code.Not
allthe data used in the study
were
available for all the projects.The
cyclomatic complexity data are missingfrom
two
of the projectsand
the phase data from four of the projects.However,
formost
analyses onlya single statistic
was
required, and, in order to conserve degrees of freedom, all projects with valid datain that subset
were
used. For example, if the analysis required complexity data, but not phase data, thestatistics
were
generated on the seventeen projects with valid complexity data (including the projects withoutphase data). It is not believed that this
approach
introduced any systematic bias.4
SOURCE
CODE
ANALYSIS
4.1
Comparison
of MetricsFor this portion of the analysis it
was
possible to analyze the data at themodule
level, sinceno
work-hours
datawere
required. Table 4.1.1 is asummary
ofsome
of the basic statistics for all themodules.
Variable
MEAN
STD
DEV
TOTALHRS,
can beassumed
to be different with a high degree of confidence.Because
all of these variables are directly related to the size of the project, this difference simply indicates that,on
average, themaintenance
projectswere
smaller thannew
development
projects.However, no
significant differencein productivity
was
found.While
the average productivity formaintenance
projects is less than fordevelopment
projects, the fact that this difference is not statistically significantmeans
thatwe
were
unableto reject the null hypothesis that
maintenance
productivity is thesame
as that fordevelopment
projects. In addition, because the line count ignoredchanged
linesand
actually subtracted for deleted lines, the productivity ofmaintenance
projectsmay
be understated. Therefore, thenew
development
projectsand
themaintenance
projects arecombined
in later productivity analyses, excludingmaintenance
specific analysis that requires e-xisting code.5
RESULTS
Four major
productivity effectswere
investigated ofwhich two
applied to all of the projects.They
were: (a)Are
more
experiencedprogrammers
more
productive, and, if so, byhow much?
and
(b)How
does the proportion of time spent in the design phase affect the productivity?
The
othertwo
effects studiedwere
appropriate formaintenance
projects only.As
cited earlier,much
hasbeen
written under the assumption that characteristics of thecode
have an effectupon
maintenance
productivity.One
of the reasons thatMcCabe
developed his cyclomatic complexitywas
toattempt to quantify the complexity of
code
in order to determine the appropriate length of subroutines sothat the
code
would
not be toocomplex
to maintain. This study hasexamined
this effect using thecomplexity density metric defined earlier.
The
secondmaintenance
productivity issue analyzedwas
theimpact of
documentation
in theform
ofcomments.
5.1 Experience
It has long
been
thoughtand
several studies haveshown
that the experience of theprogrammers
is an important factor in productivity. Verifying this
phenomenon
is not always easy, in part because engineers with different levels of experiencedo
different tasks.As
described above, therewere
essentiallythree different categories of technical personnel.
The
highest category (scientist) included the top twocompany
levels.The
employees
in this categorywere
primarily project managers.The
second
category contained the remaining three professional levelsand
the engineers in this categorywere
primarily responsible for developing the software.The
lowest level provided support.Because
the scientistsand
thesupport stalf
were
byand
large not directly involved in the writing of the software, their effect onproductivity is only indirect. Therefore, the research effort focused
on
the engineer (levels3, 4,and
5) staff.To
test the hypothesis that the higher level(more
educationand
more
experience) engineers aremore
productive, the following
model was
created:Y
=
Bo+
(B, * P3)+
(B^ *PJ
+
(B3 *?,)
where,
Y
=
NCPROD
P3
=
the percentage of the totaldevelopment
hours spent by level 03 engineers.P4
=
the percentage of the totaldevelopment
hours spent by level 04 engineers.P5
=
the percentage of the totaldevelopment
hoiu-s spent by level 05 engineers.Development
hours are only those hoursworked on
specifications, design, coding, or debugging.The
results of this regression were:NCPROD
=
-15.75+
27.88*P3+
25.97*P,+
18.93*P5(-2.03) (2.89) (2.91) (1.89)
R^
=
0.45 AdjustedR^
=
0.30 F-Value
=
3.04 n=
15The
interpretation of these results is that adding an additionalone
percentmore
time by level 3engineers
working
on
direct software tasks results in an increase of approximately 2.2 additionalNCSLOC/day,
ceteris paribus.The
result for levels 4and
5 are 2.1and
1.5 respectively.These
results are exactly as expected in thatmore
experienced engineers aremore
productive.(The
t-statistics for the level3
and
4 engineers are significant at the alpha=
.01 level,and
at the .10 level for the level 5 engineers.)By
examining the data at this level of detail it is also possible todraw
conclusions about themarginal productivity of additional experience,
and
its value to the firm. In particular, the datashow
a37.19%
increase in productivitybetween
levels 4and
5,and
a47.28%
increasebetween
levels 3and
5.What
is interesting to note is that the differences inmean
salaries for each level is only15%.
Startingat an index salary of 1 for level 1 scientists, this
means
that the average salary index for a level 3 engineeris .7225 (.85 squared), a level 4 is .6141,
and
level 5, .5220. Therefore, the net benefit in using a levelu
o
O
Marginal
Productivity
vs Average Years Experience
"T
2,5 3 3.5 4 4.5 5 5.5 6 6.5 1
Avg Years Experience Cfounded to .5 yr^
7,5 B 8.5
Figure 5.1:
Marginal
Productivity versus Experiencea level 4 engineer over a level 5 is approximately
20%.
This illustratestwo
points of potential importanceto software managers.
The
first is thatmore
productive staffdo
notseem
to becompensated
at a levelcommensurate
with their increased productivity.The
intuitiverecommendation
that followsfrom
thisrelationship is that
managers
should seek out higher ability staff, since their marginal benefit exceeds theirincreased marginal cost.
A
second
resultfrom
these data is that thereseem
to be declining marginalreturns to experience (see Figure 5.1). This is consistent with
some
previous research [Jefferyand
Lawrence, 1985].
However,
tliis resultmay
beconfounded
by a difference in task complexity, i.e.,more
senior staff
may
be givenmore
difficult aspects of the system. Further study, perhaps in theform
of a labexperiment controlling for task complexity,
seems
indicated.5.2 Design
Phase
Impact
Because
of the availability of phase data, itwas
possible to test the hypothesis developed byproject can be improved. Specifically, the
more
lime that is spent analyzingand
designing the software, the betterand
more
detailed the design will be. Thisgood
designwould
be expected to lead to a fasterand
easier overalldevelopment
process.To
test this hypothesisANALFRAC
was
defined as thenumber
ofhours spent in the specification
and
design phases divided by the totalnumber
of project hours (seeAppendix
D).The
results are plotted in Figure 5.2.Productivity
vs
Analysis
Percentage
8
5
Spec, and Design Fttg
Figure 52: Productivity versus Fraction of
Time
Spent
in AnalysisThe
result for a linear regressionmodel
of these data is:NCPROD
=
3.97+
21.48 *ANALFRAC
(1.56) (1.14)
R^
=
0.09 AdjustedR^
=
0.02F-Value
=
1.30 n=
15and
design phases has a measurable impacton
source lines ofcode
productivity.Of
course the limitednumber
of data points reduce thepower
of this test.53
Complexity
Complexity
density(CYCMAX/NCSLOC)
is the appropriate metric for evaluating the effects ofcode
complexity becauseMcCabe's
complexity is so closely related toprogram
length. IfMcCabe's
complexity
were
left unnormalized, the resultswould
bedominated
by the length effect, i.e., the tested hypothesiswould
actually be thatmaintenance
productivity is related toprogram
length.While
this is a possible hypothesis, there are two reasonswhy
it is less interesting than the impact of complexity.The
first is thatmanagers
typicallydo
not have a great deal of control over the size of aprogram
(which isintimately
connected
to the size of the problem).However,
bymeasuring
and
adopting complexitystandards,and/or by using
CASE
restructuring tools, they canmanage
complexity.The
second
reason is that there are valid intuitive reasonswhy
the length of thecode
may
not have a large effecton
maintenance
productivity.When
an engineer maintains a program, eachchange
will tend to be localized to a fairly smallnumber
of modules.He
will, therefore, onlyneed
detailedknowledge
of a small fraction of the code.The
size of this fraction is probably unrelated to the total size of the program.
Thus
the length of the entirecode
is less relevant.'''Logically, the initial complexity density should only affect those phases that are directly related
to writing the code.
User
support, system operation,and
management
functions should not be directlyaffected by the complexity of the code. For this reason, a
new
partial productivitymeasure
was
defmed:ALLDPROD:
The
totalnumber
of lines divided by the time spent in allnon-overhead
activities:specifying, designing, documenting, coding or testing
and
debugging.To
perform
this analysis both the phase dataand
the cyclomatic complexity data for themaintenance
projectswere
needed.Because
all the datawere
required, only seven data sampleswere
available.
A
regression ofALLDPROD
was
performed
with the complexity density of the initialcode
(INITCOMP):
'"In fact, the correlation of productivity with initial
code
lengthwas
not significant (Pearson coefficient, -0.21; alpha=
0.61).ALLDPROD
=
28.7 - 0.134 *INITCQMP
(3.53) (-2.69)R2
=
0.59 AdjustedR^
=
0.51 F-Value=
7.22 n=
78
o> c (n is8
This
model
suggests that productivity decUnes with increasing complexity density (see Figure 5.3).Complexity
vs
Maintenance
Productivity
0.00
ActuaI vaIues
New SLOC/work-hour
Linear Prediction
Figure 53:
Maintenance
Productivity versus Complexity DensityWhile
itwould be
speculative to ascribe toomuch
importance to the results,which
are basedon
only seven data points, the resultsdo
seem
sufficiently strongand
the implications sufficiently important that further'^
A
possible concern with thismodel
mightbe
multicollinearitybetween
the independent regressorand
the intercept term.To
test for this, the Belsley-Kuh-Welsch testwas
run,and
no
significantstudy is indicated.'^ If these results continue to hold
on
a larger independent sample, then such resultswould
provide strong support for the use of the complexity densitymeasure
as a quantitativemethod
foranalyzing software to determine its maintainability. It might also be used as an objective
measure
of thisaspect of software quality.
5.4
Source
Code
Comments
The number
ofcomments
in thecode
has alsobeen
suggested to affect the maintainability of thecode
[Tcnny, 1988].A
regression ofALLDPROD
as a fimction of the percentage of the lines ofcode
thatwere
comments
(PERCOMM)
was
also performed. (These data are plotted in Figure 5.4.)ALLDPROD
=
4.96+
0.05 *PERCOMM
(L04) (0.36)
R2
=
0.02 AdjustedR^
=
-0.14F-
Value
=
0.13 n=
8The
results indicate that the density ofcomments
does notseem
to be related to the maintainability of the code.Because
more
difficultcode
may
becommented more
heavily, it is not clearwhether
heavilycommented
code
will be easier to maintain. Therefore, another question iswhether
theprogrammers
adjusttheir
programming
styles to the complexity of thecode which
is being written.For
example,one would
expect
programmers
tocomment
complex
codemore
heavilyand
to breakcomplex
code
intomore
subroutines. If this
were
the case, the density ofcomments
should increase with the complexity density. Infact, the Pearson correlation coefficient
between
those two metrics is -0.30 (alpha=
0.0001, n=
465).The
interpretation of this negative correlation is that
more
difficultcode
getcommented
less heavily, suggestingthat
comment
densitymay
be simply a function of the time available, rather than being a function of theneed
forcomments.
'^
Programming
stylemay
alsoaffect productivity.
For
example,one
would
expect that by increasing the complexity density(CYCMAX/NCSLOC)
ofnewly writtencode
aprogrammer
would
requiremore
timeto write these
more
complex
hues of code.The
correlation of the complexity density with the productivityI.
2
c3q
1G 15 -14 13 12 -11 -10 -9 a 7 -A B 5 4 -3 2 1 -1 15Maintenance
Productivity
as a Function of Comrrent Density
1 1 00 25,00 T 1 1 1 35.00 45.00 Comments/SLOC Actual Values 1 55,00 65.00
Figure 5.4:
Maintenance
Productivity versusComment
Density6
CONCLUSIONS
Through
the availability of detailed data about the software projects developedand
maintainedat this data-site this
paper
hasbeen
able to test anumber
of relationshipsbetween
software complexityand
other factors affectingdevelopment and
maintenance productivity. In particular, increased softwarecomplexity, as
measured
by theMcCabe
cyclomatic complexity density metric, is associated withreduced
maintenance
productivity. Greater emphasison
the design phase is associated with increased productivity.And,
systems developers with greater experiencewere
found tobe
more
productive, although at a decliningmarginal rate.
How
can softwaredevelopment
and
maintenancemanagers
make
best use of these results?To
begin with, the
measurement
of these effectswas
made
possible only through the collection of metrics related to softwaredevelopment
inputsand
outputs. In particular, use of these data argue for thecollection of detailed data
beyond mere
SLOC
and work-hour
measures.Based on
these data, softwcireproductivity can be
improved
throughemployment
of higher experience staffand
reduction in the cyclomatic complexity of existing code. Staff with higher experiencewere
shown
tobe
more
productive, at an average rategreater than their increased cost. Finally, cost effective reduction in the cyclomatic complexityof existing
code
should be attempted.One
approach
may
be bymeasuring
complexity ofcode
as it iswritten,
and
requiring that it adhere to an existing standard estabhshed through on-site data collection.A
second
approach
may
be through existing restructuring products, although, of course, the cost effectiveness of this latter solution is contingentupon
the costs of acquiringand
operating the restructurer,as well as the size of the existing software inventory.
An
additional caveat related to this solution is thatthese restructurers
may
impose one
time additionalmaintenance
costs in terms of initially reducing the maintainers' familiarity with the existingcode
structure. This lilcely effectmust
also be taken into accoimt. In terms of future research, the softwaredevelopment
and maintenance
community would
benefitfrom
validation of these resultson
datafrom
other firms particularlyfrom
other industries. Also, theeffects of complexity
on
maintenance
need
to be rephcatcdon
a largersample
before accepting theseresults as guides for standard practice.
Beyond
these validation-type studies, furtherwork
couldbe
devotedto the experience issue, in terms of attempting to determine the causes of the apparent declining marginal
returns to experience
shown
by these data.Research
on complexity impacts couldexamine
the actual longitudinal impacts of the use ofcommercial
restructurers. In general,more
empirical studies involving detailed data collection will help to provide the insights necessary to help softwaredevelopment
progress7
BIBLIOGRAPHY
I
[Banker, et al. 1987]
Banker, Rajiv D.; Datar, Srikant M.;
and
Kemerer, Chris F.; "Factors Affecting SoftwareDevelopment
Productivity:An
Exploratory Study"; in Proceedings of the Eighth InternationalConference
on
Information Systems;December
6-9, 1987, pp.' 160-175.[Basili. et al. 1983]
BasUi, Victor R.; Selby, Richard
W.
Jr.;and
Phillips, Tsai-Yun; "Metric Analysisand
Data
ValidationAcross
FORTRAN
Projects"; inIEEE
Transactionson
Software Engineering; SE-9, (6):652-663, 1983.[Belady
and Lehman,
1976]Beladv, L. A.
and
Lehman,
M.
M.;"A
Model
of LargeProgram
Development";IBM
Systems Journal, v.15, n.' 1, pp. 225-252, (1976).
[Beisley, et al. 1980]
ISelsley, D.;
Kuh,
E.and
Welsch, R.; Regression Diagnostics;John Wiley and
Sons,NY, NY,
1980.[Boehm,
1981]Boehm,
Barry W.; Software Engineering Economics; Prentice-Hall,Englewood
Cliffs, NJ, 1981.[Boehm,
et al. 1984]Boehm,
B. W.; Gray, T. E.;and
Seewaldt, T.;IEEE
Transactionson
Software Engineering;May
1984.[Boehm,
1987]Boehm,
Barry W.; "Improving Software Productivity";Computer; September
1987; pp. 43-57.[Card
and
Agresti, 1987]Card, D. N.
and
Agresti.W.
W.; "Resolving the Software ScienceAnomaly";
Journal of Systemsand
Software; 7 (l):29-35,March,
1987.[Chrysler, 1978]
Chrysler, Earl;
"Some
Basic Determinants ofComputer Programming
Productivity";Communications
of theACM;
21 (6):472-483, June, 1978.[Conte, et al. 1986]
Conte, S. D.;
Dunsmore,
H.
E.;and
Shen, V. Y.; Software Engineering Metricsand
Models;Benjamin/Cunmiings PubUshing
Company,
Inc., 1986. [Curtis, 1981]Curtis, Bill; "Substantiating
Programmer
Variability"; Proceedings of theIEEE
69 (7):846, July, 1981. [Curtis et al., 1981]Curtis, Bill et al.; "Measuring the Psychological Complexity ofSoftware
Maintenance
tasks with the Halsteadand
McCabe
Metrics";IEEE
Transactionson
Software Engineering; SE-5:99-104,March
1979. [Dickey, 1981]Dickey,
Thomas
E.;"Programmer
Variability"; Proceedings of theIEEE
69 (7):844-845, July, 1981. [Elshoff, 1976]Elshoff, J. L.;
"An
Analysis ofSome
Commercial
PL/1
Programs";IEEE
Transactionson
Software Engineering;SE2
(2):113-120, 1976.[Fjeldstad
and Hamlen,
1979]Fjeldstad, R. K.