COCOMO II and Big Data
Rachchabhorn Wongsaroj*, Jo Ann Lane,
Supannika Koolmanojwong, Barry Boehm
*Bank of Thailand and Center for Systems and Software Engineering
Computer Science Department , Viterbi School of Engineering
University of Southern California
Outline
Big Data Concept
COCOMO II Cost factor
COCOMO II Cost factor and Big Data
Future Works
(c) USC CSSE
Source: IBM
Big Data concept
3
3V’s concepts of Big Data
(IBM)
Volume --
The amounts of data generated
Variety --
The different data types and sources
Velocity --
The speed of data is generated in/out and moves around
Big Data
Datasets whose size are beyond the ability of typical
database software tools to capture, store, manage, and analyze
Volume
People to People People to Machine Machine to MachineVariety
Velocity
8 Billion messages/day 845M active users
340Million Tweets/day 140M active users
20 Hours of video uploaded every minute Source: IBM 4
Source: IBM
Big Data Landscape
5
6
Big Data Landscape
(cont.)
Lots of data is being created &
collected World interconnection
Data
Quantity
Data
Quality
Data
Variety
Data
Timely
Big Data problems
COCOMO II
product size estimate product, process,
platform, and personnel attributes
reuse, maintenance,
and increment parameters organizational
project data
development, maintenance cost and schedule estimates
cost, schedule distribution by phase, activity, increment
recalibration to organizational data
COCOMO Black Box Model
8
COCOMO II – Cost factor
9
(c) USC CSSE
Significant factors of development cost:
scale drivers
are sources of exponential effort
variation
cost drivers
are sources of linear effort variation
product, platform, personnel and project attributes
effort multipliers associated with cost driver ratings
Each factor is rated between very low and very high per
rating guidelines
Precedentedness (PREC)
Degree to which system is new and past experience applies
Development Flexibility (FLEX)
Need to conform with specified requirements
Architecture/Risk Resolution (RESL)
Degree of design thoroughness and risk elimination
Team Cohesion (TEAM)
Need to synchronize stakeholders and minimize conflict
Process Maturity (PMAT)
SEI CMM process maturity rating
(c) USC CSSE 10
Scale Drivers
(c) USC CSSE
(c) USC CSSE (c) USC CSSE 11
11
Precedentedness (PREC)
Degree to which system is new and past experience applies
Development Flexibility (FLEX)
Need to conform with specified requirements
Architecture/Risk Resolution (RESL)
Degree of design thoroughness and risk elimination
Team Cohesion (TEAM)
Need to synchronize stakeholders and minimize conflict
Process Maturity (PMAT)
SEI CMM process maturity rating
(c) USC CSSE 12
Scale Factors (Wi) Very Low Low Nominal High Very High Extra High Precedentedness (PREC) thoroughly unprecedented largely unprecedented somewhat unprecedented generally familiar largely familiar throughly familiar Development Flexibility (FLEX) rigorous occasional relaxation some relaxation general conformity some conformity general goals Architecture/Risk Resolution (RESL)*
little (20%) some (40%) often (60%) generally (75%) mostly (90%) full (100%) Team Cohesion (TEAM) very difficult interactions some difficult interactions basically cooperative interactions largely cooperative highly cooperative seamless interactions Process Maturity (PMAT)
Weighted average of “Yes” answers to CMM Maturity Questionnaire
* % significant module interfaces specified, % significant risks eliminated
(c) USC CSSE
12
Elaboration of the PREC rating scales:
(c) USC CSSE 13
Feature Very Low Nominal / High Extra High
Precedentedness
Organizational understanding of product objectives
General Considerable Thorough
Experience in working with related software systems
Moderate Considerable Extensive
Concurrent development of associated new hardware and operational procedures
Extensive Moderate Some
Need for innovative data processing architectures, algorithms
Considerable Some Minimal
Precedentedness (PREC)
(c) USC CSSE
Platform Factors
Time constraint (TIME) Storage constraint (STOR)
Platform volatility (PVOL)
Personnel Factors
Analyst capability (ACAP) Program capability (PCAP)
Applications experience (APEX)
Platform experience (PLEX)
Language and tool experience (LTEX)
Personnel continuity (PCON)
Cost Drivers
14 (c) USC CSSE
Product Factors
Reliability (RELY) Data (DATA) Complexity (CPLX) Reusability (RUSE) Documentation (DOCU)
Project Factors
Software tools (TOOL)
Multisite development (SITE)
Personnel Factors
Analyst capability (ACAP)
Program capability (PCAP)
Applications experience (APEX)
Platform experience (PLEX)
Language and tool experience (LTEX)
Personnel continuity (PCON)
15 (c) USC CSSE
Product Factors
Reliability (RELY) Data (DATA) Complexity (CPLX) Reusability (RUSE) Documentation (DOCU)Cost Drivers and Big Data
Platform Factors
Time constraint (TIME)
Storage constraint (STOR)
Platform volatility (PVOL)
Project Factors
Software tools (TOOL)
Multisite development (SITE)
Required Software Reliability (RELY)
Measures the extent to which the software must perform its
intended function over a period of time.
Ask: what is the effect of a software failure?
Product Factors
(cont’d)
(c) USC CSSE
16
Very Low Low Nominal High Very High Extra High
RELY Descriptors slight inconvenience low, easily recoverable losses moderate, easily recoverable losses high financial loss risk to human life
Source: IBM
Big Data Landscape
17
Data Base Size (DATA)
Captures the effect large data requirements have on development
to
generate test data
that will be used to exercise the program.
Calculate the data/program size ratio (D/P):
) ( Program ) ( SLOC Size Bytes ze DataBaseSi P D
Very Low Low Nominal High Very High Extra High
DATA DB bytes/ Pgm SLOC < 10 10 D/P < 100 100 D/P < 1000 D/P > 1000
Product Factors
(cont’d)
(c) USC CSSE
18
IBM: Data Base Size of Big Data -> Scale from terabytes to zettabytes
(c) USC CSSE (c) USC CSSE
20
Product Complexity (CPLX)
Complexity is divided into five areas:
control operations,
computational operations,
device-dependent operations,
data management operations, and
user interface management operations.
Select the area or combination of areas that characterize the
product or a sub-system of the product.
(c) USC CSSE
21
Module Complexity Ratings vs. Type of Module
Use a subjective weighted average of the attributes, weighted by their relative product importance.
Very Low Low Nominal High Very High Extra High
Control Operations
Straightline code with a few non-nested structured programming operators: DOs, CASEs, IFTHENELSEs. Simple module composition via procedure calls or simple scripts. Straightforward nesting of structured programming operators. Mostly simple predicates. Mostly simple nesting. Some intermodule control. Decision tables. Simple callbacks or message passing, including middleware-supported distributed processing. Highly nested structured programming operators with many compound
predicates. Queue and stack control. Homogeneous, dist. processing. Single processor soft real-time ctl. Reentrant and recursive coding. Fixed-priority interrupt handling. Task synchronization, complex callbacks, heterogeneous dist. processing. Single-processor hard real-time ctl. Multiple resource scheduling with dynamically changing priorities. Microcode-level control.
Distributed hard real-time control. Computational Operations Evaluation of simple expressions: e.g., A=B+C*(D-E) Evaluation of moderate-level expressions: e.g., D=SQRT(B**2-4.*A*C) Use of standard math and statistical routines. Basic matrix/vector operations. Basic numerical analysis: multivariate interpolation, ordinary differential eqns. Basic truncation, roundoff concerns.
Difficult but structured numerical analysis: near-singular matrix equations, partial differential eqns. Simple parallelization. Difficult and unstructured numerical analysis: highly accurate analysis of noisy, stochastic data. Complex parallelization. (c) USC CSSE 22
Very Low Low Nominal High Very High Extra High Device-dependent Operations Simple read, write statements with simple formats. No cognizance needed of particular processor or I/O device characteristics. I/O done at GET/PUT level. I/O processing includes device selection, status checking and error processing.
Operations at physical I/O level (physical storage address translations; seeks, reads, etc.).
Optimized I/O overlap.
Routines for interrupt diagnosis, servicing, masking. Communication line handling. Performance-intensive embedded systems. Device timing-dependent coding, micro-programmed operations. Performance-critical embedded systems. Data Management Operations Simple arrays in main memory. Simple COTS-DB queries, updates.
Single file subsetting with no data structure changes, no edits, no intermediate files. Moderately complex COTS-DB queries, updates.
Multi-file input and single file output. Simple structural changes, simple edits. Complex COTS-DB queries, updates. Simple triggers activated by data stream contents. Complex data restructuring. Distributed database coordination. Complex triggers. Search optimization. Highly coupled, dynamic relational and object structures. Natural language data management. User Interface Management Simple input forms, report generators.
Use of simple graphic user interface (GUI) builders. Simple use of widget set. Widget set development and extension. Simple voice I/O, multimedia.
Moderately complex 2D/3D, dynamic graphics, multimedia. Complex multimedia, virtual reality.
Product Factors
(cont’d)
(c) USC CSSE
(c) USC CSSE 25
Execution Time Constraint (TIME)
Measures the constraint imposed upon a system in terms of the
percentage of available execution time expected to be used by the system consuming the execution time resource.
(c) USC CSSE 26
Very Low Low Nominal High Very High Extra High
TIME 50% use of available execution time 70% 85% 95%
Platform Factors
(c) USC CSSE
26
Main Storage Constraint (STOR)
Measures the degree of main storage constraint imposed on a software system or subsystem.
(c) USC CSSE 28
Very Low Low Nominal High Very High Extra High
STOR 50% use of available storage 70% 85% 95%
(c) USC CSSE
28
Platform Factors
The largest big data practitioners – Google,
Facebook, Apple, etc – run what are known
as hyper scale computing environments.
The key requirements of big data storage are that:
Must be capable of handling large volumes of data
Must be scalable to growth
Must provide the input/output operations per second (IOPS)
to deliver data to analytic tools
(c) USC CSSE
29
Analyst Capability (ACAP)
Analysts work on requirements, high level design and detailed design.
Consider analysis and design ability, efficiency and thoroughness, and the
ability to communicate and cooperate.
Programmer Capability (PCAP)
Evaluate the capability of the programmers as a team rather than as
individuals. Consider ability, efficiency and thoroughness, and the ability to
communicate and cooperate.
(c) USC CSSE 30
Very Low Low Nominal High Very High Extra High
ACAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile
Very Low Low Nominal High Very High Extra High
PCAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile
Personnel Factors
(c) USC CSSE
Applications Experience (AEXP)
Assess the project team's equivalent level of experience with this type of application.
(c) USC CSSE 31
Very Low Low Nominal High Very High Extra High
AEXP 2 months 6 months 1 year 3 years 6 years
Personnel Factors
(cont’d)
(c) USC CSSE
(c) USC CSSE 32
32
Platform Experience (PEXP)
Assess the project team's equivalent level of experience with this platform including the OS, graphical user interface, database, networking, and distributed middleware.
(c) USC CSSE 33
Very Low Low Nominal High Very High Extra High
PEXP 2 months 6 months 1 year 3 years 6 year
Personnel Factors
(cont’d)
(c) USC CSSE
(c) USC CSSE 34
34
(c) USC CSSE 35
35
(c) USC CSSE (c) USC CSSE 36
36
Conclusion - Scale Drivers and Big Data
Scale Drivers
COCOMO II
Coverage
Precedentedness (PREC)
Covered
Development Flexibility (FLEX)
Covered
Architecture/Risk Resolution (RESL)
Covered
Team Cohesion (TEAM)
Covered
(c) USC CSSE (c) USC CSSE 37
37
Cost Drivers
COCOMO II Coverage / Future Work
Reliability (RELY) Covered
Data (DATA) Need to define EXTRA HIGH Cost rating
For terabytes to zettabytes data project
Complexity (CPLX) Covered but need more detail for Big Data -
custom developed solution (25% of all projects) Reusability (RUSE) Covered
Documentation (DOCU) Covered
Time constraint (TIME) Covered
Storage constraint (STOR) Covered
Platform volatility (PVOL) Covered
(c) USC CSSE (c) USC CSSE 38
38
Conclusion -
Cost Drivers and Big Data
Cost Drivers
COCOMO II Coverage / Future WorkAnalyst capability (ACAP) Covered
Program capability (PCAP) Covered
Applications experience (APEX) Covered
Platform experience (PLEX) Covered
Language and tool experience (LTEX) Covered Personnel continuity (PCON) Covered Software tools (TOOL) Covered Multisite development (SITE) Covered Required schedule (SCED) Covered
(c) USC CSSE (c) USC CSSE 39
39
Reference
Barry W. Boehm, et al (2000), Software Cost Estimation With COCOMO II, Prentice Hall, New Jersey.
Barry W. Boehm (1981), Software Engineering Economics , Prentice Hall, New Jersey.
McKinsey Global Institute, Big data: The next frontier for innovation,
competition, and productivity , June 2011 (www.mckinsey.com/mgi)
Zikopoulos, P., and Eaton, C. (2011). Understanding big data: Analytics for
enterprise class hadoop and streaming data, McGraw-Hill Osborne Media.
Enterprise Strategy Group, Research Report : The Convergence of Big Data
Processing and Integrated Infrastructure