COCOMO II and Big Data

(1)

COCOMO II and Big Data

Rachchabhorn Wongsaroj*, Jo Ann Lane,

Supannika Koolmanojwong, Barry Boehm

*Bank of Thailand and Center for Systems and Software Engineering

Computer Science Department , Viterbi School of Engineering

University of Southern California

(2)

Outline



Big Data Concept



COCOMO II Cost factor



COCOMO II Cost factor and Big Data



Future Works

(c) USC CSSE

(3)

Source: IBM

Big Data concept

3

3V’s concepts of Big Data

(IBM)



Volume --

The amounts of data generated



Variety --

The different data types and sources



Velocity --

The speed of data is generated in/out and moves around



Big Data

Datasets whose size are beyond the ability of typical

database software tools to capture, store, manage, and analyze

(4)

Volume

People to People _{People to} Machine Machine to Machine

Variety

Velocity

8 Billion messages/day 845M active users

340Million Tweets/day 140M active users

20 Hours of video uploaded every minute Source: IBM 4

(5)

Source: IBM

Big Data Landscape

5

(6)

6

Big Data Landscape

(cont.)

(7)

Lots of data is being created &

collected World interconnection

Data

Quantity

Data

Quality

Data

Variety

Data

Timely

Big Data problems

(8)

COCOMO II

product size estimate product, process,

platform, and personnel attributes

reuse, maintenance,

and increment parameters organizational

project data

development, maintenance cost and schedule estimates

cost, schedule distribution by phase, activity, increment

recalibration to organizational data

COCOMO Black Box Model

8

(9)

COCOMO II – Cost factor

9

(c) USC CSSE



Significant factors of development cost:



scale drivers

are sources of exponential effort

variation



cost drivers

are sources of linear effort variation

product, platform, personnel and project attributes

effort multipliers associated with cost driver ratings

Each factor is rated between very low and very high per

rating guidelines

(10)



Precedentedness (PREC)

 Degree to which system is new and past experience applies



Development Flexibility (FLEX)

 Need to conform with specified requirements



Architecture/Risk Resolution (RESL)

 Degree of design thoroughness and risk elimination



Team Cohesion (TEAM)

 Need to synchronize stakeholders and minimize conflict



Process Maturity (PMAT)

 SEI CMM process maturity rating

(c) USC CSSE 10

Scale Drivers

(c) USC CSSE

(11)

(c) USC CSSE (c) USC CSSE 11

11



 Degree to which system is new and past experience applies



 Need to conform with specified requirements



 Degree of design thoroughness and risk elimination



Team Cohesion (TEAM)

 Need to synchronize stakeholders and minimize conflict



Process Maturity (PMAT)

 SEI CMM process maturity rating

(12)

(c) USC CSSE 12

Scale Factors (Wi) Very Low Low Nominal High Very High Extra High Precedentedness (PREC) thoroughly unprecedented largely unprecedented somewhat unprecedented generally familiar largely familiar throughly familiar Development Flexibility (FLEX) rigorous occasional relaxation some relaxation general conformity some conformity general goals Architecture/Risk Resolution (RESL)*

little (20%) some (40%) often (60%) generally (75%) mostly (90%) full (100%) Team Cohesion (TEAM) very difficult interactions some difficult interactions basically cooperative interactions largely cooperative highly cooperative seamless interactions Process Maturity (PMAT)

Weighted average of “Yes” answers to CMM Maturity Questionnaire

* % significant module interfaces specified, % significant risks eliminated

(c) USC CSSE

12

(13)



Elaboration of the PREC rating scales:

(c) USC CSSE 13

Feature Very Low Nominal / High Extra High

Precedentedness

Organizational understanding of product objectives

General Considerable Thorough

Experience in working with related software systems

Moderate Considerable Extensive

Concurrent development of associated new hardware and operational procedures

Extensive Moderate Some

Need for innovative data processing architectures, algorithms

Considerable Some Minimal

Precedentedness (PREC)

(c) USC CSSE

(14)



Platform Factors

 Time constraint (TIME)

 Storage constraint (STOR)

 Platform volatility (PVOL)



Personnel Factors

 Analyst capability (ACAP)

 Program capability (PCAP)

 Applications experience (APEX)

 Platform experience (PLEX)

 Language and tool experience (LTEX)

 Personnel continuity (PCON)

Cost Drivers

14 (c) USC CSSE



Product Factors

 Reliability (RELY)  Data (DATA)  Complexity (CPLX)  Reusability (RUSE)  Documentation (DOCU)



Project Factors

 Software tools (TOOL)

 Multisite development (SITE)

(15)



Personnel Factors

 Analyst capability (ACAP)

 Program capability (PCAP)

 Applications experience (APEX)

 Platform experience (PLEX)

 Language and tool experience (LTEX)

 Personnel continuity (PCON)

15 (c) USC CSSE



Product Factors

 Reliability (RELY)  Data (DATA)  Complexity (CPLX)  Reusability (RUSE)  Documentation (DOCU)

Cost Drivers and Big Data



Platform Factors

 Time constraint (TIME)

 Storage constraint (STOR)

 Platform volatility (PVOL)



Project Factors

 Software tools (TOOL)

 Multisite development (SITE)

(16)



Required Software Reliability (RELY)



Measures the extent to which the software must perform its

intended function over a period of time.



Ask: what is the effect of a software failure?

Product Factors

(cont’d)

(c) USC CSSE

16

Very Low Low Nominal High Very High Extra High

RELY Descriptors slight inconvenience low, easily recoverable losses moderate, easily recoverable losses high financial loss risk to human life

(17)

Source: IBM

Big Data Landscape

17

(18)



Data Base Size (DATA)



Captures the effect large data requirements have on development

to

generate test data

that will be used to exercise the program.



Calculate the data/program size ratio (D/P):

) ( Program ) ( SLOC Size Bytes ze DataBaseSi P D _

Very Low Low Nominal High Very High Extra High

DATA _{DB bytes/ Pgm SLOC < 10} 10  D/P < 100 100  D/P < 1000 D/P > 1000

Product Factors

(cont’d)

(c) USC CSSE

18

IBM: Data Base Size of Big Data -> Scale from terabytes to zettabytes

(19)

(c) USC CSSE (c) USC CSSE

(20)

20

(21)



Product Complexity (CPLX)



Complexity is divided into five areas:



control operations,



computational operations,



device-dependent operations,



data management operations, and



user interface management operations.



Select the area or combination of areas that characterize the

product or a sub-system of the product.

(c) USC CSSE

21

(22)



Module Complexity Ratings vs. Type of Module

 Use a subjective weighted average of the attributes, weighted by their relative product importance.

Very Low Low Nominal High Very High Extra High

Control Operations

Straightline code with a few non-nested structured programming operators: DOs, CASEs, IFTHENELSEs. Simple module composition via procedure calls or simple scripts. Straightforward nesting of structured programming operators. Mostly simple predicates. Mostly simple nesting. Some intermodule control. Decision tables. Simple callbacks or message passing, including middleware-supported distributed processing. Highly nested structured programming operators with many compound

predicates. Queue and stack control. Homogeneous, dist. processing. Single processor soft real-time ctl. Reentrant and recursive coding. Fixed-priority interrupt handling. Task synchronization, complex callbacks, heterogeneous dist. processing. Single-processor hard real-time ctl. Multiple resource scheduling with dynamically changing priorities. Microcode-level control.

Distributed hard real-time control. Computational Operations Evaluation of simple expressions: e.g., A=B+C*(D-E) Evaluation of moderate-level expressions: e.g., D=SQRT(B**2-4.*A*C) Use of standard math and statistical routines. Basic matrix/vector operations. Basic numerical analysis: multivariate interpolation, ordinary differential eqns. Basic truncation, roundoff concerns.

Difficult but structured numerical analysis: near-singular matrix equations, partial differential eqns. Simple parallelization. Difficult and unstructured numerical analysis: highly accurate analysis of noisy, stochastic data. Complex parallelization. (c) USC CSSE 22

(23)

Very Low Low Nominal High Very High Extra High Device-dependent Operations Simple read, write statements with simple formats. No cognizance needed of particular processor or I/O device characteristics. I/O done at GET/PUT level. I/O processing includes device selection, status checking and error processing.

Operations at physical I/O level (physical storage address translations; seeks, reads, etc.).

Optimized I/O overlap.

Routines for interrupt diagnosis, servicing, masking. Communication line handling. Performance-intensive embedded systems. Device timing-dependent coding, micro-programmed operations. Performance-critical embedded systems. Data Management Operations Simple arrays in main memory. Simple COTS-DB queries, updates.

Single file subsetting with no data structure changes, no edits, no intermediate files. Moderately complex COTS-DB queries, updates.

Multi-file input and single file output. Simple structural changes, simple edits. Complex COTS-DB queries, updates. Simple triggers activated by data stream contents. Complex data restructuring. Distributed database coordination. Complex triggers. Search optimization. Highly coupled, dynamic relational and object structures. Natural language data management. User Interface Management Simple input forms, report generators.

Use of simple graphic user interface (GUI) builders. Simple use of widget set. Widget set development and extension. Simple voice I/O, multimedia.

Moderately complex 2D/3D, dynamic graphics, multimedia. Complex multimedia, virtual reality.

Product Factors

(cont’d)

(c) USC CSSE

(24)

(25)

(c) USC CSSE 25

(26)



Execution Time Constraint (TIME)

 Measures the constraint imposed upon a system in terms of the

percentage of available execution time expected to be used by the system consuming the execution time resource.

(c) USC CSSE 26

Very Low Low Nominal High Very High Extra High

TIME  50% use of available execution time 70% 85% 95%

Platform Factors

(c) USC CSSE

26

(27)

(28)



Main Storage Constraint (STOR)

 Measures the degree of main storage constraint imposed on a software system or subsystem.

(c) USC CSSE 28

Very Low Low Nominal High Very High Extra High

STOR  50% use of available storage 70% 85% 95%

(c) USC CSSE

28

Platform Factors

The largest big data practitioners – Google,

Facebook, Apple, etc – run what are known

as hyper scale computing environments.

(29)

The key requirements of big data storage are that:



Must be capable of handling large volumes of data



Must be scalable to growth



Must provide the input/output operations per second (IOPS)

to deliver data to analytic tools

(c) USC CSSE

29

(30)



Analyst Capability (ACAP)

 Analysts work on requirements, high level design and detailed design.

Consider analysis and design ability, efficiency and thoroughness, and the

ability to communicate and cooperate.



Programmer Capability (PCAP)

 Evaluate the capability of the programmers as a team rather than as

individuals. Consider ability, efficiency and thoroughness, and the ability to

communicate and cooperate.

(c) USC CSSE 30

Very Low Low Nominal High Very High Extra High

ACAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile

Very Low Low Nominal High Very High Extra High

PCAP 15th percentile 35th percentile 55th percentile 75th percentile 90th percentile

Personnel Factors

(c) USC CSSE

(31)



Applications Experience (AEXP)

 Assess the project team's equivalent level of experience with this type of application.

(c) USC CSSE 31

Very Low Low Nominal High Very High Extra High

AEXP  2 months 6 months 1 year 3 years 6 years

Personnel Factors

(cont’d)

(c) USC CSSE

(32)

(c) USC CSSE 32

32

(33)



Platform Experience (PEXP)

 Assess the project team's equivalent level of experience with this platform including the OS, graphical user interface, database, networking, and distributed middleware.

(c) USC CSSE 33

Very Low Low Nominal High Very High Extra High

PEXP  2 months 6 months 1 year 3 years 6 year

Personnel Factors

(cont’d)

(c) USC CSSE

(34)

(c) USC CSSE 34

34

(35)

(c) USC CSSE 35

35

(36)

36

Conclusion - Scale Drivers and Big Data

Scale Drivers

COCOMO II

Coverage

Covered

Team Cohesion (TEAM)

Covered

(37)

37

Cost Drivers

COCOMO II Coverage / Future Work

Reliability (RELY) Covered

Data (DATA) _{Need to define EXTRA HIGH Cost rating}

For terabytes to zettabytes data project

Complexity (CPLX) _{Covered but need more detail for Big Data -}

custom developed solution (25% of all projects) Reusability (RUSE) Covered

Documentation (DOCU) Covered

Time constraint (TIME) Covered

Storage constraint (STOR) Covered

Platform volatility (PVOL) Covered

(38)

38

Conclusion -

Cost Drivers and Big Data

Cost Drivers

COCOMO II Coverage / Future Work

Analyst capability (ACAP) Covered

Program capability (PCAP) Covered

Applications experience (APEX) Covered

Platform experience (PLEX) Covered

Language and tool experience (LTEX) Covered Personnel continuity (PCON) Covered Software tools (TOOL) Covered Multisite development (SITE) Covered Required schedule (SCED) Covered

(39)

39

Reference

 Barry W. Boehm, et al (2000), Software Cost Estimation With COCOMO II, Prentice Hall, New Jersey.

 Barry W. Boehm (1981), Software Engineering Economics , Prentice Hall, New Jersey.

 McKinsey Global Institute, Big data: The next frontier for innovation,

competition, and productivity , June 2011 (www.mckinsey.com/mgi)

 Zikopoulos, P., and Eaton, C. (2011). Understanding big data: Analytics for

enterprise class hadoop and streaming data, McGraw-Hill Osborne Media.

 Enterprise Strategy Group, Research Report : The Convergence of Big Data

Processing and Integrated Infrastructure