Small Data in Big Data July 17, 2013 So6ware Experts Summit Sea>le

(1)

Small Data in Big Data

July 17, 2013

So6ware Experts Summit

Sea>le

Ayse Basar Bener Data Science Lab

Mechanical and Industrial Engineering Ryerson University

(2)

ANALYTICS IN SOFTWARE

ENGINEERING

(3)

Data AnalyKcs in So6ware Engineering

•  To make decisions under uncertainty

•  How to assign available resources and budget end-‐to-‐

end?

•  Where to allocate scarce tesKng resources?

•  How much maintenance eﬀort is required?

•  When to stop tesKng?

•  How conﬁdent are we to release the product?

•  …..

•  Expert judgement is common way to make

decisions

•  Bias, availability, limited experience

(4)

Data AnalyKcs in So6ware Engineering

•  Necessary, but sca>ered

•  Code versioning systems

•  Code metric repositories

•  Issue tracking/ management systems

•  Specialized tools

(5)

Theory (Kahneman)

•  Human mind works in two modes: – Fast Thinking Mode:

•  Default

•  Based on heurisKcs

•  Error prone

– Slow Thinking Mode:

•  ReacKve: Triggered by Fast Thinking

•  Based on Facts and Logic

(6)

Where Big Data Techniques ﬁt in

SE:

Help experts by simulaKng human slow thinking mode in a “faster” mode!

(7)

The Problem

Ø Sca>ered Research Clusters

Ø Overlooked Research Clusters

Ø Lack of generalizaKon Eﬀorts

Ø Lack of Theory

Ø Privacy Concerns of Industry

(8)

The Vision

•  Theory

•  Interplay of analyKcs techniques and SE to work

like human brain

•  Human in the loop models

•  Use big data and experts to not only predict the future,

but cause the future

•  PracKce

•  Tool support

(9)

How?

Ø Stop validaKng, start applying in real sehngs

Ø to provide tools that combine individually validated

research clusters for enabling applicaKons in real sehngs

Ø to refocus on overlooked research clusters, i.e. people in 3P (People, Product, Process).

Ø to form an academic culture paying a>enKon to underlying theories and assumpKons to avoid academic number

crunching exercises.

Ø to extend our eﬀorts beyond individual cases to pursue generalizaKons.

Ø to address the concerns of business side whose data and support are required to realize the above.

(10)

Puhng the Bricks Together…

Where we are… Where we should be…

(11)

DOES SIZE MATTER?

(12)

The devil is hidden in the details

(13)

Big Data versus Data Analysis

•  The sum of small pieces are larger than the

whole? •  Issues

– Access

– Storage

(14)

Size versus Data

•  Any data or meaningful data

•  Centralized or decentralized

•  CollaboraKon or control

•  May be ‘small is beauKful’

(15)

Small Data

•  Data AnalyKcs = Big Data??

•  Context based – Case studies

•  Model assumpKons

– More data to overcome over ﬁhng?

(16)

SMALL DATA EXAMPLES

(17)

Small data is important in so6ware

engineering

•  Sampling: Empirical evidence – Under/ micro sampling

– Dimensionality reducKon

(18)

Micro Sampling: Use Even Less...

•  Given N defecKve modules:

–  M = {25, 50, 75, ...} <= N

–  Select M defecKve and M defect-‐ free modules.

–  Learn theories on 2M instances

•  Undersampling: M=N •  8/12 datasets -‐> M = 25 •  1/12 datasets -‐> M = 75 •  3/12 datasets -‐> M = {200, 575, 1025}

T. Menzies, B. Turhan, G. Gay, A. Bener, B. Cukic, H. Jiang PROMISE’08

(19)

QualitaKve Studies in So6ware

Engineering

•  ConnecKng the dots

•  Field studies

•  RecommendaKon systems

– Researcher and PracKKoner work together

(20)

Problem

Predic'on of defect categories Goal

•  We aimed to increase the informaKon content of the output

of a defect predicKon model by esKmaKng the categories of defects in the defect-‐prone so6ware modules.

Challenges

•  Many defect categorizaKon methodologies in the literature.

•  No standard categorizaKon methodology.

(21)

1-‐Big Data Analysis

•  We had to use the category deﬁniKons that

were available in mulKple datasets: pre-‐ and post-‐ release defect categories.

•  DeﬁniKon of pre-‐release and post-‐release

diﬀerent among projects.

•  Predictor performance not saKsfactory.

•  CategorizaKon not worth the trouble?

•  These categories were not meaningful for

(22)

(23)

2-‐Small Data Analysis

•  We idenKﬁed the defect categories with the

quality assurance team of the company.

•  We idenKﬁed metrics that were signiﬁcantly

correlated with the categories by analyzing the small data.

•  The model since the categorizaKon was

tailored for the company needs.

•  We improved defect predicKon accuracy

(24)

(25)

Lessons Learned

•  Analysis of data with the key stakeholders of

the organizaKon is the key for providing delivering a valuable soluKon.

•  Knowledge gained from one customer may

not be directly transferrable to another.

Caglayan et al., Promise 2010 Tosun et al., WeTSOM 2011

(26)

Problem

Conﬁrma'on biases of so9ware engineers

Goal

•  to analyze factors aﬀecKng so6ware engineers’

conﬁrmaKon biases.

MoDvaDon

•  due to the conﬁrmatory behavior of so6ware

engineers, defects may be introduced during any phase of SDLC

•  IdenKficaKon of the factors affecKng confirmaKon bias

to circumvent its negaKve eﬀects

Challenges

•  QuanKﬁcaKon of conﬁrmaKon bias

(27)

1-‐Big Data Analysis

•  DeﬁniKon of a methodology to quanKfy

conﬁrmaKon bias levels of so6ware engineers.

•  FormaKon of conﬁrmaKon bias metrics set.

•  FormaKon of a single derived metric.

•  ConducKng N-‐way ANOVA.

(28)

(29)

2-‐Small Data Analysis

•  IdenKﬁed outliers in the data and analyzed them.

•  Interviews with PM’s and SE’s who are outliers

•  InvesKgate task load distribuKon of developers.

–  Outliers had heavy task loads and they were mentally

exhausted.

–  Hence, their test results did not reﬂect their actual

conﬁrmaKon bias levels.

•  We removed the outliers and repeated the analysis

(30)

(31)

Lessons Learned

•  Analysis of data with so6ware engineers and

project managers who are involved in the ﬁeld studies is crucial.

•  Field studies should cover so6ware companies

from diﬀerent domains as much as possible to overcome threats to external validity and to obtain meaningful results.

Calikli & Bener, ASE 2013

(32)

CONCLUSION

(33)

Raw Data-‐ the process

case study

Schu>, R. 2012, Data Science Course Blog: h>p:// columbiadatascience.com/blog

(34)

Small Data

•  Meaningful small data is all you need

–  Theories can be learned from a very small sample of available data

•  We need to understand the underlying concepts

–  Combine with available data and models

•  Combined use of big data techniques and local

models

–  Remove errors with small data

–  Access and use enourmous amount of data for analysis-‐ with big data

(35)

Small Data in Big Data July 17, 2013 So6ware Experts Summit Sea>le