Small Data in Big Data
July 17, 2013
So6ware Experts Summit
Sea>le
Ayse Basar Bener Data Science Lab
Mechanical and Industrial Engineering Ryerson University
ANALYTICS IN SOFTWARE
ENGINEERING
Data AnalyKcs in So6ware Engineering
• To make decisions under uncertainty
• How to assign available resources and budget end-‐to-‐
end?
• Where to allocate scarce tesKng resources?
• How much maintenance effort is required?
• When to stop tesKng?
• How confident are we to release the product?
• …..
• Expert judgement is common way to make
decisions
• Bias, availability, limited experience
Data AnalyKcs in So6ware Engineering
• Necessary, but sca>ered
• Code versioning systems
• Code metric repositories
• Issue tracking/ management systems
• Specialized tools
Theory (Kahneman)
• Human mind works in two modes: – Fast Thinking Mode:
• Default
• Based on heurisKcs
• Error prone
– Slow Thinking Mode:
• ReacKve: Triggered by Fast Thinking
• Based on Facts and Logic
Where Big Data Techniques fit in
SE:
Help experts by simulaKng human slow thinking mode in a “faster” mode!
The Problem
Ø Sca>ered Research Clusters
Ø Overlooked Research Clusters
Ø Lack of generalizaKon Efforts
Ø Lack of Theory
Ø Privacy Concerns of Industry
The Vision
• Theory
• Interplay of analyKcs techniques and SE to work
like human brain
• Human in the loop models
• Use big data and experts to not only predict the future,
but cause the future
• PracKce
• Tool support
How?
Ø Stop validaKng, start applying in real sehngs
Ø to provide tools that combine individually validated
research clusters for enabling applicaKons in real sehngs
Ø to refocus on overlooked research clusters, i.e. people in 3P (People, Product, Process).
Ø to form an academic culture paying a>enKon to underlying theories and assumpKons to avoid academic number
crunching exercises.
Ø to extend our efforts beyond individual cases to pursue generalizaKons.
Ø to address the concerns of business side whose data and support are required to realize the above.
Puhng the Bricks Together…
Where we are… Where we should be…
DOES SIZE MATTER?
The devil is hidden in the details
Big Data versus Data Analysis
• The sum of small pieces are larger than the
whole? • Issues
– Access
– Storage
Size versus Data
• Any data or meaningful data
• Centralized or decentralized
• CollaboraKon or control
• May be ‘small is beauKful’
Small Data
• Data AnalyKcs = Big Data??
• Context based – Case studies
• Model assumpKons
– More data to overcome over fihng?
SMALL DATA EXAMPLES
Small data is important in so6ware
engineering
• Sampling: Empirical evidence – Under/ micro sampling
– Dimensionality reducKon
Micro Sampling: Use Even Less...
• Given N defecKve modules:
– M = {25, 50, 75, ...} <= N
– Select M defecKve and M defect-‐ free modules.
– Learn theories on 2M instances
• Undersampling: M=N • 8/12 datasets -‐> M = 25 • 1/12 datasets -‐> M = 75 • 3/12 datasets -‐> M = {200, 575, 1025}
T. Menzies, B. Turhan, G. Gay, A. Bener, B. Cukic, H. Jiang PROMISE’08
QualitaKve Studies in So6ware
Engineering
• ConnecKng the dots
• Field studies
• RecommendaKon systems
– Researcher and PracKKoner work together
Problem
Predic'on of defect categories Goal
• We aimed to increase the informaKon content of the output
of a defect predicKon model by esKmaKng the categories of defects in the defect-‐prone so6ware modules.
Challenges
• Many defect categorizaKon methodologies in the literature.
• No standard categorizaKon methodology.
1-‐Big Data Analysis
• We had to use the category definiKons that
were available in mulKple datasets: pre-‐ and post-‐ release defect categories.
• DefiniKon of pre-‐release and post-‐release
different among projects.
• Predictor performance not saKsfactory.
• CategorizaKon not worth the trouble?
• These categories were not meaningful for
2-‐Small Data Analysis
• We idenKfied the defect categories with the
quality assurance team of the company.
• We idenKfied metrics that were significantly
correlated with the categories by analyzing the small data.
• The model since the categorizaKon was
tailored for the company needs.
• We improved defect predicKon accuracy
Lessons Learned
• Analysis of data with the key stakeholders of
the organizaKon is the key for providing delivering a valuable soluKon.
• Knowledge gained from one customer may
not be directly transferrable to another.
Caglayan et al., Promise 2010 Tosun et al., WeTSOM 2011
Problem
Confirma'on biases of so9ware engineers
Goal
• to analyze factors affecKng so6ware engineers’
confirmaKon biases.
MoDvaDon
• due to the confirmatory behavior of so6ware
engineers, defects may be introduced during any phase of SDLC
• IdenKficaKon of the factors affecKng confirmaKon bias
to circumvent its negaKve effects
Challenges
• QuanKficaKon of confirmaKon bias
1-‐Big Data Analysis
• DefiniKon of a methodology to quanKfy
confirmaKon bias levels of so6ware engineers.
• FormaKon of confirmaKon bias metrics set.
• FormaKon of a single derived metric.
• ConducKng N-‐way ANOVA.
2-‐Small Data Analysis
• IdenKfied outliers in the data and analyzed them.
• Interviews with PM’s and SE’s who are outliers
• InvesKgate task load distribuKon of developers.
– Outliers had heavy task loads and they were mentally
exhausted.
– Hence, their test results did not reflect their actual
confirmaKon bias levels.
• We removed the outliers and repeated the analysis
Lessons Learned
• Analysis of data with so6ware engineers and
project managers who are involved in the field studies is crucial.
• Field studies should cover so6ware companies
from different domains as much as possible to overcome threats to external validity and to obtain meaningful results.
Calikli & Bener, ASE 2013
CONCLUSION
Raw Data-‐ the process
case study
Schu>, R. 2012, Data Science Course Blog: h>p:// columbiadatascience.com/blog
Small Data
• Meaningful small data is all you need
– Theories can be learned from a very small sample of available data
• We need to understand the underlying concepts
– Combine with available data and models
• Combined use of big data techniques and local
models
– Remove errors with small data
– Access and use enourmous amount of data for analysis-‐ with big data