• No results found

Small Data in Big Data July 17, 2013 So6ware Experts Summit Sea>le

N/A
N/A
Protected

Academic year: 2021

Share "Small Data in Big Data July 17, 2013 So6ware Experts Summit Sea>le"

Copied!
35
0
0

Loading.... (view fulltext now)

Full text

(1)

Small  Data  in  Big  Data  

July  17,  2013  

So6ware  Experts  Summit  

Sea>le

 

Ayse  Basar  Bener   Data  Science  Lab  

Mechanical  and  Industrial  Engineering   Ryerson  University  

(2)

ANALYTICS  IN  SOFTWARE  

ENGINEERING  

(3)

Data  AnalyKcs  in  So6ware  Engineering  

•  To  make  decisions  under  uncertainty  

•  How  to  assign  available  resources  and  budget  end-­‐to-­‐

end?  

•  Where  to  allocate  scarce  tesKng  resources?  

•  How  much  maintenance  effort  is  required?  

•  When  to  stop  tesKng?  

•  How  confident  are  we  to  release  the  product?  

•  …..  

•  Expert  judgement  is  common  way  to  make  

decisions  

•  Bias,  availability,  limited  experience  

(4)

Data  AnalyKcs  in  So6ware  Engineering  

•  Necessary,  but  sca>ered  

•  Code  versioning  systems  

•  Code  metric  repositories  

•  Issue  tracking/  management  systems  

•  Specialized  tools  

(5)

Theory  (Kahneman)  

•  Human  mind  works  in  two  modes:   – Fast  Thinking  Mode:    

•  Default  

•  Based  on  heurisKcs  

•  Error  prone  

– Slow  Thinking  Mode:  

•  ReacKve:  Triggered  by  Fast  Thinking  

•  Based  on  Facts  and  Logic  

(6)

Where  Big  Data  Techniques  fit  in  

SE:  

 

Help  experts  by  simulaKng  human  slow   thinking  mode  in  a  “faster”  mode!  

(7)

The  Problem  

Ø Sca>ered  Research  Clusters  

Ø Overlooked  Research  Clusters  

Ø Lack  of  generalizaKon  Efforts  

Ø Lack  of  Theory  

Ø Privacy  Concerns  of  Industry  

(8)

The  Vision  

•  Theory    

•  Interplay  of  analyKcs  techniques  and  SE  to  work  

like  human  brain  

•  Human  in  the  loop  models  

•  Use  big  data  and  experts  to  not  only  predict  the  future,  

but  cause  the  future    

•  PracKce    

•  Tool  support  

(9)

How?    

Ø Stop  validaKng,  start  applying  in  real  sehngs  

Ø to  provide  tools  that  combine  individually  validated  

research  clusters  for  enabling  applicaKons  in  real  sehngs  

Ø to  refocus  on  overlooked  research  clusters,  i.e.  people  in  3P   (People,  Product,  Process).  

Ø to  form  an  academic  culture  paying  a>enKon  to  underlying   theories  and  assumpKons  to  avoid  academic  number  

crunching  exercises.  

Ø to  extend  our  efforts  beyond  individual  cases  to  pursue   generalizaKons.  

Ø to  address  the  concerns  of  business  side  whose  data  and   support  are  required  to  realize  the  above.  

(10)

Puhng  the  Bricks  Together…  

Where  we  are…   Where  we  should  be…  

(11)

DOES  SIZE  MATTER?  

(12)

 

 

The  devil  is  hidden  in  the  details  

(13)

Big  Data  versus  Data  Analysis  

•  The  sum  of  small  pieces  are  larger  than  the  

whole?   •  Issues  

– Access  

– Storage  

(14)

Size  versus  Data  

•  Any  data  or  meaningful  data  

•  Centralized  or  decentralized  

•  CollaboraKon  or  control  

•  May  be  ‘small  is  beauKful’  

(15)

Small  Data  

•  Data  AnalyKcs  =  Big  Data??  

•  Context  based   – Case  studies  

•  Model  assumpKons  

– More  data  to  overcome  over  fihng?  

(16)

SMALL  DATA  EXAMPLES  

(17)

Small  data  is  important  in  so6ware  

engineering  

•  Sampling:  Empirical  evidence   – Under/  micro  sampling  

– Dimensionality  reducKon  

 

 

(18)

Micro  Sampling:  Use  Even  Less...

   

•  Given  N  defecKve  modules:  

–  M  =  {25,  50,  75,    ...}  <=  N  

–  Select  M  defecKve  and  M  defect-­‐ free  modules.  

–  Learn  theories  on  2M  instances  

•  Undersampling:  M=N   •  8/12  datasets  -­‐>  M  =  25     •  1/12  datasets  -­‐>  M  =  75     •  3/12  datasets  -­‐>  M  =  {200,   575,  1025}    

T. Menzies, B. Turhan, G. Gay, A. Bener, B. Cukic, H. Jiang PROMISE’08

(19)

QualitaKve  Studies  in  So6ware  

Engineering  

•  ConnecKng  the  dots  

•  Field  studies    

•  RecommendaKon  systems  

– Researcher  and  PracKKoner  work  together  

(20)

Problem  

Predic'on  of  defect  categories   Goal  

•  We  aimed  to  increase  the  informaKon  content  of  the  output  

of  a  defect  predicKon  model  by  esKmaKng  the  categories  of   defects  in  the  defect-­‐prone  so6ware  modules.  

Challenges  

•  Many  defect  categorizaKon  methodologies  in  the  literature.  

•  No  standard  categorizaKon  methodology.  

(21)

1-­‐Big  Data  Analysis  

•  We  had  to  use  the  category  definiKons  that  

were  available  in  mulKple  datasets:  pre-­‐  and   post-­‐  release  defect  categories.  

•  DefiniKon  of  pre-­‐release  and  post-­‐release  

different  among  projects.  

•  Predictor  performance  not  saKsfactory.  

•  CategorizaKon  not  worth  the  trouble?    

•  These  categories  were  not  meaningful  for  

(22)
(23)

2-­‐Small  Data  Analysis  

•  We  idenKfied  the  defect  categories  with  the  

quality  assurance  team  of  the  company.  

•  We  idenKfied  metrics  that  were  significantly  

correlated  with  the  categories  by  analyzing   the  small  data.  

•  The  model  since  the  categorizaKon  was  

tailored  for  the  company  needs.  

•  We  improved  defect  predicKon  accuracy  

(24)
(25)

Lessons  Learned  

•  Analysis  of  data  with  the  key  stakeholders  of  

the  organizaKon  is  the  key  for  providing   delivering  a  valuable  soluKon.    

•  Knowledge  gained  from  one  customer  may  

not  be  directly  transferrable  to  another.  

Caglayan  et  al.,  Promise  2010   Tosun  et  al.,  WeTSOM  2011  

(26)

Problem  

Confirma'on  biases  of  so9ware  engineers  

Goal  

•  to   analyze   factors   affecKng   so6ware   engineers’  

confirmaKon  biases.    

 

MoDvaDon  

•  due   to   the   confirmatory   behavior   of   so6ware  

engineers,  defects  may  be  introduced  during  any  phase   of  SDLC  

•  IdenKficaKon  of  the  factors  affecKng  confirmaKon  bias  

to  circumvent  its  negaKve  effects  

Challenges  

•  QuanKficaKon  of  confirmaKon  bias  

   

 

(27)

1-­‐Big  Data  Analysis  

•  DefiniKon  of  a  methodology  to  quanKfy  

confirmaKon  bias  levels  of  so6ware  engineers.  

•  FormaKon  of  confirmaKon  bias  metrics  set.  

•  FormaKon  of  a  single  derived  metric.  

•  ConducKng  N-­‐way  ANOVA.    

(28)
(29)

2-­‐Small  Data  Analysis  

•  IdenKfied  outliers  in  the  data  and  analyzed  them.  

•  Interviews  with  PM’s  and  SE’s  who  are  outliers  

•  InvesKgate  task  load  distribuKon  of  developers.  

–  Outliers  had  heavy  task  loads  and  they  were  mentally  

exhausted.  

–  Hence,  their  test  results  did  not  reflect  their  actual  

confirmaKon  bias  levels.    

•  We  removed  the  outliers  and  repeated  the  analysis  

(30)
(31)

Lessons  Learned  

•  Analysis  of  data  with  so6ware  engineers  and  

project  managers  who  are  involved  in  the  field   studies  is  crucial.    

•  Field  studies  should  cover  so6ware  companies  

from  different  domains  as  much  as  possible  to   overcome  threats  to  external  validity  and  to   obtain  meaningful  results.  

Calikli  &  Bener,  ASE  2013  

(32)

CONCLUSION  

(33)

Raw  Data-­‐  the  process  

case   study  

Schu>,  R.  2012,  Data  Science  Course  Blog:  h>p:// columbiadatascience.com/blog  

(34)

Small  Data  

•  Meaningful  small  data  is  all  you  need  

–  Theories  can  be  learned  from  a  very  small  sample  of   available  data    

•  We  need  to  understand  the  underlying  concepts  

–  Combine  with  available  data  and  models  

•  Combined  use  of  big  data  techniques  and  local  

models  

–  Remove  errors    with  small  data  

–  Access  and  use  enourmous  amount  of  data  for   analysis-­‐  with  big  data  

 

(35)

References

Related documents

Long-term operation of Dukovany NPP New nuclear sources Stabilization abroad Renewable sources Customer orientation New Energy Performance and Entrepreneurship 1...

study, change, civic engagement, cocurricular, college president, college student, community college, community service, discussion groups, engagement, experiential learning,

In this study, we assessed the dynamics of the abundance and community structure of selected soil bacterial communities as a function of plant cultivar, growth stage, and soil

International Rectifier has MOSFET Spice models on www.irf.com that can be used for pre- prototype circuit validation for a multitude of power application topologies.

Use of multiple data sources to enumerate work-related amputations in Massachusetts: The contribution of Workers’ Compensation records. L Davis, K

SA5 (Telecom Management) is working on the management concepts, the management requirements and use cases from operators perspective for mobile networks that include virtualized

“Where any portion, as the Consumer Advocate determines, of the total amount of the expenditure incurred by the Province for or in connection with the administration of this Act

The physical culture is for example like a long-sleeved white shirt, abit (sarong), lobe (white lebai), Saroben (Serban), Robe and Jas, Solop ( Slippers ),