• No results found

Data Collection from Open Source Software Repositories

N/A
N/A
Protected

Academic year: 2021

Share "Data Collection from Open Source Software Repositories"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Collection from Open

Source Software Repositories

G

ORAN

M

AUŠA

, T

IHANA

G

ALINAC

G

RBAC

F

ACULTY OF

E

NGINEERING
(2)

Software Defect Prediction (SDP)

Aim:

◦ Focus testing effort to software units with higher fault-proneness probability

Motivation:

◦ High testing costs (80% after release)

◦ Pareto principle can be applied

SDP approach:

◦ Classification based on

parameters of size and complexity

Predictive model building:

Evaluation Training Set Testing Set Model Building Data

Collection processing Data Pre- Division Data

Bugs

80%

20%

Code

80%

20%

Data Collection
(3)

Data Collection for SDP

Motivation:

◦ The context of project development may influence SDP performance

◦ Small number of available datasets => inability to study the context influence

Problem:

◦ Lack of systematic data collection approach

◦ Data collection is time consuming and not trivial

Potential Source of data:

1. Industrial (large telecom. software)

 Rarely available

2. Open repositories (PROMISE gives NASA datasets)

 Impossible to validate (missing data collection procedure and source code)

 Often suffer from: missing values, outliers, duplicated entries, unbalance,...

3. Open source projects (Eclipse, Mozilla, Apache)

(4)

Open Source Software Repositories

Linking 2 repositories :

◦ Source code management & bug tracking

◦ Structured and unstructured data

◦ Problem: there is no formal link

◦ Consequence: different approach -» data bias

Important characteristics :

Bug status: (closed / opened)

Bug resolution: (fixed / otherwise)

Bugs severity:

(blocker - normal / +trivial / +enhancement)

Repository search order:

(start with bugs / source code changes)

Declaration of defect-free units

(5)

Data Collection for SDP

Linking Techniques :

◦ Simple search

◦ Regular expression search

◦ Authorship correspondence

◦ Time correlation

◦ Advanced NLP techniques (ReLink)

Issues :

Granularity level (package / file / class / method)

Software metrics (product / development & process / usage)

Bug – File cardinality (many – to – many)

Bug – File duplicated links

Bug ID varying length

Bug tracking

repository

Bug ID Bug assignee Bug closed Release Bug opened Comments

Source code

repository

Commit message Commit author Commit date Release tag Release date
(6)

Bug – Code (BuCo) Analyzer Tool

[SoftCOM 2014]

Tool developed through :

Systematic literature review

(36 papers from [1] + 35 / 136 / 4447)

Exploratory study

(12 students, observer triangulation, 5 projects, 4 exercises, 5 data forms, 52 tasks)

Software product metrics tools review

(iterative review 35 / 19 / 5 / 2 tools)

Iterative development

(30 students - 13 groups)

Systematic comparison of techniques

(7 techniques, 5 projects, 37 releases)

Tool properties :

Automatic data collection

Simple interface

6 bug-code linking techniques

Calculation of 50 product metrics Bug counting

Report generation

[1] Hall T, Beecham S, Bowes D, Gray D, Counsell S: A systematic literature review on fault prediction performance in software engineering, IEEE Trans Softw Eng 38(6), pp.1276-1304, 2012

(7)

Bug – Code (BuCo) Analyzer Tool

[SoftCOM 2014]

Tool offers:

◦ Bug download from Bugzilla of Eclipse, Apache and Mozilla communities

◦ SCM download from GIT

◦ Bug-Code linking techniques:

◦ Automatic calculation of product metrics

(8)

Bug – Code Linking Techniques

[

]

Analysis 1 :

◦ Comparison: Simple search & ReLink

◦ Aim: define Regex Search

◦ Project: Apache HTTPD

◦ Source: ReLink data, GIT repository

Analyses 2 & 3 :

◦ Comparison: Regex search & ReLink

◦ Aim: benchmark evaluation

◦ Projects: Apache HTTPD, OpenNLP

(9)

Bug – Code Linking Techniques - Results

[

]

Analysis 1 – results :

◦ Unequal input & linking output:

◦ Manual investigation revealed:

(10)

Bug – Code Linking Techniques - Results

[

]

Analyses 2 & 3 – results :

◦ OpenNLP – benchmark dataset (equal input), different linking output:

◦ Manual investigation – REGEX :

(11)

Bug – Code Linking Techniques - Conclusion

[

]

The generalization of research requires:

Datasets from various domains

Systematic procedure with limited bias

Bug – Code linking

Proven to be prone to bias

Complex technique outperformed by regular expression search

Future research

Compare the whole data collection process approaches

(12)

Current Research

Developing a systematic data collection procedure for SDP

Comparison of different linking techniques on various environments:

Comparison of the most

popular SZZ approach [2]

to our own

◦ Interactions between different techniques, approaches and datasets used in our experiment

(13)

Thank you for you attention!

References

Related documents

While it is true that humans are not naturally good monitors, we firmly believe that crew monitoring performance can be significantly improved through policy changes, training and

Total Weak Acceptable Good V.. means that the students' abilities to solve the mathematical problem vary according to their learning styles. This difference is evident in

The major findings of the current study suggest that, among our sample of Chinese business undergraduates in Macau, personality, learning style, and motivational orientation

SEER Cancer Stat Fact Sheet, Cancer of the Brain and Other Nervous System; Mattson Jack Brain Cancer Epidemiology, December 2007, Mattson Jack Cancer Treatment Architecture,

reforms, Attitute towards Indian religions Social and Economic impact of The rule of East India Company...

Company with ERP system data transmission master data adminstration PersoBOX operator basic settings start personalization process data send data as personalized functions

Crucially, repayment of debt is a credible signal of low-risk status because low-risk people have an advantage in resisting opportunistic behavior in the credit market, and

Finally, incorporation of the dependence of volatility estimates on the tenor of the derivative contracts and the level of the principal underlying risk factor price (i.e.