Exploratory Failure Analysis of Open Source Software 1

(1)

Exploratory Failure Analysis of Open Source Software

1 Cobra Rahmani, Satish M. Srinivasan, Azad Azadmanesh

College of Information Science & Technology University of Nebraska-Omaha,

Omaha, U.S.

e-mail: {crahmani, smsrinivasan, azad}@unomaha.edu Abstract-1Reliability growth modeling in software system plays

an important role in measuring and controlling software quality during software development. One main approach to reliability growth modeling is based on the statistical correlation of observed failure intensities versus estimated ones by the use of statistical models. Although there are a number of statistical models in the literature, this research concentrates on the following seven models: Weibull, Gamma, S-curve, Exponential, Lognormal, Cubic, and Schneidewind. The failure data collected are from five popular open source software (OSS) products. The objective is to determine which of the seven models best fits the failure data of the selected OSS products as well as predicting the future failure pattern based on partial failure history. The outcome reveals that the best model fitting the failure data is not necessarily the best predictor model.

Keywords- empirical software engineering, goodness-of-fit; open source software; software reliability growth modeling;

I. INTRODUCTION

Over the past years, open source software (OSS) has drawn increasing attention, both from the business and academic world. The leading concept of open source presented by Raymond [17] differentiates the collaborative open source approach from the traditional in-house and proprietary software development. Success behind OSS can be attributed to collaboration with volunteers across organizations and geographical boundaries, faster development due to surcharge in the number of developers, and platform independence due to its development environment [16]. In this research, five different OSS products named Eclipse V.2, Apache HTTP Server 2, Firefox, MPlayer OS X, and ClamWin Free Antivirus have been considered for estimating their reliability and for predicting the failure process. The rationale behind choosing these projects is popularity of the products in terms of number of users and downloads, the time-span in which the products have been in operation, and availability of the bug reports. The failure data collection concentrated on two popular archive sources, i.e. Bugzilla [5] and Sourceforge.net [19]. In this study, failure intensities, in sequence of two weeks, are collected for each software. The Failure Intensity (FI) information for MPlayer OS X and ClamWin Free Antivirus are obtained from Sourceforge.net. For Firefox, Eclipse V.2, and Apache 2,

1 _{This research is funded in part by Department of Defense (DoD)/Air} Force Office of Scientific Research (AFOSR), NSF Award Number FA9550-07-1-0499, under the title “High Assurance Software”.

the failure information is gathered from Bugzilla. Table I gives more information about the OSS products used in this study. The table highlights the official release year and the duration of the collected failure data for each product.

Table I. INFORMATION OF COLLECTE D FAILURE FOR FIVE OSS

OSS product Official release year

Durations of collected failures

Start date End date

Firefox 1999 03/19992 10/2006 Eclipse V.2 2001 10/20013 _12/2007 Apache 2 2002 03/2002 12/2008 ClamWin Free Antivirus 2004 03/2004 08/2008 MPlayer 2002 09/2002 06/2006

The study compares seven distribution models in order to determine whether there is a consistency between any of these models with respect to the goodness-of-fit and reliability prediction of the selected OSS products. The study attempts to shed light on the probable reasons if this consistency cannot be observed. The distribution models are Cubic, Exponential, Gamma, Lognormal, Schneidewind, S-curve, and Weibull [3,8,9,13,14,18,20, 22]. These models are chosen because of a combination of reasons such as their capability in providing various distribution shapes or potential in creating distributions that follow the failure patterns of software systems.

The rest of the paper is organized as follows. Section II provides some definitions and background information about the aforementioned distribution models. Section III focuses on failure data analysis and the reliability modeling process. Section IV concludes the paper with a summary.

II. BACKGROUND

Software reliability growth models (SRGM) have been in existence for approximately 40 years with the intent to creating models that can accurately quantify software quality. Other distribution models have been developed over the years for purposes other than software reliability that at times are used in the software quality field. The hope is that by deciding on an appropriate model, its parameters can reflect the software behavior at one or more software phases such as development, testing, and operation [8,10,14]. In [8], a number of models are discussed with information that can help the users and practitioners on deciding a model and assessing the reliability of software.

2_{The failure data collected prior to the official release date of Firefox are} obtained from Mozilla bug reports.

(2)

Lakey [10] provides a number of SRGMs and provides a flowchart approach on how to decide on a model. However, because of so many interacting factors, no single model can be trusted to universally perform well at all times in estimating reliability or predicting the expected number of remaining defects [1,4,10,21]. To deal with the inaccurate predictions made by SRGMs, some authors have offered recalibrating the models. That is, the previous errors in earlier predictions are used to transform the model into a more accurate prediction model [1,4].

In [7], authors have used non-linear regression to analyze defect data obtained from testing of three releases of a commercial system. They have applied four reliability models for selecting a suitable reliability model, which can best fit the customer defect data as testing progresses. Their study uses the method presented in [21]. In their approach, four traditional reliability models have been compared, but their case study is limited to just one commercial software. In [23], the authors have compared several SRGM models on one set of OSS failure data and concluded the logarithmic Poisson execution time model fits better than the other SRGMs for the actual data set. Their work is mainly concentrated on the goodness-of-fit without any assessment on prediction capability of these models.

In [24], the author compares six different SRGMs on four different data sets taken from previous researches. Eighty percent of failure data is used to estimate the goodness-of-fit of those models and the other twenty percent is selected to validate the prediction capability of the models. The outcomes have shown differences between the best fitting and predicting models. A similar approach is used in [11], in that some observations from the end of failure data are removed for the purpose of comparing predication performance between two SRGMs.

In this study, the Probability Density Function (PDF) of the chosen distribution models are used to model the failure patterns of the selected products. PDF, denoted as f(t), shows the relative concentration of failures at different points of time t. The following gives a brief introduction to the seven distribution models considered in this study.

Weibull – The PDF of Weibull function is:       (/ ) 1 ) (t t e t f   

where α and β represent the scale and shape of the distribution model. The shape value determines the shape of the graph and the effect of the scale parameter is to squeeze or stretch the graph.

S-curve– There are different s-shaped distribution models. The one adopted in this study is used by SPSS [20] and has the following PDF: t b b e t f₍₎ 01/

where is a constant and is the regression coefficient. If is positive, then the slope of the graph is upward. Otherwise the slope is downward.

Lognormal – The lognormal assumes that the natural

logarithm of time to failure is normally distributed. The  and µ are the mean and standard deviation of the natural logarithm of time to failure, respectively. The PDF of lognormal distribution is given by:

Furthermore,  and µ determine the shape and scale of the distribution model.

Schneidewind – This model assumes that the cumulative

number of failures is Non-Homogeneous Poisson Process (NHPP) [12,14], which was originally studied in hardware reliability. NHPP models assume that the failure process varies with time and that the cumulative number of failures up to time t is Poisson distributed with the parameter m(t) that is the mean value of failures. Specifically,

) ( ! )] ( [ ) ) ( ( mt n e n t m n t M P   

where M(t) and m(t) are the total and the expected number of failures in interval [0, t], and n is an integer. The mean value of the distribution model is:

where α and β are the initial failure rate and the negative derivative of failure rate, respectively. Therefore, the expected number of failure during the period is m( ) -m( ). The Schneidewind’s model is built on the belief that the failure frequency changes over time and that the recent failures rather than the past failures are more beneficial in predicting the future behavior of the system [8].

Gamma – This model has properties similar to that of Weibull distribution with the scale and shape parameters α and β, respectively. The PDF of the Gamma distribution is given by:

where is the gamma function:

It is known [22] that for positive integer values x > 1,

(3)

Exponential- Exponential distribution is a special case of Gamma and Weibull distributions with . Its PDF is given by:

Cubic – The PDF of the Cubic model is given as:

where is a constant and are the regression coefficient values.

III. EXPERIMENTALANALYSIS

Prior to analyzing the performance estimates of the reliability growth models, the failure data for the five selected OSS products must first be collected and filtered. Therefore, the reliability estimate process is partitioned into three steps: bug-gathering, bug-filtering, and bug-analysis. For the bug-gathering step, a java program has been developed to extract the raw failure data from the bug repository systems for each product. Although the breadth and depth of the bug reports vary from one repository system to another, each bug report normally contains a

unique identification value for the report, the actual time/date the bug is reported, some information about the user reporting the bug, the product name, and also the status of the bug report filled by the organization in charge of the product development, such as whether the bug is fixed, valid, or deleted. The quality of reliability estimation highly depends on sufficient error reports and the accuracy of reports provided by the users.

During the second step, i.e. bug-filtering, the extracted reports from the first step are filtered out in order to remove the unwanted reports such as duplicated ones. The reason for filtering is that some reports may not represent a real defect, or the information provided may not be complete. Among the bug-reports for MPlayer and ClamWin, which are gathered from Sourceforge.net, those reports with status other than “Deleted” (not a valid bug-report) are collected. For the other three software products, the bug reports are gathered from Bugzilla and those bug-reports with the following status values are accepted and the rest are discarded: FIXED (bug is fixed), WONTFIX (bug will not be fixed), LATER (bug won’t be fixed in the current product version) and REMIND (bug probably won’t be fixed in the current product version).

Figure 1. Filtered failure intensities for the selected OSS products Table II. VALUES FOR THE SEVEN DISTRIBUTION FUNCTIONS

0 2 4 6 8 10 12 14 16 18 1 ₁₁ ₂₁ ₃₁ ₄₁ ₅₁ ₆₁ ₇₁ ₈₁ ₉₁ 101 111 Biweekly Time C la m W in F ai lu re In te n si ty ( FI ) 0 5 10 15 20 25 1 8 152229364350576471788592 Biweekly Time M p la ye r Fa ilu re In te n si ty ( FI ) 0 10 20 30 40 50 60 1 ₁₅ ₂₉ ₄₃ ₅₇ ₇₁ ₈₅ ₉₉ 113 127 141 155 169 A p ac h e 2 F ai lu re I n te n si ty ( FI ) Biweekly Time 0 20 40 60 80 100 1 6 1116212631364146515661667176 Fi re fo x Fai lu re In te n si ty ( FI ) Biweekly Time 0 500 1000 1500 2000 1 16 31 46 61 76 91 ₁₀₆ ₁₂₁ ₁₃₆ ₁₅₁ Ec lip se F ai lu re In te n si ty ( FI ) Biweekly time 0 500 1000 1500 2000 1 ₁₆ ₃₁ ₄₆ ₆₁ ₇₆ ₉₁ 106 121 136 151 Biweekly Time Ec lip se V 2 .0 F ai lu re In te n si ty ( FI ) Distribution function OSS product

Cubic Exponential Gamma Lognormal Schneidewind S-curve Weibull

Apache 2 0.57 0.52 0.55 0.57 0.51 0.07 0.56

ClamWin Free Antivirus 0.55 0.53 0.54 0.50 0.52 0.14 0.55

Eclipse V.2 0.45 0.44 0.58 0.34 0.36 0.09 0.59

Firefox 0.53 0.45 0.48 0.39 0.25 0.38 0.51

(4)

Finally, in the last step, i.e. bug-analysis, the dates of the filtered bug-reports are used to organize the reports into biweekly intervals for further analysis. Figure 14 exhibits the

failure intensities for the five OSS products. The x-axis and y-axis represent each biweekly period and its corresponding failure intensity, respectively. Also, each graph in the figure shows the interval for which the failure reports are collected. On a quick glance at the figure, Eclipse does not seem to follow a pattern similar to those of the other software products. Further investigation reveals that the bug reports include failures of multiple Eclipse versions. When the reports for each version are separated, it is noticed that the pattern of failure intensities for each version generally follows the same pattern as others. Therefore, rather than dealing with multiple versions of Eclipse with similar patterns, one single version i.e. Eclipse V.2.0 is analyzed for reliability estimation. The last graph in Figure 1 shows the failure intensities for Eclipse V 2.0. This version is selected for reliability analysis because of its high volume of bug reports in comparison to other versions.

A. Goodness-of-fit Performance

In this study, SPSS is used for conducting the statistical tests of goodness-of-fit. Specifically Non-Linear Regression (NLR) is employed to measure the goodness-of-fit of the seven distribution models with respect to the selected OSS products. NLR is used because the failure intensities of the selected OSS products follow a curvature pattern instead of a linear trend, which is evident from Figure 1. Table II shows the calculated values, as the result of NLR for the seven distribution functions. is a measure of the strength of how well the regression estimate fits the failure data [2]. value is between 0 and 1, inclusive. The closer is to 1, the stronger the match is between the estimated regression and the observed failure data.

In Table II, the highest value of among the distribution models for each product is bold-faced. Looking at the values, the Cubic model exhibits the overall best estimate of fitting the observed failure data. This is followed by the Weibull distribution. Furthermore not much discrepancy in values is noticed between the Cubic and Weibull distributions. One may also observe that the performance of the Gamma distribution is close to Weibull. Recall that the Gamma distribution is a special case of Weibull. Among these, S-curve shows the overall worst performance. Table III provides the best fitting models for each of the five OSS products.

Table III. BEST MODELS FOR FITTING THE FAILURE INTENSITY OF THE SELECTED OSS PRODUCTS

4

The intensities of bug reports are connected to form smoother plots. The purpose is to better visualize the pattern of failure reports.

The next objective is to determine whether the model showing the best goodness-of-fit is also the best predictor of future failures. To investigate this, the time interval of the collected failures data for each product is halved. The failure data in the first half is used to estimate the parameters for each distribution model. Then, the same estimate of parameters is used to forecast the failures during the second half.

B. Prediction Performance

As indicated, the time interval of failure sample size is divided in to half where one-half is used for predicting the other-half is used for estimating future failures. Since the failure data for the software products under study is gathered for at least four years, it seems there is sufficient data in the first half to picture a decent estimate of the future failures.

Except for Firefox, the failure data for the other products seem to be in a stabilized phase. So there should be a decent fit for the first half interval. Indeed doing the estimate for the first half supports this observation. For Firefox, even though the failure data is collected for over six years, it does not seem that the failure detection and removal are in a stable state. This study attempts the prediction process for Firefox as well, to obtain better insight for situations where sufficient failure data is not available or the reliability growth of a product may not be stable.

As shown in different studies [13,14,15,16], among all reliability models, there is no single model to be always superior over the other models. But the failure pattern can be used as a simple way to decide on some models believed to provide a decent prediction. The prediction performance of the chosen distribution models are compared by determining the least average difference between the observed and predicted number of failures in the second half interval. This is measured by the Average Predicted Error (APE) form given below:

APE= where n is the number of biweekly periods in the second half interval of a product.

After calculating the estimated parameters for the first half interval and stretching the graph results over the second half, Figure 2 exhibits a graphical view in predicting of ClamWin failures for the seven distribution models. Due to the lack of space, the prediction graphs for the other products are not shown. However, Table IV shows the APE values for all selected products. As APE shows the average difference between actual failure intensities and predicted ones, a smaller APE value represents a better prediction. As shown in the table, Gamma and Lognormal are good predictors. Whereas, the Cubic model that was a good fitter identified earlier has the worst prediction performance. Comparing the table with Figure 2, the APE values support the visual patterns in the OSS product Best fitting model

Apache 2 Cubic, Lognormal

ClamWin Free Antivirus Cubic, Weibull

Eclipse V.2 Weibull

Firefox Cubic

(5)

Figure 2. ClamWin actual failure intensity and prediction by the seven distribution functions Table IV.APE VALUES FOR THE SEVEN DISTRIBUTION F UNCTIONS

Distribution function OSS product

Cubic Exponential Gamma Lognormal Schneidewind S-curve Weibull

Apache 2 205.23 3.40 3.07 2.17 3.54 4.32 2.84

ClamWin Free Antivirus 2.40 0.99 0.90 1.19 0.93 2.31 1.02

Eclipse V.2 3782.50 71.09 74.19 68.99 71.08 56.70 73.26

Firefox 82.45 241.06 19.61 14.29 107.16 29.23 27.62

MPlayer 56.17 1.49 0.71 0.96 1.85 2.95 3.57

Table V. BEST MODELS FOR PREDICTING THE FUTURE FAILURE INTENSITY OF THE SELECTED OSS PRODUCTS

figure. Table V provides the best predictor models for each product. Based on these observations, it is concluded that a best goodness-of-fit model may not necessarily be a good predictor model.

Comparing the tables III and V, it is noticed that the best models for goodness-of-fit and prediction disagree for majority of the products. To better understand the reasons for not seeing the same consistency among the models in terms of goodness-of-fit and future prediction of failures, the Firefox product is further scrutinized. The observations are shown in Figure 3. The graph titled “filtered bug pattern” is the same as the failure intensity as shown in Figure 2. “Fitted FI” is the estimated fitted graph by Weibull based on the entire failure data. The other two graphs in the figure show the predictions when the first one- year and two-years of failure data are used for estimating the parameters of Weibull. The early portions of

the two graphs are thus the fitted estimates based on one- and two-years of failure data and the latter parts of the graphs are the prediction estimates. As anticipated, the prediction based on one-year failure data is very poor. This is because as the length of prediction interval is increased by having less failure data to depend on, it becomes more difficult to predict future failures. For the same reason, the prediction using two years of failure data shows better accuracy of prediction. This observation could be the possible reason that the authors in [11,24] adopted to predict a small percentage of failures compared to the total failure data.

Additionally, the graph of the filtered bug pattern in Figure 3 shows a dip in about the 25th biweekly period, which causes Weibull to adjust its estimate accordingly. This forces the one-year fitted graph to continue the decreasing trend of failures as time increases. This dip can also be wrongfully interpreted as a sign of cumulative failure data becoming stable. A similar observation (dip) is taking place around the 50th biweekly period, although not as severe as the dip for the one-year failure data used for prediction purpose.

In general, there are many factors that affect the accuracy of prediction. One obvious factor is the model type used. A survey done in the late 1990s by the American Society for Quality reported that only 4% of the responders

-3 2 7 12 17 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101106111 Exponential Gamma Lognormal S-curve Weibull Schneidewind Cubic ClamWin FI

OSS product Best prediction model

Apache 2 Lognormal

ClamWin Free Antivirus Gamma

Eclipse V.2 S-curve

Firefox Lognormal

(6)

could apply a SRGM [14]. Additionally, application of a SRGM correctly requires a good understanding of the product profile at different stages. As some examples, whether the failures are independent of each other, whether the defect removals are imperfect, or whether there has been any shift in operation profile of the product, all can affect the prediction estimate. As a self-experience, the Eclipse product in Figure 1 shows a failure pattern that can be modeled by multiple distributions such as Weibull. But most likely, the fitted graph would not provide a good estimate of the actual graph. Investigation revealed that the operation profile of Eclipse changes during each release of the product, which happens around January of each year.

Figure 3. Firefox actual failure intensity and prediction by Weibull based on 1-year, 2-years, and the entire failure data.

IV. CONCLUSION

This study has attempted to compare seven reliability models with respect to estimates of failure intensities and failure forecasts against the actual failure data. The bug reports of five different OSS products are collected and used as input to the seven models.

The study has used nonlinear regression analysis as a metric to measure the goodness-of-fit. As the second metric, APE is used to determine which model is the best predictor. For the selected products, Weibull and Cubic are promising models for goodness-of-fit. But the Cubic model is shown to be the worst predictor. In general, Gamma and Lognormal models provided the best prediction models for future failures followed by the S-curve model. Therefore, the results show that a model able to provide a good fit may not be a good predictor of future failures because of so many interacting factors.

It is reasonable to believe that some failure intensities, called outliers [6], out of a larger sample may have tangible effect on the parameters of the regression estimates. Therefore, as an avenue of future research, it is worth investigating this phenomenon, as to whether forecast of failures is improved when these outliers are removed from the estimation process based on the available failure data. Another avenue is to determine the effect on prediction by recalibrating the models used in this study.

REFERENCES

[1] A.A. Abdel-Ghaly, P.Y. Chan, B. Littlewood, “Evaluation of computing software reliability”, IEEE Transactions on Software Engineering, vol. SE-12, no. 9, pp. 950-967, 1986. [2] A.D. Aczel, J. Sounderpandian, Complete Business

Statistics, 6th Ed., McGraw Hills, 2005.

[3] P. Asthana, “Jumping the technology S-curve”, IEEE Spectrum, vol. 32, no. 6, pp. 49-54, 1995.

[4] S. Brocklehurst, B. Littlewood, “New ways to get accurate reliability measures”, IEEE Software, pp. 34-42, July 1992. [5] Bugzilla, http://www.bugzilla.org.

[6] W.J. Conover, Practical Nonparametric Statistics, 3rd Ed., John Wiley, 1999.

[7] R. Hewett et.al, “On Effective Use of Reliability Models and Defect Data in Software” Development, http://www.docstoc.com/docs/20095671/On-Effective-Use-of-Reliability-Models-and-Defect-Data.

[8] IEEE Reliability Society, “IEEE recommended practice on software reliability”, IEEE Std 1633-2008, June 2008. [9] H.S. Kan, Metrics and Models in Software Quality

Engineering, 2nd Ed., Addison-Wesley, 2003.

[10]P. Lakey, A. Neufelder, “System and software reliability assurance notebook”, Rome Laboratory, 1997.

[11]J.S. Lawson, C.W. Wesselman, D.T. Scott, “Simple plots improve software reliability prediction models”, Quality Engineering, vol. 15, no. 3, pp. 411-417, 2003.

[12]M.R. Lyu, Handbook of Software Reliability Engineering, McGraw Hills, 1996.

[13]R. Mullen, S.S. Gokhale, “The Lognormal distribution of software failure rates: Applications to software reliability growth modeling”, 9th

Int’l Symposium on Software Reliability Engineering, pp. 134-142, 1998.

[14]H. Pham, System Software Reliability, Springer, 2006. [15]H. Pham, L. Nordmann, “A generalized NHPP software

reliability model”, 3rd_{Int’l Conference on Reliability and}

Quality in Design, 1997.

[16]C. Rahmani, H. Siy, H., A. Azadmanesh,, “An experimental analysis of open source software reliability”, 28th

IEEE Symposium on Reliable Distributed Systems, Sep 2009. [17]E.S. Raymond, “The cathedral and the bazaar: musings on

Linux and open source by an accidental revolutionary”, 2nd Ed., O’Reilly, 2001.

[18]N.F. Schneidewind, "Analysis of error processes in computer software”, Sigplan Note, vol. 10, no. 6, pp. 337-346, 1975. [19]SourceForge, http://sourceforge.net.

[20]SPSS, http://www.spss.com/statistics.

[21]C. Stringfellow, A.A. Andrews, “An empirical method for selecting software reliability growth models”, “Empirical Software Engineering”, vol. 7, no. 4, pp. 319-343, Dec 2002. [22]K.S. Trividi, Probability and Statistics with Reliability and

Computer Science Applications, 2nd Ed., John Wiley, 2002. [23]Y. Tamura, S. Yamada, “Comparison of software reliability

assessment methods for open source software and reliability assessment tool”, Journal of Computer Science vol. 2, no. 6, pp. 489-495, 2006.

[24]D.R.P. Williams, “Prediction capability analysis of two and three parameters software reliability growth models, Information Technology Journal, vol. 5, no. 6, pp. 1048-1052, 2006. 0 15 30 45 60 75 90 105 1 9 ₁₇ ₂₅ ₃₃ ₄₁ ₄₉ ₅₇ ₆₅ ₇₃ Fa ilu re In te n si ty (F I) Biweekly Time Filtered Bug Pattern Fitted FI

Fitted 1-Year FI Fitted 2-Year FI