Basic Assumptions - The Rayleigh Model - Metrics & Models in Software Quality Engineering.pdf

Chapter 7. The Rayleigh Model

7.3 Basic Assumptions

Using the Rayleigh curve to model software development quality involves two basic assumptions. The first assumption is that the defect rate observed during the development process is positively correlated with the defect rate in the field, as illustrated in Figure 7.3. In other words, the higher the curve (more area under it), the higher the field defect rate (the GA phase in the figure), and vice versa. This is related to the concept of error injection. Assuming the defect removal effectiveness remains relatively unchanged, the higher defect rates observed during the development process are indicative of higher error injection;

therefore, it is likely that the field defect rate will also be higher.

Figure 7.3. Rayleigh Model Illustration I

The second assumption is that given the same error injection rate, if more defects are discovered and removed earlier, fewer will remain in later stages. As a result, the field quality will be better. This

relationship is illustrated in Figure 7.4, in which the areas under the curves are the same but the curves peak at varying points. Curves that peak earlier have smaller areas at the tail, the GA phase.

Figure 7.4. Rayleigh Model Illustration II

Both assumptions are closely related to the "Do it right the first time" principle. This principle means that if each step of the development process is executed properly with minimum errors, the end product's quality will be good. It also implies that if errors are injected, they should be removed as early as possible, preferably before the formal testing phases when the costs of finding and fixing the defects are much higher than that at the front end.

To formally examine the assumptions, we conducted a hypothesis-testing study based on component data for an AS/400 product. A component is a group of modules that perform specific functions such as

spooling, printing, message handling, file handling, and so forth. The product we used had 65

components, so we had a good-sized sample. Defect data at high-level design inspection (I0), low-level design inspection (I1), code inspection (I2), component test (CT), system test (ST), and operation (customer usage) were available. For the first assumption, we expect significant positive correlations Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.3 Basic Assumptions

between the in-process defect rates and the field defect rate. Because software data sets are rarely normally distributed, robust statistics need to be used. In our case, because the component defect rates fluctuated widely, we decided to use Spearman's rank-order correlation. We could not use the Pearson correlation because correlation analysis based on interval data, and regression analysis for that matter, is very sensitive to extreme values, which may lead to misleading results.

Table 7.1 shows the Spearman rank-order correlation coefficients between the defect rates of the

development phases and the field defect rate. Significant correlations are observed for I2, CT, ST, and all phases combined (I0, I1, I2, CT, and ST). For I0 and I1 the correlations are not significant. This finding is not surprising because (1) I0 and I1 are the earliest development phases and (2) in terms of the defect removal pattern, the Rayleigh curve peaks after I1.

Overall, the findings shown in Table 7.1 strongly substantiate the first assumption of the Rayleigh model.

The significance of these findings should be emphasized because they are based on component-level data. For any type of analysis, the more granular the unit of analysis, the less chance it will obtain statistical significance. At the product or system level, our experience with the AS/400 strongly supports this assumption. As another case in point, the space shuttle software system developed by IBM Houston has achieved a minimal defect rate (the onboard software is even defect free). The defect rate observed during the IBM Houston development process (about 12 to 18 defects per KLOC), not coincidentally, is much lower than the industry average (about 40 to 60 defects per KLOC).

To test the hypothesis with regard to the second assumption of the Rayleigh model, we have to control for the effects of variations in error injection. Because error injection varies among components, cross-sectional data are not suitable for the task. Longitudinal data are better, but what is needed is a good controlled experi-ment. Our experience indicates that even developing different functions by the same team in different releases may be prone to different degrees of error. This is especially the case if one release is for a major-function development and the other release is for small enhancements.

Table 7.1. Spearman Rank Order Correlations

In a controlled experiment situation, a pool of developers with similar skills and experiences must be selected and then randomly assigned to two groups, the experiment group and the control group.

Separately the two groups develop the same functions at time 1 using the same development process and method. At time 2, the two groups develop another set of functions, again separately and again with the same functions for both groups. At time 2, however, the experiment group intentionally does much more front-end defect removal and the control group uses the same method as at time 1. Moreover, the functions at time 1 and time 2 are similar in terms of complexity and difficulty. If the testing defect rate and field defect rate of the project by the experiment group at time 2 are clearly lower than that at time 1 after taking into account the effect of time (which is reflected by the defect rates of the control groups at the two times), then the second assumption of the Rayleigh model is substantiated.

Without data from a controlled experiment, we can look at the second assumption from a somewhat relaxed standard. In this regard, IBM Houston's data again lend strong support for this assumption. As discussed in Chapter 6, for software releases by IBM Houston for the space shuttle software system from November 1982 to December 1986, the early detection percentages increased from about 50% to more than 85%. Correspondingly, the product defect rates decreased monotonically by about 70% (see Figures 6.1 and 6.2 in Chapter 6). Although the error injection rates also decreased moderately, the effect of early defect removal is evident.

Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.3 Basic Assumptions

I l@ve RuBoard

Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.3 Basic Assumptions

I l@ve RuBoard

7.4 Implementation

Implementation of the Rayleigh model is not difficult. If the defect data (defect counts or defect rates) are reliable, the model parameters can be derived from the data by computer programs (available in many statistical software packages) that use statistical functions. After the model is defined, estimation of end-product reliability can be achieved by substitution of data values into the model.

Figure 7.5 shows a simple example of implementation of the Rayleigh model in SAS, which uses the nonlinear regression procedure. From the several methods in nonlinear regression, we chose the DUD method for its simplicity and efficiency (Ralston and Jennrich, 1978). DUD is a derivative-free algorithm for nonlinear least squares. It competes favorably with even the best derivative-based algorithms when evaluated on a number of standard test problems.

Figure 7.5 An SAS Program for the Rayleigh Model

/*****************************************************************/

/* */

/* SAS program for estimating software latent-error rate based */

/* on the Rayleigh model using defect removal data during */

/* development */

/* */

/* --- */

/* */

/* Assumes: A 6-phase development process: High-level design(I0)*/

/* Low-level design (I1), coding(I2), Unit test (UT), */

TITLE1 'RAYLEIGH MODEL - DEFECT REMOVAL PATTERN';

OPTIONS label center missing=0 number linesize=95;

/*****************************************************************/

Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.4 Implementation

6='GA'

/* Now we estimate the parameters of the Rayleigh distribution */

/* */

/*****************************************************************/

proc NLIN method=dud outest=out1;

/*---*/

/* INPUT B: */

/* The non-linear regression procedure requires initial input */

/* for the K and R parameters in the PARMS statement. K is */

/* the defect rate/KLOC for the entire development process, R is */

/* the peak of the Rayleigh curve. NLIN takes these initial */

/* values and the input data above, goes through an iteration */

/* procedure, and comes up with the final estimates of K and R. */

/* Once K and R are determined, we can specify the entire */

/* Rayleigh curve, and subsequently estimate the latent-error */

/* rate. */

Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.4 Implementation

/* Specify the entire Rayleigh curve based on the estimated */

/* Prepare for the histograms in the graph, values on the right */

/* hand side of the assignment statements are the actual */

/* defect removal rates--same as those for the INPUT statement */

/*---*/

/* Now we plot the graph on a GDDM79 terminal screen(e.g., 3279G)*/

/* The graph can be saved and plotted out through graphics */

/* interface such as APGS */

/* */

/*****************************************************************/

goptions device=GDDM79;

* GOPTIONS DEVICE=GDDMfam4 GDDMNICK=p3820 GDDMTOKEN=img240x HSIZE=8 VSIZE=11;

* OPTIONS DEVADDR=(.,.,GRAPHPTR);

proc gplot data=out2;

plot DEF*J DEF1*J/overlay vaxis=0 to 25 by 5 vminor=0 fr hminor=0;

Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.4 Implementation

CHI_sq = ( y - E_rate)**2 / E_rate;

proc sort data=temp2; by T;

data temp2; set temp2; by T;

if T=1 then T_chisq = 0;

T_chisq + CHI_sq;

proc sort data=temp2; by K T;

data temp3; set temp2; by K T;

The SAS program estimates model parameters, produces a graph of fitted model versus actual data points on a GDDM79 graphic terminal screen (as shown in Figure 7.2), performs chi square goodness-of-fit tests, and derives estimates for the latent-error rate. The probability (p value) of the chi square test is also provided. If the test results indicate that the fitted model does not adequately describe the observed data (p > .05), a warning statement is issued in the output. If proper graphic support is available, the colored graph on the terminal screen can be saved as a file and plotted via graphic plotting devices.

In the program of Figure 7.5, r represents tm as discussed earlier. The program implements the model on a six-phase development process. Because the Rayleigh model is a function of time (as are other

reliability models), input data have to be in terms of defect data by time. The following time equivalent values for the development phases are used in the program:

I0 � 0.5 I1 � 1.5 I2 � 2.5

Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.4 Implementation

UT � 3.5 CT � 4.5 ST � 5.5

Implementations of the Rayleigh model are available in industry. One such example is the Software LIfe-cycle Model tool (SLIM) developed by Quantitative Software Management, Inc., of McLean, Virginia. SLIM is a software product designed to help software managers estimate the time, effort, and cost required to build medium and large software systems. It embodies the software life-cycle model developed by Putnam (Putnam and Myers, 1992), using validated data from many projects in the industry. Although the main purpose of the tool is for life-cycle project management, estimating the number of software defects is one of the important elements. Central to the SLIM tool are two important management indicators. The first is the productivity index (PI), a "big picture" measure of the total development capability of the organization.

The second is the manpower buildup index (MBI), a measure of staff buildup rate. It is influenced by scheduling pressure, task concurrency, and resource constraints. The inputs to SLIM include software size (lines of source code, function points, modules, or uncertainty), process productivity (methods, skills, complexity, and tools), and management constraints (maximum people, maximum budget, maximum schedule, and required reliability). The outputs from SLIM include the staffing curve, the cumulative cost curve over time, probability of project success over time, reliability curve and the number of defects in the product, along with other metrics. In SLIM the X-axis for the Rayleigh model is in terms of months from the start of the project.

As a result of Gaffney's work (1984), in 1985 the IBM Federal Systems Division at Gaithersburg,

Maryland, developed a PC program called the Software Error Estimation Reporter (STEER). The STEER program implements a discrete version of the Rayleigh model by matching the input data with a set of 11 stored Rayleigh patterns and a number of user patterns. The stored Rayleigh patterns are expressed in terms of percent distribution of defects for the six development phases mentioned earlier. The matching algorithm involves taking logarithmic transformation of the input data and the stored Rayleigh patterns, calculating the separation index between the input data and each stored pattern, and choosing the stored pattern with the lowest separation index as the best-fit pattern.

Several questions arise about the STEER approach. First, the matching algorithm is somewhat different from statistical estimation methodologies, which derive estimates of model parameters directly from the input data points based on proved procedures. Second, it always produces a best-match pattern even when none of the stored patterns is statistically adequate to describe the input data. There is no mention of how little of the separation index indicates a good fit. Third, the stored Rayleigh patterns are far apart;

specifically, they range from 1.00 to 3.00 in terms of tm, with a huge increment of 0.25. Therefore, they are not sensitive enough for estimating the latent-error rate, which is usually a very small number.

There are, however, circumventions to the last two problems. First, use the separation index

conservatively; be skeptical of the results if the index exceeds 1.00. Second, use the program iteratively:

After selecting the best-match pattern (for instance, the one with tm = 1.75), calculate a series of slightly different Rayleigh patterns that center at the best-match pattern (for instance, patterns ranging from tm⁼ 1.50 to tm = 2.00, with an increment of 0.05 or 0.01), and use them as user patterns to match with the input data again. The outcome will surely be a better "best match."

When used properly, the first two potential weak points of STEER can become its strong points. In other words, STEER plays down the role of formal parameter estimation and relies heavily on matching with existing patterns. If the feature of self-entered user patterns is used well (e.g., use defect patterns of projects from the same development organizations that have characteristics similar to those of the project for which estimation of defects is sought), then empirical validity is established. From our experience in software reliability projection, the most important factor in achieving predictive validity, regardless of the model being used, is to establish empirical validity with historical data.

Table 7.2 shows the defect removal patterns of a number of projects, the defect rates observed during the first year in the field, the life-of-product (four years) projection based on the first-year data, and the Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.4 Implementation

the first year in the field, the life-of-product (four years) projection based on the first-year data, and the projected total latent defect rate (life-of-product) from STEER. The data show that the STEER projections are very close to the LOP projections based on one year of actual data. One can also observe that the defect removal patterns and the resulting field defects lend support to the basic assumptions of the Rayleigh model as discussed earlier. Specifically, more front-loaded defect patterns lead to lower field defect rates and vice versa.

Table 7.2. Defect Removal Patterns and STEER Projections Defects Per KLOC

Addison Wesley: Metrics and Models in Software Quality Engineering, Second Edition 7.4 Implementation

In document Metrics & Models in Software Quality Engineering.pdf (Page 190-198)