Experiment I – Algorithm Viability - The Use of Automated Search in Deriving Software Testing S

In this experiment, we demonstrate the viability of the proposed algorithm in terms of the time taken to derive a suitable input profile for the three examples SUTs.

The research questions addressed are:

RQ-1 Is the proposed algorithm capable of deriving near-optimal input profiles (i.e. profiles for which the minimum coverage probability is close to the maximum possible)? RQ-2 How much time is required, on average, for the algorithm to derive such profiles?

2.5.2 Preparation

The algorithm parameter settings used in this experiment are listed in table 4. Since the objective of the experiment is to demonstrate viability rather than to determine the best performance of the algorithm, only a small amount of effort was spent in tuning the parameters. The trade-off between an accurate estimation of the minimum coverage probability for use as the fitness metric and the resources consumed in executing the instrumented SUT more times to achieve this accuracy was noted in section 2.2.2. The settings of the sample size parameter, K, were chosen in the context of this trade-off, but erring on the size of accuracy: the settings are sufficiently high that even for candidate profiles with minimum coverage probabilities a magnitude or more lower than the target minimum coverage probability for these SUTs, every coverage element is exercised many times and thus a relatively accurate estimate of the minimum coverage probability can be made. The choice of a lower setting of K for simpleFunc is based on the smaller number of coverage elements and the higher target minimum coverage probability achievable for this SUT than for bestMove and nsichneu.

As discussed in section 2.2.3 above, the weights of mutations that decrease the size of the representation (Mjoi and Mrem) are set higher than those that increase it (Mspland Madd) in order to encourage parsimony in the representation, with the objective of improving algorithm performance and avoiding excessive memory usage.

For simpleFunc and bestMove, the optimal minimum coverage probability can be deter- mined by manual examination of the code as 0.25 and 0.1667, respectively. For these SUTs, the target fitness, τpmin, was set to be close to, but not at, these optimum values: obtaining

exactly the optimal fitness may require an infrequent chance occurrence or may be impossible given that the fitness is calculated using a finite sample. For example, if there are four mu- tually exclusive coverage elements, obtaining an optimal fitness of 0.25 would require each coverage element to be exercised by exactly one quarter of the sample, which may occur only rarely even if the actual minimum coverage probability of the profile is this optimal value, and would be impossible if the number of samples is not a multiple of four.

For nsichneu, the code is too complex to derive an optimal probability manually and so the target was set to close to the maximum value observed during preliminary experimentation.

2.5.3 Method

For each of the three SUTs, the algorithm was executed 32 times. We will refer to each execution of the algorithm as a trial.

Parameter Effect SUT Value

λ neighbourhood sample size 4

ρprb bin probability mutation factor 10.0

ρlen bin length mutation factor 10.0

wprb Mprbmutation weight 1.0

wlen Mlenmutation weight 1.0

wspl Msplmutation weight 0.8

wjoi Mjoimutation weight 1.0

wadd Maddmutation weight 0.8

wrem Mremmutation weight 1.0

K evaluation sample size simpleFunc 200

bestMove 1 000

nsichneu 1 000

τpmin target min. coverage probability simpleFunc 0.24

bestMove 0.14

nsichneu 0.035

τiter maximum no. iterations simpleFunc 4×103

bestMove 6×104

nsichneu 4×103

Table 4 – Algorithm parameter settings used in Experiment I. The SUT name is listed in the column ‘SUT’ only for those settings that differ between the SUTs.

Each trial was provided with a different seed to the pseudo-random number generator. The seeds were obtained from the website random.org that generates random numbers from atmospheric noise at radio frequencies.

The choice of 32 trials is a compromise between the accuracy of these results and the resources—computing power and time—that were available for experimentation. During the analysis of the results below, confidence intervals are provided as an indication of the accuracy achieved.

The trials were run on a set of Linux servers. Each trial used one core of a CPU running at 3 GHz.

The responses measured were whether or not the algorithm trial was successful in deriving a profile with the target fitness, and the algorithm’s run time. We choose to use the processor time (obtained using the standard C function clock()) as the run time metric since this will be less sensitive to the effect of other processes running on the same server than the alternative measure of elapsed time.

2.5.4 Results

The results are summarised in table 5. The reported values are the mean of the responses observed in the 32 trials for each SUT. The 95% confidence intervals are estimated by a bootstrapping technique.

The response π+ is the mean proportion of trials that succeeded in deriving a suitable input profile: one for which the fitness—the estimated minimum coverage probability—met or exceeded the target probability. T+ is the mean processing time taken by trials that are successful, and T₋ the mean time taken when the trail is unsuccessful and is terminated when the maximum number of iterations is reached.

If we assume that after an unsuccessful trial, the test engineer would in practice run further trials of the algorithm (with different seeds) until one is successful, then we may estimate the total time taken until a near-optimal profile is obtained. To do this, the algorithm trials are modelled as a Bernoulli process: each trial is an independent random variable with

Mean Response simpleFunc bestMove nsichneu Proportion of trials successful, π+ 1.0 0.72+0.13_−0.19 0.31+0.16_−0.16

Time taken by successful trial, T+ (min) 0.04+0.02_−0.01 80+12₋₁₄ 144+25₋₃₀

Time taken by unsuccessful trial, T₋ (min) 124+3₋₄ 204+5₋₄ Estimated total time until success, C+(min) 0.04+0.02_−0.01 129+76₋₄₀ 592+702₋₂₅₂

Table 5 – A summary of results of Experiment I. Upper and lower bounds of the 95% confidence intervals are reported as differences from the mean response in smaller type after the mean.

two outcomes, success and failure, where the probability of success is π+. For such processes, the number of failures until the first success follows an exponential distribution with mean (1−π+)/π+. Thus the average time taken for the mean number of failures followed by one successful trial is:

C+= 1−π+

π+ T−+T+ (26)

This estimate is reported in the final row of table 5.

The estimate is conservative: in practice the test engineer is likely to be able to run multiple trials of the algorithm in parallel on a multicore computer (we explore the benefits of such parallelisation in section 5). However, it does provide an upper bound on the time taken to obtain a near-optimal input profile.

2.5.5 Discussion and Conclusions

The results show that the algorithm is capable of deriving near-optimal profiles for all three SUTs (RQ-1). In the worst case—that of nsichneu—it would take approximately 10 hours on average to derive a profile (RQ-2).

We did not specify a priori an upper limit on the computing resources and time taken by the algorithm for it to be considered viable. However, we note that if a test engineer were to run a set of trials overnight on a single core of a computer equivalent in power to a desktop PC, there would be a good chance of deriving a profile by morning for any of the three SUTs. We argue that this is indicative that the proposed algorithm is a viable method of deriving input profiles for statistical testing.

2.5.6 Threats to Validity

We identify two important threats to the validity of our conclusions:

Generality It is not possible to extrapolate the results to other software with confidence given the small number of SUTs considered in the experiment. However, the SUTs were chosen to be diverse in terms of the characteristics listed in table 3. Moreover, bestMove and nsichneu were chosen specifically because they are non-trivial in size and have characteristics that we believed would challenge the algorithm. If this is indeed the case, we may reasonably expect the existence of a fairly large number of real-world SUTs—at the very least, those SUTs ‘smaller’ or ‘less complex’ than bestMove and nsichneu—for which the algorithm is a viable method of synthesising profiles. It is not clear, however, exactly which characteristics should be used to identify ‘smaller’ and ‘less complex’ SUTs.

Implementation We assume that the code used in this and subsequent experiments is an accurate implementation of the proposed algorithm described in section 2.2. For practical purposes any differences in implementation are not necessary a problem: we have an implemented an algorithm that is fit for purpose, even if it not exactly the proposed one. However, in the context of research, it would be difficult to interpret experimental results for the purpose of enhancing the algorithm if there are meaningful differences between the implemented and proposed algorithms.

We have reduced the chance of any such differences by testing the implementation during and after the development process. Many aspects of the algorithm’s operation may be logged to the output file in addition to the data required for the experiments so as to support this testing.

2.6 Experiment II – Fault-Detecting Ability

In document The Use of Automated Search in Deriving Software Testing Strategies (Page 65-69)