Experiment XI – Compatibility - The Use of Automated Search in Deriving Software Testing Strate

In section 4.3.4, we argued that any input profile represented using the BBN representation– i.e. profiles over an input domain consisting of a fixed number of scalar numeric arguments— could also be represented using the SCFG representation. In other words, the set of profiles representable using a SCFG is a superset of the profiles representable using a BBN.

In this experiment, the first objective is to demonstrate how the SCFG representation may be used to represent the input profiles of SUTs that were represented using a BBN in previous chapters. The second objective is to assess how the efficiency of search algorithm changes when the SCFG representation is used in place of the BBN representation for these SUTs.

The research questions are therefore:

RQ-1 Can the SCFG representation be applied to SUTs whose input domains were previously represented using a BBN in chapter 3: bestMove, nsichneu, cArcsin, fct3, and tcas? RQ-2 How does the change of representation from BBN to SCFG affect the efficiency of the

search algorithm?

Normally the choice of parent and child variables for the conditional dependencies pro- posed in section 4.3.2 is unrestricted: the production weights of any variable may be conditional on any other variable in the grammar (including itself). However, we demonstrate below that grammars representing a fixed number of scalar numeric arguments require no non-determinism, and so the order in which variables are processed (rewritten) when gen- erating a string is predictable. It is therefore possible to restrict conditional dependencies so that a variable may only be conditional on a variable that is processed earlier during the string generation process. This motivates a further research question:

RQ-3 Does restricting conditional dependencies to the variable processing order improve algorithm efficiency?

4.5.2 Preparation

SCFG Representation To construct a suitable SCFG representation for each of the SUTs the following pattern is applied:

S → V1 V2 . . . Vn V1 → [a1, b1] V2 → [a2, b2] . . . _→ ... Vn → [an, bn]

where the Vi are the arguments to the SUT, each one of which would be a node in the BBN representation. Each Vi is a binned variable representing the interval[ai, bi].

For example, the SUT bestMove has two parameters, noughts and crosses which both takes value in the range[0, 511](section 2.4.3). The SCFG representation is therefore:

S → Noughts Crosses Noughts → [0,511]

Real-Valued Binned Variables Two of three arguments for the SUT cArcsin are real-valued: inRand inC are the real and imaginary components of the complex angle, and in section 3.4.2 we chose an input domain of [−10, 10)for both. As a result of the implementation decision to represent all binned variables, both integer and real, using 16-bit integers, the following SCFG is used:

S → InR InC realFlag InR → [0,65534]

InC → [0,65534] realFlag _→ [0,1]

The integer value, x, returned for either of the variables inR and inC when a string is generated is converted to a real-valued argument, x0 in the range[−10, 10)using the trans- formation:

x0 =−10+20x+γ

65535 (48)

where γ is a random variable sampled from the interval[0, 1)using an additional call to the PRNG.

Implementation of Conditional Dependency Constraint Research questions RQ-3 requires conditional dependencies to be restricted to the order in which the variable are processed during string derivation. For the SUTs considered in this experiment, the above grammar pattern ensures that the processing order is the same as the order in which the variables are defined in the grammar. Therefore the implementation of this constraint on conditional dependencies uses the definition order and so avoids the need for additional processing of the grammar to determine the processing order.

Algorithm Parameters Experiment VIIIc (section 3.8.5) evaluated the efficiency of the algorithm using the BBN representation. Therefore the method of Experiment VIIIc is repeated using the SCFG representation, and the result compared against the earlier experiment. We additionally include the SUT tcas, using the results of Experiment X for comparison.

The optimal algorithm configuration and parameter settings used in Experiment VIIIc were the outcome of a series of tuning experiments. To ensure a fair comparison of the two algorithms, an equivalent series of tuning experiments would ideally be performed using the SCFG representation. However, undertaking such an equivalent series of experiment would take significant time and computing resources.

Since understanding the effect of representation on algorithm efficiency is a secondary objective of this chapter—the primary objective is to demonstrate the wider applicability of the SCFG representation—we make the pragmatic decision to re-use the configuration and parameter settings of Experiment VIIIc for the SCFG representation. We argue that the two representations are very similar in nature when the SUT input domains, as here, consist of a fixed number of scalar numeric values, and therefore it is reasonable to expect that the configuration and parameter setting that are optimal for the BBN representation are likely to be acceptably near optimal for the SCFG representation.

As a result of this decision, we are unable to make a direct comparison between the two representations: the choice of parameter settings biases the comparison in favour of the BBN representation. However the experiment is able to derive a bound on the algorithm efficiency (an upper bound when quantified as the number of SUT executions) when using the SCFG representation: the efficiency of the algorithm using the SCFG representation will be at least

as good as this bound if the parameter settings were tuned to the representation. Since the two representations are similar, we expect that the bound on the algorithm efficiency is suffi- ciently tight to enable a conclusion to be drawn as to effect of using the SCFG representation. The parameter settings (duplicating those used for the BBN representation in Experi- ment VIIIc) are listed in table 36. The target minimum coverage probability for each SUT is the same as used throughout chapter 3.

Should the algorithm fail to attain the target probability—a situation which occurs only very rarely at these near-optimal parameter settings—it is terminated after a fixed number of iterations, τiter; in chapter 3 a similar mechanism was used, but based on the total number of SUT executions.

Parameter Description SUT Setting

K evaluation sample size 302

λ neighbourhood sample size 9

ρprb production weight mutation factor 54.002

ρlen bin terminal length mutation factor 2.755

Wbins Gbinsmutation group weight 1 000

W_edge G_edgemutation group weight 144

Wdrct Gdrct mutation group weight 10 584

wrem Mremmutation weight 1 000

wadd Maddmutation weight 689

wprb Mprbmutation weight 1 000

wjoi Mjoimutation weight 886

wspl Msplmutation weight 1 159

wlen Mlenmutation weight 1 028

µprnt maximum no. parent variables per child variable 1

µbins maximum no. bin terminals per binned variable (η=2.0)

ζ initial no. bin terminals (as proportion of µbins) 1.0

µeval maximum no. evaluations per profile 10

τiter max no. iterations 1×105

τpmin target min. coverage probability bestMove 0.112

nsichneu 0.028

cArcsin 0.12

fct3 0.05

tcas 0.078

Lstring max. length generated string 32

Linter max. length intermediate string 32

Table 36 – The algorithm parameter settings used for the SCFG representation.

4.5.3 Method

For each of the five SUTs—bestMove, nsichneu, cArcsin, fct3, and tcas—an SCFG representation was constructed using the pattern described above. 64 trials were run for each SUT (each with a different seed to the PRNG) using the parameter settings of table 36. For each trial, the number of times the SUT is executed by the algorithm was recorded, and transformed—if the target minimum coverage probability is not attained—using the method described in section 3.3.3.

The procedure was repeated a second time with the conditional dependencies between grammar variables restricted to the order in which the variables are defined (and therefore processed during string generation).

SUT bestMove nsichneu cArcsin fct3 tcas Average Median Efficiency (106execs)

SCFG unrestricted dependencies 20.6 2.10 0.735 2.83 65.8 5.23

SCFG restricted dependencies 11.9 2.43 0.628 3.21 18.0 3.87

BBN 11.3 0.440 0.940 2.59 13.5 2.93

Significance (rank-sum p-value)

SCFG unrestricted v. restricted 0.047 % 92 % 36 % 90 % <10−5 <10−5

SCFG restricted v. BBN 82 % <10−5 1.6 % 52 % 11 % 0.19 %

Effect Size (Vargha-Delaney A)

SCFG unrestricted v. restricted 0.68 0.51 0.55 0.49 0.82 0.73

SCFG restricted v. BBN 0.49 0.80 0.38 0.53 0.58 0.66

Table 37 – Results of Experiment XI.

4.5.4 Results

The results are summarised in table 37.

For each SUT, the median algorithm efficiency—in units of 106executions—is calculated over each of 64 trials for both SCFG with both unrestricted and restricted dependencies between variables. The results for the BBN representation are taken from Experiment VIIIc for bestMove, nsichneu, cArcsin, and fct3; and from Experiment X for tcas.

To estimate the average efficiency over all five SUTs , 64 mean values are first calculated where the nth value is the geometric mean of the efficiency returned by the nth trial of each SUT. (The choice of geometric mean for this purpose is discussed in section 3.2.5.) The median of these 64 geometric means is then reported in the column labelled ‘Average’.

To assess the statistical significance of the differences in algorithm efficiencies, the non- parametric Mann-Whitney-Wilcoxon (rank-sum) test was used to compare the SCFG with unrestricted and restricted dependencies, and the SCFG with restricted dependencies to the BBN. The non-parameteric Vargha-Delaney test is applied to calculate effect size for the same sets of responses. (The Vargha-Delaney statistic is described in section 2.6.4.)

4.5.5 Discussion and Conclusions

The results show that it is possible to derive input profiles using the SCFG representation: the efficiencies reported in table 37 are much less than if the algorithm were to be terminated only when reaching the maximum number of iterations rather than on finding a suitable profile (RQ-1).

The results also show that there is an improvement in algorithm efficiency when the conditional dependencies in the SCFG representation are restricted to the order in which the variables are processed, compared to unrestricted dependencies: the average efficiency across the five SUTs is improved by approximately 26%, from 5.23×106 to 3.87×106 executions, as a result of this restriction, and the difference is statistically significant at the 5% level (RQ-3). However the effect of the restriction differs greatly between individual SUTs: for tcas the restriction improves the efficiency greatly, while for nsichneu, cArcsin, and fct3 there is no statistically significant difference.

The efficiency of the algorithm when using the SCFG representation (with restricted dependencies) is slightly worse than the BBN representation: on average, the use of the SCFG representation decreases efficiency by approximately 32% from 2.93×106to 3.87×106execu- tions. This difference is statistically significant at the 5% level, and the effect size is medium

according to the guidelines suggested by Vargha and Delaney (2000). Again, the effect on individual SUTs is more complex: there is no statistically significant difference between the representations for bestMove, fct3 and tcas, while the SCFG representation decreases the efficiency of the algorithm applied to nsichneu greatly.

As discussed above, we must treat the efficiency for the SCFG representation as a bound on the tuned efficiency. Thus we conclude that the restricted SCFG representation results in an average algorithm efficiency that is, at worst, 32% less efficient than the BBN representation (RQ-2). We argue that this worst-case difference is small enough that it is reasonable to use the SCFG representation as an alternative for the BBN representation for input domains consisting of a fixed number of scalar numeric arguments.

We argued that the representations are very similar in nature when representing domains consisting of a fixed number of scalar numeric arguments, especially when the variable dependencies are restricted to the derivation order. Nevertheless there are differences in both the algorithm itself and its implementation between the two representations:

• There is no equivalent to the start variable in the BBN representation. The neighbourhood of a candidate profile represented as an SCFG is therefore larger than the equivalent BBN: mutations may add dependencies with the start variable as the parent, and the weights of the start variable’s production may be changed. Since the start variable has only one production in the pattern used for this experiment, and this production is used prior to any other variable being derived, neither of these mutations has any effect. However, the algorithm efficiency is likely to be decreased as a result.

• The new algorithm implementation (used only with the SCFG representation) assesses mutations for validity after they are selected; this may rarely result in a neighbour that is no different from the current candidate input profile. This implementation decision is designed to improve algorithm run time, but may decrease efficiency measured in terms of the number of SUT executions.

• The new algorithm implementation stores bin terminal lengths and weights as integers and enforces a minimum length and weight of 1. The existing algorithm implementation (used with the BBN representation) uses floats to store bin lengths and probabilities and has no minimum value.

These differences may offer an explanation for why the efficiency of the SCFG representation is not always equivalent to that of the BBN representation at the same parameter settings. However, in the absence of further experimentation, we are unable to provide an explanation for the large differences in efficiency observed in a few specific cases: the very poor efficiency of the unrestricted SCFG representation applied to tcas, and the poor efficiency of both restricted and unrestricted SCFG representation when applied to nsichneu compared to the BBN representation.

4.5.6 Threats to Validity

Generality Although a reasonably diverse set of SUTs have been used for this experiment, we are unable to generalise, with confidence, the results to the wider class of SUTs with a fixed number of scalar numeric arguments.

Parameter Settings For the reason discussed above, the parameter settings were not tuned for SCFG representation, and observed efficiencies for the SCFG representation must be considered a bound on the tuned efficiency of the algorithm. This limits the conclusions we are able to draw. We cannot, for example, state confidently whether there is, in general, a difference in algorithm efficiency between the two representations. Moreover, some of the unexplained observations—such as the very poor performance of the SCFG representation when applied to nsichneu—may be artefacts resulting from this decision.

4.6 Experiment XII – Wider Applicability

In document The Use of Automated Search in Deriving Software Testing Strategies (Page 169-175)