Implementation and Enhancements - Response Surface Methodology

3.3 Response Surface Methodology

3.3.3 Implementation and Enhancements

This section describes the implementation of RSM used for the empirical work of this chapter. It also describes some innovative enhancements to the standard technique that facilitate the

use of RSM for this particular application: tuning the parameters of a stochastic algorithm. The implementation details and enhancements are based on experience gained during preliminary experimentation, and were motivated by the need to accommodate the stochastic variance in the efficiency of the profile-deriving algorithm, to ensure the objectivity of the methodology, and to enable automation of the process.

Terminology In the following, an algorithm trial refers to a single execution of the algorithm, using a defined specific set of parameter values, and returning a single response, i.e. a measure of the algorithm efficiency. A tuning run refers to a single application of RSM, consisting of many algorithm trials, that returns a single set of tuned algorithm parameters. The term experiment will be reserved for a set of tuning runs that compare the performance of different algorithm configurations.

Coded Parameter Space For some algorithm parameters, the algorithm behaviour may be rea- sonably be expected to be related to the ratio of the change rather than the absolute difference in parameter settings. For example, a change in K, the evaluation sample size, from 10 to 20, is likely to have more effect than a change from 110 to 120. In addition these, and other parameters, may differ by orders of magnitude between different SUTs.

For these reasons, it is therefore convenient to relate the coded value of such parameters to the logarithm of the actual parameter values. Equation (35) is adapted for these parameters as:

zi =2

log xi−log Bi log Ui−log Li −

1 (38)

(where Liand Uiare ‘reasonable’ bounds on the actual parameter values, as discussed above). The use of logarithm maps an absolute change in the coded parameter space to multiplicative change in the actual parameter space. This is a more convenient mapping for parameters where algorithm performance is related to the relative change, and allows parameter values to range more easily across orders of magnitude during the tuning.

Trial Termination Criteria An individual algorithm trial terminates if a profile is derived whose fitness reaches or exceeds the target minimum coverage probability, τpmin.

Two additional termination criteria are used to avoid long-running jobs. This is a practical consideration: occasionally, some algorithm configurations with particular parameter choices can produce large data structures that are very slow to process. The additional termination criteria are specified as limits on:

• the number of times an instrumented SUT is executed during a trial (parameter τexec); and,

• the processor time used by the trial (parameter τtime).

If either limit is reached, the algorithm terminates cleanly, and the accumulated number of executions and fitness at point of termination are recorded.

Termination based on the number of executions is compatible with repeatable experimentation: it is dependent only on the algorithm configuration and the SUT. However, termination based on processor time is less repeatable: the computing environment in which the trial executes is an additional factor. For this reason, τtime is set so that is exceeded only in exceptional cases. (The actual settings used for both of these limits is specified during the description of each experiment in this chapter.)

Starting Point RSM is a form of hill climbing that locates a local optimum reached by a climb from its starting point. There is no reason to assume that this local optimum is the global optimum: the response surface may be rugged with many local optima. To increase the probability of locating the global optimum, or at least a ‘good’ local optima, multiple tuning runs are performed, each time starting at a different, random starting point.

The starting points are chosen by randomly sampling each coded parameter from the interval [−0.5, 0.5] using a uniform distribution. By selecting the point from a large space around the centre of the ‘reasonable’ region for parameter values, there is a chance that the hill climb could start in the basin of attraction of the global optimum (or a ‘good’ local optimum).

Response To accommodate censored observations, i.e. trials that terminate as a result of the two limits described above before a suitable input profile is derived, statistical techniques such as the Tobit model (Tobin, 1958) or Cox’s Proportional Hazards Model (Cox, 1972) could be applied during the analysis. We have previously successfully applied such techniques to the tuning of search algorithm performance (Poulding et al., 2007). However both these techniques make assumptions as to the distribution of the responses. In order to avoid these assumptions, we instead incorporate information from censored observations through the use of the following modified response metric:

y=log

N τpmin

min(bpmin, τpmin)

(39) where N is the number of times the instrumented SUT was executed during the trial (i.e. the original response measure), τpmin the target minimum coverage probability, and pbmin the fitness of elite input profile, i.e. the candidate profile with highest estimated minimum coverage probability found during the algorithm trial.

If the algorithm completes and an input profile was derived that has target minimum coverage probability, then the response, y, is the original measure of efficiency, log N. If the trial terminated for any other reason, then the number of SUT executions at the point of termination is increased by a proportion dependent on how close the algorithm was to achieving the target fitness. This is not intended to be an estimate of the number of SUT executions that would be made by the algorithm trial if it had run to completion, but simply as a means of guiding the tuning process by incorporating information from censored observations. (In the very rare cases that bpminis zero, an arbitrary response of y=22—equivalent to e22 _≈ _3.58_×₁₀9 _{SUT executions—is recorded. This value is much higher than normally} possible, but not so high as to distort the analysis.)

Preliminary experiments demonstrated that the limit on the number of executions was set sufficiently high that although some algorithm parameter choices resulted in a relatively large proportion of censored observations at the initial parameter settings, very few, if any, trials were censored at the final parameter settings. For this reason, we may be confident that tuned parameters derived using RSM are indeed very efficient, rather than being an artefact of this modified response.

Design for Linear Model Fitting A two-level fractional factorial design is used for the design points at which algorithm trials are run in order to estimate the model parameter in the first phase of each RSM iteration. If the starting point is(z∗₁, z∗₂, . . . , z∗_n), then the design is a subset of the points formed by combinations of coordinates z∗_i ±0.1. The value of 0.1 was chosen

with the intention that a first order-linear model would be a reasonable approximation of the surface in this small region, but large enough that there is sufficient difference in algorithm performance (compared to the intrinsic stochastic noise) across the region to establish a steepest path on the surface.

The number of parameters varies depending on the algorithm configuration (the combina- tion of search method, representation and fitness metric), and this would normally affect the size of the design. In order to apply equivalent computation to the tuning of each configuration, we generate the smallest fractional factorial design with a minimum of 128 points and a resolution of 4 (or better), regardless of the number of parameters. The resolution refers to the degree with which the coefficients of higher order terms cannot be determined through regression analysis using such a fractional factorial design. A resolution of 4 is the least which ensures that the coefficients in the first-order linear model may be accurately determined, and is much smaller (and therefore requires less computation) than a full two-level factorial design for the same number of parameters.

Steepest Path Descent The steepest path is explored at points separated by the vector: −0.2

p∑n

i=1βi

(β1, β2, . . . , βn) (40)

Here the steepest path vector of equation (37) is normalised and multiplied by a scalar value. Since better algorithm parameters have lower values of the response metric defined by (39), the steepest path should descend and so this scalar is negative. The magnitude of 0.2 was found to be effective during preliminary experimentation.

To assess the average algorithm efficiency at points on the steepest path, 32 trials of the algorithm are performed, each trial using the same parameter settings but a different seed to the pseudo-random number generator. The efficiency is calculated as the median of the response (equation (39)) of each trial. The median is used as it is more robust to outliers than the mean.

We introduce an innovation in how the coded parameters are converted to the actual parameters at points on the steepest descent path. Some of the actual parameters take only integer (or similarly discrete) values. The process of transforming coded values to actual values and then rounding to the nearest integer can cause inconsistencies in the differences between parameter values at consecutive points. For example, the actual values, before rounding, for a parameter at five consecutive points along the path may be: 3.1, 4.3, 5.5, 6.7, 7.9. When rounded to the nearest integer, the parameter values are: 3, 4, 6, 7, 8, resulting in an inconsis- tent difference of 2 between the second and third points compared to 1 between the other points. These inconsistencies might result in sudden changes in the response metric, and since our chosen descent stopping criteria (described below) assumes a smooth curve can be fitted to the response along the steepest path, the sudden change could lead to a poorly fitted curve and therefore unreliable results.

This potential source of error is minimised by applying ‘probabilistic’ rounding when deriving the actual parameter values for points on the steepest path. If the fractional part of the unrounded parameter value is g, then the value is rounded up to the next integer if γ<g, where γ is a random value uniformly sampled from the interval[0, 1), and rounded down otherwise. This rounding rule is applied independently in each of 32 trials, so that the mean value of this parameter considered across the trails is close to the value before rounding, but each individual trial uses an integer value. Assuming an approximately linear relationship

between response and this parameter value, the median response will be an estimate of that at the unrounded parameter value, and even without a linear relationship, it will almost certainly be a closer approximation than if the direction of rounding were the same for all 32 trials.

Descent Stopping Criterion In most applications of the RSM, the steepest path is followed until a simple stopping criterion is met, such as two consecutive points on the path giving a worse response than the previous point. During preliminary experimentation it was found that a simple criterion such as this was unreliable as a result of the stochastic noise in the algorithm response: occasionally the median responses at a point was slightly worse than at the previous point, but the overall trajectory of the response was still towards better response. For this reason, a second innovation was made to the methodology: the stopping criterion was based on a smooth curve fitted to the median responses at points along the path. The process of curve fitting incorporates information from all points along the path, and is therefore robust to the occasional outlier value for the median response.

The fitted curve is a cubic form:

˜y=as3+bs2+cs+d (41)

where ˜y is the median response and s the number of steps along the path from the starting point. A cubic curve accommodates both a minimum and maximum along the path, and was found to produce a good fit to observed results during preliminary experimentation: an example is shown in figure 18. The coefficients a, b, c, and d are estimated using least-squares regression. 17 18 19 20 21 22 0 1 2 3 4 5 6 7 8 9 10 11 step response

Figure 18 – An example of a cubic curve (dashed line) fitted to responses along the steepest path. At each step along the path, a boxplot illustrates the distribution of observed algorithm efficiencies over 32 trials: the box is drawn between the first and third quartiles, and the horizontal line across the box is the median response. (The response is the metric of equation 39, equivalent to the logarithm of the number of SUT executions.)

Initially the path is followed for a total of 8 steps—the starting point plus 7 steps along the path—and the cubic fitted to the observed median responses. If the fitted cubic has a minimum, and this predicted minimum is at least two steps from the last point sampled (i.e. it is a true minimum and not simply the lowest point sampled so far) then the descent of

the path stops. Otherwise subsequent batches of 4 steps are taken up to a maximum of 21 steps: the starting point plus 20 points along the path. The limit on the number of steps is used since the vector is unlikely to still be a good estimate of the steepest path so far from the starting point. (20 steps represents a distance travelled in the coded parameter space of 10 units, but is otherwise an arbitrary limit.)

Path descent may also stop at fewer than 20 steps if any of the parameters reaches a hard bound on its value, for example, a parameter that must be positive would become negative at the next step. In this case, a minimum is estimated that leaves sufficient ‘space’ for the fractional factorial design in the first phase of the next RSM iteration.

As well as, we believe, accommodating the noise in algorithm responses more effectively than a simpler stopping criterion, the curve fitting criterion has an additional advantage: the minimum predicted by the fitted curve can lie between the sampled points, potentially resulting in a better estimation of the minimum.

Iterations When RSM is applied to, for example, industrial processes, there is significant ben- efit in performing many iterations in order to find the best possible settings for the process. However the goal here is find only near-optimal parameter settings for the purpose of reliably comparing algorithm configurations; finding the best possible settings is not necessary and would require too many computing resources. For this reason, the RSM process applies just two iterations of hill climbing; preliminary experimentation suggested that further iterations typically produced little improvement in the response. Always performing two iterations also ensures that equivalent effort is expended in tuning each algorithm configuration.

A second-order linear model is not fitted after that last iteration because fitting such a model to a noisy response and a relatively large number of parameters is unlikely to be reli- able, and is unnecessary since only near-optimal settings are required. Instead, the minimum point estimated during the steepest path descent of the second iteration is used as the tuned algorithm parameters.

Automation The configuration and analysis of each RSM phase is performed using a combi- nation of shell and Matlab scripts. The algorithm trials and the scripts implementing RSM run on a computing cluster. Each server in the cluster has Linux as the operating system, and the cluster is managed using Oracle Grid Engine software. A single tuning run, consisting of all the phases of RSM, executes without input by the experimenter: at the end of each phase a grid job is automatically submitted to perform the next phase.

3.4 Software Under Test

This section describes the SUTs used in the experiments of this chapter. A threat to validity identified for the experiments of chapter 2 was the relatively small number of SUTs. Therefore two new real-world SUTs, cArcsin and fct3, are added to the existing SUTs bestMove and nsichneu. The very simple toy SUT simpleFunc is no longer used as an example in the experiments.

Ideally, the set of example SUTs would be larger, but further SUTs would add to the computing resources required to the perform the experiments. The set of four diverse SUTs is a trade-off between the generality of the empirical results and the available computing resources.

The characteristics and provenance of the new SUTs are described below. The instrumented source code of cArcsin is available at: http://www-users.cs.york.ac.uk/smp/ supplemental/; only the object code of fct3 was provided to us and thus we are unable to share the source code for this SUT.

3.4.1 SUT Characteristics

Relevant characteristics of the new SUTs are summarised in table 7. The characteristics of the SUTs bestMove and nsichneu, previously described in chapter 2 are shown for comparison.

SUT bestMove nsichneu cArcsin fct3

lines of code 89 1 967 70 135

no. loops 7 1 0 0

no. conditional branches 42 490 18 19

no. input arguments 2 11 3 8

argument data types int int double,bool int

domain cardinality 2.62×105 3.40×106 ∞ 6.71×107

Table 7 – Characteristics of the SUTs bestMove, nsichneu, cArcsin, and fct3.

As for the existing SUTs, input profiles are derived for the purpose of branch coverage of the new SUTs.

In document The Use of Automated Search in Deriving Software Testing Strategies (Page 96-102)