The Relationship between Search Based Software Engineering and Predictive Modeling

(1)

The Relationship between Search Based Software

Engineering and Predictive Modeling

Mark Harman

University College London,

Department of Computer Science, CREST Centre, Malet Place, London, WC1E 6BT, UK.

ABSTRACT

Search Based Software Engineering (SBSE) is an approach to soft-ware engineering in which search based optimization algorithms are used to identify optimal or near optimal solutions and to yield insight. SBSE techniques can cater for multiple, possibly compet-ing objectives and/ or constraints and applications where the poten-tial solution space is large and complex. This paper will provide a brief overview of SBSE, explaining some of the ways in which it has already been applied to construction of predictive models. There is a mutually beneficial relationship between predictive mod-els and SBSE. The paper sets out eleven open problem areas for Search Based Predictive Modeling and describes how predictive models also have role to play in improving SBSE.

1. INTRODUCTION

Harman and Clark argued that ‘metrics are fitness functions too’ [54]. This paper extends this analogy to consider the case for ‘pre-dictive models are fitness functions too’. It turns out that the rela-tionship between fitness functions and predictive models is much richer than this slightly facile aphorism might suggest. There are many connections and relationships between Search Based Soft-ware Engineering and Predictive Modeling. This paper seeks to un-cover some of these and describe ways in which they might be ex-ploited, both by the SBSE community and by the Predictive Mod-eling community.

Search based Software Engineering (SBSE) seeks to reformulate Software Engineering problems as search problems. Rather than constructing test cases, project schedules, requirements sets, de-signs, architectures and other software engineering artifacts, SBSE simply searches for them. The search space is the space of all pos-sible candidate solutions. This is typically enormous, making it impossible to enumerate all solutions. However, guided by a fit-ness function that distinguishes good solutions from bad ones, we can automate the search for good solutions in this search space.

SBSE has witnessed a dramatic increase in interest and activity in the last few years, resulting is several detailed surveys of tech-niques, applications and results [2, 9, 50, 57, 60, 100]. It has been shown that SBSE solutions to Software Engineering problems are

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

competitive with those produced by humans [29, 115] and that the overall SBSE approach has tremendous scalability potential [13, 49, 85, 94]. Section 2 provides a very brief overview of SBSE, while Section 3 present an overview of previous work on SBSE applied to predictive modeling.

SBSE is attractive because of the way it allows software engi-neers to balance conflicting and competing constraints in search spaces characterized by noisy, incomplete and only partly accu-rate data [50, 53]. This characterization of the problem space may sound surprisingly familiar to a reader from the Predictive Mod-eling community. Predictive modMod-eling also has to contend with noisy, incomplete and only partly accurate data. As a result, there is a great deal of synergy to be expected between the two commu-nities. This paper seeks to explore this synergy. It will also argue that the way in which SBSE can cater for multiple competing and conflicting objectives may also find, as yet largely unexplored, ap-plications in predictive modeling work.

The 2009 PROMISE conference call for papers raised four ques-tions for which answers were sought from the papers submitted. Clearly, in the year that has passed since PROMISE 2009, progress will have been made. However, the general character of the four questions means that they are unlikely to be completely solved in the near future. Since these questions are likely to remain open and important for the predictive modeling research community they serve as a convenient set of questions for this paper to seek to an-swer through the perspective of SBSE. The questions, and the sec-tions that address them, are as follows:

1. How much do we need to know about software engineering in order to build effective models? (addressed in Section 4) 2. How to adapt models to new data? (addressed in Section 5) 3. How to streamline the data or the process? (addressed in

Section 6)

4. How can predictive modeling gain greater acceptance by the SE community? (addressed in Section 7)

Of course, it would be one-sided and simplistic to claim that the predictive modeling community had so much to gain from consid-ering SBSE, without also considconsid-ering the reciprocal benefit to the SBSE community that may come from results and techniques asso-ciated with predictive modeling. Indeed, there are several ways in which predictive modeling may also benefit the SBSE community. These are briefly set out in Section 8.

The paper is written as an accompaniment to the author’s keynote presentation at PROMISE 2010: The6th_{International Conference}

on Predictive Models in Software Engineering. The paper, like the keynote, is constructed to raise questions and possibilities, rather

(2)

than to answer them. The paper presents 11 broad areas of open problems in SBSE for predictive modeling, explaining how tech-niques emerging from the SBSE community may find potentially innovative applications in Predictive Modeling. These open prob-lem areas (and the sections that describe them) are as follows:

1. Multi objective selection of variables for a predictive model (Section 4)

2. Sensitivity analysis for predictive models (Section 4) 3. Risk reduction as an optimization objective for predictive

models (Section 4)

4. Exploiting Bayesian models of Evolutionary Optimization (Section 5)

5. Balancing functional and non–functional properties of pre-dictive models (Section 6)

6. Exploiting the connection between fitness function smooth-ing and model interpolation (Section 6)

7. Validating predictive models through optimization (Section 7) 8. Human–in–the–loop predictive Modeling with Interactive

Evo-lutionary Optimization (Section 7)

9. Identifying the building blocks of effective predictive models (Section 7)

10. Predictive models as fitness functions in SBSE (Section 8) 11. Predictive models of SBSE effort and landscape properties

(Section 8)

2. SEARCH BASED SOFTWARE

ENGINEER-ING

Software Engineering, like other engineering disciplines, is all about optimization. We seek to build systems that are better, faster, cheaper, more reliable, flexible, scalable, responsive, adaptive, main-tainable, testable; the list of objectives for the software engineer is a long and diverse one, reflecting the breadth and diversity of ap-plications to which software is put. The space of possible choices is enormous and the objectives many and varied.

In such situations, software engineers, like their peers in other engineering disciplines [117] have turned to optimization techniques in general and to search based optimization in particular. This ap-proach to software engineering has become known as Search Based Software Engineering [50, 57]. Software Engineering is not merely yet another engineering discipline for which Search Based Opti-mization issuitable; its virtual nature makes software the ideal ‘killer application’ for search based optimization [53].

SBSE has proved to be very widely applicable in Software En-gineering, with applications including testing [16, 18, 23, 56, 81, 90, 113], design, [26, 55, 93, 104, 110], requirements, [14, 39, 38, 66], project management [3, 10, 11] refactoring [61, 95] and, as covered in more detail in the next section, predictive modeling. To these Software Engineering problems a wide range of differ-ent optimisation and search techniques have been applied such as local search [73, 85, 92], simulated annealing, [21, 15, 92], Ant Colony Optimization [7, 91], Tabu search [31], and most widely of all (according to a recent review [60]), genetic algorithms and ge-netic programming [17, 46, 55, 113]. Exact and greedy algorithms, though not strictly ‘search’ based are also often applied to similar problems, for example to requirements [122] and testing [119]. As

such, these algorithms are also regarded by many researchers as falling within the remit of the SBSE agenda.

The reader who is unfamiliar with these algorithms, can find overviews and more detailed treatments in the literature on SBSE [60, 50, 100]. For the purpose of this paper, all that is required is to note that SBSE applies to problems in which there are numer-ous candidate solutions and where there is a fitness function that can guide the search process to locate reasonably good solutions. The algorithms in the literature merely automate the search (using the guide provided by the fitness function in different ways). For example, genetic algorithms, inspired by natural selection, evolve a population of candidate solutions according to the fitness func-tion, while hill climbing algorithms simply take a random solution as a starting point and apply mutations to iteratively seek out near neighbours with improved fitness. As has been noted in the litera-ture on SBSE, the reader can interpret the words ‘fitness function’ as being very closely related to ‘metric’ in the Software Engineer-ing sense of the word [54].

It is important not to be confused by terminology: In Search Based Software Engineering, the term ‘search’ is used to refer to ‘search’ as in ‘search-based optimization’ not ‘search’ in the sense of a ‘web search’ or a ‘library search’. It should also be understood that search seldom yields a globally optimal result. Typically a ‘near optimal’ solution is sought in a search space of candidate so-lutions, guided by a fitness function that distinguishes between bet-ter and worse solutions. This near optimal solution, merely needs to be ‘good enough’ to be useful. SBSE is very much an ‘engineer-ing approach’ and so SBSE researchers often formulate problems in terms of finding solutions that are good enough, better than pre-viously obtained or within acceptable tolerances.

A detailed survey of all work on SBSE up to December 31st 2008 can be found in the work of Harman et al. [60]. There are also more in-depth surveys on particular aspects of SBSE, for example test-ing [9, 86] design level SBSE [100] and SBSE for non–functional properties [2].

It is a useful conceptual device to imagine all possible solutions (including the very worst and best) as lying on a horizontal plain in which similar candidates are proximate. We can think of the fitness of each individual in the plain as being denoted by a vertical high in the vertical dimension. In this way we conceptualize a ‘fitness landscape’, in which any globally optimal solutions will correspond to the highest peaks and in which properties of the landscape may relate to the difficulty of finding solutions using a chosen search technique. If the problem is one of maximizing the value of fitness then we seek to find solutions that denote the tops of hills in the landscape. If the problem is to minimize the value of the fitness function, then we seek the deepest troughs. Figure 1 presents an example of two such fitness landscapes, in which the goal is to min-imize the value of fitness. The visualization of fitness landscapes provides insights into the nature of the search problem. In Figure 1 the upper landscape is far more challenging than the lower land-scape. The lower landscape is produced by transforming the upper landscape, thereby making the search more efficient and effective. Sections 6 and 8 consider the ways in which such transformations may be useful to Predictive Modeling, and how predictive model-ing may help SBSE researchers to better understand their search landscapes.

Like other engineers, software engineers find themselves unable to achieve all possible objectives and therefore some balance be-tween them is typically sought. This motivates a multi objective approach to optimization which is increasingly prevalent in SBSE work. Search based optimization is naturally suited to the balanc-ing of competbalanc-ing and conflictbalanc-ing multiple objectives and so these

(3)

−100 −50 0 50 100 −100 −50 0 50 100 0 0.2 0.4 0.6 0.8 1 1.2 1.4 b a Objective Value Untransformed program −100 −50 0 50 100 −100 −50 0 50 100 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 b a Objective Value Transformed version

Figure 1: Transforming the Fitness Landscape by Transform-ing the Fitness Function (example from previous work [88, 89]). In this example, we seek the global minimum in the search space. The untransformed landscape is exceptionally difficult: The global minimum is found when the value ofaandbare both−1. However, in all other cases where the value ofaandb

are equal one of the local minima is reached. The fitness func-tion is transformed in this case by transforming the program from which fitness is computed. As a result, the local minima disappear from the landscape and all points on the landscape guide the search towards the global minimum. A problem can be transformed so that search will be dramatically more likely to locate a global optima, without necessarily knowing the val-ues denoted by this optimum. As such, this transformation is an effective approach to migrating a hard search problem into an easier search problem from which optimal solutions can be found. Such approaches may also find applications in smooth-ing and optimizsmooth-ing predictive models.

Figure 2: A cost–benefit Pareto front from Search Based Re-quirements Optimization [122]. Moving up the vertical axis de-notes reduced cost; moving rightward on the horizontal axis denotes increased customer satisfaction. Each circle is one valid choice of requirements that minimizes cost and maximizes value. For the set of requirements denoted by each circle, the search can find no solution with a lower cost and greater value. We can use this to find a good trade off between cost and value. This approach can be adapted to Search based Predictive Mod-eling in order to balance competing and conflicting predictive modeling objectives. For example, it could be used to mini-mize risk while maximizing predictive quality. Multi objective predictive modeling possibilities are discussed in more detail in Sections 4 and 6.

multi objective approaches have been adopted across the spectrum in software quality [72], testing [78, 30, 36, 59, 120], requirements [41, 102, 123], design [22, 98], and management [8, 12]. Sections 4 and 6 claim that multi objective SBSE also has an important role to play in optimizing predictive models for several competing and conflicting objectives.

Figure 2 gives an example of a multi objective solution Pareto front from the domain of Requirements Optimization. The solu-tion is a Pareto front of solusolu-tions each of which denotes a non– dominated solution. A solution is Pareto optimal if there is no other solution which better solves one of the two objectives without also giving a worse solution for the other objective. In this way, the Pareto front contains a set of choices, each of which balance in an optimal manner, the trade off between the functional and non– functional properties.

Like all metaheuristic search, one cannot be sure that one has the true globally optimal Pareto front and so we seek to produce as good an approximation as possible. Nevertheless, the Pareto front can yield insights into the tradeoffs present.

3. PREVIOUS WORK ON SBSE FOR

PRE-DICTIVE MODELING

SBSE techniques have a natural application in Predictive Mod-eling. In order to apply an SBSE approach to prediction, we need to capture the objectives of the predictive modeling system as a set of fitness functions which the SBSE algorithm can optimize. One

(4)

natural way in which this can be done is to use the degree of fit of an equation to a set of data points as a fitness function.

The equation is sought from among a very large set of candidate equations, using a Genetic Programming (GP) approach. In this approach, the equation is not constrained to use only certain com-binations of expressions and does not suffer from the limitations of a standard linear regression fit, in which a comparatively limited set of possibilities are usually available.

As with predictive modeling in general, software cost estimation is one of the areas in which this search based predictive model-ing approach has been most widely studied. Software project cost estimation is known to be a very demanding task [106] with in-herent uncertainties and a lack of reliable estimates. Software cost estimation is further hampered by poor quality, missing and incom-plete data sets. All of these problems are very well addressed by SBSE. The SBSE approach is able to handle missing and incom-plete data and can seek a best fit equation that may be piecewise; revealing very different trends (and requiring consequently differ-ent equations) for differdiffer-ent parts of the problem space.

The first author to use a search based approach to project cost es-timation was Dolado [32, 33, 34]. He used a GP approach to evolve functions that fitted the observed data for project effort (measured in function points).

In Dolado’s work, the population consists of a set of equations. The operators that can be included in these GP-evolved equations are very inclusive. Dolado includes arithmetic operators and other potentially useful mathematical functions such as power, square roots and log functions. Dolado’s fitness function is themean squared error, mse= 1 n−2 n X i=1 (yi−yˆi)2

Several other authors have also used GP to fit equations that pre-dict quality, cost and effort [19, 20, 24, 37, 69, 70, 71, 80, 82, 83, 84, 108]. Other authors have compared GP to more ‘classi-cal’ techniques such as linear regression [105] and have combined search based approaches with approaches that use Artificial Neural Networks [109] and Bayesian classifiers [21].

SBSE has also been used to build predictive models for defect prediction [67], performance prediction [75] and for the classifica-tion [112] and selecclassifica-tion [73] of the metrics to be used in predicclassifica-tion. However, SBSE has been much less applied to predicting locations and numbers of bugs [1], which is surprising considering how im-portant this is within the predictive modeling community [96].

4. CAN SBSE TELL US HOW MUCH DO

WE NEED TO KNOW IN ORDER TO BUILD

EFFECTIVE MODELS?

Building an effective model that can produce reasonable predic-tions of a dependent variable setV requires at least two important ingredients:

1. There has to be a set of variablesA, the values of which determine or at least influence the values ofV.

2. There has to be sufficient information about values ofAin order to support the inference of a reasonable approximation to the relationship that exists betweenAandV.

This may seem very obvious, but it is often overlooked or inad-equately accounted for. Given a candidate set of ‘influencing fac-tors’ (which, one hopes to be a superset ofA) it is not unreasonable to ask

“Which of the candidate variables’ values can affect the values of the variables we seek to predict?”

This problem can be addressed in a variety of ways, one of which involves an SBSE approach. Kirsopp et al. [73] showed how a sim-ple local search could be used to find those sets of project features that are good predictors of software project effort and cost. They demonstrated how their approach could assist the performance of Case Based Reasoning systems by identifying, from a set of 43 pos-sible candidate project variables, those which are good predictors.

This approach is attractive because it is generic. For a candidate set, A0, of prediction variables to be included in the model, the search must allocate a fitness toA0. Kirsopp et al. simply ‘jack knife’ the set of data to treat each project as a possible candidate for prediction. They discard information about the chosen project and seek to predict the discarded values usingA0. This gives a fitness based on prediction quality. The experiment can be repeated for each project in the set to give an overall estimate of the predictive quality of the setA0. The aggregate predictive quality denotes the fitness of the choice of variablesA0. In general, the closerA0is to

Athe better should be its fitness.

In a set of 43 variables there are243_{possible subsets so the} pre-diction variable space is too large to enumerate. However, using a simple local search Kirsopp et al. were able to identify good sets of prediction variables. They were also able to use the sim-plicity of their hill climbing search technique to good advantage in determining properties of the search space. They examined the fit-ness landscape to reveal that the search problem had large basins of attraction and clustered peaks, making hill climbing a reasonable search technique to apply.

Their approach was evaluated with respect to Case Based Rea-soning but this was not an inherent aspect of their approach. It would be possible to use a similar approach to locate the set of ‘useful prediction variables’ for any predictive modeling approach. Furthermore, there are other factors that could be taken into ac-count. Kirsopp et al. published their results in 2002. Since that time, there has been an increasing interest in multi objective opti-mization for software engineering (as mentioned in the introduction to this paper).

A multi objective approach to the problem of identifying a good prediction set could take into account other factors as well as pre-diction quality. There may be several factors that are important in a good model, and there may be tensions between these factors be-cause the achievement of one may be won only at the expense of another (a very typical problem for multi objective optimization). Such additional objectives for predictive modeling that a multi ob-jective search based approach might seek to optimize include:

1. Predictive Quality. Clearly we shall want our predictions to be reasonably accurate in order to be useful. Prediction quality may not be a single simple objective. There are many candidates for measuring prediction quality. Typically [24, 33] MMRE has been used. However, this assumes that we simply wish to minimize the likely error. This may not al-ways be the case. Also, MMRE is only one of many possible metrics and there has been evidence to suggest that it may not always select the best model [43].

It has also been argued that information theoretic measures have an important role to play in SBSE as fitness functions [50]. Such measures can capture information content and information flow and make ideal fitness function candidates because the solutions to be optimized in SBSE are typically composed of information [53]. For measuring fitness of

(5)

pre-dictive models we may look to techniques founded on infor-mation theory such as the Akaike Inforinfor-mation Criterion [4] and Schwartz’ Bayesian variant [103].

A prediction model that is often very accurate, but occasion-ally exceptionoccasion-ally inaccurate may not be desirable though it may have an apparently attractive MMRE. The decision maker may seek a prediction system that reduces the risk of an inaccurate prediction, even at the expense of overall pre-dictive quality. There has been previous work on risk reduc-tion for project planning using SBSE [12], but there has been no work on risk as an objective to be minimized in search based predictive modeling; the problem of incorporating risk into search based models therefore remains open.

2. Cost. There may be costs associated with applying a predic-tive model to obtain a prediction, introducing a cost–benefit aspect to the optimization problem: we seek the best predic-tion we can for a given cost. Cost–benefit analysis is useful where there is a cost of collecting data for an attribute of in-terest. For example, it may be difficult or demanding to ob-tain some data, but relatively easy to obob-tain other data. There may also be a monetary cost associated with obtaining infor-mation, for example when such data is obtained from a third party. In such situations we have a cost–benefit trade off for which a multi objective search is ideally suited. However, this has not previously been explored in the SBSE literature. 3. Privacy. Certain information may have degrees of privacy and/or access control that may limit the applicability of the model. It may be preferable to favour, wherever possible, less private data in constructing or applying a predictive model. Once again, a multi objective search based optimization could handle this situation in a very natural manner, provided the degree of privacy could be measured and, thereby, incorpo-rated into a fitness function.

4. ReadabilityPredictive models may be used to give insight to the developer, in which case the readability of the model may be important. Tim Menzies kindly pointed out that readabil-ity and concision may be possible objectives in response to an earlier draft of this paper, also suggesting how they might be measured [45].

5. Coverage and Weighting. We may have several dependent variables inV, for which a prediction is sought. It could be that we can find sets of predictors that predict one of the outcomes inV better than another. In such cases we may seek a solution that is a good generalist, balancing all of the dependent variables equally and seeking to predict all equally well. Alternatively, some predicted variables may be more important and we may therefore allocate a higher weight to these.

Both these situations can be covered in a multi objective search, using standard search based optimization techniques. For coverage of a set of predictors, with equal weighting or where we are unsure about the relative weights of each, we can adopt a Pareto optimal approach, which treats all objec-tives as equal. On the other hand, where we know the relative weighting to be applied to the elements ofV, we can simply use a single objective approach in which the overall fitness is a weighted sum of the predictive quality of each element in

V.

For complex, multiple objective problems such as these, where the constraints may be tight and the objectives may be conflicting

and competing, multi objective SBSE has proved to be exception-ally effective. It remains an open problem to apply multi objec-tive SBSE to the problems of predicobjec-tive modeling. In many cases, a Pareto optimal approach will be appropriate because the objec-tives will be competing and conflicting and there will be no way to define suitable weights to balance them. Using a Pareto opti-mal approach, we can obtain a Pareto front similar to that depicted in Figure 2 that balances, for example, privacy against predictive quality or cost against predictive quality. Such graphs are easier to visualize in two dimensions, but it is also possible to optimize for more than two objectives.

In addition to identifying which sources of data and which vari-ables are important for predictive modeling, there is also a related problem of determining which of the predictor variables inAwill have the greatest impact on the predictions that the model pro-duces. For these highly ‘sensitive’ variables, we shall need to be extra careful. This is particularly important in Software Engineer-ing predictive models, because it is widely known that estimates in Software Engineering are often very inaccurate [99, 107, 111].

If we can identify those sensitive elements ofAthat may have a disproportionately large impact on the solution space, then we can pay particular attention to ensuring that the estimates of these elements is as accurate as possible. This may save time and may increase uptake as a result. It may be cost effective to focus on the top 10% of input variables that require careful prediction rather than paying equal attention to all input variables. Similarly we can prioritize the input variables so that more effort is spent calibrating and checking the estimates of those for which the model is most sensitive.

There has been previous work on such sensitivity analysis for problems in Requirements Optimization [58]. Hitherto, this ap-proach to search based sensitivity analysis remains unexplored for predictive modeling.

5. CAN SBSE HELP US TO ADAPT

MOD-ELS TO NEW DATA?

Re-estimating the values of some ‘sensitive’ variables will present us with new data. The new data will override untrusted existing data because they are better estimated and therefore preferable. However, new data is not always a replacement for existing poor quality data. It may be that existing data is more trusted and new data is viewed with healthy scepticism until it has been demon-strated to provide reliable prediction. In this situation, a model needs to take account of the relative lack of trust in the new data. One obvious way in which this can be achieved is to incorporate a Bayesian model into the prediction system [40].

Over the past decade, researchers in the optimization community have considered models of Evolutionary Computation that involve statistical computations concerning the distribution of candidate so-lutions in a population of such soso-lutions. This trend in research on Estimation of Distributions Algorithms’ (EDAs) may have rel-evance to the predictive modeling community. In particular, two classes of EDAs are worth particular attention: the Bayesian Evo-lutionary Algorithms (BEA) and the Bayesian Optimization Algo-rithms (BOA). Both may have some value to the Predictive Model-ing community.

In a BEA [121], the evolutionary algorithm seeks to find a model of higher posterior probability at each iteration of the evolutionary computation, starting with the prior probabilities. One of the out-puts of a BEA is a Bayesian network. In this way, a BEA can be considered to be one way in which a search based approach can be used to construct a Bayesian Network.

(6)

A BOA [97] uses Bayesian modeling of the population in or-der to predict the distribution of promising solutions. Though not essential for the approach to work, the use of Bayesian modeling allows for the incorporation of prior knowledge into the search con-cerning good solutions and their building blocks. It is widely ob-served that the inclusion of such domain knowledge can improve the efficiency of the search. The BOA provides one direct mecha-nism through which this domain knowledge can be introduced.

6. CAN SBSE HELP US TO STREAMLINE

THE DATA OR THE PROCESS?

SBSE techniques may be able to streamline the predictive mod-eling process and also the data that is used as input to a model. This section explains how SBSE has been used to trade functional and non–functional properties and how this might yield insight into the trade offs between functional and non–functional properties for predictive models.

The section also briefly reviews the work on fitness function smoothing and discusses how this may also have a resonance in work on predictive modeling.

Section 4 briefly discussed the open problem of multi–objective predictive model construction using SBSE to search for models that balance many of the competing objectives that may go towards making up a good model. However, the objectives considered could all be characterized as being ‘functional’ objectives; they concern the predictive quality, cost and characteristics of the predictions made by the model. We may optimize for non–functional prop-erties of the predictive model and may also seek a balance between functional and non–functional aspects of the model.

There has been a long history of work on the optimization of non–functional properties, such as temporal aspects of software testing [6, 114] and stress testing [5, 23, 44]. Afzalet al. [2] provide a survey of non–functional search based software testing. Recent work on SBSE for non–functional properties [68, 116], has also shown how functional and non–functional properties can be jointly optimized in solutions that seek to balance their competing needs.

White et al. [116] show how functional and non–functional prop-erties can be traded for one another. They use GP to evolve a pseudo-random number generator with low power consumption. They do this by treating the functional property (randomness) and the non–functional property (power consumption) as two objectives to be solved using a Pareto optimal multi objective search. Using the Pareto front, White et al. were able to explore the operators that had most impact on the front, thereby gaining insight into the relationship between the operations used to define a pseudo ran-dom number generator and their impact on functional and non– functional properties.

In using SBSE as an aid to predictive modeling we re-formulate the question ‘how should we construct a model’ to ‘what are the criteria to guide a search for the best model among the many can-didates available’. This is a ‘reformulation’ of the software engi-neering problem of predictive modeling as a search problem in the style of Clark et al. [25]. With such a reformulation in mind, there is no reason not to include non-functional aspects of the model in the fitness function which guides the search.

In this way, we could search for a model that balances predictive quality against the execution time of the model. For some appli-cations, the execution time of certain model aspects may be high. This could be a factor in determining whether a predictive model is practical. In such situations it would be advantageous to factor into the search the temporal aspects of the problem. Using such

a multi objective approach, the decision maker could consider a Pareto front of trade offs between different models, each of which optimally balances the choice between predictive quality and time to produce the prediction.

As with the work of White et al. [116], the goal may not be nec-essarily to produce a solution, but to yieldinsightinto the trade offs inherent in the modeling choices available. Producing a sufficiently accurate and fast predictive model may be one possible aim, but we should not overlook the value of simply understanding the balance of trade offs and the effect of modeling choices on solution charac-teristics. SBSE provides a very flexible and generic way to explore these trade offs.

It is not uncommon for approaches to search based optimization to estimate or approximate fitness, rather than computing it directly. Optimizers can also interpolate between fitness values when con-structing a new candidate solution from two parents. This is not merely a convenience, where only approximate fitness is possible. Rather, it has been shown that smoothing the search landscape can improve the performance of the search [101]. Some of the tech-niques used for fitness approximation and interpolation either im-plicitly or exim-plicitly accept that the fitness computed may not be entirely reliable. There would appear to be a strong connection be-tween such realistic acceptance of likely reliability and the problem of constructing a predictive model. Perhaps some of the techniques used to smooth fitness landscapes can be adapted for use in predic-tive modeling.

7. CAN SBSE HELP PREDICTIVE

MODEL-ING GAIN GREATER ACCEPTANCE BY

THE SE COMMUNITY?

As described in Section 6, SBSE may not be used merely to help find good quality predictive models from among the enormous space of candidates. It may also be used as a tool for gaining in-sight into the trade offs inherent in the choices between different model characteristics, both functional and non–functional.

We might consider a predictive model to be a program. It takes inputs in the form of data and produces output in the form of pre-dictions. The inputs can be noisy, imprecise and only partly defined and this raises many problems for modeling as we have discussed earlier. The outputs are not measured according to correctness, but accuracy, forcing a more nuanced statistical view of correctness. Both these input and output characteristics make predictive models unlike traditional programs, but this should not obscure our recog-nition that predictive models are a form of program.

This observation underpins the application of Genetic Program-ming (GP) (as briefly reviewed in Section 3). We know from the literature on GP [79, 74] that GP is good at providing small pro-grams that are nearly correct. Scaling GP to larger propro-grams has proved to be a challenge, though there has been recent progress in this area [65, 78]. However, such GP scalability issues are not a problem for predictive modeling, because the programs required, although subtle, are not exceptionally long. Furthermore, tradi-tional programming techniques cope inadequately with ill-defined, partial and messy input data, whereas, this is precisely the area where the GP approach excels.

In many ways, the GP community and the predictive modeling communities have a similar challenges to overcome with regard to acceptance within the wider Software Engineering community. The perceived‘credibility gap’, in part, may stem from the very nature of the problems both communities seek to tackle. Messy and noisy inputs are deprecated in Software Engineering and much effort has gone into eliminating them. Both the GP and the predictive

(7)

model-ing communities are forced to embrace this ‘real world messiness’ and do not have the luxury of simply rejecting such inputs as ‘in-valid’. Similarly, outputs for which no ‘correctness guarantee’ can be provided sit uneasily on the shoulders of a science that grew out of mathematics and logic and whose founders eschewed any such notions [27, 42, 62].

SBSE researchers face a similar challenge to that faced by re-searchers and practitioners working with predictive models. SBSE is all about engineering optimization. Its motivation and approach are inherently moulded by an engineering mindset, rather than a more traditional mindset derived from mathematical logic. As such, SBSE is concerned with improvingnot withproving. Perhaps the same can be said of Predictive Modeling.

Fortunately there is a growing acceptance that Software Engi-neers are ready to widen their understanding of correctness (and its relationship tousefulness). Increasingly, software engineering is maturing to the point where it is able to start to consider complex messy real world problems, in which the software exhibits engi-neering characteristics as well as mathematical and logical charac-teristics [63]. The growing interest in SBSE may be one modest example of this increasing trend [60], though it can also be seen in the wide acceptance of other useful (though not guaranteed cor-rect) techniques, such as dynamic determination of likely invariants [35].

There are several ways in which SBSE researchers have found it possible to increase acceptability of search based solutions and these may extend to predictive modeling, because of the close simi-larities outlined above. Four possible approaches to gaining greater acceptance using SBSE are outlined below:

1. Justification. It may be thought that SBSE is a form of ‘Arti-ficial Intelligence for Software Engineering’. However, this observation can be misleading. While it is true that some search techniques have been used for AI problems, the appli-cation of search based optimization for Software Engineer-ing is all about optimization and little about artificial intelli-gence. One problem with an ‘AI for SE’ view point is that some AI techniques lack an adequate explanation for the re-sults they produce. This is a common criticism of Artificial Neural Net approaches to predictive modeling [64]. SBSE is not about artificial intelligence; the only ‘intelli-gence’ in the search process is supplied, a priori, by the human in the form of the fitness function and problem rep-resentation and in the construction of the search algorithm. Also, by contrast with ANNs and other so-called ‘black box’ AI techniques, the SBSE approach is far from being ‘black box’. Indeed, one of the great advantages of the SBSE ap-proach is precisely that it isnota ‘black box’ approach. Al-gorithms such as hill climbing and evolutionary alAl-gorithms make available adetailedexplanation of the results they pro-duce in the trace of candidates considered. For example, in hill climbing, each solution found is better than the last, ac-cording to the fitness function. This means that there is a se-quence of reasoning leading from one individual to the next, which explains how the algorithm considered and discarded each candidate solution on the way to its final results. 2. Human–in–the–loopSearch based optimization is typically

automated. It is often the ability to automate the search pro-cess that makes SBSE attractive [50, 53]. However, the search process does not have to befullyautomated. It is possible to employ the human to guide fitness computation while the al-gorithm progresses. For example, interactive evolution com-putes fitness, but allows the human to make a contribution to

the fitness computation, thereby incorporating human knowl-edge and judgement into the evolutionary search process. Interactive evolution provides a very effective way to allow the human a say in the construction of solutions. It may be essential to do this, perhaps because fitness cannot be wholly captured algorithmically or because there is a necessary ele-ment of aesthetics or subjective judgeele-ment involved. In pre-vious work on SBSE the possibility of interactive evolution has been proposed for applications to program comprehen-sion [51] and design [110]. It could also be used to increase acceptance of solutions; if an engineer has helped to define fitness, might such an engineer not be more likely to accept its optimized results?

3. Optimization as EvaluationUsing SBSE, any metric can be treated as a fitness function [54]. Therefore, if we have a candidate prediction system, we can use it as a fitness func-tion and optimize solufunc-tions according to its determinafunc-tion of high values from low. Essentially, we are using the predic-tion system ‘in reverse’; we seek to find those sets of decision variables inAthat lead to high values of one of the predicted variables inV. This could have interesting and useful appli-cations in the validation of predictive models.

Many SBSE techniques can be configured to produce a se-quence of candidate solutions with monotonically increasing fitness. This is one of the advantages of the way a search can auto-justify its findings. Using such a ‘monotonically in-creasing fitness chain’ one can explore the degree to which the prediction system provides good predictions. If it does, then the candidate solutions in such a chain, should increas-ingly exhibit properties that ought to lead to a greater value of the predicted variable. In this way, SBSE can provide an alternative perspective on prediction system validation. 4. InsightSearch based approaches can be used to find the

so-called ‘building blocks’ of good solutions, rather than simply to find the solutions themselves. The building blocks of an optimal or near optimal solution are those sub parts of the solution that are shared by all good solutions. The identifica-tion of good building blocks can provide great insight into the solution space. For example, in predictive modeling, finding building blocks of good solutions may shed light on the rela-tionship between clusters of related parameters that affect the predictive quality of a solution. This could potentially be far more revealing than principal component analysis, which it subsumes to some extent, because it reveals relationships be-tween components of solutions, not merely the components themselves.

Building blocks are an inherent feature of the workings of evolutionary approaches to optimization [28, 118]. However, they can be extracted from any search based approach, in-cluding local searchers [85]. Identifying the building blocks of a solution using search can be an end in itself, rather than a means to an end. It is not necessary to trust the algo-rithm to produce a solution, merely that it is able to identify building blocks likely to be worthy of consideration by the human; a far less stringent requirement. The ultimate solu-tion proposed may be constructed by a human engineer and so the problem of acceptance simply evaporates. However, the engineer can have been informed and guided on the se-lection of parameters and construction of the model using a search based process that flags up potentially valuable build-ing blocks.

(8)

8. HOW CAN PREDICTIVE MODELING HELP

SBSE RESEARCHERS?

Section 7 explored the connections between predictive modeling and the SBSE community and, in particular, the strong connec-tions between GP-based approaches to SBSE and predictive mod-eling. These connections suggest that work on predictive ing may be as helpful to SBSE as SBSE is to predictive model-ing. This section briefly explores the ways in which the results and techniques from the predictive modeling community may be useful to the SBSE research community1. The section briefly considers three areas of SBSE for which predictive models could be help-ful and where there has been only a little initial work. This work demonstrates the need for predictive models but there remain many interesting open problems and challenges.

1. Predictive Models as Fitness FunctionsThe two key ingre-dients for SBSE are the representation of candidate solutions and the fitness function [50]. The fitness function guides the search and so it is important that it correctly distinguishes good solutions from bad. We might say that we want the fitness function to be a good predictorof the attribute we seek to optimize. Naturally, the success of the whole SBSE enterprise rests upon finding fitness functions that are good predictors. In this way, there is a direct contribution to be made by the predictive modeling community.

Defining suitable fitness functions is far from straightforward in many SBSE applications. For example, in Search Based Software Testing, there has been work on exploring the de-gree to which the fitness typically used is a good predictor [76].

SBSE researchers also have to construct fitness functions that fit existing data (for simulations). The predictive modeling community has significant experience and expertise concern-ing missconcern-ing and incomplete data that clearly should have a bearing on fitness function construction. Once we have a good prediction system, based on available data, we may think of this as a metric and, thereby, also think of it as a fitness function [54]. We may therefore claim, as we set out to do in the introduction that “Prediction Systems and Fitness Functions too”.

2. Predicting Global Optima ProximityThe SBSE approach reformulates a Software Engineering problem as a search problem. This implicitly creates a fitness landscape, like those depicted in Figure 1. Predictive Modeling may help to analyze and understand this landscape.

For example, if the landscape consists of one smooth hill, then a simple hill climbing approach will find the global op-timum. Gross et al. [47, 48] have explored the possibility of predicting the proximity of a result to a global optima for Search Based Software Testing. This is a very impor-tant topic, not just in SBSE research, but also in optimization in general. One of the advantages of greedy approaches for 1

The author is from the SBSE community and is not an expert in predictive modeling. As a result, this section is shorter than the remainder of the paper. However, this should not be interpreted as being anything other than a reflection of lack of expertise. An author with greater expertise in predictive modeling would surely produce a more detailed, authoritative and, no doubt, better account of the ways in which predictive modeling can be useful to SBSE researchers. Indeed, it is to be hoped that such a paper (or papers) may emerge from the predictive modeling community; it would be very welcome in the SBSE community.

single objective test case selection and minimization [120] is the fact that greedy algorithms are known to be within a log of the global optimum.

3. Effort predictionSome searches using SBSE can consume considerable computation resources. While greedy algorithms and hill climbing tend to be fast and efficient (at the expense of often locating sub optimal solutions), more sophisticated search techniques such as evolutionary algorithms can be computationally intensive. This is important, since most of the existing work on SBSE has focused on the use of pop-ulation based evolutionary algorithms, such as genetic algo-rithms and genetic programming [60].

There is clearly scope for predictive modeling here. Predict-ing effort is one of the most widely studied aspects of predic-tive modeling, particularly with relation to software project cost and effort estimation [106]. Lammermann and Wegener have explored the possibility of predicting effort in Search Based Software Testing [77]. However, there is much more work that can be done on predictive models to inform SBSE. If effort can be predicted, then perhaps it can be controlled and thereby managed or even reduced. For example, there is work on transformation for search based testing, that seeks to reduce computation search effort by transforming programs for increased ‘testability’ [52, 56, 87]. This work is currently guided by known problems for testability. A more sophisti-cated approach might take account of predictions from mod-els of testability. As was explained in Figure 1, different landscapes have very different properties, with some being more amenable to search based optimization. Therefore, the ability to predict features of the fitness landscape would be extremely attractive.

9. CONCLUSIONS

This paper has explored some of the ways in which SBSE re-search may assist predictive modeling, outlining some open prob-lems and challenges for the SBSE community. The paper also briefly reviews existing work on SBSE techniques for predictive modeling and how they may be further extended.

There is a close relationship between the challenges faced by both the SBSE and the Predictive Modeling communities. It seems likely that any research flow of ideas will be bi-directional. There are many ways in which the results and techniques emerging from the predictive modeling community may also be of benefit to re-searchers in SBSE. The connections between SBSE research work and predictive modeling outlined in this paper suggest that there may be significant value in a joint workshop to further explore these connections, relationships and to cross pollinate ideas between the two research communities.

10. ACKNOWLEDGEMENTS

The ideas presented here have been shaped by discussions with Giulio Antoniol, Edmund Burke, John Clark, Max Di Penta, Jose Javier Dolado, Norman Fenton, Robert Hierons, Mike Holcombe, Yue Jia, Brian Jones, Bill Langdon, Spiros Mancoridis, Phil McMinn, Tim Menzies, Marc Roper, Conor Ryan, Martin Shepperd, Paolo Tonella, Darrell Whitley, Xin Yao, Shin Yoo and Yuanyuan Zhang. I apologize to those I may have neglected to mention. My work is part funded by the generous support for grants from the EPSRC, the EU and Motorola.

(9)

11. REFERENCES

[1] W. Afzal, R. Torkar, and R. Feldt. Search-based prediction of fault count data. InProceedings of the 1st International Symposium on Search Based Software Engineering (SSBSE ’09), pages 35–38, Cumberland Lodge, Windsor, UK, 13-15 May 2009. IEEE Computer Society.

[2] W. Afzal, R. Torkar, and R. Feldt. A systematic review of search-based testing for non-functional system properties. Information and Software Technology, 51(6):957–976, 2009.

[3] J. S. Aguilar-Ruiz, I. Ramos, J. C. Riquelme, and M. Toro. An Evolutionary Approach to Estimating Software Development Projects.Information and Software Technology, 43(14):875–882, December 2001. [4] H. Akaike. A new look at the statistical model

identification.IEEE Transactions on Automatic Control, 19(6):716–723, 1974.

[5] J. T. Alander, T. Mantere, and P. Turunen. Genetic Algorithm based Software Testing. InProceedings of the 3rd International Conference on Artificial Neural Networks and Genetic Algorithms (ICANNGA ’97), pages 325–328, Norwich, UK, April 1997. Springer-Verlag.

[6] J. T. Alander, T. Mantere, P. Turunen, and J. Virolainen. GA in Program Testing. InProceedings of the 2nd Nordic Workshop on Genetic Algorithms and their Applications (2NWGA), pages 205–210, Vaasa, Finland, 19-23 August 1996.

[7] E. Alba and F. Chicano. Finding Safety Errors with ACO. InProceedings of the 9th annual Conference on Genetic and Evolutionary Computation (GECCO ’07), pages 1066–1073, London, England, 7-11 July 2007. ACM. [8] E. Alba and F. Chicano. Software Project Management with

GAs.Information Sciences, 177(11):2380–2401, June 2007. [9] S. Ali, L. C. Briand, H. Hemmati, and R. K.

Panesar-Walawege. A systematic review of the application and empirical investigation of search-based test-case generation.IEEE Transactions on Software Engineering, 2010. To appear.

[10] G. Antoniol, M. Di Penta, and M. Harman. Search-based Techniques for Optimizing Software Project Resource Allocation. InProceedings of the 2004 Conference on Genetic and Evolutionary Computation (GECCO ’04), volume 3103/2004 ofLecture Notes in Computer Science, pages 1425–1426, Seattle, Washington, USA, 26-30 June 2004. Springer Berlin / Heidelberg.

[11] G. Antoniol, M. Di Penta, and M. Harman. Search-based Techniques Applied to Optimization of Project Planning for a Massive Maintenance Project. InProceedings of the 21st IEEE International Conference on Software Maintenance (ICSM ’05), pages 240–249, Los Alamitos, California, USA, 25-30 September 2005. IEEE Computer Society. [12] G. Antoniol, S. Gueorguiev, and M. Harman. Software

project planning for robustness and completion time in the presence of uncertainty using multi objective search based software engineering. InACM Genetic and Evolutionary Computation COnference (GECCO 2009), pages 1673–1680, Montreal, Canada, 8th – 12th July 2009. [13] F. Asadi, G. Antoniol, and Y. Guéhéneuc. Concept locations

with genetic algorithms: A comparison of four distributed architectures. InProceedings of2ndInternational Symposium on Search based Software Engineering (SSBSE

2010), page To Appear, Benevento, Italy, 2010. IEEE Computer Society Press.

[14] A. J. Bagnall, V. J. Rayward-Smith, and I. M. Whittley. The Next Release Problem.Information and Software

Technology, 43(14):883–890, December 2001. [15] P. Baker, M. Harman, K. Steinhöfel, and A. Skaliotis.

Search Based Approaches to Component Selection and Prioritization for the Next Release Problem. InProceedings of the 22nd IEEE International Conference on Software Maintenance (ICSM ’06), pages 176–185, Philadelphia, Pennsylvania, 24-27 September 2006. IEEE Computer Society.

[16] A. Baresel, D. Binkley, M. Harman, and B. Korel. Evolutionary Testing in the Presence of Loop-Assigned Flags: A Testability Transformation Approach. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’04), pages 108–118, Boston, Massachusetts, USA, 11-14 July 2004. ACM.

[17] A. Baresel, H. Sthamer, and M. Schmidt. Fitness Function Design to Improve Evolutionary Structural Testing. In Proceedings of the 2002 Conference on Genetic and Evolutionary Computation (GECCO ’02), pages 1329–1336, New York, USA, 9-13 July 2002. Morgan Kaufmann Publishers.

[18] L. Bottaci. Instrumenting Programs with Flag Variables for Test Data Search by Genetic Algorithms. InProceedings of the 2002 Conference on Genetic and Evolutionary Computation (GECCO ’02), pages 1337–1342, New York, USA, 9-13 July 2002. Morgan Kaufmann Publishers. [19] S. Bouktif, D. Azar, D. Precup, H. Sahraoui, and B. Kégl.

Improving Rule Set Based Software Quality Prediction: A Genetic Algorithm-based Approach.Journal of Object Technology, 3(4):227–241, 2004.

[20] S. Bouktif, B. Kégl, and H. Sahraoui. Combining Software Quality Predictive Models: An Evolutionary Approach. In Proceedings of the International Conference on Software Maintenance (ICSM ’02), pages 385–392, MontrÃl’al, Canada, 3-6 October 2002. IEEE Computer Society. [21] S. Bouktif, H. Sahraoui, and G. Antoniol. Simulated

Annealing for Improving Software Quality Prediction. In Proceedings of the 8th annual Conference on Genetic and Evolutionary Computation (GECCO ’06), pages

1893–1900, Seattle, Washington, USA, 8-12 July 2006. ACM.

[22] M. Bowman, L. Briand, and Y. Labiche. Multi-objective genetic algorithms to support class responsibility assignment. In23rd

IEEE International Conference on Software Maintenance (ICSM 2007), Los Alamitos, California, USA, October 2007. IEEE Computer Society Press. To Appear.

[23] L. C. Briand, Y. Labiche, and M. Shousha. Stress Testing Real-Time Systems with Genetic Algorithms. In Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO ’05), pages 1021–1028, Washington, D.C., USA, 25-29 June 2005. ACM.

[24] C. J. Burgess and M. Lefley. Can Genetic Programming Improve Software Effort Estimation? A Comparative Evaluation.Information and Software Technology, 43(14):863–873, December 2001.

(10)

M. Lumkin, B. Mitchell, S. Mancoridis, K. Rees, M. Roper, and M. Shepperd. Reformulating software engineering as a search problem.IEE Proceedings — Software,

150(3):161–175, 2003.

[26] S. L. Cornford, M. S. Feather, J. R. Dunphy, J. Salcedo, and T. Menzies. Optimizing Spacecraft Design - Optimization Engine Development: Progress and Plans. InProceedings of the IEEE Aerospace Conference, pages 3681–3690, Big Sky, Montana, March 2003.

[27] O.-J. Dahl, E. W. Dijkstra, and C. A. R. Hoare.Structured Programming. Academic Press, London, 3 edition, 1972. [28] K. De Jong. On using genetic algorithms to search program

spaces. In J. J. Grefenstette, editor,Genetic Algorithms and their Applications: Proceedings of the second international Conference on Genetic Algorithms, pages 210–216, Hillsdale, NJ, USA, 28-31 July 1987. Lawrence Erlbaum Associates.

[29] J. T. de Souza, C. L. Maia, F. G. de Freitas, and D. P. Coutinho. The human competitiveness of search based software engineering. InProceedings of2nd

International Symposium on Search based Software Engineering (SSBSE 2010), page To Appear, Benevento, Italy, 2010. IEEE Computer Society Press.

[30] C. Del Grosso, G. Antoniol, M. Di Penta, P. Galinier, and E. Merlo. Improving Network Applications Security: A New Heuristic to Generate Stress Testing Data. In Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO ’05), pages 1037–1043, Washington, D.C., USA, 25-29 June 2005. ACM.

[31] E. Díaz, J. Tuya, R. Blanco, and J. J. Dolado. A Tabu Search Algorithm for Structural Software Testing. Computers & Operations Research, 35(10):3052–3072, October 2008.

[32] J. J. Dolado. A Validation of the Component-based Method for Software Size Estimation.IEEE Transactions on Software Engineering, 26(10):1006–1021, October 2000. [33] J. J. Dolado. On the Problem of the Software Cost Function.

Information and Software Technology, 43(1):61–72, January 2001.

[34] J. J. Dolado and L. Fernandez. Genetic Programming, Neural Networks and Linear Regression in Software Project Estimation. In C. Hawkins, M. Ross, G. Staples, and J. B. Thompson, editors,Proceedings of International Conference on Software Process Improvement, Research, Education and Training (INSPIRE III), pages 157–171, London, UK, 10-11 September 1998. British Computer Society.

[35] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynamically discovering likely program invariants to support program evolution.IEEE Transactions on Software Engineering, 27(2):1–25, Feb. 2001.

[36] R. M. Everson and J. E. Fieldsend. Multiobjective

Optimization of Safety Related Systems: An Application to Short-Term Conflict Alert.IEEE Transactions on

Evolutionary Computation, 10(2):187–198, April 2006. [37] M. P. Evett, T. M. Khoshgoftaar, P. der Chien, and E. B.

Allen. Using Genetic Programming to Determine Software Quality. InProceedings of the 12th International Florida Artificial Intelligence Research Society Conference (FLAIRS ’99), pages 113–117, Orlando, FL, USA, 3-5 May 1999. Florida Research Society.

[38] M. S. Feather, S. L. Cornford, J. D. Kiper, and T. Menzies. Experiences using Visualization Techniques to Present Requirements, Risks to Them, and Options for Risk Mitigation. InProceedings of the International Workshop on Requirements Engineering Visualization (REV ’06), pages 10–10, Minnesota, USA, 11 September 2006. IEEE. [39] M. S. Feather and T. Menzies. Converging on the Optimal

Attainment of Requirements. InProceedings of the 10th IEEE International Conference on Requirements Engineering (RE ’02), pages 263–270, Essen, Germany, 9-13 September 2002. IEEE.

[40] N. E. Fenton and M. Neil. A critique of software defect prediction models.IEEE Transactions on Software Engineering, 25(5):675–689, 1999.

[41] A. Finkelstein, M. Harman, S. A. Mansouri, J. Ren, and Y. Zhang. “Fairness Analysis" in Requirements Assignments. InProceedings of the 16th IEEE

International Requirements Engineering Conference (RE ’08), pages 115–124, Barcelona, Catalunya, Spain, 8-12 September 2008. IEEE Computer Society.

[42] R. W. Floyd. Assigning meanings to programs. In J. T. Schwartz, editor,Mathematical Aspects of Computer Science, volume 19 ofSymposia in Applied Mathematics, pages 19–32. American Mathematical Society, Providence, RI, 1967.

[43] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit. A simulation study of the model evaluation criterion mmre. IEEE Transactions on Software Engineering, 29:985–995, 2003.

[44] V. Garousi, L. C. Briand, and Y. Labiche. Traffic-aware Stress Testing of Distributed Real-Time Systems based on UML Models using Genetic Algorithms.Journal of Systems and Software, 81(2):161–185, February 2008. [45] G. Gay, T. Menzies, M. Davies, and K. Gundy-Burlet. Automatically finding the control variables for complex system behavior.Automated Software Engineering, page to appear.

[46] N. Gold, M. Harman, Z. Li, and K. Mahdavi. Allowing Overlapping Boundaries in Source Code using a Search Based Approach to Concept Binding. InProceedings of the 22nd IEEE International Conference on Software

Maintenance (ICSM ’06), pages 310–319, Philadelphia, USA, 24-27 September 2006. IEEE Computer Society. [47] H.-G. Groß, B. F. Jones, and D. E. Eyres. Structural

Performance Measure of Evolutionary Testing applied to Worst-Case Timing of Real-Time Systems.IEE Proceedings - Software, 147(2):25–30, April 2000. [48] H.-G. Groß and N. Mayer. Evolutionary Testing in

Component-based Real-Time System Construction. In E. Cantú-Paz, editor,Proceedings of the 2002 Conference on Genetic and Evolutionary Computation (GECCO ’02), pages 207–214, New York, USA, 9-13 July 2002. Morgan Kaufmann Publishers.

[49] R. J. Hall. A quantum algorithm for software engineering search. InAutomated Software Engineering (ASE 2009), pages 40–51, Auckland, New Zealand, 2009. IEEE Computer Society.

[50] M. Harman. The current state and future of search based software engineering. In L. Briand and A. Wolf, editors, Future of Software Engineering 2007, pages 342–357, Los Alamitos, California, USA, 2007. IEEE Computer Society Press.

(11)

[51] M. Harman. Search based software engineering for program comprehension. In15thInternational Conference on Program Comprehension (ICPC 07), pages 3–13, Banff, Canada, 2007. IEEE Computer Society Press.

[52] M. Harman. Open problems in testability transformation (keynote). In1st International Workshop on Search Based Testing (SBT 2008), Lillehammer, Norway, 2008. [53] M. Harman. Why the virtual nature of software makes it

ideal for search based optimization. In13thInternational Conference on Fundamental Approaches to Software Engineering (FASE 2010), pages 1–12, Paphos, Cyprus, March 2010.

[54] M. Harman and J. A. Clark. Metrics Are Fitness Functions Too. InProceedings of the 10th IEEE International Symposium on Software Metrics (METRICS ’04), pages 58–69, Chicago, USA, 11-17 September 2004. IEEE Computer Society.

[55] M. Harman, R. Hierons, and M. Proctor. A New Representation and Crossover Operator for Search-based Optimization of Software Modularization. InProceedings of the 2002 Conference on Genetic and Evolutionary Computation (GECCO ’02), pages 1351–1358, New York, USA, 9-13 July 2002. Morgan Kaufmann Publishers. [56] M. Harman, L. Hu, R. M. Hierons, J. Wegener, H. Sthamer,

A. Baresel, and M. Roper. Testability Transformation. IEEE Transactions on Software Engineering, 30(1):3–16, January 2004.

[57] M. Harman and B. F. Jones. Search based software engineering.Information and Software Technology, 43(14):833–839, Dec. 2001.

[58] M. Harman, J. Krinke, J. Ren, and S. Yoo. Search based data sensitivity analysis applied to requirement engineering. InACM Genetic and Evolutionary Computation

COnference (GECCO 2009), pages 1681–1688, Montreal, Canada, 8th – 12th July 2009.

[59] M. Harman, K. Lakhotia, and P. McMinn. A multi-objective approach to search-based test data generation. InGECCO 2007: Proceedings of the9thannual conference on Genetic and evolutionary computation, pages 1098 – 1105, London, UK, July 2007. ACM Press.

[60] M. Harman, A. Mansouri, and Y. Zhang. Search based software engineering: A comprehensive analysis and review of trends techniques and applications. Technical Report TR-09-03, Department of Computer Science, King’s College London, April 2009.

[61] M. Harman and L. Tratt. Pareto optimal search-based refactoring at the design level. InGECCO 2007: Proceedings of the9thannual conference on Genetic and evolutionary computation, pages 1106 – 1113, London, UK, July 2007. ACM Press.

[62] C. A. R. Hoare. An Axiomatic Basis of Computer Programming.Communications of the ACM, 12:576–580, 1969.

[63] C. A. R. Hoare. How did software get so reliable without proof? InIEEE International Conference on Software Engineering (ICSE’96), Los Alamitos, California, USA, 1996. IEEE Computer Society Press. Keynote talk. [64] A. Idri, T. M. Khoshgoftaar, and A. Abran. Can neural

networks be easily interpreted in software cost estimation?, 2003.

[65] D. Jackson. Self-adaptive focusing of evolutionary effort in hierarchical genetic programming. In A. Tyrrell, editor,

2009 IEEE Congress on Evolutionary Computation, pages –, Trondheim, Norway, 18-21 May 2009. IEEE Computational Intelligence Society, IEEE Press. [66] O. Jalali, T. Menzies, and M. Feather. Optimizing

Requirements Decisions With KEYS. InProceedings of the 4th International Workshop on Predictor Models in Software Engineering (PROMISE ’08), pages 79–86, Leipzig, Germany, 12-13 May 2008. ACM.

[67] G. Jarillo, G. Succi, W. Pedrycz, and M. Reformat. Analysis of Software Engineering Data using

Computational Intelligence Techniques. InProceedings of the 7th International Conference on Object Oriented Information Systems (OOIS ’01), pages 133–142, Calgary, Canada, 27-29 August 2001. Springer.

[68] A. M. Joshi, L. Eeckhout, L. K. John, and C. Isen. Automated Microprocessor Stressmark Generation. In Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture (HPCA ’08), pages 229–239, Salt Lake City, UT, USA, 16-20 February 2008. IEEE.

[69] T. M. Khoshgoftaar and Y. Liu. A Multi-Objective Software Quality Classification Model Using Genetic Programming. IEEE Transactions on Reliability, 56(2):237–245, June 2007.

[70] T. M. Khoshgoftaar, Y. Liu, and N. Seliya. Genetic Programming-based Decision Trees for Software Quality Classification. InProceedings of the 15th International Conference on Tools with Artificial Intelligence (ICTAI ’03), pages 374–383, Sacramento, California, USA, 3-5 November 2003. IEEE Computer Society.

[71] T. M. Khoshgoftaar, N. Seliya, and D. J. Drown. On the Rarity of Fault-prone Modules in Knowledge-based Software Quality Modeling. InProceedings of the 20th International Conference on Software Engineering and Knowledge Engineering (SEKE ’08), pages 279–284, San Francisco, CA, USA, 1-3 July 2008. Knowledge Systems Institute Graduate School.

[72] T. M. Khoshgoftaar, L. Yi, and N. Seliya. A multiobjective module-order model for software quality enhancement. IEEE Transactions on Evolutionary Computation, 8(6):593– 608, December 2004.

[73] C. Kirsopp, M. Shepperd, and J. Hart. Search heuristics, case-based reasoning and software project effort prediction. InGECCO 2002: Proceedings of the Genetic and

Evolutionary Computation Conference, pages 1367–1374, San Francisco, CA 94104, USA, 9-13 July 2002. Morgan Kaufmann Publishers.

[74] J. R. Koza.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, 1992.

[75] M. Kuperberg, K. Krogmann, and R. Reussner.

Performance Prediction for Black-Box Components Using Reengineered Parametric Behaviour Models. In

Proceedings of the 11th International Symposium on Component-Based Software Engineering (CBSE ’08), volume 5282 ofLNCS, pages 48–63, Karlsruhe, Germany, 14-17 October 2008. Springer.

[76] F. Lammermann, A. Baresel, and J. Wegener. Evaluating Evolutionary Testability with Software-Measurements. In Proceedings of the 2004 Conference on Genetic and Evolutionary Computation (GECCO ’04), volume 3103 of Lecture Notes in Computer Science, pages 1350–1362,