3. Statistical Methods for the Analysis of Stochastic Optimisers
3.4. Performance measurement
The performance of optimisation algorithms that return approximations of optimal so- lutions, are described by two variables: the quality of the solution and the computation time to produce it.
In the case of SLS methods, a natural termination criterion does not exist, and the longer time is allocated the better the solution quality should be. In order to carry out the comparison among several algorithms, it is then common praxis to allow all algorithms
to consume the same amount of computational resources (Rardin and Uzsoy,2001refer
3.4 Performance measurement 41
such an experimental setting, algorithms with a natural termination condition should be restarted if they end before the time limit imposed.) If the comparison is done on more than one instance, an instance independent measure of solution quality must be defined. A convenient measure for solution quality is the “distance”, or error, from the optimal value. This measure exhibits two problematic issues. The first is the determination of the optimal solution, given that, for the cases of our interest, exact methods are infea- sible. There may be cases in which the optimal solution is known from the process of construction of the instances. An alternative possibility is the substitution of the opti- mal solution quality with bounds or approximations. Unfortunately, these values are often weak indicators of the optimal values. Statistical estimation techniques based on ex- treme value theory constitute another alternative for the estimation of optimal solutions. Accordingly, the distribution of the best solutions in n independent solution samples is approximated by a Weibull distribution and a confidence interval for the optimal solu- tion quality is derived from there. These techniques have been applied in the field of
combinatorial optimisation (McRoberts,1971;Dannenbring,1977;Golden and Alt,1979;
Smith and Sucur,1996;Ovacik et al.,2000), but further investigations on more problems
are necessary to assess the actual reliability of their estimates. A last possibility, widely used in common practice, is the comparison of results with best known solutions. To this end, instances are used that belong to well known benchmark sets for which a collection of good solutions is made available by previous studies. If instead instances are new, then as best solution can be taken the best produced by any of the algorithms involved in the comparison. It might also be a choice to perform long time runs of a good algorithm and record the best solutions found, unless this is computationally too expensive. The drawback of relating the analysis to best known results is that the comparison becomes biased by these values and it may change if the best known values improve.
The second problematic issue in the definition of a measure for the error is that different instances exhibit different scales of solution costs. If we denote the cost of the solution found on a run of an algorithm on an instance i as c(i) and the optimal cost, or a possible
approximation of it, as copt(i), the relative error |c(i) − copt(i)|/copt(i)may help to make
results among instances more comparable. Zemel (1981) defines an error measure as
“proper” if it remains invariant under some trivial transformation on the instance that leaves the problem equivalent. The relative error is not always proper and he proposes as a more robust measure
e(c, i) = c(i) − copt(i)
c0(i) − c opt(i)
(3.1)
where c0(i)is the worst solution cost. Unfortunately, deriving the worst solution cost may
be a problem as hard as finding copt.Zlochin and Dorigo(2002) suggest then the use of a
surrogate value for c0(i), that is, the expected cost of a uniform distribution of solutions
or the expected cost produced by a most standard algorithm, such as a heuristic or a ran- dom solution generator. With this latter choice, besides being more invariant than the
relative error, the error measure3.1has also the practical property of providing an imme-
diate indication of how much better an algorithm performs compared to an elementary algorithm. A value of the error e(c, i) close to 1 indicates that the performance of the two algorithms are similar.
A different approach from considering an error measure is to transform results into ranks. This method is appealing in experiments with algorithms on several instances be- cause each instance can be seen as a judge who assigns a vote to the algorithms. A ranking procedure operates within instances and assigns value 1 to the best result, value 2 to the
42 Statistical Methods for the Analysis of Stochastic Optimisers
next, etc., with duplicate ranks allowed in the case of identical results. Chiefly impor- tant, ranking on instances provides results which are invariant with respect to different instance scales. Nevertheless, ranking necessarily reduces the information available, as it neglects the entity of differences among algorithms on single instances.
If optimisation algorithms are evaluated on the basis of their ability to solve an instance to optimality or to produce a given solution quality, performance can be assessed by mea- suring the computation time (also called run time). In order to make run times on different machines comparable, transformation ratios may be used which are obtained by running on all the machines a benchmark code that implements similar algorithmic operations as those of the algorithms studied. When possible, measuring basic algorithmic operations which are common to all algorithms, removes machine dependencies and is, therefore, a better choice.