Chapter 2 Background
2.6 Optimisation Under Uncertainty
2.6.4 Simulated Annealing
Simulated annealing is based on annealing in metallurgy [Cerny, 1985; Kirkpatrick et al., 1983]. The basis of the method is that a physical system which is cooled quickly will generally not reach an optimal energy configuration, and will be prone to defects. By contrast, a physical system which is cooled slowly is much more likely to reach a globally optimal energy configuration and be without defects.
The simulated annealing algorithm applies the same idea to numerical prob- lems. At the start of the optimisation, moves through the search space that lead to a solution which is worse than the current solution are readily accepted. As the op- timisation progresses the likelihood of accepting these bad moves reduces. This idea is akin to gradually reducing the temperature of the system. At a high temperature the optimisation is free to move around the search space, irrespective of the quality
of the moves. However, as the temperature reduces the mobility of the optimisation reduces and it is likely to only accept moves that show an improvement over the current choice (state).
To define the algorithm more formally, consider an optimisation where we wish to minimise a function, f(x), over x ∈ A ⊂ Rd. First, we consider how to
sample from the distribution described by the (relative) densityg(x) = e−βf(x) for some fixedβ ∈R. Markov Chain Monte Carlo sampling techniques can be used to draw samples from this distribution.
To construct such a chain we must define a transition density between two points in the state space, q(x → y), where x, y ∈ A. This transition density is composed of two components. The first is the probability of attempting a move to a particular point; we label this the perturbation density, h(x → y). The second is the probability of accepting such a move, labelleda(x →y). We then have that
q(x → y) =h(x → y)a(x → y). In other words, the transition density is equal to the probability density of considering a move and then accepting that move.
Since we already know the distribution we wish our chain to follow, we can use this to define our choice of acceptance function. The most common choice for
a(x→y) in this case is the Metropolis acceptance function. This is defined to be
a(x→y) = min
g(y)h(y→x)
g(x)h(x→y),1
whereg(x) is the density we wish to be sample. Justification as to why a chain using this acceptance function has a stationary distribution with relative density g(x) is provided in [Roberts and Rosenthal, 2004].
In most casesh(x→y) is chosen to be symmetrical, for example,h(x→y) =
φ(x−y), where φ is the density function of the normal distribution. If h(x → y) is symmetric and we set g(x) =e−βf(x) (as above) then the Metropolis acceptance function in our case becomes
a(x→y|β) = mine−β(f(y)−f(x)),1.
In simulated annealing we initially run this chain for a fixed amount of time (for some initial choice ofβ). We then reduce the temperature (i.e. β ↑) and run the chain for a further block of time. Each new chain is initiated using the last value ofx
accepted by the previous chain. To define this precisely we further require a function which defines how the (inverse) temperature of our system changes,b(k); the number of temperature changes,M; the number of iterations for each temperature change,
L; and an initial state,x0. The simulated annealing algorithm is then as defined in algorithm 2.1.
The resulting algorithm will, given enough time, converge (in probability) to the global minima of the object function (f(x)) [Henderson et al., 2003]. [Gelfand and Mitter, 1989] further showed convergence (in probability) to the global optimum whenf(x) could only be estimated and the resulting estimates had a Gaussian error distribution. [Gutjahr and Pflug, 1996] further showed convergence as long as the error distribution was both symmetric and suitably peaked around the correct value. However, in all of these proofs the required rate of cooling would take too long to be feasible in practice and so convergence to the global minima cannot be guaranteed. As described above, we know the exact chain explored for fixed temperature (β−1). However, each time the temperature changes it will take some time for the chain to return to the equilibrium distribution. The behaviour of the chain during this period is unknown.
Algorithm 2.1 Simulated Annealing Algorithm
1: x:=x0,v:=f(x) and β= 0 2: for 1≤k≤M do 3: β :=b(k) 4: forL stepsdo 5: Pick ˜x subject to h(x→x˜) 6: if a(x→x˜|β)> u s.t. u∼U(0,1)then 7: x:= ˜x 8: return x
Picking an appropriate choice of cooling schedule (as encompassed byb(k)) is a known problem of interest. Two common strategies are to either cool the system linearly (b(k) = β0+kw where β0, w ∈R), or to cool the system exponen-
tially (b(k) = α−kβ0 where α ∈ [0,1] and β0 ∈ R) [Chen et al., 2007; Guoa and
Zhengb, 2005]. It is worth noting that [Strenski and Kirkpatrick, 1991] do not find any measurable difference between the performance of linear and geometric cooling schedules. However, for convergence to be certain (in probability) a logarithmic cooling schedule needs to be used (b(k) = log(ck+d) wherec, d∈Ralthough normally
d = 1). Unfortunately, this strategy is too slow for normal usage [Nourani and Andresen, 1998].
Another consideration is how often to cool the system. In the simple al- gorithm above the system is cooled every L steps. This is a static schedule. An adaptive schedule could also be used. An adaptive schedule varies the cooling rate using information obtained during the algorithm’s execution [Henderson et al., 2003].
For example, the system could instead be cooled everyLaccepted moves instead of everyLattempted moves.
There is no commonly accepted cooling method [Henderson et al., 2003]. In reality, the optimal cooling schedule is often problem specific. Examples of further problem specific cooling schedules can be found in [Kolonko, 1999; Bertsimas and Tsitsiklis, 1993; Thompson and Dowsland, 1998].
During the course of this work we will focus on simple cooling problems. As discussed above, our intention is to consider the scenario when f(x) cannot be calculated exactly and there is some associated error. Ideas similar to this have previously been considered in [Ball et al., 2003].