• No results found

Covariance Matrix Adaptation Annealing Algorithm

5.4 Covariance Matrix Adaptation Annealing

5.4.2 Covariance Matrix Adaptation Annealing Algorithm

The proposed algorithm has the several advantages: 1) It drives the landscape tran- sition of the energy function from roughly convex to the original peaked shape. This allows particles to gradually move towards to the global mode; 2) The perturbation is not isotropic in all directions, but tends to perturb the samples along the weighted

directions conforming to the observation likelihood; 3) The accumulative path simi- lar to the evolution path of CMA-ES is incorporated into the perturbation matrix to maintain the primary trend. This prevents the perturbation directions from chang- ing dramatically and reduces the chance of oscillating successive perturbations which cancel out each other; and 4) It is inspired by human social behaviour, where indi- viduals are attracted by the combined behaviour of both successful individuals and majority interests. This extra influence is introduced, allowing each sample to move closer to better fit samples, or the mean sample. All samples are expected to move towards to the centre and gradually concentrate. The algorithm preserves the major features of Simulated Annealing; the major differences only exist in the perturbation matrix update.

Initially, a series of survive ratesα1, ...,αM forM phases are defined. A large sur- vival rate suggests slow convergence and is often used with a large number of phases. Conversely, a small survive rate creates a sharply shaped importance distribution, resulting in fast convergence, and is suitable for a small number of phases. In our ex- perience, it is recommended thatαbe varied from 0.3 to 1. The perturbation matrixP0

is initialised to account for maximum perturbation according to the specific context. If the perturbation matrix from the previous time is available,P0,tis initialised by:

P0,tP0,t−1+ (1β)P0

whereβcontrols the balance between the dynamic partP0,t−1from the previous time

and the stable partP0 as a hard constraint. Otherwise, P0,t = P0. If the previousN

samples/particlesxt−1are available, theNparticlesxitcan be initialised with the tem- poral model. Otherwise, particlesxitare initialised by applying Gaussian perturbation with covarianceP0,tand meanµ=0 to the given initial position.

Then, for all particles xit, we evaluate the energy function E(yt,xit)to obtain the

§5.4 Covariance Matrix Adaptation Annealing 107

is minimised. It is assumed to be positive. As the evaluation of the energy function forms the computational bottleneck, it is desirable to design the energy function to be quickly computed. After determining the energy for all particles, we want to form the importance distribution which has the specific survive rateαm. This is done by solv- ing Equation (5.1.2) to find λm such that the importance distribution has the desired shape. We use the resultingλm to update the weights for all particles. Therefore, all particles associated with weights can be regarded as an approximation of the impor- tance distribution. WhenNparticles are resampled from the importance distribution, the offspring have a higher chance of coming from better parent particles. Roughly speaking, 100·λm percent of particles will have offspring, and other particles will be obsoleted. This is how survive rate shapes the importance distribution, influences particle selection and eventually controls the convergence speed.

5.4.2.1 Perturbation Matrix and Particle Velocity Update

The new particles are derived from three major factors: 1) direct resampling from the importance distributionxbi, 2) Gaussian perturbation with adaptive covarianceδi, and 3) the particle velocity imposed by attraction from superior individuals vi. This is formulated by:

xi =xbi+ (1c

v)δi+cvvi

b

xiis similar to the resampled particle in Simulated Annealing andδi is Gaussian per- turbation generated by N(0,(mi=1αi)Pm). The covariancePm is learned adaptively throughout the course of optimisation and progressively scaled by the survive rate to simulate cooling and freezing of the particles’ movement. Different from Gaussian perturbation in many aspects, the particle velocityvidrives the particle to move in the direction approaching the global best and mean particles. This is similar to the PSO algorithm. cv [0, 1]is a parameter used to control the contributions from Gaussian perturbation and particle velocity. cv > 0.5 favours fast convergence in the roughly

convex situation. Conversely,cv <0.5 favours broader exploration of the multimodal landscape.

In the Gaussian perturbation, the covariance matrix Pm is calculated adaptively,

similar to Rank-µ-Update and Rank-One-Update in CMA-ES, which can be mathe-

matically represented as:

x = N

i=1 wixi pc = (1−cc)pc+cc(xm−xm−1) Pm = (1−c1−cµ)Pmi 1+c1pcpTc +cµ N

i=1 wi(xi−xm)(xi−xm)T

where Rank-µ-Updatecµ∑i=N1wi(xi−xm)(xi−xm)Tis observation likelihood weighted covariance to shape the perturbation direction, expected to help push the next particle perturbations towards the exploration of more favourable regions. Rank-One-Update

c1pcpTc incorporates historical information to smooth the perturbation directions. This is done by enforcing that particles move along the accumulative path of the mean par- ticlexwithout excessive oscillations. The parameterscc,cµ andc1are used to control

the contributions of Rank-µ-Update and Rank-One-Update, as well as control expo-

nential smoothing between phases. In our experience, with cc = 0.5,cµ = 1/3 and

c1 =1/3 we obtain reasonable results for general cases.

The particle velocity is perturbed to simulate social behaviour and interaction be- tween particles, through attraction to the best individual and the mean individual, as addressed in the formulation below:

vi = (1−rb) xm−bxi xm−bxi∥22 +rb x(best)bxi x(best)bxi∥2 2

whererbdenotes a uniform random variable in[1, 0]. Thus,vi is a vector originating atbxi and pointing to the range betweenxmandx(best). The squaredL2denominator is

§5.4 Covariance Matrix Adaptation Annealing 109

also consistent with the influence weakening in a social network where the physical distance between two individuals is increasing. Therefore, ifbxi is far away from both xm andx(best), the particle velocity has a very small magnitude and Gaussian pertur- bationδi will dominate the new position of the particle. Ifbxi is far fromxm and close tox(best), the particle velocity will be dominated byx(best)bxi. Conversely, ifbxi is far fromx(best) and close toxm, the particle velocity will drivebxi to move more towards xm.

To summarise, Covariance Matrix Adaptation Annealing is given in Algorithm 9.

Algorithm 9Covariance Matrix Adaptation Annealing at timet

Require: a sequence ofαm for every phase is defined, previous particlesxt−1, obser- vationyt, the number of phasesMand the initial covariance matrixP0 andP0,tare given

form=1 toMdo

1: Initialise N particles xit from the previous phase or the temporal model

p(xit|xit1).

2: Calculate the energyE(yt,xit)for all particles. 3: Findλmby solving the equationαmNNi=1(wit,m)2 =

( ∑N

i=1wit,m

)2

.

4: Update weights for all particles using the equation wi = p(yt|xit) = exp{−λE(yt,xit)}.

5: ResampleNparticlesbxi from the importance distribution. 6: Update the perturbation matrix.

x= ∑i=N1wixi

pc= (1−cc)pc+cc(xm−xm−1)

Pm,t = (1−c1−cµ)Pm−1,t+c1pcpcT+cµ∑i=N1wi(xi−xm)(xi−xm)T

7: Generate N random perturbations δi by Gaussian noise with covariance

(mi=1αi)Pm,tand meanµ=0.

8: Compute particle velocities imposed by the majority force fromxm and attrac- tion to the current global bestx(best),vi = (1−rb) xm−bx

i ∥xm−bxi∥22 +rb x (best)bxi ∥x(best)bxi2 2. 9: Compute the finalNparticles withxi =xbi+ (1c

v)δi+cvvi end for