MOL Particle Swarm Optimiser - Parallel Implementation

3.4 Parallel Implementation

3.4.1 MOL Particle Swarm Optimiser

As mentioned earlier, thePSOwas the centre of much research in high performance GPU implementations early on [256,286,195,304,13,112]. Some of these are GPU-based methods, and some are cluster computer methods. Clearly however, there are different methods of parallelising the PSO. The earliest parallel PSO was that of Schutte et al. [256] in 2003, which involved the division of work among nodes on a Beowulf cluster using MPI (Message Passing Interface) [206]. Venter and Sobieszczanski-Sobieski also made use of MPI in 2005, and introduced another

3.4. PARALLEL IMPLEMENTATION 53

parallel PSO which makes use of a deliberate lack of synchronisation to improve parallel efficiency [286]. The simplicity of thePSOalgorithm and its inherent parallel nature proved to be very attractive.

In latter years whereGPUshave gained increasing interest, other authors have also proposed GPU-parallel

PSOalgorithms. These include the work of Bastos-Filho et al. who proposed a GPU-based PSO with a variety of communication topologies, both synchronous and asynchronous [13]. Another named the SyncPSO was developed by Mussi et al. which embraces the limitations of blocks in terms of resources and synchronisation [195].

allocate and initialise enough space fornsolution vectors withddimensions on device

allocate space fordnrandom deviates whiletermination criteria not metdo

call CURAND to fill the random number array with uniform deviates in the range [0,1) copy vectors to device, includingg

BEGIN CUDA KERNEL

withdnCUDA Threads (i= 0..dn−1);

// Safety check for unconventional thread grid configurations. ifi < dnthen

ifvelocity threadthen

// Thread grid is set up for separate threads to compute velocities and positions for simplicity.

vi←ωvi+φgrgi(g−xi)

end if

syncthreads() ifposition threadthen

ensure velocity is within bounds

xi←xi+vi

ensure position vector is within bounds

ifi%(2d+ 1) == 2dthen

calculate the fitness of vectorxwhichibelongs to end if

end if end if

END CUDA KERNEL copy new vectors to host

obtain the best solution and assign tog visualise the result

end while

ALGORITHM6: The parallel implementation of the MOL PSO.

As discussed in section3.2.2, the MOLPSO provides a few benefits which make parallelisation easier. Algo- rithm6shows a GPU-parallel version of the MOLPSO, where each particle has one thread dedicated to one of its components. Much of this algorithm is dictated by implementation details of the typical GPU-enabled algorithm. Firstly, syncthreads()is necessary in order to avoid race conditions between threads, which will ensure the new

54 3. CONTINUOUS GLOBAL OPTIMISATION

velocity is computed and ready for use. It is worth noting that a simple modification using a previously calculated velocity can avoid this synchronisation. Other race conditions are eliminated by separating read-only solution vectors and write-only solution vectors. Every time step ensures that the device contains the current best candidate in global memory where all thread blocks can access it (g). Each dimension of each particle is assigned a thread, maximising the fine-grained parallelisation of the algorithm. Once the velocity update and position update equations are computed, all threads apart from one per particle are disabled. This is done in order to compute the fitness of the particle without race conditions. It is noteworthy that memory copies to and from the device is a very expensive operation. Mussi et al., for example, have opted instead for an algorithm which operates entirely on the GPU without synchronisation with the host other than when it has reached a maximum number of generations [195].

The reason for having identical if statements in the algorithm is to ensure that all threads reach the barrier synchronisation. This is a necessary condition to avoid undefined behaviour inCUDA[202]. Finally, the vectors are copied to the device such that position and velocity vectors are adjacent, in order to improve memory locality and retrieval speed.

L´evy flights are also used in Algorithm6. Enough random deviates are generated by CURAND (which is included as part ofCUDA), in order to compute a L´evy deviate for every dimension, for every particle on the device.

In testing this algorithm,2048particles were used. This would be a considerably difficult task for a single- threaded implementation. For the L´evy deviates, the parameters used werec= 2andα= 1.5to give a balance between Cauchy and Rayleigh flights.φgwas set to a constant value of0.01and particle velocities were constrained

to between−0.04and0.04in every dimension.

Schwefel 4D Rastrigin 32D Ackley 64D de Jong 256D Rosenbrock 32D

Success % (Original) 1% 0% 0% 0% 0%

(L´evy Flights) 100% 100% 52% 100% 0%

Best Solution 17±39 110±80 1.4±1 1.5±2 35±4

0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 30±1

Timestep time (µsec) 140±2.5 1240±6 2221±0.9 3700±8 632±0.5

173±0.7 1380±7 2565±1.8 4723±2.6 910±30

Total Timesteps 4000±400 4000±0 4000±0 4000±0 4000±0

2000±430 213±21 2000±1600 1300±390 4000±0

Total Time (msec) 560±27 4950±20 8890±4 14800±35 2530±2.35

330±70 300±30 6000±4300 6000±1800 3700±130

TABLE3.3: Results with the MOLPSO (in parallel) with2048particles including L´evy-flights (3000frames each, averaged over100runs each) accompanied by standard deviations for optimising various test functions. For the measurements collected, the first line pertains to the typical Brownian random space exploration method, and the second is the L´evy-flight method.

The results from comparing the parallel MOLPSO with regular uniform-random deviates against the version with L´evy-flights are shown in Table3.3. For each measurement collected, the first line pertains to the original Brownian motion space exploration method, and the second pertains to the L´evy-flight method. The most difficult

3.4. PARALLEL IMPLEMENTATION 55

function to optimise was the Rosenbrock function in32dimensions. It is hypothesised that variable dependence in the function and a generally neutral fitness landscape contribute to this.

The L´evy flights method seems to have an advantage over the original method. As noted earlier, the computa- tional expense involved in computing L´evy deviates is high over the cost of uniform random deviates. However, the results in Table3.3do not indicate that this increased cost is excessive. Given faster convergence, the end result essentially means that the global minimum is obtained in fewer time steps, which is perhaps made more clear by the total time taken. Less total time is necessary in order to obtain the solution than with lower quality space exploration.

In all cases, the original space exploration method terminated at the maximum number of frames (4000), and did not succeed in consistently minimising any of the functions for the given parameters. Success was defined as reaching within0.0001of the function minimum.

In a brief additional comparison, the GPU and CPU algorithms were used for De Jong’s sphere function in

256dimensions, with2048particles. The CPU algorithm required roughly65msec for one timestep, and the GPU algorithm required3.7msec for one timestep. This is roughly an18X speedup. It is important to take into account that the CPU algorithm can perform much faster with fewer particles, and compute many more timesteps for the same amount of time. However, it is more likely to succumb to typical difficulties such as deceptiveness in fitness functions.

(a) A plot of the particle population size against the frame calculation time for varying numbers of particles with 64 dimensions each, inµ-seconds. The function used here is the Ackley function.

(b) A plot of the number of dimensions against the time taken to compute one timestep with 64 particles (averaged across 100 timesteps in a run) in a range of dimensions on the Ackley function (averaged over 100 separate runs).

FIGURE3.4: Parallel MOLPSO: Particle population size scaling and dimension count scaling characteristics.

Figures3.4(a)and3.4(b)show some scaling performance results. The objective in these tests were to obtain a measure of population-scaling characteristics and also how the algorithm responds to higher dimensions. It appears that there is a fairly linear scaling with system sizes up to4096. The fact that the MOLPSO is not of complexity O(n2₎_{but more of the order}_O₍_n_log_n₎_{due to a lack of interaction except with the global best solution means} that the algorithm will scale very well. In addition, having one thread assigned to each dimension of every particle further reduces the scaling coefficient in Figure3.4(b).

56 3. CONTINUOUS GLOBAL OPTIMISATION

3.4.2 Firefly Algorithm

(I, p.47)

Parallelising theFAis met with considerable difficulty [118]. System scaling remains a substantial problem even after attempting to rectify the excessive computation required by unrestricted communication between agents. It is therefore prudent to examine how to reformulate the optimiser such that it is more suited to parallelisation, while minimising the loss of effectiveness. The modification discussed here involves truncating the neighbourhood topology of each particle (agent) in order to reduce the number of interactions necessary. The basis for this is the exponential decay function which degrades perceived fitness across a distance. Assuming that particles at a great distance have negligible influence on the movement of a particle, then these can be removed without causing significant differences in convergence. This does however introduce a new tradeoff, whereby one must ensure neighbourhoods are not too small (so as to cause stagnation), and not too large (so as to reintroduce all interactions and hence the originalO(n2₎_complexity.

For convenience, the Firefly update formula is repeated here:

xi+1=xi+βe−γr

ij₍_x_j−_x_i_{) +}_α₍_d₎ _(3.12)

The modification made is to chooseγas follows:

γ= lnh

−g2

whereh|h∈_R∧0 < h <1is the coefficient given to theβ-step when a firefly is observed on the boundary of the local neighbourhood at distanceg. Experimental testing revealed that a more smooth decay towards the boundary of the neighbourhood had little benefit over a simpler0to1cutoff at the grid box boundary. To improve compute performance, the zero-one cutoff was used instead of the light decay function.

To compare against a single-threaded implementation, an implementation written by Mancuso was obtained [177]. Four test functions were used to compare these algorithms, the Rosenbrock, Rastrigin and Schwefel functions, as well as Ackley’s Path function.

Rosenbrock Rastrigin Schwefel Ackley’s Path GPU Time (msec) 9488.7 966.5 848.6 949.9 GPU Minimum 0.000045 1.1741 25.475 3.061 CPU Time (msec) 368460 367329 369935 368384 CPU Minimum 0.000071 1.445 73.267 1.1382

TABLE3.4: CPU vs GPU Parallel Firefly algorithms in optimising a set of 3-parameter test functions.

Figure3.5shows a screenshot of the GPU-basedFA, deliberately slowed by small step sizes in order to accentuate the movement characteristics of the fireflies. Thanks to the spatial partitioning techniques developed in Section2.2.2in the previous chapter, this algorithm runs very quickly for such a colossal population size.

Table3.4contains some performance data for the comparison between the original CPUFAand the GPU Parallel Firefly algorithm presented here. The data collected was averaged over 100 independent runs. Both CPU and GPU algorithms maintained populations of 4096 particles and 600 time steps were executed for one run on each. The global minima for each test function in three dimensions were as follows:

3.4. PARALLEL IMPLEMENTATION 57

FIGURE3.5: A population of 262,244 fireflies optimising a 3-parameter Rosenbrock function. Step sizes were deliberately smaller to accentuate movement through space and discovery of better solutions.

1. Rosenbrock function: forf(x, y, z),f(1,1,1) = 0.

2. Ackley’s Path function: forf(x, y, z),f(0,0,0) = 0.

3. Rastrigin function:

forf(x, y, z),f(0,0,0) = 0.

4. Schwefel function:

for(x, y, z),f(420.9,420.9,420.9) = 06_.

Both the CPU and GPU implementations randomly distribute particules in the allowable ranges of each function shown in Table3.1. The boundary checks in the CPU Firefly algorithm were removed to more closely resemble the implementation of the GPU algorithm7_{. However, the CPU algorithm was implemented using double precision,} whereas the GPU algorithm uses single precision. With the large margin of performance difference indicated by the results, this difference in implementation is less likely the cause. Such a considerable speedup is certainly worth the effort considering the size of the population being used. Saturating the search space with particles certainly assists the algorithms in finding the global minimum, especially in the Schwefel function, where the traditional bounds are−500< xi<500.

Overall, both the best compute time and accuracy of the solution were achieved by the parallel algorithm, except the Ackley Path function in accuracy. There are a number of reasons for this. Firstly, it may well be that the Ackley Path function is simply more suited to an optimiser with global interaction, such as the original CPU-basedFA. By observation, once the parallel algorithm’s particles prematurely converge, they are completely out of reach of others in a different grid box, and zerousefulinteractions will take place. In the case of the single-threaded algorithm, all

6_{The global minimum of the Schwefel function is}₋₃₃₅₁_._{8632, but to yield a minimum of zero, this was renormalised so that the minimum}

is at0.

58 3. CONTINUOUS GLOBAL OPTIMISATION

particles are always in contact with each other, and when premature convergence occurs, it may still be possible for particles to escape and move closer to the global minimum.

The speed-up obtained of the GPU over the CPU is39times for4096fireflies, but would be much higher for larger numbers of fireflies. Larger bounding boxes such as−500< xi<500would suit even larger numbers of

fireflies. In these experiments, it is difficult to form conclusive comparisons (as in metaheuristics in general), but in testing, it was not possible to use the CPUFAfor optimising the Schwefel, Rastrigin, and Ackley functions in reasonable time. The GPUFAdiscussed could optimise all three within one second.

3.4.3 Discussion

It is possible to achieve good performance with far less fireflies. In testing these algorithms, what seemed important is whether local minima were present in the test functions. Functions lacking many local optima tended to be easy for the CPU implementation to optimise with very few fireflies and in very short time. Using16fireflies, the single-threaded algorithm could achieve an error of less than0.00005in approximately 25msec (on a 10-run average) on the Rosenbrock function. By comparison, the parallelFAdiscussed here takes approximately 338msec (10-run average) to obtain an error less than0.00005. The CPUFAhas a clear advantage here. However, using16

fireflies in a complex (local minima containing) function such as the Schwefel or Rastrigin functions, the CPU algorithm will either take an inordinate amount of time (days or weeks), or fail to achieve the global minimum due to premature convergence to local minima.

The basic principle at work in the parallelFAis that of divide and conquer. By saturating the parameter space between the allowable constraints with2048or more fireflies reduces the maximum distance between the true global minimum and the nearest firefly. While this still does not provide any guarantee, it greatly increases the chance of success. It is important to note that this is valid for constrained optimisation, but different strategies would be necessary for unconstrained optimisation, where no bounding box is supplied.

Little discussion has taken place on the choice of constant parameters in these algorithms. For the MOLPSO and the FA, at least 3 parameters needed to be calibrated by hand. The choice of these were made empirically by observing the effects on a visualisation of the optimiser. Thoughtful consideration is necessary when choosing these parameters, as in some cases, the wrong choice can lead to consistently suboptimal solutions. The next section deals with the intriguing problem of reinterpreting this hand-calibration effort as another optimisation problem.

In document Data parallel structural optimisation in agent based modelling : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand (Page 69-75)