2.2 Introduction to Parallel ABMS
2.2.3 Multiple GPUs
Aaby and colleagues at Oak Ridge National Laboratory have shown that multipleGPUscan be harnessed together in order to improve performance further [1]. While there is a considerable performance increase in using multiple
GPUs, the underlying complexity is not mitigated at all without using some method such as spatial partitioning. Essentially the parallelisation process depends upon what strategy is used. In the case ofABM, the most immediate method would be to assign a thread to every agent. This is “fine-grained” parallelism. Unless each agent has independent tasks which can be parallelised, there is no other logical way to assign more threads to the problem. “Coarse-grained” parallelism in an agent-based model would be assigning one thread (or processor) to several agents.
By using the uniform grid developed above, one can further improve performance of a multiple-GPU agent- based model. Algorithm3shows a simple method to accomplish this. Note that it is not necessary for all devices to build the datastructures independently. However, since a timestep cannot be computed before this happens, every device computes the datastructure separately. This avoids having to distribute the datastructure to every device, which would incur an additional memory copy penalty.
32 2. PARALLEL AGENT-BASED MODELLING AND SIMULATION
Allocate arrays
copyVectorsToDevices()
fori←0tonOR NOT exit condition do
i←i+ 1
foreach GPU in paralleldo
forj ←0toN U M BOIDSin paralleldo hashes[j] = calculate hash(position [ j ]) end for
Sort by hash key (hashes,indices) Populate boxStart and boxEnd Write agents to their sorted locations foreach agent in paralleldo
forevery adjacent grid cell0≤g <8do forevery agentain grid cellgdo
ifain communication radiusthen Compute rule contributions end if
end for end for
Update position and velocity using averaged contributions end for end for copyPartialVectorsFromDevices() copyVectorsToDevices() drawBoids() end for
ALGORITHM3: Multiple-GPU (mGPU) implementation of the uniform grid in Boids.
In Algorithm3, a timestep is computed by first copying velocities and position vectors to all devices. Identical information is cloned onto every device. It is not necessary that all devices are aware of all agents. It is sometimes suitable to distribute distinct parts of the space (either lattice or continuous space) to the devices with a read-only border [1]. This is one possible enhancement to the above algorithm. After theGPUshave constructed the uniform grid, they compute the CUDA kernel, which uses it by iterating over each adjacent grid cell, and the agents within them. Finally, the agents are updated, and the host copies back the sequences of agents modified by each GPU. The process repeats when the host reconstructs the full arrays once more and copies these to the devices. There are, however, some redundancies here which can be eliminated.
The most immediate drawback here is that devices must synchronise in order to construct the whole array of agents, which is fed back to theGPUsagain. This can be improved using newer versions of CUDA released, allowing several devices to communicate with each other directly using a unified address space [202]. Historical releases of CUDA have improved upon each other significantly, and the trend seems to continue. Page-locked memory can also be used to construct the full array of agents, which will also improve performance.
It is also very important to consider the memory hierarchy at almost every step of development in these programs. Memory fetches from global memory is extremely slow in comparison to memory fetches from the registers and
2.2. INTRODUCTION TO PARALLEL ABMS 33
shared memory on each SM of the GPU. Also, the constant memory bank should not be ignored, which is a cached segment of global memory. Use of the texture cache can also be a source of improvement here, since it is designed to operate at maximum efficiency for an algorithm which is likely to benefit from spatial locality in its data [202].
(a) A log-linear plot of agent-agent interaction time step compute time using the CUDA kernel with different system sizes across timesteps.
(b) Datastructure construction time plot by system size for mGPU.
(c) Comparison of mGPU and single-GPU algorithms for timestep computing by system size.
FIGURE2.15: Multiple-GPU performance results for datastructure construction and timestep computation.
Figure2.15shows some performance data for the multiple-GPU implementation. As expected, the multiple-GPU algorithm increases in compute time more slowly than the single-GPU algorithm (Figure2.15(c)). Although there could be some algorithmic improvements, a slower increase in compute time is certainly desirable. Constructing the spatial datastructures appear to be fairly consistent for different system sizes using the mGPU algorithm (Figure2.15(b)). Figure2.15(a)shows that the first few timesteps of a simulation run increases in computing time quickly, followed by a more consistent increase in computing time. The reason why computing times increase consistently is due to the agents using less grid boxes while moving towards the centre of the space. Ideally, spatial distribution of agents would be more uniform for maximum utility from the spatial partitioning algorithm.
Part II
Parallel Evolutionary Algorithms
CHAPTER
3
CONTINUOUS GLOBAL OPTIMISATION
Many problems can be described in terms of a real-valued cost function. For example, a wind turbine blade requires a design which maximisesoutput powergiven specific weather conditions [138]. Hypothetically, the output power
P =F(f, r, m)could be expressed in terms of fibreglass thicknessf, number of ribsr, and perhaps more directly on massm. Furthermore, these parameters may be subject to an upper and lower limit due to tolerances in allowable blade stresses and displacements. Optimisation of cost functions such as these is relatively simple, but the choice of algorithms is vast enough to cast doubt on “off-the-shelf” solutions, especially given recent theoretical advances. Such optimisers are broadly categorised intopopulation-based, andtrajectory-based methods. Some are then further divided into stochastic and deterministic. Problems themselves can also be categorised into constrained and unconstrained.
The purpose of this chapter is to introduce and characterise a selection of major stochastic numerical optimisers in terms of data-parallel computing, applicability in calibrating agent-based models, and for use later in parallel geometric optimisation in Chapter4. Optimiser evaluation methods and some notable variations in algorithm design are discussed. Spatial partitioning concepts are adopted from Chapter2in order to propose improvements to these algorithms. Advanced space exploration techniques such as L´evy Flights and variations of these are discussed. Finally, a study on optimiser calibration is presented. Given the fundamental similarities between population-based optimisers such as the PSO and agent-based models, calibration of these can be equated, from an optimisation point of view. The chapter ends with a short study on higher dimension visualisation.
The contents of this chapter extend upon work previously published by the author inProc. 12th IASTED Int. Conf. on Artificial Intelligence and Applications1,Proc. Int. Conf. on Modelling, Identification and Control (AsiaMIC 2013)2,Parallel and Cloud Computing3and alsoProc. Int. Conf. on Genetic and Evolutionary Methods (GEM 2012)4.
1A. V. Husselmann and K. A. Hawick. Random flights for particle swarm optimisers. InProc. 12th IASTED Int. Conf. on Artificial
Intelligence and Applications, Innsbruck, Austria, 11-13 February 2013. IASTED
2A. V. Husselmann and K. A. Hawick. Particle swarm-based meta-optimising on graphical processing units. InProc. Int. Conf. on Modelling,
Identification and Control (AsiaMIC 2013), Phuket, Thailand, 10-12 April 2013. IASTED
3A. V. Husselmann and K. A. Hawick. Levy flights for particle swarm optimisation algorithms on graphical processing units.Parallel and
Cloud Computing, 2(2):32–40, April 2013
4A. V. Husselmann and K. A. Hawick. Parallel parametric optimisation with firefly algorithms on graphical processing units. InProc. Int.
Conf. on Genetic and Evolutionary Methods (GEM’12), number 141 in CSTN, pages 77–83, Las Vegas, USA, 16-19 July 2012. CSREA
38 3. CONTINUOUS GLOBAL OPTIMISATION