CUDA - Hardware and next generation codes

4.4 Hardware and next generation codes

4.4.2 CUDA

GPUs are normally associated with rendering textures and lighting effects and have been specifically designed to excel in those regimes so it takes a little thought to appreciate how this can relate to N-body simulations. To begin with, consider a texture file. A texture is a large array of pixels that is read in quickly from memory and then manipulated in some geometric space. This requires that a large number of pixels must be held and pro- cessed simultaneously, but does not require any real computational power besides simple geometric transformations.

Now, a CPU is designed to handle a single request extremely quickly which makes them powerful for operations such as user input or controlling an internet connection but leaves them unable to cope with the sheer volume of simultaneous operations. A GPU,

however, contains a huge number of processors which wok in parallel. Each processor is considerably slower than a single CPU but when the texture is broken down into smaller pieces and divided among the processor units of the GPU the increased parallelism means the operation is carried out considerably faster.

While the actual mechanics of rendering images are a bit more involved than that simple example, the point still stands that GPUs are designed to execute a simple command multiple times by using a high bandwidth connection to a large number of individual processors. This can be applied to solving the N-body problem by, in the simplest instance, tasking each processor to evaluate the force on a single particle. Due to the large number of processors involved the kick and drift steps can be completed in a fraction the time it would take even a powerful multi-core CPU.

Using the hardware for this has been made much easier by the development of spe- cialised software libraries that allow for a developer to write code directly for the hardware. CUDA has libraries that extend popular languages such as C and FORTRAN which has allowed development of scientific codes to accelerate.

I do not want to get too deeply into technical specifications, but I will note a few key figures to emphasise the scale of the improvement. A typical consumer level CPU will have around eight cores which can then be connected to other CPUs via an interface such as MPI. A consumer level GPU on the other hand will have hundreds of cores that, in the case of a high-end GPU unit, can reach teraflop processing rates with total chip bandwidths of several TB/s. Overall, a networked series of GPUs can outperform a similar investment in CPU hardware by over an order of magnitude.

The CUDA architecture is not without its drawbacks, however. While NVIDIA have put admirable resources into building GPUs that are built specifically for simulation work, they still use the same underlying architecture as any other CUDA-enabled device which, ultimately, is designed for rendering images. This has several implications. For example, when jobs are divided between processor threads the threads are then run on separate blocks which cannot communicate with each other. There are pools of shared memory, but interaction between threads is limited and attempting to force a level of communication between them comes at a significant cost to speed. For a visual representation of this see Fig. 4.1.

Figure 4.1:A schematic representation of the division of memory between streaming multi-

Additionally, the processors are not designed for user input or for conditional evaluation and under such circumstances will produce undesirable behaviours such as evaluating all possible outcomes of a conditional statement. It is generally recommended that a host CPU carries out all operations that require conditional evaluation or user input before packing all the data off to the GPU for processing in a way rather reminiscent of GRAPE. Note, however, that this means it is impossible to pass pointers to the GPU as the two devices are not in the same memory space. Also problematic is the fact that the processor kernel that runs on a block of processors must be the same for each.

Overall, the advantages of GPU processing far outweigh the drawbacks if the code is designed properly. In only the last few years a large variety of codes have been created such as NBSymple (Capuzzo-Dolcetta et al., 2011), OCTGRAV (B´edorf & Portegies Zwart, 2012), and GENGA (Grimm & Stadel, 2014), not to mention the ones designed for other libraries. While this is an area of significant personal interest we do not make use of this technology in our work and I will now move on to other topics. All the above information is available in much greater technical detail at NVIDIA’s CUDA section of their developer portal listed in online resource [5]. A more detailed look at the astrophysical implications of the technology can be found in the excellent review of B´edorf et al. (2012).

In document On gravity : a study of analytical and computational approaches to problem solving in collisionless systems (Page 111-114)