• No results found

We will study here the Mali GPU contained in the Samsung Exynos 5 Dual (explained in Section 4.3.2) which is used in the Arndale prototype of the Mont-Blanc project. It has two special features which affect the common OpenCLprogramming in our case:

• Firstly, the Mali GPU has a unified memory system different from the usual

physically separated memory components. The local and private memory are physically the global memory, so moving data from global memory to local or private memory typically does not improve performance.

• Secondly, allGPU threads have individual program counters. This means that all the threads are independent and can diverge without any performance impact. As a result, branch divergence is not a major issue.

In order to exploit the accelerator (Mali GPU) of the prototype, the routines that will run on the accelerator have to be written in OpenCL. This mission involves some new tasks to do: the translation of the code from FORTRAN to C, the distribution of work between work-items and the creation and initialization of the OpenCL components (context, kernels, command queues and memory objects).

As a result of the previous chapter, we know which sections in the code are the most compute intensive: pushing the particles (push) and depositing their charge (pull).

Our first purpose was to introduce OpenCL in these routines by minimizing the changes in comparison with the previous version. This strategy attempted to increase the productivity and decrease the maintenance cost, because scientists add modifications and upgrades inside the code with certain regularity, e.g. including more details into the simulations. This happens to EUTERPE which requires a prompt and efficient response to keep up-to-date the code.

8.4.1

Push Phase

In the push phase, the planning of the work distribution was simple: one work-item was assigned to each particle (Fig. 8.8). Since each work-item works on a different particle, all work-items write in different memory locations (like what is illustrated in the figure inside Algorithm7.1 but replacing threads with work-items). Therefore, the adaptation of the routine was reduced to a straightforward translation.

wi0 wi1

...

wi0 wi1

wg0

wg1

Figure 8.8: Representation of the Push kernel inOpenCL. Where win are the work-items (in charge of particles), and wgnrepresent the work-groups and their common data.

In order to see the behaviour of the push kernel, this time the test ran on the Arndale prototype. The simulation moved 1 million of particles in a grid with a resolution of ns× nθ× nφ = 32 × 32 × 16. The work-group size was set to 32 work-items (it means that each work-group was in charge of 32 particles). It should also be noted that the data set was smaller than previous tests because of the lower amount of memory available in this prototype.

Figure8.9provides a timeline to show OpenCL calls on the host and on the compute device (all compute devices are grouped in just one line). The figure makes evident the significant additional work needed to configure the OpenCL environment (left of the figures), although luckily it only has to be paid once at the beginning. The data movement between the host and the compute devices is also included inside the configuration time, so it is really important to be careful with this data movement if we do not want to ruin the improvement achieved with theGPU.

8.4.2

Pull Phase

The pull phase was more challenging to implement, since different particles can contribute to the charge density on the same grid point (Fig.8.10). As we maintained the same work distribution as in the push phase, several threads could update the same memory location (similarly to what is illustrated in the figure inside Algorithm 7.2 but replacing the threads with work-items). To avoid these memory conflicts, the two previously mentioned solutions used for theOpenMP implementation (Section7.4.2) did not work properly due to the high number of threads on aGPU: atomic operations became inefficient because of lock contention that serialized the execution, and grid copies required too much memory.

...

kernel global reduction wi0 wi1 wg0 wi0 wi1 wg1

Figure 8.10: Representation of the two OpenCL kernels in Pull phase. Where win are the work-items (in charge of particles), and wgnrepresent the work-groups and their

common data.

The strategy to solve this issue was inspired by the way in whichOpenCL distributes data processing. A grid copy is created per work-group, so only the work-items in the same work-group accumulate contributions on the same grid copy. As a consequence, the lock contention is far minor and the reduction of memory conflicts makes feasible the use of atomic operations.

Nevertheless, a new step is required to reduce all these copies to the final grid. The final global reduction is minor and a single kernel is enough to perform it since the number of grid copies is limited by the number of work-groups. This time each work-item will collect the result of the same point in all the copies of the grid.

Finally, one last aspect to mention was an issue that happened during the elaboration of this OpenCL version. Some double-precision arithmetic operations were not properly defined in the OpenCL driver on Mali GPU, and this fact delayed the work until ARM managed to solve it.

In order to see the behaviour of the pull kernel, we performed again a test on Arndale prototype. The dataset for the simulation was the same as for the push routine.

Figure 8.11 provides a timeline of OpenCL calls on the host and on the compute devices. It also shows the calls to the pull and global_reduct kernels. Apart from what was commented above in Figure 8.9, we can add that the global_reduct kernel computes the global reduction of the charge density, and its execution time is much smaller than that of the pull kernel.

Figure 8.11: Timeline showing OpenCL calls and the pull and global_reduction kernels running.