OmpSs+OpenCL Version - Particle-in-cell algorithms for plasma simulations on heterogeneous arch

As we said above, some work has to be done to use OmpSs in the code. This process is called taskification and consists in reorganizing some parts of the code to introduce tasks.

The easiest way to do it is splitting any parallelizable loop in two nested loops. The internal loop repeats a block of iterations from the original loop, and the external loop performs as many iterations as blocks of iterations. The next step is to pack the internal loop into a function which is defined as TASK. Each time that the function is called by

the external loop, a task is generated and is enqueued to run when its dependencies are complied.

8.7.1 Push Phase

The functioning of OmpSs to distribute the push phase among the compute units was tested with a simulation of 125,000 particles in a grid of 32 × 32 × 16 cells. Each task was in charge of 128 particles, so the total number of tasks in this test was 977. The number of particles per task is a trade off: too few particles imply consuming too much time in managing the tasks, while too many particles imply insufficient tasks to distribute the work and achieve a load balance among compute devices.

Figure8.15shows the results obtained when executing the push phase. The top graph indicates the distribution of theOpenMP andOpenCLtasks within the CPUs andGPU, respectively. The red (or darker) colour means running tasks. The middle graph shows the evolution of the tasks in the execution and how the OmpSsruntime distributes the tasks. The green (or lighter) colour shows the first created tasks and the blue (or darker) colour shows the last ones. And the bottom graph depicts the number of enqueued tasks.

Figure 8.15: Timelines showing the functioning of OmpSsin the push phase.

The figure allows us to verify that the tasks of this simulation are distributed relatively balanced between the hardware resources (CPUs andGPU). These resources are working most of the time and it is probable that the application does not achieve the maximum

performance, but this fact is compensated by the improvement on productivity thanks to the reduction in the programming complexity. The short length of the tasks suggests us that it could be possible to increase the number of particles per task reducing the management of task. Moreover, the steady pace at which tasks are dequeued gives us the idea that their lengths are similar.

8.7.2 Pull Phase

We repeated the previous test to verify the functioning of OmpSs with the pull phase. This time the number of particles was increased up to 1 million to increase the chances of conflicts when charge density was deposited at grid points. The number of tasks was 25 to reduce the amount of memory needed because each task had a grid copy. The number of tasks is also a trade-off: too many tasks imply many grid copies that may drain the memory, while too few tasks imply insufficient tasks to distribute the work and achieve a load balance among compute devices.

Figure8.16 shows the results obtained when executing the pull kernel. The graphics are equivalent as those explained in Figure8.15. This time, the different lengths of the tasks and the changeable pace at which tasks are dequeued reveal us the presence of lock contention. This contention comes from memory conflicts because different particles contribute to the charge density at the same grid point.

8.7.3 Evaluation

Figure 8.17 and Figure 8.18 show the results of the ARM OmpSs version. We can

see that the performance is very stable and few tasks are enough to get the best performance for both kernels. In this experiment, we reused the sameOpenCL kernels developed previously and we added the OmpSspragmas to specify the tasks and its data dependencies. 6 8 10 12 14 16 0 25 50 75 100 125

T

ime (s)

# Tasks

Push

Figure 8.17: Execution time of the push kernel for different combinations of number of tasks.

10 12 14 16 18 20 0 25 50 75 100 125

T

ime (s)

# Tasks

Pull 1 p/wi Pull 2 p/wi Pull 4 p/wi Pull 8 p/wi Pull 16 p/wi Pull 32 p/wi Pull 64 p/wi

Figure 8.18: Execution time of the pull kernel for different combinations of number of tasks and particles per work-item (p/wi).

Moreover,OmpSsoffers a reduction in programming complexity, since the number of OmpSsdirectives included is far lower than the OpenCL calls included in the hybrid version (table 8.1).

Kernel Performance - Best time (s) Programmability - Productivity OpenMP+OpenCL OmpSs OpenCL calls OmpSs Directives

Push 6.02 6.92 161 12

Pull 10.31 10.83 167 18

Table 8.1: Comparison between the hybrid version (OpenMP+OpenCL) and the OmpSs version.

The results shown in Figure8.19 allow us to have an idea about the performance of Mont-Blanc prototype with respect to MareNostrum III. The data used for this comparison is the cylindrical geometry with 32 × 32 × 16 cells and two different distributions of particles: a small test with 125,000 particles and a second bigger one with near 1 million particles. The figure shows the execution times of the OmpSs+OpenCL version on Mont-Blanc (using one node), which obviously are worse than the OpenMP version on MareNostrum III (using only 2 cores), because the computing power gap between them is really big. But the figure also shows that the increase of time between the two tests (small and big) is proportional between both platforms. Therefore, we can say that the

new version for the prototype is working reasonably well.

Marenostrum III small big 0 200 400 600 800 1000 1200 1400 T ime (ms) small big 0 2000 4000 6000 8000 10000 12000 Montblanc Prototype Push phase Pull phase

In document Particle-in-cell algorithms for plasma simulations on heterogeneous architectures (Page 153-158)