OpenCL Platform Performance

6.4 OpenCL Platform

6.5.1 OpenCL Platform Performance

The first test aims to check the proper behavior of the OpenCL platform and evaluate the impact of the implemented optimizations. For that purpose, we executed the three applications considering six possible scenarios: ACPU, AGPU, R1CPU, R4CPU, RGPU, RGPUO. On ACPU, the application runs an Android-native, sequential version on the processor of the mobile. The AGPU version of the application replaces the CPU code of the functions performing the computation by the necessary OpenCL commands to offload the execution of an equivalent high-performing kernel onto the GPU and transfer all the involved data values in and back from the device memory. To simulate the performance obtained with an application developed by an average programmer, the application uses an in-order queue to submit the commands to the OpenCL platform. On the remaining four scenarios, the developer codes the application following the COMPSs programming model and the final user sets up the runtime to force the runtime to execute on a specific computing platform. On R1CPU and R4CPU, the runtime uses only the CPU platform exploiting one and four cores respectively. On RGPU and RGPUO, the runtime offloads all the tasks to the GPU through the OpenCL platform. The former disables all the optimizations obtaining a behavior similar to the GPU scenario, while the latter enables all the optimizations (reusing memory buffers and overlapping transfers with other kernel executions).

For each scenario, the application measures the execution time and its energy consumption. Within the execution time, it distinguishes the amount of time spent on the execution of tasks (Computation) from the overhead surrounding the computation (Overhead). This experiment focuses on isolating the part of this overhead corresponding to transfers between main and devices memories (Ov. Mem.) to evaluate the benefits of the optimizations implemented on the GPU Backend. Regarding the energy consumption, it only separates the energy used for computing the methods (Application) from the energy consumed by the whole system including the screen (System).

6.5. EVALUATION

6.5.1.1 Digit Recognition

Charts in Figure 6.3 depict the results obtained from processing 512 images with the Digits Recognition application. It is plain to see that GPU allows a significant improvement on both, time and energy, regardless of using COMPSs. Comparing ACPU and AGPU scenarios, the execution time shrinks from 18,516 ms to 4,358 ms (23.53%) – 1,531 ms of which correspond to memory transfers –; and the energy consumption, from 36.48 J to 8.68 J (27.8%). R1CPU are the results already presented and commented in Section 5.3.2 for the CPU Platform with Proxied Execution. The R4CPU scenario is dismissed since the application has no task parallelism and it never uses more than a CPU core at a time. Using COMPSs incurs an overhead of 31 ms and 0.02 J due to the interprocess communication to exchange the commands. Obviously, this overhead appears on both scenarios where the runtime uses the GPU since the exchanged commands are the same. Besides this overhead, the application performs as on the AGPU scenario when the platform optimizations are disabled and adds an overhead of 1,531 ms due to the transfers of values between the host and device memories. When enabled, the runtime reuses the memory values generated in the device memory by one task as the input of the succeeding one; thus allows to reduce the overhead of data copies from and to the device memory to 5 ms. The optimizations implemented for the management of the device memory speed up the execution of the application on GPUs even when the application has no task level parallelism. Despite the improvement on the execution time, these optimizations have a low impact on the energy consumption (0.56 J) since the cause of the most significant part of it is the actual computation of the kernels.

0 5000 10000 15000 20000 Tasks Ov. Mem. Overhead Time ( ms ) RGPUO RGPU R4CPU R1CPU AGPU ACPU 0 5 10 15 20 25 30 35 40 Tasks System Energy ( J ) RGPUO RGPU R4CPU R1CPU AGPU ACPU

Figure 6.3: Execution time (left) and energy consumption (right) obtained when running DR with 512 images using both devices, the CPU and the GPU.

6.5.1.2 Bézier Surface

Figure 6.4 shows the observed measurements of calculating a surface of 1024 x 1024 points using 256 x 256 blocks with the Bézier Surface application. Tasks in BS have no dependencies; thus, the runtime can exploit the parallelism and use the four cores of the CPU at a time speeding up the execution of the kernels up to 2.72x (2,930 ms) at the cost of increasing the energy consumption

up to 19.64 J (124.9%). As with DR, the runtime incurs a little overhead (30 ms and 0.02 J) observed when comparing ACPU to R1CPU and AGPU to RGPU.

0 2000 4000 6000 8000 10000 12000 Tasks Ov. Mem. Overhead Time ( ms ) RGPUO RGPU R4CPU R1CPU AGPU ACPU 0 5 10 15 20 25 Tasks System Energy ( J ) RGPUO RGPU R4CPU R1CPU AGPU ACPU

Figure 6.4: Execution time (left) and energy consumption (right) obtained when running BS with blocks of 256x256 using both devices, the CPU and the GPU.

Processing the tasks using the GPU device is 2.99 times faster than using a single core of the CPU as shown by the Computation time of the AGPU and ACPU (2,672 ms vs. 7,984 ms). However, the memory transfers overhead (337 ms) slows down the application; it only achieves a 2.65x lower execution time (3,009 ms): an execution time slightly higher than the one for the R4CPU scenario. Since BS tasks have no dependencies, they never read values generated by other tasks; therefore, the runtime cannot reuse values already transferred for preceding tasks. However, the computation of one task can overlap with the transfers of output/input values of the preceding and succeeding ones. This optimization allows the runtime to reduce the time spent on memory transfers from 337 ms to 3 ms on the RGPUO scenario. On the RGPUO scenario, BS lasts 2,705 ms and consumes 7.68 J.

6.5.1.3 Canny Edge Detection

As seen in Figure 6.5, the GPU device processes the 30 frames in 420 ms, 11.95x faster than a CPU core; and again, the data transfers worsen the application performance adding a 324 ms overhead. In overall, the application takes 5,020 ms to run in the ACPU scenario and consumes 9.89 J; while for the AGPU case, it needs 744 ms and 1.33 J. The runtime adds an overhead of 30 ms and 0.02 J slightly noticeable when comparing ACPU and AGPU to R1CPU and RGPU, respectively.

This application presents task-level parallelism and dependencies among tasks; thus, the GPU can apply both optimizations. The GPU reuses the output of some tasks as the input of its successors; thus, the runtime reduces the number of transfers. Besides, the remaining transfers can overlap with the computation of other dependency-free tasks. Enabling these optimizations allows the runtime to reduce the 324 ms overhead caused by memory transfers to 1 ms. On the RGPUO scenario, the application lowers the execution time to 451 ms and its energy consumption to 1.22 J.

6.5. EVALUATION 0 1000 2000 3000 4000 5000 6000 Tasks Ov. Mem. Overhead Time ( ms ) RGPUO RGPU R4CPU R1CPU AGPU ACPU 0 2 4 6 8 10 12 14 Tasks System Energy ( J ) RGPUO RGPU R4CPU R1CPU AGPU ACPU

Figure 6.5: Execution time (left) and energy consumption (right) obtained when running CED using both devices, the CPU and the GPU.

In document Programming models for mobile environments (Page 104-107)