Experimental Performance Evaluations - b) Buffer allocation for the swap method

b) Buffer allocation for the swap method

5.2.2. Experimental Performance Evaluations

Let us discuss performance of the three methods explained in the previous sections using dynamic kernel functions: the IIR filter, the Jacobi method and the LU decomposition.

Although these applications work recursively, each application has a unique characteristic.

The experimental environment is a PC that consists of an Intel’s Core i7 930 processor at 2.80GHz with 12GB DDR3 memory, and an NVIDIA Tesla C2050 with 3GB Memory.

The OS in the PC is the Cent OS of the Linux Kernel 2.6.18. The driver version of the GPU is 3.0.

We measure the execution times of each application using the methods when we apply both double precision and single one to the calculation. Regarding the Copy host/device methods, we measure the data transfer time consumed by the copy operation. The time used by the swap operation is also measured. We will show the results on graphs with bars where two parts are included: a gray one shows the time consumed by the copy/the swap operation, and the black one shows the time used for the calculation performed in the kernel.

• IIR filter kernel

Figure 37. Performance of IIR filter kernel on CUDA.

Figure 38. Performance of IIR filter kernel on OpenCL.

Single precision Double precision

Figure 39. Performance of Jacobi method on CUDA.

0 Execution time (sec) Copy/Swap time

Calculation time

Single precision Double precision

Figure 40. Performance of Jacobi method on OpenCL.

U

₁₂

Figure 41. Algorithm of LU decomposition.

Figure 42. Performance of LU decomposition on CUDA.

Figure 43. Performance of LU decomposition on OpenCL.

The IIR (Infinite Impulse Response) filter is a well-known kernel to be applied to im-age processing for emphasizing the edges of objects. We use the following equation for the filter with 16 coefficients:

yn=

Here, the y is used in the right side of the equation. Herein we explain, this is with a typical recursive kernel, which exchanges the input data stream of y and the output data stream of y. This exchange is performed by the Copy host/device and the Swap method. Regarding the parallelization, each yi calculation is assigned to a stream processor (i.e. a thread in the CUDA and a work item in the OpenCL). The number of total input samples of x and y equivalents to the number of iterations.

Figure 37 and Figure 38 show the execution times for the methods on the CUDA and the OpenCL respectively. During the iteration of the kernel, the size of y does not change. Moreover, the calculation is not heavy. Therefore, the percentage of the copy time in the Copy host method is very large. The copy device and the swap method achieve almost the same time on both environment. Thus, this kernel function has benefit from the elimination mechanism of the swap or the copy device.

• Jacobi Method

The Jacobi method is applied to a system of linear equations where the coefficient matrix is sparse and to be solved approximately with recursive iterations using the following equation:

where the x^(k+1)_i approximates x_iafter kth iteration.

In the kernel function assigned to the thread or the work item, the input data stream for x obtains the approximation after the kth iteration. The output data stream will become the one of the (k + 1)th iteration. Therefore, the I/O buffers are exchanged among the recursive iterations.

Figure 39 and Figure 40 show the execution times among the methods on the CUDA and the OpenCL respectively. The number of iterations is normalized to the same as the length of x vector. As the opposite case of the kernel function such as the IIR filter, this kernel function does not require a large amount of I/O data for the recursive calculation in comparison to the amount of calculations. Therefore, three methods achieve almost the same performances without copy/swap overheads.

• LU decomposition

Finally, the LU decomposition is used with the same objective as the Jacobi method to solve the linear equations. However, it is applied to the case when the coefficient matrix is dense. Figure 41 shows the shape of the matrix after the kth decomposition calculating the equations listed in the left side of the figure. The L₂₂U₂₂is generated as the coefficient matrix for the next iteration. This means that the input data stream is the matrix and also generates the matrix as the output data stream for the next iteration. Thus, the output data streams is recursively used in the input every iteration as the size is decreased.

Figure 42 and Figure 43 show the execution times for the considered methods on the CUDA and the OpenCL respectively. The sizes of the I/O data streams are controlled by a constant argument to pass the iteration number. Even if the I/O data streams are reduced every iteration, this kernel needs to use k× k matrix at the kth iteration.

Therefore, copy overhead is observed in the cases of the Copy host/device methods.

Here we confirmed that the Swap method achieves the best performance again.

Any targeted applications that we have focused in this section achieve the best perfor-mance when the swap method is applied. The Copy device method shows also drastically better performances than the ones of the Copy host. However, it also causes some overhead to use the GPU’s memory bus during the data copy between the I/O buffers. Thus, we have confirmed that the swap operation should be implemented to avoid the overhead imposed when the recursive data transfer between the I/O buffers is included in the algorithm be-cause it does not have any overhead even if the transferred data size is small as we have observed in the case of Jacobi method. When the data size becomes larger, it is clear that the Swap method obtain the best performance.

In document STREAM-BASED PARALLEL COMPUTING METHODOLOGY AND DEVELOPMENT ENVIRONMENT FOR HIGH PERFORMANCE MANYCORE ACCELERATORS (Page 64-68)