Scalar Product - Linking Scheme code to data-parallel CUDA-C code

In this test case, we linked Scheme code to a CUDA-C kernel that performed parallel scalar product on GPU. This kernel was taken from NVIDIA’s CUDA SDK [9]. In this example, a kernel takes an input vector pair containing floating-point numbers and an output floating-point vector. Both input vectors were divided into equal number of segments. Here, the kernel calculated a scalar product for each segment pair from both input vectors and stored resultant scalar products on the output vector. Therefore, the length of the output vector was equal to the number of segments of input vectors. In this example, each input vector was divided into 2560 segments. Therefore, CUDA-C kernel calculated 2560 scalar products.

In this test case, we measured execution times against different numbers of participating thread blocks. We changed the grid size in execution configuration for a particular size of input vector pairs. In this test case we measured execution time for an increasing number of threads with a fixed vector size. We also increased vector sizes linearly to observe the scaling behavior in the separated charts.

The charts in Figures 5.15 –5.18 show performance comparisons of parallel scalar product implemented in Scheme and CUDA-C for four different vector sizes. In these charts, the X-axis represents number of thread blocks per grid and the Y-axis represents execution time in milliseconds. In the execution configuration, each one-dimensional thread block had 256 threads. Therefore, the number of participating threads for a particular execution time can be found by multiplying grid size with the fixed block size.

Figure 5.15: Performance comparison of parallel scalar product

First, the kernel was called with a vector size 10M for both input vectors. In Figure 5.15 both execution times for Scheme and CUDA-C showed inverse trends y = 21.912 + 21.744/x and y = 22.140 + 13.208/x,

respectively. The R Square values in Scheme and CUDA-C were 0.964 and 0.969, respectively, this means that both lines fit the data almost perfectly. By seeing both trend lines we understand that both execution times reduced with an increasing number of thread blocks. Initially, Scheme took longer to execute than CUDA-C. With an increasing number of thread blocks along the X-axis, the distance between these two trend lines decreased. This means the difference between the two execution times is also reduced. Finally, the Scheme trend line crossed the CUDA-C trend line after 32 thread blocks on the X-axis. Scheme then takes less time compared to CUDA-C for the rest of the values on the X-axis. For this experiment, we found that ∆y = −0.228 + 21.744/x included the generated Scheme shim. We observed that initially there was a 1–4% overhead in Scheme for grid sizes 8 to 16, but that there was no overhead in Scheme from grid size 32 on. Both Scheme and CUDA-C then showed consistent execution times.

Figure 5.16: Performance comparison of parallel scalar product

In Figure 5.16, we ran the same programs for both implementations but we doubled the vector size to 20M. We also found that both implementations showed inverse trends, with y = 42.435 + 45.134/x in Scheme and y = 42.825 + 25.752/x in CUDA-C. The R Square values for Scheme and CUDA-C were 0.970 and 0.954, respectively, which is a good fit of both lines to the data. We observed that the execution times for both implementations were double compared to the execution times in Figure 5.15. This is because vector size doubled in this experiment. This chart also shows the same behavior as the previous chart. Initially, Scheme implementation took longer than CUDA-C. As the grid size increased, distance between two execution times reduced until the Scheme trend line crossed the CUDA-C trend line after the thread block 32 on the X-axis. Scheme then took less time compared to CUDA-C for the rest of the values along the in X-axis. For this experiment, we found that ∆y = −0.39 + 19.382/x; this included the Scheme shim. We also found that 1–5% overhead in Scheme compared to CUDA-C for grid sizes 8 to 16. At grid size 32. the overhead became almost 0% in Scheme and then execution times for both implementations were almost consistent across the grid sizes on the X-axis.

In Figure 5.17, the kernel was called with vector size 30M for two input vectors. This tripled the vector size compares in Figure 5.15. Here, we also found both Scheme and CUDA-C show inverse trends y = 63.198 + 65.123/x and y = 63.123 + 38.832/x, respectively. The R Square values for both Scheme and CUDA-C were 0.969 and 0.953, respectively. This is a relatively good fit of both lines to the data.

Figure 5.17: Performance comparison of parallel scalar product

Here, we also found that initially Scheme had longer execution times compared to CUDA-C and gradually the difference between Scheme and CUDA-C reduced on the Y-axis. After grid size 32 both trend lines kept consistent distance across the values on the X-axis. In this experiment, the Scheme trend line never met the CUDA-C trend line. For this experiment we found that ∆y = 0.074 + 26.58/x and contained only the Scheme shim. We also found an initial 1–5% overhead in Scheme for grid sizes 8 to 32 and almost 0% overhead after grid size 64. In this experiment, we observed that 0% overhead in Scheme drifted to the right on the X-axis compared to Figures 5.15 and 5.16. For these experiments, we found 0% over head at grid size 32.

In Figure 5.18, the kernel was called with the vector size 40M for two input vectors. The vector size is now four times bigger compared to the vector size in Figure 5.15. Here, we observed that both Scheme and CUDA-C showed inverse trends y = 83.789 + 87.781/x and y = 83.601 + 51.662/x, respectively. The R Square values for Scheme and CUDA-C were 0.974 and 0.956. This is a good fit of both lines to the data. We also observed that initially Scheme had longer execution times compared to CUDA-C. Gradually, the difference between these two execution times reduced. After grid size 32, both trend lines kept consistent distance across the X-axis values. In this experiment, the Scheme trend line never met the CUDA-C trend line and we found that ∆y = 0.18 + 36.119/x. We also found a 1–6% overhead in Scheme implementation for grid sizes 8 to 32. After grid size 64, we observed almost 0% overhead in Scheme, as seen in Figure 5.18. In this example we observed a 0–5% overhead in Scheme. We also found that overhead in Scheme reduced with increased grid sizes on the X-axis and after a certain grid size, overhead became almost 0%. We also found that increasing the grid size did not always reflect better performance for both implementations. After a certain grid size we repeatedly observed similar execution times. We also found that increasing vector size required more thread participation to achieve optimal performance in Scheme.

In document Linking Scheme code to data-parallel CUDA-C code (Page 118-121)