SIMD SAC Performance - Improving the Execution Speed Through Parallelism

6. Improving the Execution Speed Through Parallelism

6.2.1. SIMD SAC Performance

Figure 6-9 shows the runtime comparison of both SIMD versions against our baseline.

Both of our SIMD versions were consistently faster than the scalar baseline. For small inputs (4 million characters or less), the SIMD Naive version was up to 3× faster than SIMD OPT. We hypothesize that since the data structures size grew along with the input size, this generated in a higher probability of cache misses due to memory access that are made more frequently and more scattered. This cumulative effect resulted in SIMD Naive being less efficient as input size grew. The general greater performance of SIMD OPT was also because of the data packing optimization applied, which reduced by half the number of iterations in the inner cycle.

Figure 6-9. SAC execution time with different input sizes from hg19. Vertical axis is log₁₀ and

the horizontal is log2

For very large input sizes (more than 224 characters), SIMD OPT was between 2× and 3× faster

than SIMD Naive. For small inputs (222 or less characters), the SIMD Naive performed better

than the SIMD OPT version. In the test with the largest number of characters, corresponding to

bases 230, the difference between the baseline and the SIMD OPT was approximately one order of

magnitude.

It was very important to measure the impact of every function inside our algorithms in the global performance of our SIMD SAC. This way we could observe the effect of the implemented optimizations and identify other optimization objectives for the future. The profiling results for the larger input sizes are shown in Figure 6-10.

In SIMD Naive the tasks that dominate the performance are evident: mainly partial radixsort, and with significant less importance the histogram computation, ComputeBounds and the memcpy. Meanwhile, in the SIMD OPT version we observe the effect of additional tasks that now become important as the time required by the most expensive functions is reduced. The dark green block in the SIMD OPT profile, which did not appear in the naive implementation, represents mostly the functions that only perform contiguous memory access (non gather/scatter) which are executed in the first iteration, along with the cost of the scalar direct sorting of groups with only two suffixes. According to additional experiments, functions that only performed contiguous memory access were at least 2× faster than the counterpart gather/scatter based functions.

(a) SIMD Naive SAC algorithm.

(b) SIMD OPT SAC algorithm

Figure 6-10. Profiling of both versions of the SIMD SAC algorithm.

The second optimization included was storing the VPI and VLU (instead of just recalculating it), whose effect is observed when comparing the histogram block of both figures (in gray). The gray blocks in the profile of SIMD Naive tend to be smaller than in the SIMD OPT, where more time is consumed due to the additional store instructions that are performed. This also contributed to reducing the runtime of the Partial Radix. This difference was more evident when the input sizes increased. This also reduced the runtime of the partial radixsort since it now does not calculate the VLU and VPI.

The effect of the packing strategy is observed in partial radix blocks, which consumes less percent- age of the total runtime in the optimized version. Packing also reduced the number of iterations of the inner cycle, which also resulted in less time spent in the memcpy instructions. This positive effect might be offset in the case of the histogram block because of the overhead introduced to store the VLU and VPI arrays. Also, both the histogram and the exclusive prefix sum are nega- tively impacted since now they must handle much larger data structures (more buckets are needed), increasing the probability of cache misses in both tasks. Additionally, the cost of the expanding function became significant in the SIMD OPT version, because in this version the expanding pro- cess involves more complex operations; and because of the general impact of some other functions had been reduced significantly due to the packing itself. Due to this runtime decrease, the non-SIMD portion of the code (mostly the scalar direct sorting and ComputeBounds), exhibited a greater impact in the overall performance. Finally, the observed reduction in the execution time of the partial radixsort is also a consequence of the scalar direct sorting of groups of size 2, which had a higher impact for small inputs. These findings also explain the variation in runtime time shown in Figure 6-9.

Finally we calculated the performance gain of both of the SAC algorithms, which is presented in Figure 6-11. Considering the Intel AVX-512 architecture used, the ideal parallelism should cor- respond to a speedup of 16× for 32-bit element arrays. Figure 6-11 shows that, on one hand, SIMD Naive achieved a speedup of up to 6× for small inputs, which decreased as the input size increased. On the other hand, SIMD OPT exhibited a speedup that grows with the input, up to 7× faster than the reference, for the experiments presented. Such speedup can be considered as significant taking into consideration the overhead and implementation issues related to SIMD pro- gramming.

Although one of the optimizations applied was to pack two characters into one array element, we decided not to try for a denser packing strategy. We think that by doing so, it would also result in a negative impact in the final performance, since with bigger integers the histogram and the exclusive prefix sum would need much more buckets. This is translated into much higher probability of cache misses, with an important cumulative effect because of all the scattered accesses. The most convenient solution for this issue would be having a more versatile conflict˙detection instruction that would allow processing more elements (of less size) per AVX-512 register.

In document Efficient Storage of Genomic Sequences in High Performance Computing Systems (Page 112-115)