5.2 Kernel Benchmarks
5.2.5 OpenMP loop parallelization
There are several approaches for parallelizing a serial section of code via OpenMP. Implementing parallel regions and nested loops together can cause performance bot- tlenecks if programmers do not take into account the effects of nested parallel loops. The situation can become even more complicated inside hybrid (MPI+OpenMP) codes such as XFLAT, where, depending on the location of code synchronization points the performance of parallel regions may be different. XFLAT has an outer loop that persists throughout the lifetime of the application. Within the loop, there are several regions that were parallelized using OpenMP. There are four possible ap- proaches through which a parallel region and a parallel loop can be implemented in XFLAT. Fig. 5.19 illustrates the first approach, in which a parallel region encloses the outer loop as well as multiple parallel for loops. Note that there are single threaded sections before each for loop. The second approach, as illustrated by Fig. 5.20, is to enclose the internal region of the outer loop only. Thus, the single-threaded regions as well as the inner for loops are enclosed with an OpenMP pragma. Fig. 5.21 depicts a third approach, which is to have the parallel region enclose everything, and the nowait pragma is added to the inner for loops. In this way, every time that a thread completes its for loop computations, it does not stay idle at the end of the loop and continues outside the for loop. The last method is to parallelize only the inner for loops using an OpenMP pragma. Hence, there is no need to define the outer parallel region or to have single-threaded regions inside the outer loop (see Fig. 5.22).
Several kernels were constructed in order to benchmark the performance of the four approaches. In all of the benchmarks, the outer loop contained four parallel regions and four single-threaded regions. Each single-threaded region resided exactly before one for loop. The computations within each region depend on the previous region’s result to make sure that the compiler did not remove and optimize out any
5.2. KERNEL BENCHMARKS
#pragma omp parallel {
Loop(termination_conditions) {
#pragma omp single {
/// single-threaded code }
#pragma omp for for (int i : index) { /// multi-threaded code } ... } }
Figure 5.19: First approach for parallelizing a region via a parallel region that en- closes everything, and implements a single region within the loop.
part of the kernels.
Since the amount of computation did not vary between kernels, any performance difference was due to the different parallelism approaches. The kernels were bench- marked using three different for loop lengths. For the first run, the outer loop iteration count was set to 100k and every inner loop iteration count was set to 10k. The results for the MIC and CPU for the four different approaches are illustrated in Fig. 5.23. On the CPU there was no visible performance difference between the four methods, and on the MIC the maximum difference was about 10 seconds for 100k iteration count. For the next run, the outer loop iteration count was set to 500k and every inner loop iteration count was set to 2k. As shown in Fig. 5.24, on the CPU there was almost no performance difference between the different approaches; however, on the MIC, the maximum performance gap was about 40 seconds. For the last run, the outer loop iteration count was set to 1M and each inner loop ieration
5.2. KERNEL BENCHMARKS
Loop(termination_conditions) {
#pragma omp parallel {
#pragma omp single {
/// single-threaded code }
#pragma omp for for (int i : index) { /// multi-threaded code } ... } }
Figure 5.20: Second approach for parallelizing a region via a parallel region inside the main loop that encloses everything.
count was set to 1000. This time, as depicted in Fig. 5.25, the CPU performance fluctuation was about 7 seconds; however, on the MIC the maximum performance gap increased to about 100 seconds. The performance gap on the MIC may be due to MIC’s simpler core architecture and lower clock rate. Over 1 million iterations the performance difference between the four approaches was negligible on the CPU. On the MIC the performance was less than 100 seconds. Note that in real applications 1 million iterations of the outer loop may take hours or days to complete, therefore in absolute terms, the 100 seconds of difference among the four approaches will be negligible.
For XFLAT, the fourth method was chosen. There were two main reasons for this. First of all, the performance of the third and fourth methods were always the best. The second and more important factor was simplicity. The simplicity of the fourth method comes from the fact that there is no need to define the OpenMP parallel
5.2. KERNEL BENCHMARKS
#pragma omp parallel {
Loop(termination_conditions) {
#pragma omp single {
/// single-threaded code }
#pragma omp for nowait for (int i : index) { /// multi-threaded code } ... } }
Figure 5.21: Third approach for parallelizing a region via a parallel region that encloses everything, with a single region within the loop. Threads at the end of parallel for loop do not wait for the other threads.
Loop(termination_conditions) {
/// single-threaded code ...
#pragma omp parallel for for (int i : index) {
/// multi-threaded code }
... }
Figure 5.22: Fourth approach for parallelizing a region via separated parallel for regions.
5.2. KERNEL BENCHMARKS 21 20 19 20 26 29 19 20 0 5 10 15 20 25 30 35
omp approach 1 omp approach 2 omp approach 3 omp approach 4
Tim
e
(s
)
CPU MIC
Figure 5.23: Illustration of CPU (blue bars) and MIC (orange bars) OpenMP performance on Stampede for 100k × 10k loop configuration (outer loop len × inner loop len). Numbers on top of bars correspond to total time in seconds.
region to enclose the inner parallel for loops. Furthermore, defining the single- threaded regions was not required. Consequently, MPI functions can be put after each for loop without requiring to treat them as special lines of code inside OpenMP
parallel regions. As a result, the implementation becomes simpler, maintenance
36 37 34 36 126 146 100 107 0 20 40 60 80 100 120 140 160
omp approach 1 omp approach 2 omp approach 3 omp approach 4
Tim
e
(s
)
CPU MIC
Figure 5.24: Illustration of CPU (blue bars) and MIC (orange bars) OpenMP per- formance on Stampede for 500k × 2k loop iteration counts (outer loop len × inner loop len). Numbers on top of bars correspond to total time in seconds.
5.2. KERNEL BENCHMARKS 43 45 38 41 227 259 163 182 0 50 100 150 200 250 300
omp approach 1 omp approach 2 omp approach 3 omp approach 4
Tim
e
(s
)
CPU MIC
Figure 5.25: Illustration of CPU (blue bars) and MIC (orange bars) OpenMP performance on Stampede for 1M × 1k loop iteration counts (outer loop len × inner loop len). Numbers on top of bars correspond to total time in seconds.
becomes easier, and the debugging phase is less complicated.