As in our barrier approach, we propose a SIMD reduction algorithm targeting the exploitation of SIMD instructions available in modern multi- and many-core pro- cessors. Our current proposal is limited to work only on basic data types which are the most common reduction operations in OpenMP. However, reductions on more complex data types could also be computed following the same approach.
6.3.1 Reduction Scheme
Our reduction algorithm follows a scheme that is integrated with the OpenMP syn- chronization barrier, as we motivated in 6.1.1. For the sake of simplicity, we frame our reduction description in the context of our SIMD barrier proposal described in Section 6.2.1. Nevertheless, our approach could easily be extended to different barrier algorithms.
Our reduction scheme is applicable thanks to the split gather/scatter barrier scheme. The computation of reduction operations occurs in the gather phase. In this phase, the arrival of one thread to the barrier implies that its partial reduction value is ready to be used in the reduction computation. Therefore, the final reduc- tion computation is completed after the gather phase when the reduction global
6.3. SIMD Reduction Algorithm 147
variable is updated with the resulting value. Afterwards, the release phase can safely start as all threads will have the reduction result available through the global variable.
The whole pseudo-code with both the gather phase of the SIMD barrier and the SIMD reduction algorithms has been shown in Listing 6.1 from Section 6.2.2. We take advantage of the groups of threads of the tree-like structure of our barrier proposal to perform the computation of partial reductions within the group. These partial reductions happen only when all threads in the group have reached the barrier.
In order to use efficient SIMD memory instructions in the computation of re- ductions, we use temporal buffers per group of threads to rearrange contiguously in memory the partial reduction of each thread. In this way, all threads copy their partial reduction value to their corresponding buffer at the beginning of the gather phase (function push reduction values at line 6). These buffers will allow using stride-one vector loads to read multiple partial reduction values at a time.
In addition, we also keep the master/slave thread roles per group in our reduc- tion approach. The following steps summarize the detailed process of computing the reduction operations piggybacked on the SIMD barrier scheme:
1. All threads reach the gather phase of the barrier and copy their partial reduc- tion value to their assigned position of the group reduction buffer (function push reduction values, line 6).
2. Slave threads signal the arrival in their group of threads and go to step 7 (line 29).
3. Master threads realize that all the slave threads in their group have reached the barrier and that their reduction data is ready to be used (line 16).
4. Master threads compute the local reduction of their group of threads (function master group reductions, line 20).
5. Master threads continue to the next level of the tree and they repeat steps from 2 to 5 for all the tree levels.
6. The master thread of the last group (root) computes the final reduction value and updates the global reduction variable (function master group reduc- tions, line 20).
7. The thread leaves the gather phase of the barrier. 8. The release phase of the barrier starts.
Figure 6.3 shows the scheme of the execution of an integer reduction for 21 threads. Unlike the SIMD counters of the barrier, there is only one buffer per group in the level zero of the tree. This means that these group buffers are reused in the subsequent levels of the tree. As depicted, buffer #0 is also reused in level 1 and level 2, and buffer #2 is reused only in level 1.
1 void master_group_reductions(...)
2 {
3 for each single reduction
4 {
5 if (level == 0) // Leaf tree nodes
6 {
7 // Multi-register Leaf Reduction
8 for i = 1 to num_vregister_per_group
9 Vop_reduce(my_red_group_buffer, my_red_group_buffer + VL*i);
10 }
11 // Vertical Tree Reduction (non-leaf tree nodes)
12 else
13 {
14 for each slave in my group
15 Vop_reduce(my_red_group_buffer, slave_red_group_buffer);
16 }
17
18 // Horizontal Root Reduction (root node)
19 if (level == (num_levels-1))
20 Hop_reduce(global_reduction_data), my_red_group_buffer);
21 }
22 }
Listing 6.3: SIMD reduction steps of the master thread (pseudo-code)
6.3.2 SIMD Reduction Steps
As we described in Section 6.3.1, master threads incrementally compute the reduc- tion operation in the gather phase of the barrier. These master threads perform a partial reduction of their group of threads for each level of the tree. This process is represented with the function master group reductions, at line 20 in List- ing 6.1 from Section 6.2.2. Listing 6.3 shows the pseudo-code of reduction steps of the function master group reductions.
In order to perform this incremental reduction computation, we define the fol- lowing two SIMD reduction operations:
Vertical SIMD reduction operation (Vop): It reduces two input vector registers
into a single output vector register using a combiner vector operation.
Horizontal SIMD reduction operation (Hop): It reduces all the scalar elements
within the same input vector register into a single output scalar element. Moreover, we use these SIMD reduction operations in the following three differ- ent computation steps:
Multi-register leaf reduction: This step only happens at level 0 (leaf tree nodes)
if the group reduction buffer has a length larger than a single vector register length (VL). In such a case, the master thread reduces the group reduction
6.3. SIMD Reduction Algorithm 149 0 1 2 3 4 5 ... 6 7 8 9 10 11 ... 12 13 14 15 16 17 ... 18 19 20 ... 6 8 10 12 14 16 ... 30 32 34 15 16 17 ... 36 40 44 27 30 33 ... Vop 210 Vop Vop Hop Cache line
Level 0 (group size = 6)
Level 1 (group size = 2)
Level 2 (group size = 2)
buffer #0 buffer #1 buffer #2 buffer #3
buffer #0 buffer #2
buffer #0
global reduction
variable
S Slave thread of the group
M Master thread of the group
Padding
SIMD reduction flow Thread local reduction variable
Figure 6.3: SIMD reduction scheme for 21 threads. Addition reduction defined on an integer data type. Tree group sizes of level 0, 1 and 2 are 6, 2 and 2, respectively. 4 SIMD reduction buffers.
buffer into a buffer with a maximum length of a single vector register length. Lines 8 and 9 from Listing 6.3 show the pseudo-code of this step. This reduc- tion step is performed using vertical SIMD reduction operations.
Vertical tree reduction: For all levels greater than 0, the group master thread
reduces all the slaves buffers into its buffer. Lines 14 and 15 from Listing 6.3 show the pseudo-code of this step. This reduction is also performed with ver- tical SIMD reduction operations.
Horizontal root reduction: At the root node, after the corresponding vertical
tree reduction of that level, the group master thread reduces its single reg- ister buffer into the final scalar reduction value. This value is copied to the global reduction variable. Lines 19 and 20 from Listing 6.3 shows the pseudo- code of this step. This step is performed with a horizontal SIMD reduction operation.
Figure 6.3 shows the scheme of the SIMD reduction computation using the ver- tical (Vop) and horizontal (Hop) SIMD reduction operations in the vertical tree re- duction and the horizontal root reduction steps. The multi-register leaf reduction step is not shown. We assume that the initial buffer at level zero is smaller than the vector register length.