SIMD Combining Tree Barrier Algorithm - SIMD@OpenMP : a programming model approach to leverage

Barrier algorithms may benefit from vector memory instructions that can read or write several scalar elements at a time. These instructions can be used either to check the arrival in the barrier or to signal the release of multiple threads at once. Reduction algorithms can use vector memory and arithmetic instructions to im- plement the partial reduction value of multiple threads at once. In addition, new SIMD instruction sets also include more advanced vector instructions that can be used efficiently to reduce the scalar elements in a vector register into a single scalar element.

The aforementioned facts lead us to think that SIMD instructions can impact the performance of barrier and reduction algorithms. They could reduce the number of executed instructions and even the traffic through the communication net- work. However, the application of SIMD instructions to barrier and reduction algorithms requires a specific design of those algorithms. As we already know from Chapter 3 and Chapter 4, the efficient exploitation of SIMD instructions implies the fulfillment of certain constraints related with the stride of data in memory and its alignment.

6.1.2 Objectives

The main objective of this contribution is to propose new synchronization barrier and reduction schemes specifically designed to exploit SIMD instructions. Further- more, our design will also take into account that multiple hardware threads could be running within the same core with the simultaneous multi-threading technol- ogy. Our proposal must fit into the OpenMP programming model and fulfill the requirements of the barrier and reduction primitives defined in the standard.

We target the Intel Xeon Phi coprocessor as many-core architecture with a large number of cores, a 4-way simultaneous multi-threading and a powerful 512-bit SIMD instruction set. The evaluation of this proposal must be compared with current production-state barrier and reduction schemes available and tuned for the chosen architecture.

6.2 SIMD Combining Tree Barrier Algorithm

We propose a tree barrier algorithm for current multi and many-core architectures that makes use of SIMD instructions. Our proposal also benefits from having multiple hardware threads per core. We denominate our barrier algorithm as reconfigurable multi-degree combining tree barrier with lock-free distributed SIMD counters. The reasons for this qualification are the following:

Combining tree: The barrier scheme is based on a traditional combining tree data

structure.

Reconfigurable: The combining tree data structure of the barrier can be recon-

them across this structure.

Multi-degree: The combining tree data structure of the barrier can have a differ-

ent branching factor (number of children) per level.

Distributed SIMD counters: Each node of the combining tree contains individ-

ual counters per thread (distributed counters) partially orchestrated by SIMD memory operations.

Lock-free: All the counters of the combining tree are lock-free and locks are not

necessary at any point of the synchronization.

The tree-like internal data structure of the barrier allows a split design with independent gather and release phases, as we motivated in Section 6.1.1. Conse- quently, this barrier algorithm is suitable for implementing the barrier primitive in any OpenMP runtime library, including that of Intel.

6.2.1 Barrier Scheme

Our barrier algorithm deploys a combining tree data structure which is walked from leaves to root in the gather phase and from root to leaves in the release phase. In this tree structure, a pre-established number of threads is assigned to each node in the same level of the tree. The group size of each level is statically defined in the initialization phase of the barrier. This group size makes the number of threads per group the same for every node within that level whereas it may be different for each particular level of the tree. If the number of threads reaching a level is not divisible by the group size, there will be an additional group (the last one) containing the remainder of threads. The total number of nodes/groups per level depends on the group size of that level and the total number of threads that execute that level. Figure 6.2 illustrates the scheme of the barrier for the synchronization of 21 threads with group sizes of 6, 2 and 2 for levels 0, 1 and 2 of the tree, respectively.

In each particular tree node only one thread is designated to play the master role (M) of the group. The remaining threads assume the slave role (S). These roles will affect their duties in the gather and release phases of the barrier, described in Section 6.2.2 and Section 6.2.3, respectively.

Those threads assigned to the same tree node constitute an independent group of synchronization. Inside this group, each thread has an exclusive 1-byte counter available for taking part into the synchronization process. All the group counters are allocated contiguously in memory satisfying the alignment constraints of the underlying SIMD instruction set. These group counters will be handled with SIMD memory operations by the master thread. To prevent false-sharing, only one group of counters is placed per cache line. The remaining memory in the line is padded. In Figure 6.2 1-byte counters are depicted in red and green, and padding in white.

This particular tree-based design with distributed counters allows exploiting SIMD resources and inter-thread cache locality in cores with simultaneous multi-

6.2. SIMD Combining Tree Barrier Algorithm 143

M S S S S S ... M S S S S S ... M S S S S S ... M S S ...

M S ... M S ...

M S ...

Cache line

Level 0 (group size = 6)

Level 1 (group size = 2)

Level 2 (group size = 2)

S Slave thread of the group

M Master thread of the group

Padding (1 Byte) Thread counter (1 Byte) Release phase Gather phase

SIMD counter #0 SIMD counter #1 SIMD counter #2 SIMD counter #3

SIMD counter #4 SIMD counter #5

SIMD counter #6

Figure 6.2: SIMD Combining tree barrier scheme for 21 threads. Group sizes of level 0, 1 and 2 are 6, 2 and 2, respectively. Seven tree nodes in total with their respective distributed SIMD counters.

threading. Inter-thread locality may be useful to carry out a first intra-core synchronization step.

In addition, the tree structure also offers the possibility of reshaping the flavor of the barrier from a multi-level combining tree structure to a lock-free totally cen- tralized barrier. Hence, it will be possible to take advantage of two utterly different barrier algorithms with only one implementation. The most appropriate one can be chosen depending on the number of threads and the characteristics of the system.

6.2.2 Gather Phase

Listing 6.1 shows the pseudo-code of the gather phase of the SIMD barrier. In this phase, the intra-group synchronization is performed through the distributed counters introduced in Section 6.2.1. These counters are represented in Listing 6.1 with the tree flags variable.

Each slave thread signals its arrival in the barrier changing the value of its exclusive 1-byte counter (line 29). It is important to note that cache line false sharing may occur in case that several slave threads from different cores in the same group perform the signaling arrival at the same time. Nevertheless, simultaneous multi-threading scenarios can benefit from this fact because groups with only slave threads from the same core will not suffer from this false sharing penalty.

Regarding the master thread, it waits for all its slave threads to reach the group by checking the slave counters using SIMD instructions (lines 16 and 17). Just one vector load allows reading at the same time as many slave counters as bytes the vector length has. Therefore, the use of SIMD instructions prevent the master thread from iterating in a scalar fashion on each single counter. While the master

1 void simd_barrier_gather(int tid)

2 {

3 int level = 0;

5 // Threads copy their reduction values on the appropriate SIMD buffer

6 push_reduction_values(...);

8 // Current thread is in a valid level and it is master at that level

9 while (level < num_levels && is_group_master(level, tid))

10 {

11 int my_group_idx = get_my_group_idx(level, tid);

12 vtype group_arrival_state =

13 get_group_arrival_state(level, my_group_idx);

15 // Master waits for group slaves using SIMD memory loads

16 while(vload(&tree_flags[level][my_group_idx]) != group_arrival_state)

17 OpenMP_barrier_duties();

19 // Master computes group reductions

20 master_group_reductions(...);

22 level++;

23 }

25 // Current thread is in a valid level and it is slave at that level

26 if (level < num_levels) // i.e., !is_group_master(level, tid)

27 {

28 // Slaves mark arrival in group and leave the gather phase

29 tree_flags[level][my_level_idx] = THREAD_ARRIVAL_STATE;

30 }

31 }

Listing 6.1: Generic scheme of the SIMD barrier gather phase (pseudo-code)

thread is waiting for its slaves, it may perform some of the OpenMP barrier duties (OpenMP barrier duties) that we commented in Section 6.1.1, such as task scheduling or checking active cancellation points.

Once all threads of a particular group have reached the barrier, the master thread continues to the next level of the tree (continuous arrow in Figure 6.2). In the meantime, slave threads leave the gather phase of the barrier. In the next level, several master threads from different groups converge at the same group, and new master and slave roles are reassigned as in the previous step. This process is re- peated until the last level (tree root) is reached. At that point, it is guaranteed that all threads have arrived at the barrier. It is important to note that each thread can be slave in only one level at most. In every level before that level, each thread will have a master role (except for slave threads at level 0). No thread will take part in any level after the level where they play a slave role.

6.2. SIMD Combining Tree Barrier Algorithm 145

1 void simd_barrier_release(int tid)

2 {

3 int level = get_deeper_level_of(tid);

5 // The current thread is a group slave thread

6 if (!is_group_master(level, tid))

7 {

8 int my_level_idx = get_my_level_idx(level, tid);

10 // Group slaves wait for their master to release them

11 while_{((*tree_flags[level][my_level_idx]) == THREAD_ARRIVAL_STATE)}

12 OpenMP_barrier_duties();

14 level--;

15 }

17 // The current thread is a group master thread

18 if (is_group_master(level, tid))

19 {

20 do{

21 int my_group_idx = get_my_group_idx(level, tid);

22 vtype group_release_state =

23 get_init_group_state(level, my_group_idx);

25 // Master releases to group slaves using a SIMD memory store

26 // that resets all the counters of the group

27 vstore(&tree_flags[level][my_group_idx], group_release_state); 28 29 level--; 30 } while(level >= 0); 31 } 32 }

Listing 6.2: Generic scheme of the SIMD barrier release phase (pseudo-code)

shown in Listing 6.1 (functions master group reductions and push reduction

valuesin lines 20 and 6). This aspect will be discussed in Section 6.3.

6.2.3 Release Phase

The pseudo-code in Listing 6.2 illustrates the different steps of the release phase of the barrier. In this phase, the tree structure is traversed from root to leaves until all threads have been released. Each thread starts playing its last role in the last level visited in the gather phase (line 3). This means that at the beginning of the release phase there is only one master threat in the root node. The remaining threads are slave threads spread across the different levels of the tree.

These slaves threads are all waiting on their respective counters for their respective master threads to release them (lines 11 and 12). In the waiting time, slave threads may perform some of the OpenMP barrier duties (OpenMP barrier

duties).

The master thread that starts from the root node performs a vector store upon all the slave counters of that level (line 27). This vector store releases to all those slave threads at the same, setting all the group counters to their initial state. Func- tion get init group state (line 23) returns a vector value that contains the initial state of the counters of a group, given the level of the three (level) and the index of the master thread of that group (my group idx). Afterwards, both master and slaves move back to their respective previous groups of the previous level and retake their former master role (dashed arrow in Figure 6.2). It is important to note that once a thread plays a master role at some level, it will continue to serve as master until level 0.

As in the gather phase, these steps are applied to each level until the first level is revisited again and each master thread releases to all its slaves. It is at that point when the released slave threads and their masters are allowed to leave the barrier and the synchronization is completed.

The biggest benefit of performing only one vector store to release to all the slaves threads of the group is that master threads avoid the intensive time-consuming ping-pong effect. This effect would occur if master threads wrote each slave counter in a scalar way, as slaves, in between, were requesting the same cache line to read their counters. Thus, per each scalar store, the master thread could have to reclaim the exclusive ownership of the cache line if some slave thread had already granted with a copy that line for reading.

In document SIMD@OpenMP : a programming model approach to leverage SIMD features (Page 169-174)