First, the three chosen models for shared memory programming on the Intel Xeon Phi are described in more detail.
4.1. Three Popular Task-based Parallel Models 42
4.1.1
OpenMP
We have introduced OpenMP as the de-facto standard for shared memory programming. OpenMP provides a set of compiler directives, tools, and environment variables which can simplify parallel application development for shared memory architectures with multiple cores. However, the application developers need to be aware of the OpenMP memory model, which provides private and shared data [19]. A notable feature of OpenMP is that it is under active development, and new features are proposed frequently in order to make it more flexible and adaptable to new architectures. One recent example is the support for task dependence since the release of OpenMP 4.0.
Writing of multithreaded programs can become quite complex, without defining certain rules. OpenMP attempts to ease the process by supporting the fork-join model [137]. We have to be clear about the fork-join term, as it can be applied at different level and can imply different meanings. For example, Chapman [19] uses this concept for threads in the earlier versions of OpenMP (before OpenMP 3.01) as follows: the starting part of the program is
executed by a single thread; whenever, a parallel construct encounters, a team of threads is created (fork); members of the team execute the code collaboratively, and at the end of the construct2, all team members except the master thread terminate (join). The concept of fork-
join is used in [126] as a parallel control pattern, where the control flow forks into multiple parallel flows that will join later. In this context –as well as throughout this dissertation–, the focus is on the dynamic task creation and execution in newer versions of OpenMP, which also follow the fork-join pattern using task and task-wait constructs. It is also important to note that still parallelism happens only in a parallel region.
The Intel OpenMP runtime library (as opposed to the GNU implementation) allocates a task list per thread for every OpenMP team. Whenever a thread creates a task that cannot not be executed immediately, that task is placed into the thread’s deque (double-ended queue). A random stealing strategy balances the load [138].
4.1.2
Cilk Plus
Cilk Plus has evolved from Cilk [139], and is an extension to C/C++ with a few additional keywords and an array section notation. It provides very simple but powerful ways of spec- ifying parallelism, as it is integrated into the compiler. It features a fork-join pattern to sup- port irregular patterns and nesting. Cilk Plus provides the cilk spawn and cilk sync keywords to spawn and synchronise tasks; cilk for loop is a parallel replacement for
1OpenMP is said to be thread-based before adding the concept of tasks, where all threads have access to the
shared memory and worksharing directives (single, for, section, etc.) are used to distribute the work between them [3].
4.1. Three Popular Task-based Parallel Models 43
sequential loops in C/C++. Cilk Plus has a syntactic extension to express fork-join: instead of calling a function, by using cilk spawn it spawns the function, which means the caller can continue its execution without waiting for the callee to return (fork); cilk sync, waits for all spawned calls in the current function to join.
The tasks are executed within a work-stealing framework. Every worker thread has deque of tasks. The worker treats its deque as a stack, by pushing and popping tasks at the back of it. Thieves steal from the front of deques [45]. In the Cilk Plus work-stealing framework, thieves steal continuations, meaning that the spawned task is immediately started by the spawning thread, and the continuation is left available for stealing. The Cilk Plus scheduling policy provides load balancing close to the optimal [117]. The Intel implementation of Cilk Plus ensures that by running a program on one processor, the same order of operations as the equivalent sequential program is produced [126].
4.1.3
Threading Building Blocks (TBB)
Intel Threading Building Blocks (TBB) is another well-known approach for expressing par- allelism [43]. TBB is an object-oriented C++ template library that contains data structures and algorithms to be used in parallel programs.
Parallelism can be expressed in terms of tasks, represented as instances of the task class, or concurrent container classes, which allow the access of multiple threads to items of a container.
TBB supports both regular and irregular parallelism, and has direct support for a various parallel patterns, such as task graphs, map, pipelines, etc. TBB abstracts the low-level thread interface. However, conversion of legacy code to TBB requires restructuring certain parts of the program to fit the TBB templates.
TBB uses a library for supporting the fork-join pattern. Similar to Cilk Plus, a common thread pool is shared by all tasks and load balancing is achieved by work-stealing. Each worker thread in TBB has a deque of tasks. Newly spawned tasks are put at the back of the deque, and each worker thread takes the tasks from the back of its deque. If there is no task in the local deque, the worker steals tasks from the front of the victims’ deques [140]. But in TBB, thieves steal children, meaning that the worker thread spawns a new task and leaves it. It executes the continuation, for example if it is executing a loop, it proceeds to the next iteration and spawns more tasks afterwards, leaves them for stealing, likewise. Furthermore, if it picks a task to run, it would be the last spawned one, as it is the one recently pushed at the back of its deque.