Task-Based Runtime - Software/Hardware Co-Design to Improve Productivity, Portability, and Perf

The LTA runtime is a task-based work-stealing runtime inspired by TBB and is responsible for generating, partitioning, and distributing loop-tasks exposed by the LTA programming API across the cores available on the system. It employs child-stealing, Chase-Lev task queues [CL05], and occupancy-based victim selection [CM08]. Figure 4.4 illustrates how a work-stealing runtime recursively partitions loop-tasks into subtasks to facilitate load balancing.

The master core, GPP 0, begins executing the application binary while all other cores spin in the runtime’s work-stealing routine. When GPP 0 encounters a parallel_for, it generates the initial loop-task by instantiating a loop-task object with the corresponding function pointer, argument pointer, and start/end indices of the range h0, 127i. Then GPP 0 partitions this loop- task by splitting the range in half, generating two loop-tasks with smaller ranges: h0, 63i and

h64, 127i. GPP 0 pushes h64, 127i onto its task queue in memory, and continues to recursively partition h0, 63i into h0, 31i and h32, 63i, and then executes the former while pushing the latter onto its task queue. Meanwhile, the work-stealing routine on GPP 2 sees the tasks in GPP 0’s task queue and steals h64, 127i. GPP 2 then partitions this task into h64, 95i and h96, 127i, executing the former. GPP 1 steals/executes h32, 63i and GPP 3 steals/executes h96, 127i. As in traditional work-stealing runtimes, tasks are always stolen in FIFO-fashion (i.e., from the top of the partition tree) to improve locality within a given core [FLR98].

Tasks are partitioned until the range is less than or equal to a configurable core task size at which point the loop-task is called a core task, which acts as the smallest unit of load balancing across cores. The LTA runtime uses a default core task size of N/(k × P), where N is the size of the initial range, k is a scaling factor, and P is the number of cores. Increasing k generates more core tasks with smaller ranges (better load balancing, higher overhead), whereas decreasing k generates less core tasks with larger ranges (worse load balancing, lower overhead). Sensitivity studies on the LTA engine configurations and application kernels explored in this thesis indicate that k = 4 is a reasonable design point for avoiding starvation due to a lack of core tasks.

One of the key differences between a traditional work-stealing runtime and the LTA runtime is how the runtimes actually execute core tasks. A traditional runtime simply uses an indirect function call (i.e., jalr) on the core task’s function pointer with the given argument pointer and start/end indices. However, the LTA runtime uses a special instruction that allows a core task to be executed on an LTA engine, if one is available. At a high level, LTA engines are able to accelerate loop-task execution by further partitioning core tasks into micro-tasks (µtasks) and mapping these µtasks onto micro-threads (µthreads).

The details of the ISA extensions and the mapping of µtasks to µthreads are discussed in the next section, but the important point with respect to the LTA runtime is that an LTA-aware task partitioning scheme allows for more efficient execution of loop-tasks. Figure 4.5 shows the difference between a default task partitioning scheme and an LTA-aware task partitioning scheme. Each number in the abstract task partitioning tree indicates the size of the range of the loop-task at that point in the tree. For example, the initial range in the example contains 42 loop iterations. Assuming that the core task size is four, the default scheme recursively splits the range in half until there is a total of 16 core tasks. Further assuming that there are four cores in the system, each with a four-µthread LTA engine, it will take four steps to execute all core tasks. The execution time of

42 21 21 10 11 10 11 5 5 5 6 5 5 5 6 3 3 3 2 3 2 3 2 3 23 2 3 2 3 3 16 core tasks 42 20 22 8 12 10 12 4 6 4 8 4 4 4 8 4 4 4 2 4 4 11 core tasks

(a) Default Partitioning

(b) LTA-Aware Partitioning T ime Core 0 µt0 µt1 µt2 µt3 Core 1 µt0 µt1 µt2 µt3 Core 2 µt0 µt1 µt2 µt3 Core 3 µt0 µt1 µt2 µt3 T ime Core 0 µt0 µt1 µt2 µt3 Core 1 µt0 µt1 µt2 µt3 Core 2 µt0 µt1 µt2 µt3 Core 3 µt0 µt1 µt2 µt3

Figure 4.5: LTA-Aware Task Partitioning – The abstract task partitioning tree and the corresponding execution diagram for the default scheme and the LTA-aware scheme are shown. Each number in the tree indicates the size of the range of the loop-task at that point in the tree. The initial range in both cases has 42 loop iterations. The default scheme partitions loop-tasks by splitting the range in half, whereas the LTA-aware scheme partitions loop-tasks by ensuring at least one sub-range is a multiple of the number of µthreads available on the LTA engine. The execution diagrams show how the generated core tasks are executed on a four-core system with four-µthread LTA engines. Each core task is executed across all available µthreads. A block represents one loop-iteration-worth of work and, in this example, takes roughly the same amount of time to execute.

all core tasks is roughly the same because all µthreads must execute in lock-step in this example. Note that because core tasks have ranges that are less than the total number of µthreads available on the LTA engine, some µthreads will remain idle during core task execution. On the other hand, the LTA-aware scheme splits the range such that at least one of the resulting sub-ranges is a multiple of

Instruction Description

xpfor r_s Indirect function call that acts as a hint to accelerate core task execution on LTA engine. Same semantics as jalr.

mtuts r_d, r_s Moves values from GPP register to LTA engine register for all µthreads.

xplock r_d, r_s, r_t Attempts to obtain binary lock with special success/failure tokens. Same semantics as amo.xchg.

xpsync µthread barrier; forces synchronization of µthreads.

Table 4.1: LTA Instruction-Set Architecture Extensions

the number of µthreads available in a single LTA engine. Specifically, the LTA-aware scheme sets the core task size to d(N × t)/(k × P)e × 2t, where t is the total number of µthreads available on the LTA engine. This partitioning generates a total of 11 core tasks, most of which have a range of four loop iterations. Executing all of these 11 core tasks will only take three steps. Note that while LTA-aware task partitioning can increase the maximum difference in size between any two core tasks compared to traditional task partitioning, LTA-aware task partitioning generally improves performance on systems with LTA engines by increasing µthread utilization. The runtime deter- mines the total number of µthreads available on the LTA engine by reading a special coprocessor register on the GPP.

In document Software/Hardware Co-Design to Improve Productivity, Portability, and Performance of Loop-Task Parallel Applications (Page 57-60)