The search algorithm - Automatic exploration of potential parallelism

6.3 Automatic exploration of potential parallelism

6.3.1 The search algorithm

We propose an iterative algorithm to explore different task decomposition strategies and estimate their performance. The inputs are the original unmodified sequential code and the number of cores in the target platform. With that information the algorithm (Figure6.21) performs of the following steps:

1. Start from the most coarse-grain task decomposition, i.e. the one that considers the whole main function as a single task.

sequential code choose the most coarse‐grain task decomposition (whole main in one task) identify potential parallelism of the selected decomposition is parallelism sufficient? Refine decomposition to get more

no

yes

task decomposition that provides sufficient parallelism identify parallelization b l k to get more parallelism bottleneck

2. Perform an estimation of the potential parallelism of the current task decomposition (the speedup with respect to the sequential execution).

3. If the current task decomposition is satisfactory (heuristic 2), report it as final and finish.

4. Else, identify the parallelization bottleneck (heuristic 1) in the current task decomposition, i.e. the task that should be further decomposed into finer-grain tasks.

5. Refine the current task decomposition in order to avoid the identified bottleneck. Go to step 2.

In the following Sections, we further describe the design choices made in designing these two heuristics and the three metrics used. But before that, we must define more precise terminology. First, we need to make a clear distinction between a task type and a task instance. If function compute is encapsulated into task, we will say that compute is a task type, or just a task. Conversely, each instantiation of compute we will call a task instance, or just an instance. Then, if we can define some metric for each task instance, we can derive a collective metric for the whole task type.

Second, we will often use a term breaking a task to refer to the process of trans- forming one task into more fine-grain tasks. For example, Figure 6.22 illustrates the iterative task decomposition process. The process starts with the most coarse-grain decomposition (D1) in which function A is the only task. By breaking task A, we obtain decomposition D2 in which A is not a task anymore and instead the direct children (B and C) become tasks. If in the next step we break task B, since B contains no children

A

break A

C

break B

B

C

break C Sequen tial co d e potential tasks D1 D2 D3 D4

tasks, B will be serialized (i.e. B is not a task anymore and it becomes a part of the sequential execution). Similarly, the next refinement serializes task C and leads to the starting sequential code. At this point, no further refinement is possible, so the iterative process naturally stops.

Heuristic 1: which task to break

In the manual search for a satisfactory decomposition, the programmer himself decides which task is the parallelization bottleneck. Our goal is to formalize this programmer experience into simple metrics that can lead an autonomous algorithm for exploring potential task decompositions.

Metric 1: task length cost

A task type that has long instances is a potential parallelization bottleneck. Thus, based on the duration of instances, we define a metric called length cost of a task type. Length cost of some task type is proportional to the duration of the longest instance of that task. Therefore, if task i has task instances whose lengths are in the array Ti, the length cost of task i is:

li = max(t), t ∈ Ti (6.1)

Furthermore, we define a normalized length cost of task i as:

li(p) = (li)p N P j=1 (lj)p (6.2)

where a control parameter p is used to tune the weight of this metric in the overall cost function.

Metric 2: task dependency cost

A task type that causes many data-dependencies is another potential parallelization bottleneck. Thus, based on the number of dependencies, we define a metric called dependency cost of a task type. Dependency cost of some task is proportional to the maximal number of dependencies caused by some instance of that task. Therefore, if

task i has instances whose numbers of dependencies are in the array Di, the dependencies cost of task i is:

di = max(z), z ∈ Di (6.3)

Furthermore, using a control parameter p, we define the normalized dependency cost of task i as: di(p)= (di)p N P j=1 (dj)p (6.4)

Metric 3: task concurrency cost

A task type that has low concurrency is another potential parallelization bottleneck. Concurrency of some instance is determined by the number of other instances that execute in parallel with that instance. Thus, we define concurrency cost of some task to be inversely proportional to the number of instances that run concurrently with that task (number of instances that execute on all cores). Therefore, if task i has task instances which run for time Ti, j while there are j instances running concurrently, the concurrency cost of task i is:

ci = X

i Ti, j

j (6.5)

Again, using a control parameter p, we define the normalized concurrency cost of task ias: ci(p)= (ci)p N P j=1 (cj)p (6.6) Overall cost

The cost function for task type i is defined as the sum of the three previous normalized metrics

li(p1)+ di(p2)+ ci(p3) (6.7)

Control parameter p

In all the defined metrics, the normalized cost is calculated using a control parameter p. For each metric separately, the sum of the normalized costs across all tasks is equal to 1. The parameter p additionally controls the mutual distance of the normalized costs for different tasks. For instance, let us assume that the applications consists of task instances A and B and that A is two times longer than B. If the control parameter p is equal to 1, task A has the length cost of 0.67, while task B has the length cost of 0.33. However, if the selected control parameter p is equal to 2, task A has the length cost of 0.8, while task B has the length cost of 0.2.

Therefore, the control parameter of some metric determines the impact of that metric on the overall cost. For instance, if we select the control parameter for length cost to be 0, all tasks will have the same normalized length cost, independent of the duration of instances of these tasks. Thus, the length of task instances would have no impact on the overall cost. On the other hand, if we select the control parameter for length cost to be infinite, the task with the longest instance will have the normalized length cost of 1, while all other tasks will have the normalized length cost of 0. In this case, the length of task instances would have a huge impact on the overall cost.

In document Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs (Page 147-151)