Summary and Conclusion - Scalability Engineering for Parallel Programs Using Empirical Performa

In this chapter, we propose a new software engineering approach for extreme-scale systems. With our scheme, we identify scalability issues in libraries that are thought to be scalable and pinpoint possible performance bugs and room for improvement. In contrast to previous ap- proaches, our technique only requires the performance engineer to have a vague (asymptotic) idea of the scalability, although the accuracy improves if more information is available (e.g., a performance litmus test or more precise expectations). We also supply a tool chain that au- tomates large parts of our four-step process and is ready for immediate use by performance engineers.

To achieve this, our tool chain utilizes empirical performance modeling to generate models that describe the behavior of functions in a target HPC library. To understand the scaling behavior, users can model execution time as the number of processes increases, and divergence models, derived from the generated models, reveal how severe the discrepancy between the ob- served and expected performance is. We demonstrate the effectiveness of our mechanism using

a number of use cases, namely MPI collective operations, the MAFIA code, OpenMP constructs, and parallel sorting algorithms.

Our first case study, however, is probably the most important library interface in HPC, that is, the MPI library. We chose to focus on it first because many commercially mature and well- tested implementations are available and clear performance expectations exist in the literature. We show how our approach enables MPI developers to spot scalability bugs early on, before commencing full-scale tests on the target supercomputer. For this, we used automated exper- iments on four different machines with five different MPI libraries, and our tool discovered a number of scalability issues that can be grouped into the following cases: (a) key collective functions on Juropa and Piz Daint display unexpected behavior; (b) the performance guideline

All reduce Reduce + Bcast is potentially violated on Piz Daint; (c) memory consumption on

Juropa limits the number of possible processes; (d) communicator duplication on Piz Daint consumes more memory than necessary; and (e) on Lichtenberg, Open MPI generally has a better scalability than Intel MPI. We conclude that our approach is a viable technique that can both point to limitations of the systems and provide MPI developers with important hints for improving the scalability of their implementations. We also expect that this will motivate the development of clear performance expectations for other parallel libraries, such as ScaLAPACK or the parallel BLAS.

4 Task Dependency Graphs

In this chapter, we discuss task dependency graphs (TDGs). It is based in part on a previous paper by Shudler et al. [58] and will provide the necessary background for the discussion on practical isoefficiency analysis in Chapter5. We focus on techniques to build TDGs and analyze them, as well as a method to replay the tasks in a TDG to simulate the execution of a task-based application.

4.1 Graph Abstraction for Task-based Applications

There is an important difference between tasks and threads: a task is a work package contain- ing a collection of instructions to be executed, whereas a thread is a light-weight process that executes given instructions in an independent context. A task-based code can be executed by one or more threads running on one or more cores. For the purpose of our discussion, we as- sume that each instance of a task is executed at any given time by only one thread and that the number of threads is equal to the number of cores (i.e., each thread runs on a separate core).

The execution of a task-based code can be represented as a directed acyclic graph (DAG)

G = (V, E), where the nodes V are tasks and the edges E represent dependencies between tasks.

Henceforth, we will refer to this DAG as a task dependency graph (TDG). An edge (v , u) ∈ E means that task u cannot begin execution before v finishes. This type of dependency is an

abstraction. It can either represent a data dependency [26] such as Read-After-Write (RAW), Write-After-Read (WAR), and Write-After-Write (WAW), or a control dependency. In Figure4.1, which shows a small TDG with execution times for each task, a task with a runtime of 9 cannot run before tasks with runtimes of 7 and 4 finish running. Except for some really simple cases, most of the interesting and useful problems have numerous dependencies in their algorithm flow, thereby producing non-trivial TDGs.

In the process of execution, the scheduler assigns tasks ready to be executed to threads. Depending on the scheduler, tasks might be stopped (i.e., preempted) and resumed later. In this situation, tasking environments, including OpenMP, distinguish between tied and untied tasks. A tied task can only be executed by the thread that started executing it, meaning that if this task is preempted, it can only be resumed by the thread that executed it before. An untied task, on the other hand, can be resumed by any available thread after preemption. Both types of tasks have advantages—tied tasks provide guarantees for private thread data (e.g.,#pragma omp threadprivatein OpenMP) [28], whereas untied tasks provide more flexibility to the scheduler.

In the next sections, we present important metrics and rules that characterize TDGs, discuss ways to construct and analyze TDGs, and eventually, present the task replay engine that allows us to replay a TDG and simulate the execution of a task-based program.

�

� � �

� � � �

�

Figure 4.1: Task dependency graph; each node contains the task time and the highlighted tasks form the critical path.

4.1.1 Metrics and rules

We characterize TDGs using a set of key metrics [102,105,106]. The work of the computation is the total execution time on one core, or the sum of all the task times. The depth of the computation (also known as span) is the total sum of all task times on the critical path, which is the longest path, in terms of task times, from any source node to any target node. A source node is a node with no incoming edges, and a target node is a node without any outgoing edges. Figure4.1 shows an example of a TDG in which work equals45 and depth equals 24. In the rest of the paper, we use the following notations:

• p, n: the number of threads and the input size, respectively

• T_p_{(n): the execution time of a computation with p threads and input size n}

• T₁(n): the work of the computation for input size n

• T_∞_{(n): the depth (or the critical path) of the computation for input size n}

• S_p_{(n) =} T1(n)

Tp(n): the speedup of the computation for specific p and n

• π(n) = T1(n)

T_∞(n): the average parallelism of the computation forn

From these TDG metrics we can derive important rules [102] and boundaries on speedup and efficiency.

Work rule

The execution cannot be faster than when we divide the whole work T₁_{(n) equally between all}

the available cores, the number of which, in our case, equals the number of threads:

T_p_{(n) ≥} T1(n)

p (4.1)

A direct consequence of the work rule is an upper bound on the speedup S_p(n) ≤ p. We ignore super-linear speedups for the sake of simplicity.

Depth rule

Since the critical path is a chain of dependencies, the tasks on this path must be executed one after the other, giving us another lower bound on the execution:

Tp≥ T∞ (4.2)

In this case, we can derive another upper bound on the speedup Sp(n) ≤ π(n). Combining both of the upper bounds we obtain an upper boundSp(n) ≤ min{p, π(n)}.

In document Scalability Engineering for Parallel Programs Using Empirical Performance Models (Page 84-89)