3.4 Our e ffort in tuning MPI/OmpSs parallelism
3.4.2 Searching for the optimal task decomposition
Another way to tune parallelism in MPI/OmpSs is to select a different task decomposi- tion. Figure3.6points that executing each MPI process on two cores already generates execution stalls. One way to reduce these stalls could be to refine the decomposition in order to avoid serialization of the green tasks. Other way would be to refine red tasks in order to extract more computation that is independent of the message transfer. How- ever, testing different decomposition is very hard. For testing some decomposition, a
programmer must generate correct MPI/OmpSs code that implements that decomposi-
tion and then measure the obtained parallelism. We further illustrate this issue in the motivating example from Section6.2.1.
This thesis studies techniques to quickly explore the potential parallelism in appli- cations. We provide mechanisms for the programmer to easily evaluate potential paral- lelism of any task decomposition. Furthermore, we describe an iterative trial-and-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine (Section 6.2). Finally, we explore potential of automating the it- erative approach by capturing the programmers’ experience into an expert system that can autonomously lead the process of finding efficient task decompositions (Section
6.3).
In our study of potential parallelism in applications, we designed Tareador (Sec- tion4.4) – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Then, based on the proposed decomposition, Tareador dynamically identifies data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Using Tareador, throughout trial-and-error top-to-bottom iterative approach, a programmer can test var- ious task decompositions and find one that exposes sufficient parallelism to efficiently use the target parallel machine. Also, we designed an autonomous driver that runs Tareador to automatically explore potential task decompositions of an application.
4
Infrastructure
The research in this thesis is based on the trace-driven simulation using the tools that are developed in BSC (Section 2.3). The initial trace-driven infrastructure includes mpitrace, Dimemas and Paraver. mpitrace library is a dynamic library that intercepts MPI related events and emits them to the trace. Paraver is a performance analysis tool that can visualize the traces obtained with mpitrace. Furthermore, Dimemas can replay the traces obtained with mpitrace to simulate the execution with the changed configuration of the target parallel machine.
In the conventional trace-driven simulation, simulating a new execution feature is done completely in the simulator. The tracing library instruments the execution of the code and inserts to the trace the events that describe that execution. Then, the simulator consumes the obtained trace, replaying the collected events and calculating new time- stamps according to the specified configuration of the target parallel machine. Thus, accounting for a new execution feature is done entirely in the simulator – the simulator takes the new feature into account when calculating the time-stamps of the simulated execution.
However, the conventional methodology makes it impossible to simulate low level architectural features in the highly parallel target machine. The problem arises from the need of simulators to be sequential. For instance, when simulating MPI execution, any change in the local time-stamps of one MPI process may change the order of messages on the network. Since the network is a shared resource, the simulator must explicitly synchronize all MPI processes on every network access. Thus, the simulator can hardly be parallelized. On the other hand, simulating numerous MPI processes causes a very large simulation. Thus, the simulation time impedes simulating a low level architectural feature on a big parallel system. To overcome this problem, we had to design a different simulation methodology that is based on modifying the trace in such a way that it models the new feature to be explored.
4.1
Simulation aware tracing
Using the state-of-the-art tools for trace-based simulation, we develop a new simula- tion technique that models the simulated feature in the earlier part of methodology – during the tracing of the application. As in conventional trace-driven simulation, our tracer instruments the execution and generates the trace of the real run. We will call this trace the authentic trace. However, apart from the records needed for the authentic trace, the tracer emits to the trace additional records related to the studied feature. Then, from the authentic trace and the additional events, we derive the arti f icial trace – what would be the trace of the potential execution of the same application if it would include the studied feature. Then, the unchanged replay simulator replays both traces, providing a comparison between the actually executed run and the potential run that includes the studied feature.
Our methodology allows simulating the effect of a low-level feature on a highly parallel machine. By making the tracing process responsible for modeling a new ex- ecution feature, our methodology shifts the major computation effort from simulation phase to tracing phase. Since each MPI process in traced concurrently, the feature modeling computation is naturally parallelized across MPI processes. The paralleliza-
tion of the feature modeling effort allowed us to make very complex simulations –
simulations that model very low-level features on very large-scale parallel machines. In order to instrument low-level execution properties of execution (such as memory
CPU_burst (0.112) Broadcast (/* root */ 0) CPU_burst (0.121) CPU_burst (0.111) Broadcast (/* root */ 0) CPU_burst (0.122) PROCESS 0: PROCESS 1: CPU_burst (0.112) Broadcast (/* root */ 0) CPU_burst (0.132) PROCESS 2: CPU_burst (0.121) Broadcast (/* root */ 0) CPU_burst (0.112) PROCESS 3:
(a) original trace – broadcast implemented with collective call
CPU_burst (0.112) Send ( -> 1 ) Send ( -> 2 ) Send ( -> 3 ) CPU_burst (0.121) CPU_burst (0.111) Recv ( 0 -> ) CPU_burst (0.122) PROCESS 0: PROCESS 1: CPU_burst (0.112) Recv ( 0 -> ) CPU_burst (0.132) PROCESS 2: CPU_burst (0.121) Recv ( 0 -> ) CPU_burst (0.112) PROCESS 3:
(b) changed trace – broadcast implemented with point-to-point calls (one-to-many)
CPU_burst (0.112) Send ( -> 1 ) CPU_burst (0.121) CPU_burst (0.111) Recv ( 0 -> ) Send ( -> 2 ) CPU_burst (0.122) PROCESS 0: PROCESS 1: CPU_burst (0.112) Recv ( 1 -> ) Send ( -> 3 ) CPU_burst (0.132) PROCESS 2: CPU_burst (0.121) Recv ( 2 -> ) CPU_burst (0.112) PROCESS 3:
(c) changed trace – broadcast implemented with point-to-point calls (cyclic)
CPU_burst (0.112) Send ( -> 1) Send ( -> 3) CPU_burst (0.121) CPU_burst (0.111) Recv ( 0 -> ) Send ( -> 2 ) CPU_burst (0.122) PROCESS 0: PROCESS 1: CPU_burst (0.112) Recv ( 1 -> ) CPU_burst (0.132) PROCESS 2: CPU_burst (0.121) Recv ( 0 -> ) CPU_burst (0.112) PROCESS 3:
(d) changed trace – broadcast implemented with point-to-point calls (logarithmic)
Figure 4.1: Simulating different implementations of broadcast
accesses), we often designed tracers based on binary translation tools (Valgrind tools). These tracers can instrument the execution at the level of a single instruction. The effect of the new feature on every single instruction is calculated and incorporated into the arti f icial trace of each MPI process. Finally, the simulator replays the trace, propagating the effect of the new feature across the whole MPI execution.