Our e ffort in tuning MPI parallelism - Evaluating techniques for parallelization tuning in MPI

Our effort in tuning MPI parallelism targets eliminating communication and synchro- nization delays. Our goal is to mitigate these stalls by introducing an automatic technique that achieves maximal potential overlap and requires no code refactoring of the targeted application.

for ( ... ) { wait_for_previous_MPI_Sendrecv( 1/3 ); work( 1/3 ); nonblocking_MPI_Sendrecv( 1/3 ); wait_for_previous_MPI_Sendrecv( 2/3 ); work( 2/3 ); nonblocking_MPI_Sendrecv( 2/3 ); wait_for_previous_MPI_Sendrecv( 3/3 ); work( 3/3 ); nonblocking_MPI_Sendrecv( 3/3 ); } (a) code ITERATION 0 ITERATION 1 (b) execution Figure 3.3: Overlap with chunks

3.2.1 Automatic overlap

The state-of-the-art overlapping techniques achieve limited overlap and require significant code restructuring. The presented by-hand optimization provides limited overlap, because it overlaps transfer time only with transfer independent work. Moreover, the required code intervention requires significant refactoring that seriously harms pro- gramming productivity.

Our technique of automatic overlap

In this thesis, we design a technique that maximizes the potential overlap in an application. The technique consists of four mechanisms:

• Message chunking: Each original MPI message is partitioned into independent chunksconsisting of one or more data elements.

• Advancing sends: Each chunk is sent as soon as it is produced.

• Double buffering: Two different buffers are used to differentiate the chunks being consumed at the current iteration and the incoming chunks for consuming at the next iteration.

• Postponing receptions: Each chunk is waited at the moment when it is really needed for consumption.

Compared to the overlapping technique illustrated in Figure 3.2, our technique addi- tionally breaks the original message into chunks and then overlaps the communication

time of each chunk independently. Figure 3.3 illustrates overlap with chunks in the case when the original message is broken into three chunks. The idea is that, ideally, each third of the computation in some iteration produces one third of the message to be sent in that iteration. Similarly, each third of the received message in some iteration enables execution of one third of the computation in the next iteration. Then, the first third of the message can be sent after 33% of the first iteration, and it must arrive before the start of the second iteration. Thus, the transfer of this first chunk can be overlapped with the resting 66% of the computation in the first iteration. Therefore, in our technique there is no need for extracting independent work – work that does not produce/consume the message. Conversely, this technique overlaps the transfer of one part of the original message with the production/consumption of the rest of that message.

Also, it is important to note that the patterns by which each MPI process locally computes on transferred data can seriously limit the potential for overlap. The overlap of chunks comes from the potential of the execution to advance partial sends and postpone partial receptions. In the presented example, the first chunk is produced after 33% of the iteration, allowing it to be overlapped with the resting 66% of the computation in that iteration. However, if that first chunk was finally produced at 80% of the iteration, its transfer could overlap with only 20% of the computation. Thus, the production/consumption patterns can seriously limit the potential for advancing/postponing message chunks. Therefore, our study of automatic overlap will pay a special attention to internal computation patterns of applications.

The goal of this thesis it to explore techniques that achieve chunked overlap from Figure3.3bwithout consequently degrading maintainability of the code. The code restructuring that is needed to achieve the chunked overlapped execution seriously hurts the maintainability of the code (Figure 3.3a). Thus, our goal is to explore the techniques that will achieve chunked overlapped execution (Figure3.3b) but still preserve the clarity of the code from Figure3.1a.

Our contribution in exploring automatic overlap

This thesis tries to design and evaluate an automatic technique that, using a specialized hardware support, extracts the maximal potential overlap without the need to restruc-

ture the source code of an application (Chapter5). In Section5.2, we further describe our technique of automatic overlap. Furthermore, we introduce a new overlapping technique that we named speculative dataflow. Speculative dataflow is a speculative technique that, using a specialized hardware support, implements automatic overlap without any restructuring of the original MPI application. We demonstrate the feasi- bility of this speculative dataflow (Section5.3) and evaluate its potential in the case of real scientific MPI applications (Section5.4).

Also, throughout our study, we designed a development environment that can be very useful in further studies of overlap (Section4.2). Given only the executable of a legacy MPI application, the environment automatically evaluates the potential overlap. Moreover, the visualization support allows comparison between the original and the overlapped execution of the targeted application. Using our environment, a programmer can quickly evaluate the potential benefits of overlap in any application. Thus, prior to any implementation effort, the programmer can estimate the potential benefits of intended overlapping technique and decide whether the implementation is worth the effort. Furthermore, using the visualization support, the programmer can identify bottlenecks of the intended implementation and explore solutions to overcome them.

In document Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs (Page 58-61)