A 2D Heat Conduction Algorithm as a Use Case

3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 1 10 100 Time [usec]

Message Size [Byte] NetPIPE Latency - Infiniband (Viscluster)

with instrumentation without instrumentation

(a) Latency over InniBand connection

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 10 100 1000 10000 100000 1e+06 1e+07 Bandwidth [Mbps]

Message Size [Byte] NetPIPE Bandwidth - Infiniband (Viscluster)

with instrumentation without instrumentation

(b) Bandwidth over InniBand connection

Figure 5.5: NetPIPE Inniband latency (left) and bandwidth (right) comparison of Open MPI compiled with and without the memchecker framework

50 100 150 200 250 1 10 100 Time [usec]

Message Size [Byte] NetPIPE Latency

Running with MemPin Running with Valgrind

(a) Latency over TCP connection

0 200 400 600 800 1000 1200 1400 1600 1800 1 10 100 1000 10000 100000 1e+06 1e+07 Bandwidth [Mbps]

Message Size [Byte] NetPIPE Bandwidth

Running with MemPin Running with Valgrind

(b) Bandwidth over TCP connection

Figure 5.6: NetPIPE TCP latency (left) and bandwidth (right) comparison of Open MPI run with the memchecker framework

20 40 60 80 100 120 140 160 180 200 1 10 100 Time [usec]

Message Size [Byte] NetPIPE Latency

Running with MemPin Running with Valgrind

(a) Latency over InniBand connection

0 2000 4000 6000 8000 10000 12000 1 10 100 1000 10000 100000 1e+06 1e+07 Bandwidth [Mbps]

Message Size [Byte] NetPIPE Bandwidth

Running with MemPin Running with Valgrind

(b) Bandwidth over InniBand connection

Figure 5.7: NetPIPE Inniband latency (left) and bandwidth (right) comparison of Open MPI run with the memchecker framework

1 10 100 1000 1 10 100 Time [usec]

Message Size [Byte] NetPIPE Latency

Running with MemPin Running with Valgrind Running without tools

(a) Latency over InniBand connection

0.01 0.1 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 1e+06 1e+07 Bandwidth [Mbps]

Message Size [Byte] NetPIPE Bandwidth

Running with MemPin Running with Valgrind Running without tools

(b) Bandwidth over InniBand connection

Figure 5.8: NetPIPE Inniband latency (left) and bandwidth (right) comparison of Open MPI run with and without the memchecker framework

5.3 A 2D Heat Conduction Algorithm as a Use Case

Blocks that will be computed Overlap area of two processes

Blocks that are used for computation Data block owned by one process

Process 1 Process 2

Communicated blocks but not used for computation

Figure 5.9: An example of border update in domain decomposition

processor algorithm into a parallel context. For example, a good domain decomposition will normally lead to good load balance and fast communication among sub-domains. While a bad decomposition may result load imbalance and heavy communication, which will show a poor overall performance.

The basic idea of domain decomposition is to divide the original computational domain Ω into sub-domains Ωi, i = 1, ....M, and then solve the global problem as a sum of contributions from each sub-domain, that may be computed in parallel.The process as- sociated with a sub-domain requires elements belonging to its neighbors when it updates the elements on the border of its partition [13], that requires large amount of exchang- ing data between each neighbor pairs. However, there are cases that part of the border data need not to be updated. For example, in a 2D domain decomposition algorithm, it may require calculate elements only from their horizontal and vertical neighbor elements, but the whole border element arrays are updated from neighbor sub-domain, as shown in Figure 5.9. The result is that for the border update in every sub-domain, there will be four corner elements that will never be used for calculation (without periodic boundary condition, the virtual border [36] elements are not taken into account in the example). This might be no harm for the calculation result of the algorithm. But when decomposing the entire problem into a large number of sub-domains, the total amount of transferred but unused data may be high, and as consequence communication might require more time.

(a) 4 x 4 decomposition (b) 8 x 8 decomposition

Figure 5.10: Transferred but unused data in example domain decompositions Figure 5.10 shows a more detailed example of the transferred but unused data in the domain decomposition example. The left gure is a 4x4 domain decomposition, where every element calculation requires horizontal and vertical neighbor elements. In this specic condition, there will be 72 elements transferred but not used (36 corner elements transferred two times). When scaling this code by doubling the number of processors used to compute this domain, the number of elements communicated but not used increases dramatically. On the right side, a similar example of 8x8 domain decomposition, that has 392 elements (196 corner elements transferred two times) might not be communicated. Assuming we have a M × N domain decomposition, the total amount of such elements are described by:

(M − 1) × (N − 1) × 4 × 2 (5.1)

It is obvious that, the number of unnecessary communicated data grows superlinearly with the domain decomposition.

An example 2d heat conduction algorithm, which behave in the way described above, has been used for running with the new implemented memory checking framework on Windows. The algorithm is based on Parallel CFD Test Case [48]. It solves the partial dierential equation for unsteady heat conduction over a square domain. It was run with two processes and under control of the memory checking tool (MemPin in this case). The entire domain has been decomposed into two sub-domains, one has the range of [(0,0), (8,15)], and the other is [(7,0), (15,15)]. The run-time output (see Figure 5.11) shows the details of the execution. In the end of the output, out tool MemPin reports that in total 112 bytes of data on each process have been transferred but actually not used in the program. All of these bytes are from the corner elements exchange. Each of the process has two corner elements, and each element is transferred once for every communication. In the following of this Section, more test results will show that with more corner elements, the communication time of the application is aected. And reducing such data will improve the communication performance.

5.3 A 2D Heat Conduction Algorithm as a Use Case

Figure 5.11: Running the heat program with two processes and checked with memory checking

The heat program has been tested with more processes on dierent number of nodes on BWGrid and Nehalem Cluster at HLRS, in order to discover the relationship between the communicated but unused corner elements and the communication performance. For the rst test, the heat program was set to a 1500 × 1500 domain, and parallelized with dierent number of processes over four compute node (eight cores on each node) on Nehalem cluster. A modied version of the heat program was also used for the test, which does not send any unused corner elements. The average communication time and overall run time are measured based on ve executions on dierent number of processes, as shown in Figure 5.12. When not oversubscribing the nodes (each core has no more than one process), the modied version is generally better than the original version ranging from 3% to 7%. As we can see, communicating the corner elements of the sub-domains will indeed aect the communication time of the program.

Another test on 64 nodes was made on Nehalem cluster to start large number of processes without oversubscribing the compute node cores. The processes are assigned using round-robin algorithm among the nodes, in order to achieve a better load balancing for the simulation. Figure 5.13 shows the communication time of running the same simulation with dierent number of processes on the cluster. It presents the communication time for dierent number of processes (64 to 310). The modied version has a shorter communication time on average, which is 10% better. The best case is even 20% better than the original version.

The communication time does not increase of decrease linearly with the number of processes, because the domain decomposition will inuence the communication eciency.

195 200 205 210 215 220 225 230 5 10 15 20 25 30 Time [usec] Number of Processes

Communication time of the Heat Conduction program Original program Modified program

Figure 5.12: Comparison of the communication time between the original and modied Heat Conduction program on 4 nodes

100 150 200 250 300 350 400 450 500 550 100 150 200 250 300 Time [usec] Number of Processes

Communication time of the Heat Conduction program on 64 nodes Original program

Modified program

Figure 5.13: Communication time comparison between the original and modied Heat Conduction program on 64 nodes

5.4 MD Simulation as a Use Case

In document MPI-Semantic Memory Checking Tools für Parallel Applikationen (Page 91-97)