Parallel HF Performance in Gaussian - Effects of Thread and Memory Placement in Gaussian

5.4 Effects of Thread and Memory Placement in Gaussian

5.4.3 Parallel HF Performance in Gaussian

The previous section discussed the distribution of contention classes for the X4600 M2. To obtain an understanding of the range of possible performance outcomes, in this section we perform Gaussian performance experiments using an instance from the minimum and maximum contention classes.

Table 5.7 presents the elapsed times obtained for the first three SCF iterations using the larger Valinomycin system with the HF method and a 3-21G basis set. The results are divided in two categories – ‘Co-located Threads and Memory’ and ‘All Memory Located at Node 0’. The former corresponds to an allocation which occurs in the minimum contention class, the latter corresponds to the maximum contention class. Within each category, sets of experiments are defined. These sets are determined by how the threads are allocated to each node on the X4600 M2. The two options chosen are to use either one core per node (the S* or single option) or both cores per node (the D* or dual-core option), when adding threads. In this respect Set 1 and Set 3 are similar (S*), and Set 2 and Set 4 are similar (D*). Threads are allocated based on the ordering given in the ‘Node’ column. Thus for a two threaded calculation a ‘Node’ value of (0,1) applies to the S* case, whereas a value of 0 applies to the D* case. The times taken for both the 64Kw and 128Kw blocking factors are given in seconds in the ‘Time’ column.

Considering first the results given in Table 5.7 for ‘Set 1’. As ‘Set 1’ corresponds to single node allocation the experiments were run using Node 0 and core 0A. This results in times of 218 sec. and 267 sec. for the 64Kw and 128Kw blocking factors respectively. Increasing the thread count to 2, has threads allocated on Node 0, core 0A and Node 1, core 1A. This gives a reduction in runtime for both blocking factors. For 4 threads Nodes 0, 1, 6 and 7 were used5. Timings obtained using the two blocking factors are 58 sec. and 72 sec. For eight threads all 8 nodes were used with times of 31 sec. and 42 sec. for the two cache blocking factors respectively.

For ‘Set 3’ using a 64Kw blocking factor execution time reduces with increased thread count. For the 128Kw blocking factor there is also a decrease with increased thread count, but scalability is worse than that observed for the 64Kw blocking factor. This is because there are more cache misses, and these are costly with non-local memory placement.

Comparing the 64Kw and 128Kw results for both ‘Set 1’ and ‘Set 3’. For the 64Kw blocking factor, times obtained upto 8 threads, are almost identical between ‘Set 1’ and ‘Set 3’. This indicates that cache blocking is mitigating the effects of non-local memory placement. For the 128Kw blocking factor, as thread count increases, ‘Set 3’ results become progressively slower than those for ‘Set 1’. By 8 threads, the runtime for the 128Kw blocking factor is 50%

5_{For 4 threads these nodes were selected as firstly, Nodes 0 and 7 have 2 cHT links, whereas Nodes}

1 and 6 have two cHT links and share a third cHT link. Second, the expectation is that greater variation in runtimes would be obtained for these nodes which have fewer cHT links.

Table 5.7:Elapsed time for the first three SCF iterations, as a function of memory and thread placement for a parallel Gaussian calculation on the Valinomycin molecule using HF/3-21G.

NThreads

Co-located Threads and Memory

Set 1 Set 2

Node Time (sec.) Node Time (sec.) (S ) 64Kw 128Kw (D ) 64Kw 128Kw 1 0 218 267 – – – 2 0,1 111 137 0 109 145 4 0,1,6,7 58 72 0,1 57 76 8 0 – 7 31 42 0,1,6,7 31 38 16 – – – 0 – 7 19 27 NThreads

All Memory Located at Node 0

Set 3 Set 4

Node Time (sec.) Node Time (sec.) (S ) 64Kw 128Kw (D ) 64Kw 128Kw 1 0 218 267 – – – 2 0,1 110 147 0 111 146 4 0,1,6,7 59 97 0,1 59 94 8 0 – 7 33 83 0,1,6,7 34 85 16 – – – 0 – 7 26 87

S* – Threads are allocated on a single core per node

D* – Threads are allocated per core on a node, prior to using another node

greater than for the equivalent ‘Set 1’ results, with a cache coherency traffic overheads now significantly reducing performance. This shows that when using single cores per node, it is important both to minimise cache misses and use node local thread and memory placement in order to obtain good performance on the SunFire X4600 M2.

For ‘Set 2’, which corresponds to D*, both cores are used and results are reported for 2 – 16 threads. We consider timings for the 64Kw blocking factor. For 2 threads, both cores on node 0 are used. A time of 109 sec. is measured, this is slightly less than the corresponding time in ‘Set 1’. When the thread count is increased to 4 threads and both cores on node 0 and 1 are used, the measured time, is again, slightly less than the corresponding ‘Set 1’ time. On increasing to 8 threads and using nodes 0, 1, 6 and 7 the same difference is seen. For the 128Kw results, when using 2–4 threads there is an increase in times compared to ‘Set 1’ times. For 8 threads, the ‘Set 3’ results is slightly faster than the corresponding ‘Set 1’ time. This suggests that use of both cores in each node reduces the intra-node coherency traffic overheads associated with cHT.

For ‘Set 4’, results for 2 – 8 threads using a 64Kw blocking factor follow the same trends as 64Kw for ‘Set 2’. The times get progressively longer as thread count increases and is 27%

slower than the corresponding ‘Set 2’ result for 16 threads. Use of a 128Kw blocking factor in ‘Set 4’ shows a dramatic increase in execution time compared to 64Kw results. This result indicates that with 16 threads use of the 64Kw blocking factor, even with all memory being locate at Node 0, is able to perform better than the 128Kw blocking factor.

For all timing results from ‘Set 1’ to ‘Set 4’, it is seen that a well cache blocked algorithm can significantly reduce the effects of poor thread and memory placement.

Using data from Table 5.4 five speedup curves are presented in Figure 5.4. The Table 5.4 results are augmented with those obtained from an unmodified version of the Gaussian code (i.e. one that does not perform thread or memory placement). The speedup curves are labelled Unmodified, Set 1, Set 2, Set 3 and Set 4 accordingly. The solid black line in the two plots is a reference line for perfect speedup.

We first consider the 64Kw plot. Upto four threads, there is no significant deviation between the five curves. At eight threads a segregation occurs between Set 1 and Set 2 which are faster than Set 3, Set 4 and unmodified Gaussian. The difference between the two groups is about 10% and is similar to the difference seen between the lowest and highest times recorded for serial 18-Crown-6 Ether in Table 5.4. For sixteen threads, the difference between the two groups increases to around 30%. The best achieved speedup is 11.52 for Set 2. Results obtained for unmodified Gaussian are roughly similar to those obtained using thread and memory placements corresponding to the maximum contention class. This arises because a large block of memory (the workspace in Figure 5.3) is allocated prior to the start of the SCF iterations. Sequential code then touches this memory to create intermediate values and a section of this shared memory is then handed to each worker thread for use in its parallel section. Owing to the first-touch memory placement policy, any memory accessed prior to the parallel section will result in allocation on or near to the master thread.

If we now consider the 128Kw blocking factor. Speedups for two threads are similar. For four threads, Set 1 and Set 2 diverge from Set 3, Set 4 and the unmodified Gaussian by 30%. At eight threads, Set 1 and Set 2 differ by 9%, indicating that the policy of allocating one core per node is better, while sets 1, 2 vary from sets 3, 4 and unmodified Gaussian by 57%. For 16 threads there is a 70% difference between set 1 and 2 and sets 3, 4 and unmodified Gaussian.

These results show that thread and memory placement can produce a speedup of 10 for the co-located case, but a speedup of just 3 for the maximum contention case which is similar to unmodified Gaussian. It is to be noted that even though the maximum speedup is similar for both blocking factors, the base times used are very different; the shortest elapsed time is 19s for sixteen threads with the 64Kw blocking factor versus 29s for the 128Kw blocking factor.

Figure 5.4: Speedup results for unmodified Gaussian 03 code compared to sets 1 – 4 for two cache blocking factors (64Kw, 128Kw). Times obtained are for the first three SCF cycles for using Valinomycin, the HF method and a 3-21G basis set. Note speedups for both cases are relative to different timings, with 64Kw being faster than 128Kw.

2 4 6 8 10 12 2 4 6 8 10 12 14 16 Speedup Number of Threads (a) Cache Blocking: 64Kw Unmodified Set_1 Set_2 Set_3 Set_4 2 4 6 8 10 12 2 4 6 8 10 12 14 16 Speedup Number of Threads (b) Cache Blocking: 128Kw Unmodified Set_1 Set_2 Set_3 Set_4

5.4.4 Summary: Effects of Thread and Memory Placement on

In document Performance Models for Electronic Structure Methods on Modern Computer Architectures (Page 153-157)