Cache Blocking as a Function of Benchmark Systems and Architecture

4.3 Methodology

4.4.5 Cache Blocking as a Function of Benchmark Systems and Architecture

Architecture

In previous sub-sections, we considered the effects of cache blocking on the cycle, instruction and L1, L2 miss counts. We also considered if it was appropriate to use different blocking

6_{On the AMD848 Opteron, the hardware performance counter event used is the PAPI native event}

Figure 4.3: FLOP count per LTot for the k300a-04 and k300a-04 water cluster systems using the HF method on a 2.2Ghz AMD848 Opteron processor

factors on a per LTot basis. It was found that while the optimal blocking factor does vary for different values of LTot, there was not a strong case to do this as the total execution time was dominated by those values of LTot that gave rise to the largest number of batches. In this sub-section, we examine the sensitivity of blocking factor to the molecular system, calculation type and hardware architecture.

In Table 4.9 the blocking factor that gave the lowest execution time is reported for 5 benchmark systems, 7 hardware platforms and 2 calculation types. Also given are the default cache blocking factors. (Results for the k300a-08 system using the 6-31++G(3df,3pd) basis set are excluded as it required more physical memory than was available on all machines, except for the AMD848 system).

Table 4.9:Optimal cache blocking factor observed for the five test systems, across six processor platforms

Default

Blocking k300a-04 k300a-08 test397 α-Al2O3 Size (Kw) A B† A 3-21G 3-21G* H F Opteron 64 32 64 32 32 64 Athlon64 32 64 64 64 32 64 EM64T 128 64 64 64 64 64 Pentium 4 64 64 64 64 32 64 Pentium M 64 64 64 64 64 64 G5 32 32 64 32 32 64 G5-XServe 32 32 32 32 32 64 B 3 L Y P Opteron 64 64 64 64 32 64 Athlon64 32 64 64 64 32 64 EM64T 128 64 64 64 64 64 Pentium 4 64 64 64 64 32 64 Pentium M 64 64 64 64 64 64 G5 32 32 64 32 32 64 G5-XServe 32 32 64 32 32 64 A

– 6-31G* basis set B†_{– 6-31++G(3df,3pd) basis set}

methods. The hardware architecture type is given on the left hand side of the table. The default blocking factor for each hardware architecture is given in the ‘Default Blocking Size’ column. Finally, the blocking factor which gave the lowest cycle counts is given under each molecular system and basis set columns.

For k300a-04 using the 6-31G* basis set for HF, the optimal blocking factor is 64Kw for the Opteron and 32Kw for the Athlon64. The 64 Kw blocking factor is the dominant blocking factor for all Intel processors. While the optimal factor for the PowerPC processors is 32Kw.

Moving to the larger basis set with k300a-04, it is seen that all x86 processors do better with the 64Kw blocking factor, whereas 32Kw can be better for PowerPC processors. For the larger k300a-08 system, the trend is similar to the k300a-04 system using 6-31G*. With test397, the results are a mix of 32Kw and 64Kw blocking factors. For the α - Al2O3 calculation, on

all hardware platforms, a blocking factor of 64Kw gave the lowest cycle counts for HF. If we now look across all HF calculations, a default blocking factor that always gives the lowest cycle count is true only for the Pentium M. For the EM64T, a optimal blocking factor of 64Kw was observed across all benchmarks, though the default blocking factor is a 128Kw.

Considering the B3LYP results, for k300a-04 and the 6-31G* basis set, the 64Kw blocking factor is preferred on x86 processors and 32Kw on PowerPC. For k300a-04 and the larger basis

Table 4.10: Timing differences for a 32Kw blocking factor versus a 64Kw blocking factor, expressed as a percentage.

k300a-04 k300a-08 test397 α-Al2O3 Sum of Preferred

A B† A 3-21G 3-21G* Percentages Factor H F Opteron 2:2 +6:0 0:5 2:3 +3:4 +4:3 64 Athlon64 +1:2 +5:4 +0:2 1:8 +3:6 +8:6 64 EM64T +7:5 +10:6 +5:0 +3:8 +6:8 +33:7 64 Pentium 4 +4:6 +8:8 +0:8 1:2 +5:5 +18:5 64 Pentium M +2:8 +7:8 +2:2 +0:5 +5:4 +18:7 64 G5 6:0 +2:2 5:1 6:4 +0:8 14:5 32 G5-XServe 7:8 +0:9 5:3 7:2 +0:7 18:6 32 B 3 L Y P Opteron +0:6 +3:1 +0:9 0:3 +2:5 +6:8 64 Athlon64 +0:5 +5:1 +0:1 1:2 +3:2 +7:7 64 EM64T +3:6 +10:8 +2:9 +2:4 +5:5 +25:2 64 Pentium 4 +2:0 +7:7 +0:7 0:6 +4:4 +14:1 64 Pentium M +1:5 +7:2 +1:9 +0:8 +4:2 +15:5 64 G5 3:0 +2:4 3:9 5:0 +1:3 8:2 32 G5-XServe 4:0 +1:9 3:8 6:0 +0:6 11:5 32 A

– 6-31G* basis set B†– 6-31++G(3df,3pd) basis set

set, it is entirely 64Kw. For k300a-08, all x86 processors performed better with the 64Kw blocking factor, whereas it was 32Kw for PowerPC. The results for test397 are mixed as in the HF case and theα - Al2O3calculation is similar to the HF case in that the 64Kw blocking

factor is preferred.

As Table 4.9 contains a mix of 32Kw and 64Kw factors, it is useful to determine how each performs relative to the other if only one blocking factor is used for all the benchmark systems. In order to do this we now consider the relative difference in timing between the two cache blocking factors across all systems.

Table 4.10 presents the relative difference in times between 32Kw and 64Kw. The first part of the table corresponds to the HF method and the second to the B3LYP method. The hardware platforms and methods are given on the left hand side. For each platform, the difference in times between 32Kw and 64Kw blocking factors expressed as a percentage7 is given for each benchmark system. A positive entry in the table indicates that a 32Kw blocking factor is preferred, whereas a negative entry indicates that 64Kw is preferred. The cumulative percentages for a given hardware platform is given in the ‘Sum of Percentages’ column. Based on the sum of percentages, a preferred blocking factor is determined in the ‘Preferred Factor’ column.

7_{i.e. 100*}(32KwTime 64KwTime)

For the Opteron it is seen that the k300a-04 system with 6-31G* runs 2.2% faster if a 32Kw blocking factor is used. Whereas, for the k300a-04 system with the larger basis set, it runs 6% slower with the 32Kw blocking factor than a 64Kw blocking factor. For k300a-08 and 6-31G*, there is a 0.5% increase in time if 32Kw is used. The test397 system runs 2.3% faster with a 32Kw blocking factor and theα - Al2O3system runs 3.4% slower with a 32Kw

blocking factor. Overall, the sum of these percentages is +4.3%. This indicates that overall use of a 32Kw blocking factor results in times that are 4.3% slower compared to a 64Kw blocking factor, hence the preferred blocking factor for the Opteron, is 64Kw.

If we consider the Athlon64 processor, the absolute difference in timings for individual benchmarks is similar to those obtained for the Opteron, and a 64Kw blocking factor is also preferred.

For the EM64T system all benchmarks perform better using a 64Kw blocking factor. It is interesting to note that the sum of differences shows that the use of a 32Kw blocking factor leads to a 33% increase in overall execution time.

For the Pentium 4 system, apart from test397, all other benchmarks perform better with 64Kw. Overall an 18.5% increase in execution time results from the use of a 32Kw blocking factor, thus a 64Kw blocking factor is preferred. It is of interest to compare the difference in times for the EM64T and Pentium 4 systems, as both use the NetBurst microarchitecture. There is a 15.2% difference in overall timings between the two processors. This indicates that the EM64T is more sensitive to the use of an appropriate cache blocking factor, than the Pentium 4 system.

The Pentium M performs better with a 64Kw blocking factor, across all molecular systems. Use of a 32Kw blocking factor would result in times that are 18.7% slower.

For the G5 system, use of a 32Kw blocking factor is beneficial except for k300a-08 and α - Al2O3 systems. Overall, use of a 32Kw blocking factor results in a 14.5% reduction in

runtime. Similar trends are seen for the G5-XServe system, where there is a 18.6% reduction in runtime when 32Kw is used compared to 64Kw.

Moving to the B3LYP section of the table, the Opteron and Athlon64 processors perform better with a 64Kw blocking factor, there is a 6.8% and 7.7% increase in runtime if a 32Kw blocking factor is used. If we compare B3LYP results with HF, there is very slight variation between the two.

For the EM64T processor, a net 25.2% increase in runtime results for a 32Kw blocking factor. This increase is not as large as what is seen for HF. The Pentium 4 system has an overall increase of 14.1% in its runtime for using a 32Kw factor.

The Pentium M shows a 15.5% increase in runtime for the 32Kw blocking factor.

The G5 and G5-XServe have a 8.2% and 11.5% reduction in runtime for the use of a 32Kw blocking factor.

To summarize, it would be beneficial to use a 64Kw blocking factor on x86 machines and a 32Kw blocking factor for PowerPC processors, for the set of benchmarks used here.

4.4.6 Summary: PRISM and Cache Blocking

To conclude this section, the PRISM algorithm limits the number of shell-quartets processed to cache blocked quantities, this influences the batches processed by the inner loop of PRISM. An appropriate blocking factor produces an optimal run-time and this is subject to both the input molecular system and basis set used. The optimal blocking factor is shown to influence the L2 and L1 miss rates, as its use leads to the lowest cycle count for an ERI class (i.e. all the integrals for a given LTot) with the most number of integrals. It is better to use one fixed cache blocking parameter for an entire SCF cycle than to dynamically vary the cache blocking according to the quartet angular momentum type. For the set of benchmark systems which were assessed, a 64Kw blocking factor was found to be best for x86 processors, while a 32Kw blocking factor gave best performance for the PowerPC processors.

In document Performance Models for Electronic Structure Methods on Modern Computer Architectures (Page 107-112)