Effect of Cache Blocking on Cycle, Instruction Counts and Cache Misses

4.3 Methodology

4.4.2 Effect of Cache Blocking on Cycle, Instruction Counts and Cache Misses

Cache Misses

To explore the effect of cache blocking on PRISM’s performance the k300a-04 system using the HF method and a 6-31G* basis set was run using various cache blocking sizes from 4 – 1024Kw on the AMD848 Opteron. The results are given in Table 4.6, where we also report the number of batches.

Cache blocking factors are given on the left hand side of Table 4.6. Corresponding to each blocking factor the associated number of batches generated and hardware event counts for cycles, instruction executed, FLOP count, L1 and L2 misses are given the table’s columns. Consider first the effect of cache blocking on the number of batches. Increasing the batch size from 4 Kw to 1024 Kw results in a large decrease in the number of batches. This behaviour is to be expected since, as discussed in section 4.2.2, the number of integrals that are processed per batch (in loop6 of Algorithm 2) increases.

Execution time, which is implied by the cycle count, is 3.85 x 1010 cycles for 4Kw, de- creasing to a minima of 2.69 x 1010 cycles for 32Kw before increasing to 4.6 x 1010 cycles for 1024Kw. Depending on the blocking factor used, a 40x variation in execution time is observed.

Table 4.7: Variation in the number of batches with varying blocking factors for the k300a-04 system using a 6-31G* basis set with the HF method. Data obtained from one SCF cycle.

Total Number of Batches

LTot1 NoQrt2 4 Kw 16 Kw 32 Kw 64 Kw 256 Kw 1024 Kw AsyLim3

0 352176 13190 3727 1894 963 268 100 66 1 708932 38584 10672 5411 2761 770 295 187 2 806845 56118 16245 8313 4248 1224 503 362 3 589046 57697 17251 8849 4528 1300 531 378 4 310929 39138 14386 7378 3770 1068 419 286 5 116550 16696 9152 4874 2451 652 223 112 6 32097 4601 3398 2057 1025 266 82 34 7 5651 809 809 508 271 68 19 5 8 623 89 89 63 32 8 2 1

1) LTot Total Angular Momentum 2) NoQrt Number of Quartets 3) AsyLim Asymptotic Batch Limit

As the blocking factor increases the number of instructions decreases. This arises due to the following: first, increasing the blocking factor leads to a reduction in the number of batches. Second, the work required to re-compute shared intermediate quantities amongst shell-quartets in a batch reduces as the batch size increases. As the work done reduces with increasing batch size, so does the number of instructions being executed. A similar reduction is also seen for the FLOP count.

While reducing re-computation by increasing the blocking factor is good, it also gives rise to an increase in cache misses, since batches now start to overflow cache. This behavior is evident from the L1 and L2 miss counts, which increase as the blocking factor gets larger. Also, the L1 misses are an order of magnitude larger than L2 misses, suggesting that PRISM’s memory access patterns are cache blocked for the L2 cache, and not for the L1 cache.

4.4.3 Effect of Cache Blocking on ERI Batching

The previous sub-section considered the effects of cache blocking on measured cycle counts for the k300a-04 system with a 6-31G* basis set and the HF method. The total number of batches generated as a function of cache blocking, was also reported. It was found that increasing the blocking factor reduced the total number of batches being processed. Thus it is of interest to examine the effect of cache blocking at a finer level. In Table 4.7 we present a detailed breakdown of batch size according to the quartet LTot value for blocking factors ranging from 4Kw to 1024Kw. Also included is the asymptotic batch size limit (‘AsyLim’). The ‘AsyLim’ results correspond to the number of batches each LTot would generate if an infinite sized cache

Table 4.8:Cycle count per LTot for k300a-04 using HF/6-31G* on a 2.2Ghz AMD848 Opteron Total Asy. Qrt. Cycle count (x109)

LTot NoQrt Lim PAB. 4 Kw 16 Kw 32 Kw 64 Kw 256 Kw 1024 Kw

0 352176 66 5336 1.01 0.89 0.84 0.98 0.97 1.30 1 708932 187 3791 3.07 2.54 2.36 2.69 3.36 4.02 2 806845 362 2229 5.96 4.66 4.37 4.70 6.37 7.72 3 589046 378 1558 8.51 6.17 5.72 6.21 8.22 10.4 4 310929 286 1087 9.12 6.53 5.70 5.88 7.86 10.4 5 116550 112 1041 6.48 5.24 4.53 4.20 5.52 7.72 6 32097 34 944 3.24 2.92 2.35 2.02 2.67 3.67 7 5651 5 1130 0.95 0.94 0.84 0.67 0.67 1.01 8 623 1 623 0.17 0.17 0.17 0.17 0.17 0.17 Total Cycles (x1010) 3.85 3.00 2.69 2.75 3.58 4.64 % Increase from 32Kw +30.17 +10.56 0.00 +2.32 +24.94 +42.10

Qrt. PAB. – Average number of Quartets per batch, without the use of cache blocking

was used.

Table 4.7 is divided into two sections. The first section gives the total number of quartets generated for each LTot (in the ‘Total NoQrt’ column). The second section gives the ‘Number of Batches’ generated as a function of cache blocking and the asymptotic batch size limit for each LTot. Values for ‘Total NoQrt’ were obtained by setting a counter to 0, at the start of loop

in Algorithm 2. This counter was then incremented by the number of shell quartets that

were eventually chosen by PRISM for computation, within loop6. The values for ‘AsyLim’

were obtained by counting the number of times loop5 in Algorithm 2 was entered.

From the table it is seen that as the value of LTot increases, the total number of shell- quartets increases from 352,176 to 806,845 and then drops sharply to 623 for LTot = 8. As mentioned before, the distribution of shell-quartets results from the underlying construction of the 6-31G* basis set and the nature of the system under study.

Starting from the 4Kw blocking factor, as the value of LTot increases, the number of batches rises to a maximum of 57,697 and reduces sharply there after. Increasing the blocking factor by four to 16Kw results in approximately a 3.3 fold reduction in the number of batches. This reduction drops to 2.7 fold at LTot = 4, while for LTot7 there is no reduction at all.

The latter is a consequence of batching being ignored if the value of ‘NPerS4’, in Algorithm 2, is greater than the available cache size.

From 32Kw to 1024Kw, increasing the blocking factor leads to a proportional reduction in the number of batches. Surprisingly the use of a 1024Kw blocking factor, which corresponds to a cache of size 16MB, would still result in batches being cache blocked.

In document Performance Models for Electronic Structure Methods on Modern Computer Architectures (Page 99-102)