Cache Blocking as a Function of Quartet Type

4.3 Methodology

4.4.4 Cache Blocking as a Function of Quartet Type

In the previous section, it was seen that the effect of cache blocking on the number of batches generated varied according to LTot value. This raises the question whether the cache blocking factor should vary according to the quartet type. To address this, we present in Table 4.8 the cycle count for each LTot as a function of each cache blocking factor.

There are four major columns in Table 4.8 – ‘Total NoQrt’, ‘AsyLim’, ‘Qrt. PAB.’ and ‘Cycle count’. The ‘Total NoQrt’ and ‘AsyLim’ columns are reproduced from Table 4.7 to aid discussion. The ‘Qrt. PAB.’ column is the average number of quartets per batch, assuming there is no cache blocking. Values in this column are derived from the previous two columns. The ‘Cycle count’ row presents measured cycle counts for each LTot as a function of the cache blocking factor. Minimum cycle counts entries for each LTot have been highlighted in bold font. At the bottom of the table, the ‘Total Cycles’ column corresponds to the sum of cycle counts for each blocking factor. The ‘% Increase from 32Kw’ gives the percentage difference between cycle counts for the 32Kw blocking factor and other blocking factors.

For each value of LTot, there is a corresponding value for the total number of quartets and the asymptotic batch limit. Trends for these two columns were discussed previously. Values for the ‘Qrt PAB.’ column decrease from LTot = 0 to 6. After this there is an increase, followed by a slight decrease for LTot = 8.

The expectation for measured Cycle counts will depend on the number of quartets for a given LTot, the number of asymptotic batches and the FLOP cost associated with computing a given quartet. For 4Kw, we see that as LTot increases, the cycle count gradually increases, peaks for LTot = 4 and reduces. Moving to 16Kw, the cycle count peaks again at LTot = 4. For 32Kw, the cycle count peaks at LTot = 3 and this trend holds upto 256Kw. For 1024Kw, shell-quartets with LTot = 3 and 4 take the same amount of time to compute. It was men- tioned earlier, in section 4.2.3, that higher angular momentum functions cost more in FLOPs to assemble than those of lower angular momentum. If we consider the variation of the ‘Qrt. PAB’ column, it is seen that lower angular momentum integrals, though numerous are com- puted rapidly. As the value of LTot increases, the cost of assembling higher angular momentum integrals increases and if these are numerous, it would dominate the overall cost. However the number of these integrals will decrease after some LTot value. This explains the presence of a peak followed by a decrease of the observed cycle counts. Aggregate cycle counts for each blocking factor in the ‘Total Cycles’ row, corresponds to the Cycle counts presented in Table 4.6.

Observing the progression of bold values from the top of the table to the bottom, shows that there are two blocking factors which give the lowest cycle count per LTot. From LTot = 0 to 4 it is the 32Kw blocking factor. Following this there is a switch to the 64Kw factor. This

transition indicates that it could be beneficial to switch blocking factors at runtime.

To assess further, the usefulness of switching blocking factors, we expand the scope of systems examined to include larger system sizes and bigger basis sets. Thus, we include the k300a-04 system with the larger 6-31++G(3df,3pd) basis set, the k300a-08 system with both 6-31G* and 6-31++G(3df,3pd) basis sets.

To aid presentation of data and facilitate comparison, we now switch to using plots for cycle count. We also include total L1 and L2 misses per LTot as a function of the cache blocking factor. These plots are given in Figures 4.1 and 4.2.

Data for the 6-31G* and 6-31++G(3df,3pd) basis sets are given in the first and second columns respectively. For each column there are three sub-plots which correspond to cycles, L1 and L2 misses. For each sub-plot the x-axis denotes LTot and the y-axis corresponds to the units for cycles, L1 and L2 misses.

For the k300a-04 system, cycle counts are reproduced from Table 4.8. As before, as LTot increases (from 0 to 8) the cycle count initially peaks and then gradually decreases; with a 32Kw blocking factor giving the lowest cycle count. Moving onto L1 misses for k300a-04 using the 6-31G* basis set, the curve has peaks which correspond to the peaks seen for Cycle counts. In terms of the ordering of misses, the L1 misses for the 32Kw blocking factor is in between the 16Kw and 64Kw curves. L2 misses are an order of magnitude less than the L1 misses. The L2 misses also peak at the same values of LTot as L1 misses, and ordering of the miss curves are the same as L1 misses. Unlike L1 misses, there is a large separation between the 256Kw and 1024Kw cases. This miss behaviour is to be expected. The Opteron has a 1MB on-chip L2 cache, thus blocking for 256Kw (i.e. a blocking factor which corresponds to a 4MB L2 data cache) and 1024 Kw (i.e. a blocking which corresponds to a 16MB L2 data cache) will lead to greatly increased cache misses. A 64 Kw (512Kb) blocking factor is the upper bound on the Opteron, after which the L2 miss penalty becomes much larger and detrimental to performance. Further, we also note, that there are cases (e.g. 6-31G*) where the blocking size which results in the lowest cycle count (32Kw) does not always have the lowest L2 miss rate (2Kw). This indicates that there are potential processor pipelining issues which need to be further investigated.

Consider the plots for k300a-04 that use the larger basis set shown in Figure 4.1. For these plots, the value of LTot now varies from 0 to 12 due to higher angular momentum functions in the basis set. As LTot increases, the cycle count curves gradually increase and peak at LTot = 5, 6 and reduces there after. Though cycle counts vary by two orders of magnitude between 6-31G* and 6-31++G(3df,3pd), the curves for cycle count have the same features albeit with peaks shifted. This shift arises from the use of the larger basis set. Unlike the k300a-04 with 6-31G* basis set, the lowest cycle counts are obtained for a 64Kw blocking factor rather than 32Kw. L1 miss counts for the larger basis set have the same ordering of curves as the 6-31G*

Figure 4.1: Per LTot breakdown of cycle counts and total cache misses (L1 and L2) for the k300a-04 water cluster system using a 6-31G* basis set and the HF method on an AMD848 Opteron system.

case. L2 misses for the larger basis set have the same ordering as L1 misses except for 32Kw. Interestingly use of 32Kw with the larger basis set results in L2 misses which are almost comparable to using a blocking factor that overflows cache. Presumably this pathological behaviour is due to subtle interplay between the system, basis set and input geometry.

Results for the k300a-08 system using the 6-31G* and 6-31++G(3df,3pd) basis sets, which are given in Figure 4.2. For k300a-08 using the 6-31G* basis set, trends for cycle counts are similar to those seen for k300a-04 with 6-31G*. Cycle counts peak at LTot = 3, 2. The lowest cycle count was obtained for a 32Kw blocking factor. The ordering of miss curves and trends for L1, L2 misses using 6-31G* are similar to those seen in k300a-04 with 6-31G*.

In the case of the larger basis set for k300a-08, the ordering of cycle count curves are similar to k300a-08 with the 6-31G* basis set. But, the cycle count peaks occur at LTot = 4, 5. Cycle counts are two orders of magnitude larger than those obtained for 6-31G*. Unlike the k300a-04 system using the larger basis set, here the 32Kw blocking factor gives the lowest measured cycle counts per LTot. Ordering of L1 misses are identical across the k300a-04 and k300a-08 systems, while the ordering of L2 misses are almost similar to those for k300a- 08 with the 6-31G* basis set. Unlike k300a-04 and the larger basis set, the use of a 32Kw blocking factor is not pathological. It is interesting to note that as system and basis set size have increased, the 256Kw – 1024Kw blocking factors are now well clustered indicating its higher L1 miss count.

From the plots given in Figures 4.1 and 4.2 there are two observations on blocking factors which gave the lowest cycle counts: (a) the blocking factor that provided the lowest cycle count for the value of LTot is the one that gives the largest number of batches; (b) in all the plots there were at most two blocking factors that gave the lowest cycle counts per LTot (as in Table 4.6), but one blocking factor always out performed the other in terms of obtaining the lowest cycle counts. Thus, it would be better to tailor the blocking factor to the entire calculation rather than to the LTot value.

Figures 4.1 and 4.2 considered cycle count and total L1, L2 miss curves for a set of cache blocking factors, as a function of LTot. To augment these plots we now consider the FLOPs per LTot, with a view of using it to measure how well PRISM is able to use floating point hardware as a function of cache blocking.

From earlier discussions it was noted that FLOP costs increase as a function of the angular momentum type and contraction length of the underlying basis set. Thus the expectation is that FLOPs should be greatest for those values of LTot which have the largest cycle count per LTot. Plots for the variation in floating point performance per LTot i.e. FLOP count per cycle are shown in Figure 4.3. FLOPs for each system are categorized by the basis set and molecular system. Hence on the left hand side of the figure, the first plot is for k300a-04 using a 6-31G* basis set, below which is the k300a-08 system using a 6-31G* basis set. The FLOP count

Figure 4.2: Per LTot breakdown of cycle counts and total cache misses (L1 and L2) for the k300a-08 water cluster system using a 6-31G* basis set and the HF method on an AMD848 Opteron system.

per cycle is a derived quantity which is indicative of the average rate at which floating point operations are executed6_.

Considering the k300a-04 system with the 6-31G* basis set the FLOPs decrease initially as LTot increases and then rises to a maximum before decreasing again. There is a slight increase at LTot = 8, which is to be expected as higher angular momentum quantities require more FLOPs. For k300a-04 and the 6-31++G(3df,3pd) basis set, there is a drop in the FLOP count at LTot = 1, followed by an increase to a maximal value and then a trailing decrease. With the k300a-08 system, the peak FLOPs occurs around LTot = 5 for the 6-31G* basis set. It peaks at LTot = 4 for the 6-31++G(3df,3pd) basis set.

Cycle counts peak at LTot = 3 for 6-31G* (for both k300a-04 and k300a-08) and for 6- 31++G(3df,3pd) its LTot = 5 (k300a-04) and LTot = 4 (k300a-08). However, as measured FLOPs peaks do not correlate with these cycle count peaks. Thus, the initial expectation that FLOPs would peak for peak values of cycle count is not valid. This observation is possibly the side-effect of the processor being stalled, waiting for appropriate cachelines to be streamed from the L2 cache, and hence cannot saturate the floating point functional units. Comparing the overall FLOPs per LTot, between k300a-04 and k300a-08 for both basis sets, it is seen that the smaller k300a-04 achieves better use of the on-chip floating point units than the larger k300a-08 system.

To summarize the discussion for this sub-section on the effects of cache blocking as a function of quartet type, (a) it was seen that there are two blocking factors which yield the lowest cycle count per LTot for the k300a-04 system using a 6-31G* basis set and the HF method. By use of an expanded set of systems and basis sets, it was seen that (b) the blocking factor which gave the lowest cycle counts per LTot depends both on the molecular system and basis set being used. For all the expanded systems, (c) the use of a single blocking factor was a better option as execution time would be skewed towards those values of LTot with large number of batches and larger angular momentum quantum number shell-quartets. Also, (d) the total FLOPs per cycle for a quartet type was not directly related to those values of LTot with the largest number of batches. This possibly indicates that PRISM’s use of floating point hardware is being hampered by excessive L1 cache misses.

In document Performance Models for Electronic Structure Methods on Modern Computer Architectures (Page 102-107)