Modeling and Predicting Application Performance on Parallel Computers Using HPC Challenge Benchmarks

(1)

Modeling and Predicting Application Performance

on Parallel Computers Using HPC Challenge Benchmarks

Wayne Pfeiffer and Nicholas J. Wright

San Diego Supercomputer Center, La Jolla CA 92093-0505, USA

{pfeiffer, nwright}@sdsc.edu

Abstract

A method is presented for modeling application performance on parallel computers in terms of the performance of microkernels from the HPC Challenge benchmarks. Specifically, the application run time is expressed as a linear combination of inverse speeds and latencies from microkernels or system characteris-tics. The model parameters are obtained by an auto-mated series of least squares fits using backward elimination to ensure statistical significance. If nec-essary, outliers are deleted to ensure that the final fit is robust. Typically three or four terms appear in each model: at most one each for floating-point speed, memory bandwidth, interconnect bandwidth, and in-terconnect latency. Such models allow prediction of application performance on future computers from easier-to-make predictions of microkernel perform-ance.

The method was used to build models for four benchmark problems involving the PARATEC and MILC scientific applications. These models not only describe performance well on the ten computers used to build the models, but also do a good job of predict-ing performance on three additional computers with newer design features. For the four application benchmark problems with six predictions each, the relative root mean squared error in the predicted run times varies between 13 and 16%.

The method was also used to build models for the HPL and G-FFTE benchmarks in HPCC, including functional dependences on problem size and core count from complexity analysis. The model for HPL predicts performance even better than the application models do, while the model for G-FFTE systematically underpredicts run times.

1. Introduction

Predicting application performance on forthcoming parallel computers is often required as part of large high-performance computing (HPC) system

acquisi-tions. Typically benchmarks are run on one or more existing systems, and some form of extrapolation is made to predict future performance. The effectiveness of such extrapolations depends upon having suitable performance models for the benchmarks.

Here we explore a modeling method based on the assumption that application run time can be approxi-mated as a linear combination of inverse speeds and latencies mostly obtained from microkernels, in par-ticular, those contained in the HPC Challenge bench-mark set [1]. This allows application performance prediction from easier-to-make predictions of microk-ernel performance.

Such a method is similar to that described by McCalpin [2] for computation performance, but con-siders communication performance as well. Our method has the advantage of simplicity over more elaborate methods, such as those of Clement and Quinn [3], Kerbyson, et al. [4], Snavely, et al. [5], and Tikir, et al. [6].

The question, of course, is: “How accurate are the models that result from such a simple method, i.e., how well do they capture the dominant computation and communication performance behavior of applica-tions?” That is the subject of this paper.

2. Model equation

In general, the run time, t, of an application benchmark can be represented as the sum of three com-ponents:

€

t=t_comp+t_comm+t_io . (1) Here

€

t_comp≡ the computation time,

€

t_comm≡ the communication time that is not over-lapped with computation, and

€

t_io≡ the I/O time that is not overlapped with computation or communication.

The computation time can be further separated into components associated with floating-point performance and memory access. Similarly, the communication

(2)

time can be separated into components related to the interconnect bandwidth and latency. For the bench-marks considered here, the I/O time is negligible.

Our modeling method assumes that

€

t_i, the meas-ured run time of an application benchmark on com-puter i can be approximated by

€

ˆ

t _i, the corresponding model run time, where

€ ˆ t _i= c_jt_ij j

∑

. (2) Here €

t_ij is the inverse speed or latency for system characteristic j on computer i, and the application-specific coefficients or parameters,

€

c_j, are taken to be independent of computer.

For a given computer i, the

€

t_ij are the predictor variables in the model described by Eq. (2). Some of these will be associated with computation and some with communication, in accordance with the separation expected from Eq. (1). We take most of the possible predictors from the HPCC benchmark set, though two are specified directly from the system design. To be useful for predicting application performance on future computer systems, the predictors need to be easy to estimate in advance of system availability and few in number. In each of our models, we expect no more than four predictors.

3. Transformations for least squares fits

The model defined by Eq. (2) is linear in the pre-dictor variables. Thus, given measured values for

€

p

predictors and the application run time on each of

€

n computers (or distinct configurations, with

€

p≤n), the model parameters in Eq. (2) can be fit via least squares (or regression analysis) using standard statisti-cal techniques [7]. In conventional terminology, the application run time is the response variable.

We make two transformations of Eq. (2) before per-forming the least squares fits. The first normalizes the times (and parameters) to dimensionless form and is merely a convenience. The second weights the run times consistent with our expectation of their impor-tance and has a modest effect on the resulting fits.

In the first transformation we normalize the times to those for an arbitrarily chosen, reference computer r within our model-building set. Thus Eq. (2) is rewrit-ten as follows: € ˆ y _i≡ t ˆ i t_r = cj t_ij t_r j

∑

= c_j trj t_r       j

∑

tij t_rj = bjXij j

∑

, (3) where b_j≡c_j trj t_r , and (4) € X_ij≡ tij t_rj . (5)

Also, the normalized value of the measured run time (or response) is

€

y_i≡t_i t_r. (6)

For the reference computer, the normalized re-sponse,

€

y_r, and the normalized predictors,

€

X_rj, are equal to one. Hence, Eq. (3) shows that the normalized parameters,

€

b_j, should sum to approximately one for a good fit. This provides one of several checks on the goodness of fit.

An important assumption in least squares analysis is that the residuals (measured minus fit values) have a common variance. Since the run times can vary by severalfold for the computers considered here and are always positive, it seems more plausible that the rela-tive residuals (or percentage errors) have a common variance. This leads us to adopt a weighted fit, with the weights chosen inversely proportional to the squares of the measured run times. This will mini-mize the sum of the squares of the relative residuals.

The weighting is implemented by a second trans-formation. Specifically, we introduce a diagonal weight matrix with elements

€

W_ii≡1/y_i2 (7)

and multiply Eq. (3) by

€

W_ii1/ 2 (

€

=1/y_i). (See Ref. [8].) This transforms Eq. (3) to the following:

€ ˆ y _w,i≡ y ˆ i y_i = bj X_ij y_i j

∑

= b_jX_w,ij j

∑

, (8) where € X_w_,_i≡ Xij y_i . (9)

Eq. (8) can now be fit with least squares in the usual way, since the variance of the relative residuals is expected to be constant. Note that the measured run time after normalization and weighting is equal to one, i.e.,

€

y_w,i≡W_ii1/ 2y_i=1. (10) Another assumption in least squares analysis is that the elements of the predictor matrix X are error-free. This is not the case for most of the columns of X, which are obtained from HPCC measurements. Here we assume that the errors in X are small compared to those in the application run time and so can be ne-glected.

4. Least squares fits for model building

Important issues in performing least squares fits are deciding which predictors to include (in this case, HPCC metrics and system characteristics) and check-ing that the fits are consistent with the data and robust.

(3)

4.1. Picking the predictors

The number of possible predictors that we consider is comparable to the number of computers used to build our models (as discussed in the next two sec-tions). However, the number of predictors in any given model is smaller and is obtained by the succes-sive application of two routines in MATLAB [9].

1. For each application benchmark being modeled, a least squares fit of Eq. (8) to the measured run times of Eq. (10) is made enforcing the constraint that the parameters of the fit must be positive. This is done using the lsqnonneg routine and eliminates some of the possible predictors.

2. Although the resulting fit is suggestive and may be very good, typically some of the parameters (and associated predictors) are not statistically significant. To prevent such overfitting, backward elimination [7] is then applied, successively removing the least sig-nificant parameters until all that remain are statistically significant (i.e., greater than zero) at the 95% confi-dence level. This is done using the regstats rou-tine and the model matrix to specify which predictors are retained at each step.

4.2. Checking goodness and robustness of fit

Once a fit is obtained, various checks on the quality of the fit (and hence of the implied model) need to be made. The first thing to check is the root mean squared error: € RMSE≡ 1 n−p (y_i−y ˆ _i)2 y_i2 i

∑

        1/ 2 . (11)

This is the quantity that is minimized in our weighted least squares fits. With the weighting adopted here, the RMSE is an estimate of the relative error, and the frequently-used

€

R2 statistic is no longer meaningful. In general, the smaller the RMSE, the better the fit.

By convention, the residual of a measurement is the measured value minus the fit value, i.e.,

€

y_i−y ˆ _i here. Thus the residual is positive when the measure-ment is greater than the fit. Because of our weighting, the summation in Eq. (11) is over the squares of the relative residuals.

A more detailed check is to examine the individual residuals for outliers and influential measurements. Outliers have large relative residuals (either positive or negative), which increase the RMSE. Influential meas-urements (which are often outliers) significantly per-turb the fitted parameters and can lead to a fit that is not robust. Outliers frequently suggest a measurement problem, in which case they can be omitted. Alterna-tively, they may indicate a limitation of the model.

To increase our confidence in the validity of a model, we require that the fit be robust against all single-measurement deletions, i.e., the fit should not change significantly upon deletion of a single meas-urement. We take this to mean that the predictors re-main the same and that the parameters vary by at most a few percent. If this is not the case, then we selec-tively delete one or more measurements from the model data set until the remaining data are consistent enough to give a robust fit.

Specifically, if the fit is not robust, then we iden-tify the two most extreme outliers, i.e., the ones with the most positive and negative relative residuals. We then delete each separately, perform two new fits, and test for robustness. If either is robust, we are done. If not, we delete the most extreme outliers from these new fits and continue as before. Provided sufficient data are available, this procedure generally converges to a robust fit, as was found to be the case for all of the benchmark problems discussed here.

4.3. Collecting enough data for robust fits

A major challenge in model building is collecting sufficient data to have robust fits. For building our models of application performance, we have used two techniques to obtain additional data with only a mod-est number of computers.

First, on some computers we have made two sets of HPCC and application runs: the standard ones using all cores per node and additional ones using half the cores per node, but twice as many nodes. Typically, better performance is obtained with half the cores per node because there is less memory and interconnect contention. In such cases, each additional set of runs provides another effective computer configuration and another equation to be fit.

A second technique to obtain more fitting data in-volves using the IPM tool [10]. Among other things, IPM measures the communication run time (and hence communication fraction). Since IPM has negligible overhead, it can be used in the same run that measures the total run time. Each IPM measurement thus pro-vides another equation to be fit similar to Eq. (2), with the communication time on the left and only commu-nication terms in the sum on the right. In practice, more outliers appear to arise from the communication times than from the total times. This may be associ-ated with modest amounts of load imbalance or over-lap of communication with computation, which intro-duce more variability in the communication times.

When more information or alternate models exist that indicate how the terms in Eq. (2) vary with core count, it is then possible to combine data from multi-ple core counts into a single fit. We have used this third technique to build models for the HPCC com-plex synthetics.

(4)

Table 1. Computers used for model building and as prediction targets

Clock Peak

speed Flops/ Gflop/s Cores/

No. Processor (GHz) clock / core node System name Location Integrator & node Interconnect

Computers used for model building

1 AMD Opteron 2.2 2 4.4 2 Jacquard NERSC Linux Networx InfiniBand 4x SDR

2 AMD Opteron 2.6 2 5.2 2 Jaguar XT3 ORNL Cray XT3 3D torus

3 IBM Power3-II 0.375 4 1.5 16 Seaborg NERSC IBM Nighthawk 2 Colony

4 IBM PowerPC 440 0.7 4 2.8 2 Blue Gene Data SDSC IBM Blue Gene/L 3D torus + tree

5 IBM Power4+ 1.5 4 6.0 8 DataStar 1.5-GHz SDSC IBM p655 HPS (Federation)

6 IBM Power5 1.9 4 7.6 8 Bassi NERSC IBM p575 HPS (Federation)

7 Intel Itanium 2 1.5 4 6.0 2 Mercury NCSA IBM Tiger 2 Myrinet 2000

8 Intel Itanium 2 1.6 4 6.4 2+512 Cobalt NCSA SGI Altix 3700 NUMAlink 4 + IB 4x SDR

9 intel Xeon 3.2 2 6.4 2 Tungsten NCSA Dell PowerEdge 1750 Myrinet 2000

10 Intel Xeon (EM64T) 3.6 2 7.2 2 T2 NCSA Dell PowerEdge 1850 InfiniBand 4x SDR

Computers used as prediction targets

11 AMD Opteron 2.6 2 5.2 2 Jaguar XT4 ORNL Cray XT4 3D torus

12 Intel Xeon (Woodcrest) 2.66 4 10.6 4 Lonestar TACC Dell PowerEdge 1955 InfiniBand 4x SDR

13 Intel Xeon (Clovertown) 2.33 4 9.3 8 Abe NCSA Dell PowerEdge 1955 InfiniBand 4x SDR

Table 2. Metrics used to derive predictors and their values on DataStar

Metric group Flop speed Interconnect bandwidth Interconnect latency

Metric number 1 2 3 4 5 6 7 8 9 10 11

and name Clock Peak flop EP-DGEMM EP-STREAM EP-Random Random ring Natural ring Ping pong Random ring Natural ring Ping pong

speed speed/core speed Triad bw Access rate bandwidth bandwidth bandwidth latency latency latency

(GHz) (Gflop/s) (Gflop/s) (GB/s) (Gup/s) (GB/s) (GB/s) GB/s) (µs) (µs) (µs)

DataStar value on 64 cores 1.5 6.0 3.98 1.64 0.00210 0.223 0.634 1.56 8.98 7.25 5.53 256 cores 1.5 6.0 3.96 1.65 0.00210 0.153 0.586 1.49 9.93 7.92 5.79 512 cores 1.5 6.0 3.74 1.63 0.00218 0.126 0.584 1.49 10.31 8.23 5.99 1024 cores 1.5 6.0 3.73 1.71 0.00219 0.091 0.594 1.47 10.86 8.46 6.10 Memory bandwidth

5. Computers used for model building and

as prediction targets

To build our models we collected HPCC data and application run times on ten different computers. We also obtained data on three more recently installed computers, which are targets for prediction. Table 1 lists information on these 13 computers, all of which are at National Science Foundation or Department of Energy supercomputer centers. (Here the Cray XT3 and XT4 partitions of Jaguar are treated as different computers.)

The computers in this study encompass most of the processor and interconnect types available on high-end systems in recent years. They also span a sizable range of speeds and years of system installation.

HPCC data were obtained on 256 cores for all con-figurations. Additional HPCC data were obtained on 64, 512, and 1024 cores for some configurations. Ap-plication data were collected on 64 and 256 cores of all computers except Blue Gene, where there was not enough memory to run some of the problems.

Data were also obtained on as many as eight more effective computer configurations by using half the cores per node and twice as many nodes. These data give useful information on the effects of memory and interconnect contention, as noted previously. In addi-tion, IPM measurements of the communication time were obtained on several computers, further increasing the available data.

6. Predictors based primarily on HPCC metrics

The HPCC benchmark set [1] consists of seven synthetic benchmarks: three targeted and four complex. The targeted synthetics are DGEMM, STREAM, and bench_lat_bw. These are microkernels to quantify basic system parameters that separately characterize computation and communication performance. The complex synthetics are HPL, FFTE, PTRANS, and RandomAccess. These combine computation and communication, although FFTE and RandomAccess also have embarrassingly parallel (EP) variants without communication. Most of the synthetics report more than one metric from which suitable predictors can be chosen.

We consider nine predictors from the HPCC benchmark set and two from system specifications, giving eleven overall. The metrics used to derive the predictors are listed in Table 2. Also listed are metric values on various core counts of DataStar, which we take as our reference computer configuration.

The metrics underlying the predictors fall into four groups related to flop speed (the speed of floating-point operations), memory bandwidth, interconnect bandwidth, and interconnect latency. Each group con-tains three possible predictors, except for memory bandwidth, which has only two. Flop speed and memory bandwidth are correlated with computation performance, while interconnect bandwidth and latency are correlated with communication performance. Note

(5)

that inverse speeds, i.e., reciprocals of speed, are used as predictors in the first three groups. Out of the eleven possible predictors, we expect that no more than four will typically have statistically significant parameters for a given application model. These would be at most one from each of the four groups.

The first two flop-speed metrics that we consider are the clock speed and peak flop speed per core. These are determined from system specifications rather than HPCC. The third flop speed considered is the EP variant of DGEMM, which measures the speed of ma-trix-matrix multiplication.

The two memory-bandwidth metrics we use are the EP variants of the STREAM Triad bandwidth and the RandomAccess rate. The former measures the band-width for unit-stride memory access. The latter meas-ures the bandwidth for random memory access and is inversely proportional to the memory latency.

The EP or “Star” metrics in HPCC are measured with all cores in a node doing the same computation simultaneously. Doing so accounts for the memory contention that is typical of applications.

To model communication performance, we consider six predictors, all obtained from three pairs of meas-urements by bench_lat_bw in HPCC. The relevant pairs are the interconnect bandwidth and interconnect latency for the randomly ordered ring (RR), naturally ordered ring (NR) and average ping pong (PP) tests.

All five of the computation predictors are essen-tially independent of core count (as can be seen from the DataStar data in Table 2). Likewise, apart from the RR bandwidth, the interconnect bandwidths and laten-cies vary little with core count (except on Abe and on 1024 cores of Cobalt, where the interconnect topology changes). Consequently, predictors measured at one core count can often be used for other core counts as well.

All of our HPCC results are for baseline runs. That is, no code changes were made. The only tuning was the use of optimal compiler flags and optimized librar-ies for the Basic Linear Algebra Subprograms (BLAS).

Some of the possible predictors are highly corre-lated with each other, especially within each of four groups. An important consideration is how well our method selects between such correlated predictors to obtain a robust fit.

7. Models for applications

To assess the quality of the models generated by our method, we consider two scientific applications – PARATEC [11] and MILC [12]. These materials sci-ence and physics codes are used extensively at NSF and DOE supercomputer centers. Moreover, recent system acquisitions have required performance predic-tions for benchmark problems on both of these

applica-tions, so having models to make such predictions would be very useful.

For each application we collected results for two problem sizes run on different core counts: medium problems on 64 cores and large problems on 256 cores. The PARATEC results are for baseline runs, with code changes made only to ensure correct execution. Tuning consisted of the use of optimal compiler flags and optimized libraries for LAPACK, ScaLAPACK, BLACS, and FFTW. No changes were made to the input files.

For MILC, some code changes were made to opti-mize performance, especially on systems with AMD processors.

7.1. PARATEC

PARATEC “performs ab initio quantum-mechanical total energy calculations using pseudopo-tentials and a plane wave basis set.” [11]

We consider the two standard benchmark problems, both for silicon in the diamond structure [13]. The medium problem has 250 silicon atoms, while the large problem has 686 atoms. We modeled the me-dium problem on 64 cores and the large problem on 256 cores.

7.1.1. Medium problem. We begin with the medium problem on 64 cores, for which slightly more run-time data are available. HPCC predictor data were obtained on 64 cores for only some of the computers (including DataStar). We used 64-core predictors normalized to DataStar when available and 256-core normalized pre-dictors otherwise. In addition, we excluded the RR bandwidth from our predictor set, since the correspond-ing data on 256 cores may not be appropriate for 64 cores.

First we fit the total run-time data for computers with all cores per node. We have such data for nine of the ten model-building computers; the computer with missing data is Blue Gene, which does not have enough memory to run this problem using both cores of its nodes. lsqnonneg indicates that a fit with seven positive parameters is possible (excluding the RR bandwidth). However, backward elimination leaves only four parameters that are statistically sig-nificant. These parameters and their values are listed in the first row of data in Table 3 along with the R M S E of 3.9%, which indicates an excellent fit. (Throughout the text, we report RMSE and relative residual values in percent.)

With so few data, the four-parameter fit is not ro-bust. We confirmed this by deleting each run-time measurement in turn and repeating the fit. The nine fits so obtained (of eight measurements each) have widely varying parameters and values.

(6)

Table 3. Parameters for application fits

Data added (+) Inverse Inverse

Predic-or deleted (-) flop-speed interconnect bandwidth Interconnect latency Model tion

Benchmark Run Half Comm Out- Ro- parameters parameters parameters building target

problem times c/n times liers bust? b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 RMSE RMSE

PARATEC medium on 9 no 0.755 0.135 0.047 0.078 0.039 0.077 64 cores 14 + no 0.741 0.142 0.037 0.088 0.056 (tr=808 s) 20 + + yes* 0.796 0.046 0.131 0.075 0.154 PARATEC 9 no 0.309 0.377 0.280 0.065 0.172 large on 13 + no 0.625 0.152 0.206 256 cores 18 + + no 0.412 0.327 0.188 (tr=1,557 s) 16 + + - yes 0.295 0.350 0.316 0.087 0.127 MILC 10 no 0.243 0.638 0.090 0.143 0.301 medium on 15 + no 0.278 0.472 0.102 0.190 64 cores 21 + + no 0.272 0.352 0.053 0.227 (tr=440 s) 19 + + - yes* 0.256 0.343 0.053 0.213 0.138 MILC 9 no 0.316 0.609 0.040 0.079 0.088 0.181 large on 14 + no 0.363 0.549 0.094 0.128 256 cores 19 + + no 0.387 0.613 0.062 0.220 (tr=3,749 s) 16 + + - yes 0.372 0.573 0.082 0.127 0.164

DataStar reference times are in parenthesis. Parameters in bold are used for the fits in Figure 1. For the * fits, one of the single-measurement deletions changes an interterconnect predictor.

Inverse parameters memory bandwidth

Accordingly, we added data for five more computer configurations corresponding to using half the cores per node. The resulting fit still has four parameters, as shown in the second row of data in Table 3, and is very similar to the first fit. However, the fit is much more robust; in just three of 14 cases does single-measurement deletion lead to a markedly different fit. With an RMSE of 5.6%, the fit is only slightly poorer than before.

Finally we added six measurements of the commu-nication time obtained by IPM. This leads to a further refined fit with only three parameters, as shown in the third data row of Table 3. The inverse flop-speed pa-rameter has increased slightly, the inverse memory bandwidth parameter has disappeared, the inverse inter-connect bandwidth parameter is essentially unchanged, and the interconnect latency parameter has switched from the PP metric to the RR metric. At the same time, the RMSE has increased a little more to 7.5%. Of special note is that the resulting fit, which we take as our reference, is now robust, i.e., there is no signifi-cant variation of the fit in response to deletion of a single measurement (out of 20). (One deletion sug-gests use of the PP latency instead of the RR latency, but the associated parameter values are similar and small, so the overall fit is essentially unchanged.)

The relatively small value for the RMSE indicates that the reference fit is very good. The goodness of fit can be further checked by examining the individual values for the relative residuals, from which the RMSE is computed. These values, which are plotted in Fig-ure 1a, are relatively small too (as expected) and show no systematic variations.

As a check on the reasonableness of the fit, note that flop speed dominates overall performance, since the inverse flop-speed parameter is much larger than the other two fit parameters. This dominance is

con-sistent with the findings of Oliker, et al., [14], who reported that PARATEC achieved a very high fraction of peak performance on several systems for a similar test problem.

For reference, it is useful to have an explicit for-mula for the fitted run time. This follows from Eqs. (3) to (5), the fit parameters and absolute run time on DataStar in Table 3, and the predictor metrics for DataStar in Table 2. The resulting formula is

€ ˆ t _i= 0.7963.98Gflop/s s_i3 +0.046 1.56GB/s s_i8    € +0.131 ti9 8.98ms    ×808s. (12) Here € s_i3, € s_i₈, and €

t_i₉ are, respectively, the DGEMM speed, PP bandwidth, and RR latency on computer i.

The model given by Eq. (12) provides a good de-scription of performance on the ten computers on which it is based. However, we also want the model to be predictive as well as descriptive. Thus, we used it to predict performance for three newer computers – the Jaguar XT4 at ORNL, Lonestar at TACC, and Abe at NCSA – none of which were used to build the model. These computers have many similarities, but some important differences as compared to the earlier XT3 and T2 systems. In particular, the XT4 has better memory and interconnect bandwidths compared to the XT3, while Lonestar and Abe have more flops/clock, but slower clock speeds and more cores per node than T2.

The agreement between the predicted and measured results (corresponding to six total run times on the three newer computers) can be seen graphically from the location of the asterisks in Figure 1a. Three of the relative run-time residuals are less than 5% (in absolute value), while the other three are between 14 and 28% (in absolute value). As shown in the last column of

(7)

Figure 1. Relative residuals (

€

(y_i−y ˆ _i)/y_i) versus fit values (

€

ˆ

y _i) for normalized run times

of applications. Circles show total run times using all the cores per node, squares show total run times using half the cores per node, and triangles show communication run times, all on computers used to build the models. Solid symbols are for o u t -liers excluded from the model fits. Asterisks show total run times on computers that are prediction targets.

Table 3, this leads to a larger RMSE for the predictions (15.4%, per Eq. (11) with

€

n−p=6) as compared to the RMSE for the model-building data (7.5%). Never-theless, the relative residuals are small enough that the predictions should prove useful.

7.1.2. Large problem. We used a similar analysis to build a performance model for the PARATEC large problem on 256 cores. Table 3 lists the parameters for three successive fits as before, starting with nine total run times using all the cores per node, then adding four total run times using half the cores per node, and

finally including five communication run times. The changes between the fits are much more substantial than for the medium problem considered previously, and none of the fits are robust against all possible dele-tions of individual measurements.

The third fit (based on all 18 measurements) con-tains a significant outlier with a large negative residual (from Seaborg using half the cores per node). Deleting that outlier, which is also an influential measurement, gives a fit that is much better with a completely differ-ent set of parameters (in fact, the same set as for the first fit). This fit is still not robust, and contains yet

(8)

another outlier, this time having a large positive resid-ual (from the Seaborg communication time). Deleting this second outlier gives a fit (based on 16 measure-ments) that is better still and robust against all single-measurement deletions. With an RMSE of 8.7%, this final fit (which we take as our reference) is only slightly poorer than the final fit for the medium prob-lem. The goodness of the fit is also evident from the relative residuals of the run times shown in Figure 1b. (Note that the outliers are included in the figure, even though they were not used in the final fit.)

Although the values of the parameters are similar for both the initial and final fits, the robustness of the final fit provides greater confidence in its predictive ability. Indeed, the predicted run times for the six newer computer configurations are in reasonable agreement with the measured run times, as shown by the relative residuals plotted as asterisks in Figure 1b. The RMSE for the predictions is 12.7%, and all of the predictions have relative residuals smaller than 23% (in absolute value).

A further observation is that the model for the large problem is appreciably different from that for the me-dium problem. There is a sizeable parameter for the inverse memory bandwidth, and communication is more important (as reflected by the magnitude of the interconnect latency parameter). Indeed, the three model parameters for the large problem are all of com-parable magnitude. Moreover, the appearance of the inverse memory bandwidth parameter is apparently associated with the buffering of MPI data in memory, which is a side effect of communication.

The communication time can be reduced and scal-ing improved by adjustscal-ing the parameter num-ber_bands_fft in the input file, thereby allowing messages to be aggregated. We did not take advantage of this, because we only had a complete set of data with the default input file.

7.2. MILC

“The MILC Code is a body of high performance re-search software written in C for doing SU(3) lattice gauge theoryon several different (MIMD) parallel computers in current use.” [12]

There are three standard benchmark problems corre-sponding to increasingly large lattices [13]. We con-sider two of these: the medium problem using a 32^4 lattice and the large problem using a 64^4 lattice. Similarly to PARATEC, we modeled the medium problem on 64 cores and the large problem on 256 cores.

To develop models for the MILC problems, we fol-lowed the same procedure as for the PARATEC prob-lems. That is, we successively pooled the available run-time data from the first ten computers in Table 1 to develop model fits. Also, we prevented the RR

bandwidth from appearing in the fits for the medium problem.

The initial fits were not robust for either problem, so we selectively deleted outliers until robust fits were obtained. The evolution in the values of the parame-ters and the RMSE over four successive fits for each MILC problem are listed in Table 3.

The final, reference fits have RMSE values over the model-building data that are somewhat worse than those obtained before: 21.3% for the medium problem and 12.7% for the large problem (excluding the out-liers). These are still reasonable, however, as are the RMSE values for the predictions: 13.8% for the me-dium problem and 16.4% for the large problem. These latter values are comparable to those obtained for PARATEC. Graphical displays of the goodness of fit are shown in Figures 1c and 1d,

Examination of the parameter values listed in Table 3 shows that performance is dominated by the memory bandwidth, especially for the large problem. This is consistent with the observation of Gottlieb [15] and with scaling scans that show superlinear speedup at higher core counts. Also, the appearance of the clock speed rather than the other flop speed parameters sug-gests that the code can make only limited use of the extra flops/clock on many of the computers.

8. Models for HPCC complex synthetics

As a further test of our modeling method, we apply it to two of the HPCC complex synthetics: HPL and G-FFTE. On the one hand, their code is much simpler than that of full applications. On the other hand, they introduce a further complication, in that their problem size varies between computers depending upon the memory available.

Thus we need to correct for the varying problem size to compare run times between computers at the same core count. We will also find it useful to com-pare performance for runs made at different core counts, so we make another correction for that as well.

Making such corrections requires complexity for-mulas for the relevant algorithms, i.e., forfor-mulas for the operation count dependence on the numerical parame-ters of the algorithm. Such formulas are equivalent to performance models, once the constants or coefficients of each term are known. Thus, our models can be viewed as calibrating the coefficients in the complexity formulas.

The formulas we use involve one computation term and two communication terms. These formulas, which modify Eqs. (2) and (3), are discussed in the appendix.

8.1. HPL

HPL is the high-performance version of the Linpack benchmark [16]. It solves a dense linear system of

(9)

Table 4. Parameters for HPL and G-FFTE fits

Data added Inverse Inverse

Predic-or deleted flop-speed interconnect bandwidth Interconnect latency Model tion

Benchmark Run Half Comm Out- Ro- parameters parameters parameters building target

problem times c/n times liers bust? b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 RMSE RMSE

HPL on 10 no 0.925 0.025 0.021 0.022 0.058 256 cores 15 + no 0.929 0.027 0.025 0.020 0.031 (Nr=192,000, 23 + + no 0.952 0.011 0.026 0.136 0.037 sr=0.888 Tflop/s) 22 + + - yes 0.955 0.011 0.024 0.100 0.038 HPL on 32 + no 0.985 0.010 0.013 0.049 0.087 256, 512, & 31 + - yes 0.936 0.009 0.013 0.018 0.028 0.078 1,024 cores 49 + + yes 0.966 0.013 0.005 0.020 0.136 0.080 G-FFTE on 10 no 1.067 0.196 0.235 0.383 256 cores 15 + no 0.146 0.581 0.256 0.199 0.297 (mr=4,294,967,296, 23 + + no 0.331 0.464 0.191 0.243 0.193 sr=29.1 Gflop/s) 20 + + - yes 0.324 0.451 0.195 0.189 0.198 G-FFTE on 32 + no 0.207 0.576 0.233 0.183 0.241 256, 512, & 29 + - yes 0.218 0.544 0.225 0.118 0.249 1,024 cores 49 + + yes 0.312 0.531 0.182 0.230 0.214

DataStar reference values are in parenthesis. Parameters in bold are used for the fits in Figure 2. Inverse memory bandwidth

parameters

equations. Since most of the run time is spent in DGEMM, HPL is expected to perform similarly, and this should be reflected in our model.

Performance is primarily controlled by four numeri-cal parameters:

-€

N_i is the matrix order;

-€

P_i and

€

Q_i describe the mapping of the matrix to the processor grid, with

€

P_i×Q_i equal to the number of processor cores,

€

p_i;

-€

NB_i is the block size used to improve cache reuse. The dominant parameters for HPL are

€

N_i and

€

p_i. The complexity of computation and communication of all the HPCC benchmarks at constant

€

p_i is summa-rized by Luszczek, Dongarra, and Kepner [17]. A more detailed analysis for HPL, including separate treat-ments of the communication bandwidth and latency terms along with their

€

p_i dependence, is in the “Scal-ability” discussion of Ref. [16]. The associated formu-las and their incorporation in our model equation are discussed in Appendix A.1.

Using the HPCC predictor data, augmented by val-ues for the HPL speed and

€

N_i, we made model fits for HPL running on 256 cores. The parameters ob-tained in a manner analogous to that used for the ap-plication fits are summarized in the first block of re-sults in Table 4. All of the fits are similar and good. After deleting the largest communication time outlier (from DataStar), the final fit for 256 cores is robust and has an R M S E of 10.0% over the 22 run-time measurements used in the fit.

Additional HPCC data were collected on 512 and 1,024 cores and then added to those on 256 cores. This increased the number of total run times from 15 to 32 and the number of communication times from 8 to 17 on the ten computers used to build our models. We then used these data to build HPL model fits, in-cluding the dependence on

€

p_i as well as

€

N_i. The resulting parameter and RMSE values for three such fits

are in the second block of Table 4. Again, all three fits are similar, and the last two are robust.

We take the next-to-last fit for HPL in the table, which is highlighted in bold, as our reference. It has a very small RMSE of 2.8% after deletion of a single outlier (from Cobalt). Because sufficient total run-time data are available to ensure robustness, we have not included the communication run times in our refer-ence fit. Including them introduces many outliers and greatly increases the RMSE without materially chang-ing the parameters of the fit. The exceptionally good fit is apparent from the relative run-time residuals shown in Figure 2a.

The dominant parameter by far is the inverse DGEMM speed, as expected. Two inverse intercon-nect bandwidths enter along with an interconintercon-nect la-tency, but all are small.

An explicit formula for the HPL speed, including the dependence on

€

N_i and

€

p_i, can be obtained by combining Eqs. (A2) through (A12) with the fit pa-rameters in Table 4 and various quantities for DataStar in Tables 2 and 4. The resulting formula is

€ ˆ s _i= 0.9363.96Gflop/s s_i3    + 0.0090.153GB/s s_i6    € +0.0131.49GB/s s_i8    pi 1/ 2_/_N i 256 /192,000 € + 0.018 ti10 7.92ms       pilog2pi/Ni 2 256×8 /192,0002    −1 € × pi 2560.888Tflop/s. (13) Here € s_i₃, € s_i6, € s_i₈, and €

t_i₁₀ are, respectively, the DGEMM speed, RR bandwidth, PP bandwidth, and NR latency on computer i.

We used the model given by Eq. (13) to predict HPL speeds on six different core-count and core-per-node configurations of the XT4 and Lonestar. The

(10)

Figure 2. Relative residuals (

€

(y_i−y ˆ _i)/y_i) versus fit values (

€

ˆ

y _i) for normalized run times

of HPL and G-FFTE. Circles show total run times on computers used to build t h e models. Solid symbols are for outliers excluded from the model fits. Asterisks show total run times on computers that are prediction targets, while stars show total run times on additional computers at high core counts.

individual values of the relative run-time residuals are shown by the asterisks in Figure 2a. The associated RMSE is 7.8%, which is quite good, despite a notable outlier from Lonestar.

We also used our model to predict HPL speeds at high core counts on three selected computers with baseline data at the HPCC Web site [1]. Results on these computers – the XT3 at 4,096 cores, Blue Gene/L at 65,536 cores, and ASCI Purple at 8,192 cores – are shown by stars in Figure 2a. The measured and predicted results are in good agreement even on 65,536 cores of Blue Gene/L, where the relative run-time residual is 8.8%.

8.2. G-FFTE

G-FFTE does a complex one-dimensional Fast Fourier Transform (FFT) on a very long array of length

€

m_i. Demmel [18] presents the results of complexity analyses for two possible algorithms: block FFT and FFT with transpose, which differ in their communica-tion dependence on

€

p_i. Choosing the more appropri-ate FFT with transpose leads to the formulas given in Appendix A.2.

Proceeding in a manner similar to that for HPL, we generated a series of model fits, obtaining the parame-ters listed in the third and fourth blocks of Table 4. For all of the fits, the inverse memory bandwidth dominates performance, as expected. Note that both of the inverse memory bandwidth parameters appear in most of the fits, including the next-to-last one, which

we take as our reference. This behavior is consistent with a mix of regular and random memory accesses. The only interconnect parameter that enters the final fit is the inverse RR bandwidth.

Figure 2b shows the relative run-time residuals on the computers used to build the model as well as on those used as prediction targets. The fit across the model-building computers is good, especially after deleting three outliers, which leads to an RMSE of 11.8% (per Table 4). However, all of the residuals on the prediction targets are positive, i.e., the measured times are systematically higher than the fit times, and the associated RMSE is 24.9%. The reason for the systematic underprediction of the run times is not known. The relative residuals for the three selected computers at high core counts are also positive, though that on 65,536 cores of Blue Gene/L is only 3.6%.

9. Discussion of limitations

Our modeling method does remarkably well at gen-erating reasonable performance models, despite its simplicity. Nevertheless, the method does have its limitations.

• Differences in cache size between computers are not treated. A method similar to that of McCalpin [2] could be used to treat that and might improve models where memory bandwidth is important.

• The correspondence between the communication time and the sum of the interconnect terms in the model is only approximate. Complicating factors are

(11)

load imbalance and overlap of communication with computation, which evidently lead to an appreciable number of communication time outliers.

• Differences in the level of software tuning, whether via compiler flags, mathematical libraries, or reprogramming, are not explicitly modeled.

• Sufficient data are needed to get robust fits. Our experience suggests that 15 to 20 measurements are needed for a given benchmark problem to fit three or four parameters.

10. Summary and conclusions

We have presented a performance modeling method that approximates the run time of an application on a parallel computer as a linear combination of predictors. Each predictor is an inverse speed or latency obtained from a microkernel or system characteristic, with all of the microkernels from the HPCC benchmark set. Model generation begins with measured values for the predictors and application run times on a collection of computers at a common core count. Then an auto-mated series of least squares fits is made using back-ward elimination to ensure statistical significance of the model parameters. If necessary, outliers are deleted to ensure that the final fit is robust.

We constructed performance models for four benchmark problems involving the widely used PARATEC and MILC applications. In all cases, the fits describe the measurements well on the ten comput-ers on which they are based, and the dominant parame-ters in the fits are those expected from separate investi-gations. In addition, the fits predict performance well on three newer computers not used to build the mod-els. For the four application benchmark problems with six predictions each, the relative RMSE in the predicted run times varies between 12.7 and 16.4%. This is only moderately higher than the RMSE of 11.7% (cor-responding to an average absolute value of the relative error of 9.3%) obtained by Tikir, et al. [6] on another set of application benchmarks using a more elaborate and time-consuming method.

We also used our method, augmented by complex-ity analyses, to construct models for two of the HPCC complex synthetics: HPL and G-FFTE. These models account for variations in problem size and core count. The model for HPL is quite good, while that for G-FFTE systematically underpredicts run times. Both models predict performance well on 65,536 cores of Blue Gene/L, even though the model-building data were collected on ≤1,024 cores.

In summary, our method offers a straightforward way to capture the dominant performance behavior of applications in models that are easy-to-understand and reasonably accurate in most cases.

11. Acknowledgments

Insightful discussions with Tzu-Yi Chen and Allan Snavely at SDSC were most helpful and greatly appre-ciated at the outset of this project. Greg Bauer and Nahil Sobh of NCSA provided many results on sys-tems with Intel processors as part of the Cyberinfra-structure Partnership supported by the National Science Foundation. John Shalf, Harvey Wasserman, and An-drew Canning of NERSC provided results on systems at Department of Energy sites. Valuable help in mak-ing additional benchmark runs was generously given by Bauer, Kent Milfeld of TACC, and Roger Golliver and Mike Greenfield of Intel. The work described here was supported, in part, by NSF, which together with DOE funded operation of the computers used in this study.

12. References

[1] HPCC, icl.cs.utk.edu/hpcc.

[2] J.D. McCalpin, “Composite Metrics for System Throughput in HPC,” www.cs.Virginia.edu/~mccalpin/ SimpleCompositeMetrics2003-12-08.pdf.

[3] M.J. Clement and M.J. Quinn, “Automated perform-ance prediction for scalable parallel computing,” Parallel

Computing, vol. 23, pp. 1405-1420 (1997).

[4] D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, and M. Gittings, “Predictive Performance and Scalability Modeling of a Large-Scale Application,” Proc.

SC2001, Denver, CO (2001), www.sc2001.org/papers/

pap.pap255.pdf.

[5] A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha, “A Framework for Application Performance Modeling and Prediction,” Proc. SC2002, Baltimore, MD (2002),

www.supercomp.org/sc2002/paperpdfs/pap.pap201.pdf. [6] M.M. Tikir, L. Carrington, E. Strohmaier, and A. Snavely, “A Genetic Algorithms Approach to Modeling the Performance of Memory-bound Applications,” Proc.

SC2007, Reno, NV (2007), sc07.supercomputing.org/

schedule/pdf/pap255.pdf.

[7] N.R. Draper and H. Smith, Applied Regression

Analy-sis (Third Edition), John Wiley & Sons (1998).

[8] R.D. Cook and S. Weisberg, Residuals and Influence

in Regression, p. 209, Chapman and Hall (1982).

[9] The MathWorks – MATLAB – The Language of Tech-nical Computing, www.mathworks.com/products/matlab. [10] IPM: Integrated Performance Monitoring,

ipm-hpc.sourceforge.net.

[11] PARAllel Total Energy Code (PARATEC), www.nersc.gov/projects/paratec.

[12] C. Detar, The MILC Code (version: 6.20sep02), www.physics.utah.edu/~detar/milc/milcv6.html.

[13] NERSC5 Benchmarks, www.nersc.gov/projects/ SDSA /software/?benchmark=NERSC5.

[14] L. Oliker, et al., “Leading Computation Methods o n Scalar and Vector HEC Platforms,” Proc.SC|05, Seattle, WA (2005), http://sc05.supercomputing.org/schedule/ pdf/pap293.pdf

(12)

[15] S. Gottlieb, MILC QCD Code Benchmarks, physics.indiana.edu/~sg/milc/benchmark.html.

[16] A. Petitet, R.C. Whaley, J. Dongarra, and A. Cleary, HPL – A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers, www.netlib.org/benchmark/hpl.

[17] P. Luszczek, J. Dongarra, and J. Kepner, “Design and Implementation of the HPC Challenge Benchmark Suite,”

CTWatch Quarterly, vol. 2 (4A), pp. 18-23 (November

2006), www.ctwatch.org/quarterly/articles/2006/11/ design-and-implementation-of-the-hpc-challenge-benchmark-suite.

[18] J. Demmel, CS 267 Applications of Parallel Comput-ers Lecture 24: Solving Linear Systems arising from PDEs – I, www.cs.berkeley.edu/~demmel/cs267_Spr99/

Lectures/Lect_24_1999-new.ppt.

Appendix: Formulas for HPCC complex

synthetics

The formulas we use involve one computation term and two communication terms. Accordingly, we mod-ify Eq. (2), the basic form of the model, as follows:

€ ˆ t _i= ncomp,i/pi n_comp_,_r/p_r cjtij j comp

∑

+ ncomm,bw,i/pi n_comm_,_bw_,_r/p_r cjtij j comm,bw

∑

€ + ncomm,lat,i/pi n_comm,lat,r/p_r cjtij j comm,lat

∑

. (A1) Here € n_comp_,_i, € n_comm_,_bw_,_i, and €

n_comm_,_lat_,_i are the opera-tion counts for computaopera-tion, bandwidth-limited com-munication, and latency-limited comcom-munication, re-spectively, while

€

p_i is the number of processor cores. Eq. (A1) reduces to Eq. (2) when computer i solves the same problem as computer r on the same number of cores. Note that

€

n_comm_,_bw_,_i and

€

n_comm,lat,i typically depend explicitly upon

€

p_i, whereas

€

n_comp_,_i does not. Proceeding as in Section 2, Eq. (A1) can be rewrit-ten in normalized form as follows:

€ ˆ y _i≡ ˆ t i/(ncomp,i/pi) t_r/(n_comp_,_r/p_r)= p_i/ ˆ s _i p_r/s_r = bjXij comp j comp

∑

€ + b_jX_ijcomm,bw j comm,bw

∑

+ b_jX_ijcomm,lat j comm,lat

∑

. (A2) Here €

b_j is given by Eq. (4), as before, but now there are three variants of the normalized predictors:

€

X_ijcomp≡ tij

t_rj, (A3)

€

X_ijcomm.bw≡ (ncomm,bw,i/ncomp,i) (n_comm,bw,r/n_comp,r)

t_ij

t_rj, (A4) X_ijcomm.lat≡ (ncomm,lat,i/ncomp,i)

(n_comm,lat,r/n_comp,r)

t_ij

t_rj . (A5)

Also appearing in Eq. (A2) is the overall speed:

€

s_i≡n_comp_,_i/t_i. (A6) Likewise, the normalized value of the measured run time is now € y_i≡ ti/(ncomp,i/pi) t_r/(n_comp,r/p_r)= p_i/s_i p_r/s_r. (A7)

Before fitting Eq. (A2) by least squares, we also in-troduce weights that are inversely proportional to the squares of the normalized values of the measured run time. This weighting is done by another transforma-tion, just as described in Section 3.

A.1. HPL

The complexity of computation and communication of all the HPCC benchmarks at constant

€

p_i is summa-rized by Luszczek, Dongarra, and Kepner [17]. A more detailed analysis for HPL, including separate treat-ments of the communication bandwidth and latency terms along with their

€

p_i dependence, is in the “Scal-ability” discussion of Ref. [16]. The latter implies that

€

n_comp_,_i∝N_i3, (A8)

€

n_comm,bw,i∝N_i2p1/ 2_i , (A9)

€

n_comm_,_lat_,_i∝N_ip_ilog₂ p_i, (A10) so

€

n_comm,bw,i/n_comp,i∝p_i1/ 2/N_i, and (A11)

€

n_comm_,_lat_,_i/n_comp_,_i∝p_ilog₂ p_i/N_i2. (A12) The last two equations give the expressions needed for the corrections in Eqs. (A4) and (A5), assuming that the constants of proportionality are the same for computers i and r. (Careful examination of the formu-las in Ref. [16] shows that the constants of proportion-ality in Eqs. (A11) and (A12) have a weak dependence upon

€

P_i/Q_i. This ratio differs depending upon whether

€

P_i×Q_i= p_i is an even or odd power of 2, but the effect on the constant is small and ignored here.)

A.2. G-FFTE

For the FFT with transpose, the complexity analy-sis results of Demmel (on Slide 37 of Ref. [18]) lead to the following formulas:

€

n_comp,i∝m_ilog₂m_i, (A13)

€

n_comm,bw,i∝m_i, (A14)

€

n_comm_,_lat_,_i∝p_i2, (A15) so

€

n_comm_,_bw_,_i/n_comp_,_i∝1/ log₂m_i, and (A16)

€