BoT Size, Runtime, Parallelism and Estimate

In this section, we focus our analysis on four attributes of BoTs, namely size, runtime, parallelism and estimate. The characterisation will concentrate on examining the autocorrelations and the cross-correlations between these BoT attributes. Figure 3.8 shows that BoT runtimes can exhibit weak to strong autocorrelations. The autocorrelation in the sequence of BoT runtimes occurs because users tend to submit the same applications over and over again. This behaviour of users also causes the autocorrelation of job runtimes. However, calculating the autocorrelation in the sequence of job runtimes will be aected by the repetitions of similar jobs in the same BoT. Consequently, it yields larger autocorrelations and job runtimes become more sensitive with the autocorrelation function at short lags. Therefore, we argue that the autocorrelation should be calculated based on BoT runtimes instead of job runtimes. With respect to the attribute BoT size, we investigate and show in Table 3.13 its statistics consisting of the mean and the maximum size. A noticeable point is that BoT sizes of 4 (FS0, FS2, HPC and NIK) out of 6 traces have similar means. Furthermore, the maximum size of a BoT is rather large and can be up to thousands of jobs. Parallel systems can undergo durations of severe congestion when such a large BoT occurs. Hence, we claim that realistic BoT workloads used in scheduling evaluation should contain large BoTs with hundreds to thousands of jobs for a reliable evaluation result.

It is also interesting to see how BoT sizes are correlated with other attributes of BoTs by calculating the Spearman correlation coecient. Firstly, we examine the correlation between the size and the runtime. Before doing the examination, we

3.6. BoT Size, Runtime, Parallelism and Estimate 41 0 50 100 150 200 −0.1 0 0.2 0.4 0.6 0.8 1 Lag ACF FS0 HPC LLN NIK

Figure 3.8: Autocorrelation functions of BoT runtimes.

Table 3.13: Statistics of BoT sizes.

FS0 FS2 FS3 HPC LLN NIK

Mean 7 8 14 7 5 8

Max 1263 558 3675 2418 169 1201

expected that BoT runtimes would decrease if BoT sizes increase. This should give negative correlations, because we think that users would divide their applications into several smaller jobs by increasing their BoTs. However, results in Table 3.14 tell us that our expectation seems to be correct only for NIK. For FS2 and LLN, the correlations are also negative but rather weak, and in contrast FS0, FS3 and HPC show positive correlations. This means that if users increase their BoT sizes, jobs tend to run longer and this will harm the performance of parallel systems. Therefore, we believe that modeling and scheduling studies should take care for this realistic situation.

Table 3.14: Correlation between BoT sizes and BoT runtimes.

FS0 FS2 FS3 HPC LLN NIK

42 Chapter 3. Statistical Analysis

Our next expectation is that BoT parallelisms will decrease if BoT sizes increase. Results of calculating the correlation between the two attributes, shown in Table 3.15, conrm our expectation in case of FS2, FS3 and HPC since their correlations are negative (the other traces also produce negative correlations but rather weak). We predict this result because we believe that users should reduce the numbers of requested processors when they increase their BoT sizes. Otherwise, there could be not enough free processors to be allocated to their jobs.

Table 3.15: Correlation between BoT sizes and BoT parallelisms.

FS0 FS2 FS3 HPC LLN NIK

-0.032 -0.198 -0.310 -0.150 -0.091 -0.014

Finally, we calculate the correlation between the size and the estimate and show results in Table 3.163_{. As we see, the correlation is negative in case of FS2, FS3} and NIK. This means that users of these systems tend to take initiative to reduce the amount of time they request for their jobs when they increase the size of BoTs, possibly because they do not want schedulers to let their jobs wait long in waiting queues. However, users of the HPC system seem unconcerned about the longer times their jobs have to wait for execution because they tend to increase the estimate together with the size of their BoTs, shown by a positive correlation. To further our understanding of why HPC users tolerate the longer wait times, we calculate the occupation time and the estimated occupation time of BoTs4_{. Table 3.17 shows how} much users utilize their estimates. Since HPC has the best utilization, HPC users seem to estimate their jobs better than users of other systems. Therefore, if HPC users decrease their estimates they may have the risk of underestimation, which can kill their jobs. To guarantee a successful execution for their jobs, they must tolerate longer wait times.

Table 3.16: Correlation between BoT sizes and BoT estimates.

FS0 FS2 FS3 HPC LLN NIK

-0.033 -0.244 -0.236 0.175 - -0.145

3_{Since 65% jobs of LLN do not have the information about their user estimates, we decide to skip}

it when we analyze the BoT estimate.

4_{We define the occupation time of a job as the total time that it occupies processors, calculated}

as R×P, whereRand P are the runtime and the number of processors of the job, respectively. Similarly, the estimated occupation time of the job isE×P, whereEis the estimate of the job. The occupation time and the estimated occupation time of a BoT are the total occupation time and the total estimated occupation time of all jobs within the BoT.

In document Workload modeling and performance evaluation in parallel systems (Page 51-54)