Growth Factor Analysis - Chimaera Complex Model

2. THESIS DEPOSIT

7.4 Chimaera Complex Model

7.4.1 Growth Factor Analysis

One of the most important elements of the model is the component relating to core count, as this will determine the increasing factor as the core count is scaled (represented as theC2term in Equation 7.1).

The majority of this consumption is likely to originate from the MPI library unless the application manages its own rank-to-rank communication buffers. This means that we can visualise the model’s prediction against that of our two problem sizes, to gain an insight into the accuracy of our trend prediction. Whilst the model term is not designed to specifically model MPI growth, and will not factor in constant size allocations, it is a good approximation.

0 10 20 30 40 50 60 70 16 32 64 128 256 512 MPI Memory (MB) Core Count 603 1203 Model

Figure 7.3: Chimaera MPI memory growth against model prediction

Figure 7.3 shows that there are essentially two magnitudes to the experienced MPI memory consumption. MPI memory consumption should be dependent on core count, and roughly problem size independent, thus we would expect both problem sizes to exhibit roughly the same MPI memory consumption.

What we see from Figure 7.3 is that our model predicts, with reasonable accuracy, the trend of growth but arguably fails to grasp the magnitude. One potential reason is that we trained our model on the 1203_{problem on 32 and 64} cores, which from the graph do not exhibit the same magnitude as the equivalent sizes for the 603_problem.

To experiment, we retrain our model on the HWM traces from the 32 and 64 core runs for the 603 _{problem and revalidate. An alternative would have} been to utilise the higher core count runs (256 and 512) from the 1203_problem which also exhibits the increased consumption. Our approach, of using the 603 problem, shows how accurate models can be generated from varying problem sizes.

F(P, N) = 12480.2P

N + 108125N+ 9894469 + 10968.9(GhostCells) (7.5)

Equation 7.5 represents our updated model, based on the 603_{problem traces.} We can see that this model has a vastly increased static term, which will increase

603 ₁₂₀3

Prediction (MB) Error (%) Prediction (MB) Error (%) 16 217.98 -5.40 1473.59 n/a 32 133.37 -2.97 784.82 -8.77 64 86.43 -2.56 423.78 -7.28 128 64.32 -2.14 260.35 -5.30 256 60.57 -1.93 174.52 -2.79 512 78.49 n/a 144.34 -2.63

Table 7.6: Model predictions for Chimaera using Equation 7.5

predictions by≈8 MB, accompanied by a reduction in local problem size. Table 7.6 shows the predictions, and their associated error, for both problem sizes when modelling is based on Equation 7.5.

In comparison to the results in Tables 7.5(a) and 7.5(b) the general error rate is a a bit higher, but in contrast to the previous model our accuracy actually increases at scale. This trend is likely to be the result of the reduction in the local problem component, which plays a more important role for larger problems and at small scale. This means that for a few allocations the model is incor- rectly identifying a relationship as constant, where it is actually proportional to problem size.

7.5 Modelling Implementation Changes

In this section we make two conjectures about the design of new features within the Chimaera code, and use our models to investigate their properties. From our study in Section 5.4 we established the importance of processor decompositions and hybrid parallelism models in reducing ghost cells and improving memory consumption scalability.

Here we will apply modifications to our models to simulate the implementation of these features and make conjectures about the resulting memory con- sumptions. As such we are unable to validate these results, and play no consider- ation to implementation design choices, but rather model based on theoretical savings. Additionally we make no comment on the performance of any such

Decomposition Local Cells Ghost Cells Prediction (MB) Predicted Saving (%) 16 4x2x2 13500 3908 212.64 2.45 32 4x4x2 6750 2498 119.21 10.62 64 4x4x4 3375 1538 72.29 16.35 128 8x4x4 1800 1090 55.46 13.77 256 8x8x4 960 740 55.00 9.19 512 8x8x8 512 488 73.43 6.45 (a) Chimaera 603

Decomposition Local Cells Ghost Cells Prediction (MB) Predicted Saving (%) 16 4x2x2 108000 15008 1453.50 1.36 32 4x4x2 54000 9488 754.70 3.84 64 4x4x4 27000 5768 397.73 6.15 128 8x4x4 13500 3908 224.19 13.89 256 8x8x4 6750 2498 142.30 18.46 512 8x8x8 3375 1538 118.49 17.91 (b) Chimaera 1203

Table 7.7: Model predictions for Chimaera with 3D processor decomposition

implementations.

We base our further analysis on the architecture of the Cab platform, in accordance with the Chimaera model generated in Section 7.4.

7.5.1 3D Processor Decomposition

A 2D decomposition of a 3D problem domain will result in local problems in the shape of a cuboid (a 3D rectangle). Utilising a 3D processor decomposition will enable the generation of more regular cubic shapes. As we demonstrated in Section 5.4 the closer to a regular cube the lower the surface to volume ratio, thus minimising ghost cells.

For both the 1203_{and the 60}3 _{problem we simulate the best 3D processor} decompositions and use these to generate memory predictions based on the model in Equation 7.5, and generate an estimated memory saving from the model results presented in Table 7.6; these predictions are presented in Table 7.7. If we study the balance of ghost cells to problem cells with our 3D processor decomposition, against the standard 2D decomposition, we can see a vast improvement. For the 512 core, 1203_{, case we observe 3840 local cell and 3480}

ghost cell for the 2D decomposition (Table 7.5(a)) and 3375 local cells with 1538 ghost cells for the 3D decomposition (Table 7.7(b)), a_≈2_×increase in the ratio of local cells to ghost cells. This trend of improvement is exhibited across the experiments when comparing the 2D and 3D decompositions, though is more pronounced at higher core counts where difference between the 2D pencil and the newly established cubic shape is most extreme.

We note that in certain circumstances, such as the 512 core 603 _problem (Table 7.7(a)), the 3D decomposition results in more local cells than the com- parative 2D decomposition. This is a result of the decomposition of non-power- of-two problem sizes onto power-of-two processor counts. Fortunately we also see an approximate halving of ghost cells, thus an overall memory reduction is still achieved.

As a whole the memory savings presented in Table 7.7 are significant, and if they were implemented in a sufficiently performant configuration, could prove very beneficial.

Increased Scale

Using a 3D processor decomposition has one additional benefit: the ability to scale to more processes. A 2D decomposition of 603 _{cannot scale beyond} 3600 cores, as this would represent a 1_×1_×60 problem decomposition; more cores could not decrease the local problem size, and would be wasted. Using a 3D decomposition, it would be theoretically possible to scale to the maximum 216000 cores where a 1_×1_×1 problem decomposition would be achieved.

In document Addressing parallel application memory consumption (Page 146-150)