Optimal segment size for split-binary broadcast

5.4 Statistical learning methods: C4.5 decision trees

6.1.1 Optimal segment size for split-binary broadcast

In Chapter 4 we introduced a number of segmented versions of collective algorithms. In- troducing another, non-trivial parameter, increases the complexity of the decision functions

102 103 104 105 106 103 104 105 106 107

Segment size [Bytes]

Duration [ µ sec] 768 KB TCP, Hockney TCP, LogGP TCP, PLogP MX, Hockney MX, LogGP MX, PLogP TCP, measured MX, measured 102 103 104 105 106 104 105 106 107

Segment size [Bytes]

Duration [ µ sec] 2 MB TCP, Hockney TCP, LogGP TCP, PLogP MX, Hockney MX, LogGP MX, PLogP TCP, measured MX, measured

Figure 6.1: Effect of segmentation on split-binary broadcast performance on 64 processes for two message sizes: 768KB and 2MB on Grig cluster. Model predictions use the model parameter values from Tables 4.12 and 4.13 and Figure 4.19.

we need to construct. Thus, we would like to further examine the conditions under which the segmentation improves the performance and whether parallel communication models can be used for this purpose.

The effect of message segmentation on performance of the split-binary broadcast algorithm for an intermediate, 768KB, and a large, 2MB, data size on 64 processes was examined on Grig cluster. The model parameters used in this Section can be found in Tables 4.12 and 4.13, and Figure 4.19.

We first determine the optimal segment size according to the analytical models: Hockney and LogGP based on results of Theorems 5.1 and 5.2. The analytical solution for the optimal segment size according to PLogP model does not exist in this case, so we consider a large number of segment sizes in range from 312 B to 1 MB, and report the one with the shortest duration. Finally, we measure performance of the split-binary algorithm on 64 processes for 768 KB and 2 MB message sizes on the Grig cluster using 32 different segment sizes in the same segment size range.

The Figure 6.1 shows both the experimental results and model predictions. Table 6.1 summarizes these results. We report the optimal segment size on both MX and TCP / FastEthernet interconnect. For the PLogP model and experimental results, we report the segment size which achieved the absolute minimum, as well as the segment sizes that corre- spond to methods whose running time was within 5% and 10% of the predicted minimum. Figure 6.1 shows that neither of the models is properly capturing the TCP method performance, although the LogGP and PLogP models do have somewhat similar behavior. In the case of MX interconnect, the PLogP model is able to accurately follow measured results for segment sizes in range of 300 B to 100 KB. For large segment sizes and 2 MB message, all three models give acceptable estimate on the duration. However, neither of models is able to accurately capture the performance of the split-binary method with small segment sizes. The results in Table 6.1 are not uniform. The Hockney model predictions were rather off the mark: approximately 10 KB segment size for 768 KB message in both MX and TCP case would take more than 10% extra time over the experimental minimum. Interestingly, according to the Hockney model the optimal segment size both over MX and FastEthernet

Model Message size

Optimal segment size

MX TCP Hockney 768 KB 10308 B 10116 B 2 MB 16834 B 16519 B LogP / LogGP 768 KB 746 B 309 B 2 MB 12185 B 505 B PLogP 768 KB 3584 B 1024 B within 5%: within 5%: 1536 B, 2560 B, 3072 B 640 B through 6144 B 2 MB 3584 B 4096 B within 5%: within 5%: 1536 B, 2560 B, 3 KB 768 B through 8192 B Measured 768 KB 743 B 4096 B within 5%: within 5%: 1 KB, 2 KB, 512 B 3 KB, 8 KB, 10 KB, 12 KB within 10%: within 10%: 32 KB 16 KB 2 MB 128KB B 8192 B within 5%: 512 B, 743 B, 32 KB, within 5%: 512 B, 743 B, 3 KB, 64 KB, 80 KB, 90 KB, 100 KB 10 KB through 32 KB within 10%: 20 KB, 24 KB, within 10%: 40 KB, 200 KB, 256 KB 312 B through 40 KB

(TCP) is very similar. This is because the optimal segment size, ms, is proportional to the

square root of α_β. The latency over MX is approximately 20 times lower than over TCP, while the transfer time is approximately 20 times higher over TCP, essentially making the ratio α_β constant.

The LogP / LogGP model predictions are surprisingly accurate for intermediate message size over MX, but the remaining predictions would incur more than 10% performance penalty. The PLogP predictions for TCP were fairly accurate. Utilizing one of the segment sizes which achieve algorithm duration within 5% of the absolute minimum, yields between 5% and 10% performance penalty. However, over the MX this is not the case, and again, we would incur more than 10% performance penalty.

The results from this experiment effectively show that using parallel communication models by themselves will not yield optimal results. In addition, there tends to be a number of methods which achieve comparable performance (within 5% from the minimum duration). Thus, utilizing only absolute minimum can lead to over-fitting and negligible run-time performance differences.

Finally, to answer the question whether the segmentation improves an algorithm performance we consider the right-most experimental value reported in Figures 6.1. This value corresponds to the non-segmented case, and in case of the split-binary broadcast algorithm on the Grig cluster, virtually using any segment size improves the overall method performance for communicator and message sizes we considered.

In document Towards Automatic and Adaptive Optimizations of MPI Collective Operations (Page 106-109)