Timing Characteristics and Experiments - Master-Slave architecture based system design

PART 5 MPI-GIS : GIS POLYGONAL OVERLAY PROCESSING

5.3 Master-Slave architecture based system design

5.3.6 Timing Characteristics and Experiments

We have performed our experiments on a Linux cluster that has 80 cores distributed among 9 compute nodes. The cluster contains 1) four nodes with each having two AMD Quad Core Opteron model 2376 (2.3 GHz), 2) one node with four AMD Quad Core Opteron model 8350 (2.0 GHz), and 3) four nodes with each having two Intel Xeon Quad Core 5410 (2.33 GHz) connected using Infiniband switch. In our cluster all the nodes share the same file system hosted at the head node.

We experimented with two different sets of data. First set consists of files of size 770 MBs containing 465,940 polygons and 64 MBs containing 8600 polygons. This data set has skewed load distribution. The second data set consists of files of size 484 MBs containing 200,000 polygons and 636 MBs containing 250,000 polygons. This is the larger data set but the load distribution is uniform here. Figure 5.3 and Figure 5.4 shows the load distribution plots

Figure (5.4) Comparatively uniform load distribution for larger data set.

for a sample of the base layer polygons used in experiments. To calculate absolute speedup against the one-processor time without any parallel overhead, we used the execution time of R-Tree based version on a single processor. Unless otherwise stated, all benchmarking has been performed over the end-to-end 1-processor time (the process of taking two GML files as input, performing overlay processing, and saving the output as a GML file) using R-Tree based algorithm on the smaller data set.

Figure 5.5 shows the absolute speedup when we use sorting-based algorithm. For dynamic version using R-tree, the overall end-to-end (starting from two input GML files to producing output GML file) acceleration is about 15x as shown in Figure 5.6. R-Tree based version shows better performance in comparison to the sorting based version. Dynamic version works better than static version due to the non-uniform distribution of polygonal data as we mentioned earlier. Moreover, master-slave communication time is very small owing to the small size of communication messages.

Figure 5.8 and 5.9 shows the execution time breakdown of subprocesses for the static versions. Task creation step involves data partitioning so that slave processors can work on

Figure (5.5)Performance impact of varying worker processors using sorting-based algorithm.

Figure (5.6)Performance impact of varying worker processors using R-Tree based algorithm.

independent tasks. Overlay processing step involves computing overlay and writing the local output polygons to GML files. Figure 5.10 and 5.11 shows the execution time breakdown of subprocesses for dynamic versions. The reported time in the above-mentioned figures is the average time recorded by noting the time for each of the three subprocesses, i.e., parsing, task creation, and overlay task processing, for each slave process and then taking an

average. Overlay task processing includes assigning the tasks toGP C library and is followed

by the output storing step where the local outputs are stored in the shared file system as a separate file (one file for each slave process). The overlay processing time in case of R-Tree based version is more than sorting-based version for the same dataset as can be seen from Figure 5.9 and Figure 5.11. This is due to the fact that when we use R-Tree datastructure, we get more potentially intersecting polygons for a given base layer polygon in comparison to the sorting-based version. It should be noted here that all the potentially intersecting

Figure (5.7) Performance impact of varying worker processors on task creation.

polygons may not actually intersect. So, this does not affect the correctness of our final output.

Figure 5.7 shows that the task creation using sorting-based algorithm takes much more time in comparison to the R-Tree based algorithm. For a cluster with more than 80 processors, the sorting-based algorithm may take lesser time than R-Tree based algorithm. Although the time taken for task creation step in case of sorting-basaed algorithm decreases along with increase in number of slave processors, it is not suitable for a cluster with a few number of nodes. The comparison shows efficiency of R-Tree based algorithm over sorting-based algorithm for intersection graph creation. Use of R-Tree makes sure that the percentage of sequential portion in the overall algorithm is reduced thereby increasing the parallelism by Amdahl’s law. This is the main reason for the improvement over the sorting- based algorithm. Thus, for cluster with few number of nodes, R-Tree based algorithm works better.

For both the smaller and the larger data sets, the overlay processing task scales very well for static as well as dynamic loadbalancing as the number of slave processors increase. Since, the size of the message is very small, the communication cost is small in comparison with the cost of input file parsing and task creation. However, we observed that due to the contention for parallel file access, parsing of the input GML files and writing of output files takes longer with the number of processors growing. Further scaling of this system is challenging unless high throughput file access can be supported.

Figure (5.8) Execution time breakdown for static version (Sorting-based algorithm).

Figure (5.9) Execution time breakdown for static version (smaller data set) (R-Tree based

algorithm).

In document THE RHETORIC OF REFUGEES: LITERACY, NARRATIVE AND IDENTITY FOR SOMALI WOMEN (Page 92-96)