Cluster Size Versus Speed-Up Factor - Na¨ıve versus Optimised Implementation

7.10 Na¨ıve versus Optimised Implementation

8.1.3 Cluster Size Versus Speed-Up Factor

Figure 7.14 shows the speed-up factor of approach one and Figure 7.17 the speed-up of approach two. These figures show that an increase in cluster size always results in a decrease in runtime for both approaches. However, there are two interesting trends that emerge from further analysis of the speed-up results.

It can be seen that approach one responds better to increases in node numbers, when compared to approach two. For example, going from a cluster size of one to eight will speed up query time for approach one by a factor of 5.9 with a dataset size of 256M triples. Whereas approach two only produces a speed-up factor of 2.9 for the same increase in nodes and dataset size. This shows that approach one makes more efficient use of the available cluster resources. This could be due to approach one’s use of the map-side join technique. This technique allows for better use of the cluster, as each node can do more independently, without the need to transfer data over the network to a reduce task. Approach two uses the reduce-side method for all joins and so nodes must wait to send the data to

reduce tasks. This means that individual nodes can do less independently, explaining why approach two utilises extra nodes less efficiently when compared with approach one. Another interesting characteristic is that the speed-up from a certain cluster size varies dependant on input dataset size. For example, approach one’s speed-up going from one to eight nodes is only 2 at a dataset size of 32 million triples. However, the speed-up factor increases dramatically to 5.9 at 256M triples. This means that the more data that has to be queried by approach one, the more optimally the Hadoop cluster will utilise its available nodes. This trend is not only present going from just a size of one to eight, indeed the speed- up for all cluster sizes increases towards the optimal value as dataset size increases. This is clearly a very promising trait when dealing with NHS sized datasets. While this characteristic is present in the speed-up result for approach two, it is not as pronounced. Again, this could be due to the increased network transfers which the approach two algorithm requires. Further tests on greater cluster sizes would need to be performed to assess at what point an increase in nodes does not result in a decrease in run time. Research conducted by Fadika, Dede, Govindaraju, and Ramakrishnan (2011) show that Hadoop’s performance increase does diminish once the cluster is increased past a certain size. However this is dependant upon the specific task being run.

8.1.4 Approach Comparison

Figure 7.8 shows the comparison between the total time taken for the two approaches on the single machine. When just considering the two query stages between the two approaches, approach one is substantially faster than two. However, once the upload stage is included with the query stage, approach two is the better performing approach. The faster query stage of approach one would suggest it is more suited to a scenario where the data will be queried multiple times and thus needs to be stored for longer. Then the performance cost of the upload stage would be offset by the performance benefit of approach one’s query stage. Approach two would be more suited to a scenario where the data only needed to be queried once and the errors removed from the dataset. The performance difference between the approaches is interesting when considering the results from the single machine, as one

of the biggest performance bottle necks for Hadoop, transferring data over the network, is not a factor. This allows for a more direct and accurate comparison between approaches. As expected, approach one uses the pre-processed data to increase query performance relative to approach two. Even though the use of the map-side join is less of a benefit when running on a single machine, approach one is still performing its conditional logic on and writing to disk smaller triple elements due to the compressed data. The reduce task of approach one’s selection stage is also doing less work then that of approach two, due to the pre-formatted data. These two reasons are the main factors behind approach one’s superior query stage when running on the single machine.

The comparison between approaches is not as clear cut when considering the cluster results. From a cluster size of one to four nodes, the results mirror those from the single machine, with approach two having better performance when compared with the total time taken for approach one. However, as Figure 7.19 shows, once the cluster size is increased to eight nodes, approach one becomes the better performing approach, even with its costly upload stage included. The disparity between the single, two and four node cluster against the eight node cluster must be down to the increased demand in the network which an eight node cluster requires.

Overall, the results suggest that approach two is more suited to running on smaller cluster sizes due to its performance being affected by the additional network transfer that larger cluster sizes incur. They also suggest that approach one appears to be more scalable and is therefore suited to larger cluster sizes. This is because each node can do more work independently and requires less internode communication. On cluster sizes of four and below, approach two is more suitable for situations where the data only needs to be queried once. Approach one is more suitable if multiple queries have to be run, as the cost of the upload stage would quickly be nullified after only a few queries. However further tests on much larger clusters could be performed to assess if these performance characteristics continue.

In document Using Hadoop to implement a semantic method for assessing the quality of medical data (Page 117-120)