Hadoop: Small Cluster Performance

(1)

Hadoop:

Small Cluster Performance

Joshua Nester, Garrison Vaughan, Jonathan Pingilley, Calvin Sauerbier, and Adam Albertson

Abstract – This essay is intended to to show the performance and reliability of a small scale Hadoop cluster. Methods of testing will be explained, possible connections between data sets will be shown, and further tests that could reveal more information surveyed.

Index Terms – hadoop, cluster, small cluster, performance, reliability

I. INTRODUCTION

Hadoop is a distributed computing framework designed for data-intensive distributed applications. It is normally used on large clusters of commodity hardware. Hadoop provides these clusters with high reliability and speed. The system is designed so that node failures are automatically handled [1]. Petabytes of data are processed every day by Hadoop clusters. However, one area that is not often explored is how the performance and reliability of Hadoop translates to smaller clusters.

The intention of this project was to discover how Hadoop performed in a small scale cluster of four off-the-shelf, budget computers. For best results four identical computers would be selected. However, four identical computers were not available. In order to keep the testing as reliable as possible four available computers that were largely similar were selected. The specifications of the computers that were used for this project were:

• 1x Dell Optiplex 270 Pentium 4 3.2GHz 1GB Ram 333MHz DDR 30GB HDD 100MB/s NIC • 3x Dell Optiplex 280 Pentium 4 3.4GHz 1GB Ram 533MHz DDR2 Various size HDDs 100MB/s NIC

In order to get a picture of how Hadoop performs in small scale operation a few key areas to test were selected. The testing will examine data loss tolerance, performance, performance with node failure, and performance with node recovery.

II. TESTING DETAILS

The program that would be used for all testing was a simple word count program that can take an input of text files, count the number of occurrences of all words, and output the results to another text file. For these tests the text files would be books downloaded from www.gutenberg.org. In order to make the tests long enough to provide reliable results while keeping the test length within the time allotted for this research, 200 books were selected to be counted. The tables at the end of this document include individual times for each test performed.

A. Data Loss Tolerance

While it was known to this group that Hadoop is a single point-of-failure system if the master node goes down and there are no backup nodes, it was not known how Hadoop would handle data being deleted out of the DFS during the execution of a program. This was the shortest and simplest test that would be performed. The test would consist of starting a word count waiting one minute, deleting all of the books on the DFS, and monitoring the reaction of the cluster.

B. Hadoop Speed

To test the speed of Hadoop a baseline without Hadoop must first be determined. To find the baseline performance of a single node without Hadoop a single test would be run using a similar word count problem to the one that is used with Hadoop. Unfortunately the word count program that is used with Hadoop cannot be used without Hadoop. Thus, a replacement was found that uses terminal commands to perform a similar duty. More than a single test is not required as during preliminary testing the command yielded very consistent times. The command used for this test was:

cat *.txt | tr ' ' '\n' | sort | uniq -ic

Once the time was recorded for the test without Hadoop the group would move on to testing the cluster speed with Hadoop. The Hadoop word count program would be run using all books and record the time to complete using one to four nodes with three runs each. The times for each three run set would be averaged together to for consistency in comparing times.

(2)

C. Hadoop Speed with Node Failure

To test speed with node failure the Hadoop word count program would be run using all books and record the time to complete. All tests start with four nodes connected and after a minute into the run one to three nodes with be disconnected. Three runs with each amount of disconnected nodes will recorded. The times for each three run set would be averaged together to for consistency in comparing times.

D. Hadoop Speed with Node Recovery

To test speed with node recovery the Hadoop word count program would be run using all books and record the time to complete. All tests start with four nodes connected and after a minute into the run one to three nodes with be disconnected and then reconnected one minute later. Three runs with each amount of disconnected nodes will recorded. The times for each three run set would be averaged together to for consistency in comparing times.

III. TESTING PREPARATION

In order to prepare for the testing runs all books were copied to the master node and also copied to the DFS. Also, during initial testing it was found that the node timeout needed to be adjusted to allow the nodes to time out within a reasonable amount of time. The default timeout for nodes is set to ten minutes. The timeout was shortened to 30 seconds. This was the only preparation that was required for testing.

IV. RESULTS A. Data Loss Tolerance

When the test was performed, it was discovered that Hadoop does not handle deleting the currently used data from the DFS gracefully. If the data is deleted during a programs execution Hadoop will begin to display data inaccessible error messages and possibly crash. The test was run multiple times to ensure that the result was not a fluke. There were no instances where Hadoop did anything except display error messages or crash.

B. Hadoop Speed

Graph 1: Hadoop Average Speed

In the Hadoop speed tests it was shown that overhead for Hadoop is fairly large on anything less than four nodes (see Graph 1). The Hadoop overhead for one node is about 24%. With three nodes the relative performance per node is only roughly 50% (assuming the goal of a linear 100% computational power increase from each introduced node). However, when the fourth node is introduced, performance jumps upward greatly. With four nodes introduced the relative performance per node suddenly jumps up to 144%. This must be due to the optimizations made to Hadoop for distributed computing and data manipulation.

Graph 2: Hadoop Average Speed with Node Failure The node failure tests revealed very similar data (see Graph 2). What was more impressive with the failure tests was that loosing nodes after a run had started did not seem to affect the finish times much. Each number of disconnected nodes yielded times that were still better than their comparable standard speed runs.

1 2 3 0 10 20 30 Nodes Removed M in u te s 1 w/o 1 2 3 4 0 10 20 30 40 Nodes M in u te s

(3)

Graph 3: Hadoop Average Speed with Node Recovery The most impressive tests of all were the node recovery tests. When disconnecting and reconnecting nodes there was only one minute and a few seconds per node disconnected, lost (see Table 3). The time that it took all nodes to reconnect was about one minute, no matter how many nodes were disconnected. This makes the total time for the runs even more impressive. Each disconnected node would have been down for two minutes, but the time lost averaged one and a half minutes or less in most cases.

CONCLUSION

The tests that were run on this small Hadoop cluster show that, while performance with less than four nodes is not very high, Hadoop can still serve as a cheap way to improve data processing times. Once four nodes are introduced, the Hadoop cluster suddenly turns into a power-house that increases performance by leaps and bounds. It may be interesting to see how much extra performance is provided and if Hadoop performs any differently with more nodes.

Due to Hadoop's excellent handling of node recovery, Hadoop can even be recommended on systems that are somewhat unreliable. All of this data points to Hadoop being a cheap way to obtain very high performance even on very small and in-expensive clusters. Some consideration will still be needed in order to find the best hardware/price configuration for each individual use, but the money that could be saved and the performance gained is very promising.

REFERENCES

[1] Glen Mazza. (November 28, 2012). Hadoop Wiki. Retrieved November 2012, from http://wiki.apache.org/hadoop/.

[2] Noll, Michael. (June 29, 2012). Running Hadoop On Ubuntu Linux. In undefined. Retrieved November 2012, from http://www.michael-noll.com/tutorials/.

TABLES

Table 1: Hadoop Speed

Table 2: Hadoop Speed with Node Failure

1 2 3 4.8 5 5.2 5.4 5.6 Nodes Removed M in u te s

1 Node 2 Nodes 3 Nodes 4 Nodes

Run #1 22min 33sec 29min 55sec 17min 14sec 14min 55sec 4min

Run #2 29min 28sec 17min 42sec 15min 22sec 3min 52sec

Run #3 30min 7sec 17min 39sec 15min 3min 51sec

Avg Time 22min 33sec 29min 50sec 17min 32sec 15min 6sec 3min 54sec

22sec 18sec 16sec 6sec

1 Node w/o Hadoop

Largest Deviation From Avg

Nodes Removed

1 Node 2 Nodes 3 Nodes

Run #1 13min 40sec 16min 27sec 28min 10sec Run #2 14min 14sec 16min 9sec 28min 38sec Run #3 13min 57sec 15min 40sec 28min 9sec Avg Time 13min 57sec 16min 5sec 28min 19sec

(4)

Table 3: Hadoop Speed with Node Recovery Nodes Removed

1 Node 2 Nodes 3 Nodes

Finish Time Recovery Time Finish Time Recovery Time Finish Time Recovery Time

Run #1 5min 16sec 1min 6sec 5min 26sec 51sec 5min 37sec 52sec

Avg Time 5min 9sec 1min 3sec 5min 27sec 51sec 5min 31sec 54sec

7sec 3sec 8sec 2sec 6sec 5sec

(5)

Addendum:

Extended Testing

I. INTRODUCTION

Following the initial test of Hadoop, it was decided that, during the limited time frame available, extra testing would be done with more nodes added. This was intended to shed some light on the performance differences with extra nodes beyond four. Two extra nodes were selected for testing. Unfortunately, computers that were a close match to the original machines were unavailable. The selected computers were not ideal to keep the results consistent, but the patterns and trends should still be apparent. The two new node specification were:

• 1x Lenovo ThinkCenter 9645-WFS Core 2 Duo 2.4GHz dual core 2GB Ram 667MHz DDR2 250GB HDD

100MB/s NIC

• 1x Lenovo ThinkCenter 9482-W31 Core 2 Duo 2.66GHz dual core 3GB Ram 667MHz DDR2 250GB HDD

100MB/s NIC

Another concession that had to be made for the extended testing was the amount of tests to be run. To provide more consistent data all of the failure and recovery tests would need to be run, in addition to the new tests with the extra nodes. The time available was not conducive to this. Instead it was opted to only execute the new tests that were unique. The extended tests would be executed exactly in the same manner as the previous tests.

II. TESTING PREPARATION

Little preparation was required for the extended testing. The only steps required were the installation and setup of the two new nodes on the cluster.

III. RESULTS A. Hadoop Speed

Graph 4: Hadoop Average Speed (combined with 4 node tests)

The added performance of the extra connected nodes did not have a huge impact (see Graph 4). Performance is still great with a relative performance per node remaining at 144% for five nodes and slightly dropping to 133% for six nodes, but the actual time advantage grows less with each added node. As it was mentioned in the conclusion of the previous tests, the lower return on investment as more nodes are added must be considered when determining the best hardware/price combination.

Graph 5: Hadoop Average Speed with Node Failure (combined with 4 node tests)

The node failure tests revealed something that was not shown in any of the previous tests (see Graph 5). The test with two nodes disconnected performed as expected, slightly faster than the four node speed test. The test with one node

1 2 3 4 5 0 10 20 30 Nodes Removed M in u te s 1 w/o 1 2 3 4 5 6 0 10 20 30 40 Nodes M in u te s

(6)

disconnected yielded a totally different result. When one node was disconnected the resulting finish time jumped up greatly. The test was repeated multiple times to ensure that the result was not an anomaly. The only explanation that the group was able to come up with is that disconnecting one of the new nodes affects the reducer selection, for which there would likely be another timeout value. The group was unable to find the timeout value quickly, but is is still believed that this is the most likely reason.

Graph 6: Hadoop Average Speed with Node Recovery (combined with 4 node tests)

The tests that were done with four nodes are mirrored with six nodes (see Graph 3). The six node tests with one and two nodes removed displayed roughly a one minute time advantage over the four node tests. This can easily be explained away as the speed difference between having six nodes instead of four. No significant data was discovered in the new tests.

CONCLUSION

The extended tests provided the needed data to show if any significant differences would be encountered with added nodes. No significant differences were found beyond the possible issue with disconnecting a reducer candidate. The performance increases continued in line with the four node tests. The same things said in the conclusion for the four node tests still stand. Tests must still be performed to find the best hardware/price ratio for each implementation. With the promising performance gains Hadoop provides, the largest consideration for companies that need to process large data will not be to Hadoop or not, but will instead be how many nodes to use.

TABLES

Table 4: Hadoop Speed (6 Node tests)

Table 2: Hadoop Speed with Node Failure (6 Node tests)

1 2 3 4 5 0 2 4 6 Nodes Removed M in u te s 5 Nodes 6 Nodes

Run #1 3min 14sec 2min 47sec

Run #3 3min 2min 53sec

Avg Time 3min 8sec 2min 50sec

8sec 3sec

Nodes Removed 1 Node 2 Nodes

Avg Time 13min 30sec 3min 21sec

8sec 8sec

(7)

Table 6: Hadoop Speed with Node Recovery (6 Node tests) Nodes Removed

1 Node 2 Nodes

Finish Time Recovery Time Finish Time Recovery Time

Run #1 3min 39sec 33sec 3min 58sec 48sec

Avg Time 3min 34sec 36sec 3min 52sec 49sec

5sec 3sec 6sec 2sec