Performance and Power Implications: Energy Efficiency in Hadoop Ecosystem Muhammad Nor Al Amin Ishak MSc Advance Computer Science 2013/2014

(1)

The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others.

I understand that failure to attribute material which is obtained from another source may be considered as plagiarism.

(Signature of student)

Performance and Power

Implications: Energy Efficiency

in Hadoop Ecosystem

Muhammad Nor Al Amin

Ishak

MSc Advance Computer

Science

(2)

ABSTRACT

This project attempts at examining execution time, power consumption, and energy efficiency of two data processing tools namely Hadoop and Spark. Power consumption model is chosen as the guiding model to assess the power consumption of Hadoop and Spark each. Both data processing tools are utilized to perform word count task of different sized data files. The data files were sourced from Project Gutenberg and then loaded into the Virtual Machine in a Cloud environment. The data files are sized 500MB, 2GB and 4GB accordingly. In addition to this, the data were also processed and evaluated in each data processing tools with varying number of nodes. Subsequently, a combined analysis of the execution time and power consumption is carried out to determine the most energy efficient data processing tools of the two.

The experiment results suggest that Spark performs better than Hadoop in terms of its overall execution time. Secondly, an independent evaluation of the two tools in energy consumption shows that Spark consumes a higher amount of power than that of Hadoop. However, a combined analysis of execution time and power consumption reveals that Spark to demonstrate better energy efficiency than that of Hadoop.

(3)

ACKNOWLEDGEMENT

First and foremost, I would like to thank my project supervisor, Dr Karim Djemame, for his continuous support and advice throughout this project. I would also like to thank Dr Django Armstrong for the technical support he provided during implementing the experiments.

My appreciation also goes to Dr Hamish Carr, for his constructive feedback on the interim report and valuable remarks in the progress meetings.

Lastly, I would like to express my gratitude to my fiancée, family and friends for their support throughout this one year as an MSc student.

(4)

LIST OF ACRONYMS

CPU Central Processing Unit

DVFS Dynamic Voltage and Frequency Scaling

EE Energy Efficiency

GB Gigabyte

HDFS Hadoop Distributed File System

I/O Input/Output

IaaS Infrastructure as a Service

IT Information Technology

MB Megabyte

OS Operating System

PaaS Platform as a Service

PC Personal Computer

QoS Quality of Service

RAM Random Access Memory SaaS Software as a Service

SLA Service Level Agreements

SoC School of Computing

SSH Secure Shell

VM Virtual Machine

(5)

CONTENTS ChAPTER 1: INTRODUCTION ...1 1.1 Project Aim ...1 1.1.1 Motivation ...1 1.1.2 Research Question ...1 1.2 Objectives ...2 1.2.1 Minimum requirements ...2 1.2.2 Possible extensions ...3 1.3 Methodology ...3

1.3.1 Research and Planning ...3

1.3.2 Experimentation ...4 1.3.3 Evaluation ...4 1.4 Schedule ...4 1.4.1 Project task ...4 1.4.2 Project Milestones ...5 1.4.3 Schedule ...6

CHAPTER 2: BACKGROUND RESEARCH ...7

2.1 Cloud Computing ...7

2.1.2 Cloud Service Models ...8

2.1.3 Cloud Deployment model ...9

2.1.4 Virtualization ... 10

2.3 Big Data ... 11

2.4 Hadoop ... 12

2.4.1 Hadoop MapReduce ... 12

2.4.2 Hadoop Distributed Filesystem (HDFS) ... 13

2.4.3 YARN ... 14

2.5 Spark ... 14

2.6 Energy efficiency ... 15

Chapter 3: Experimental Design ... 17

3.1 Experiment Variables ... 17

3.1.1 Data processing tools ... 17

3.1.2 Data processing task ... 17

3.1.3 Cluster size ... 18

3.1.4 Input data size ... 18

3.2 Metrics ... 18

(6)

3.2.2 Estimated power consumption ... 19

3.2.3 Energy Efficiency ... 20

3.3 Hypothesis ... 20

3.3.1 Does Spark has better execution time than Hadoop? ... 20

3.3.2 Does Spark consume more power than Hadoop? ... 20

3.3.3 Which is more energy efficient; Hadoop or Spark? ... 21

Chapter 4: Experiment Implementation ... 22

4.1 Choice of data ... 22

4.1.1 Criteria ... 22

4.1.2 E-book and Project Gutenberg ... 23

4.2 Data collection and measurement ... 23

4.2.1 Execution Time ... 24

4.2.2 Collectd ... 24

4.3 Deployment Platform ... 26

4.3.1 Virtual Machine / Node specifications ... 26

4.3.2 Software ... 27

4.4 Programming language and Frameworks ... 27

4.5 Configuration setup ... 27

4.5.1 Pre-configuration setup ... 27

4.5.2 Hadoop installation and configuration ... 28

4.5.3 Spark installation ... 33

CHAPTER 5: Technical evaluation ... 37

5.1 Small data size ... 37

5.1.1 Execution time Analysis ... 37

5.2 Medium data size ... 41

5.2.1 Execution Time Analysis ... 41

5.3 Large data size ... 44

5.3.1 Execution time analysis ... 44

5.3.2 Estimated Power consumption ... 45

5.4 Hypothesis Evaluation ... 47

(7)

5.4.2 Spark consume more power than Hadoop... 47

5.4.3 Hadoop is more energy efficient ... 47

Chapter 6: Project evaluation ... 48

6.1 Methodology evaluation ... 48

6.2 Achievements of aims and objectives ... 48

6.2.1 Minimum requirements ... 49

6.3 Schedule Adjustment... 49

6.4 Related Work ... 50

6.4.1 Comparison to Prior Work ... 50

6.4.2 Contribution to research ... 51

6.5 Future work & Possible Extensions ... 51

CHAPTER 7: PROJECT CONCLUSION ... 53

REFERENCES... 54

Appendix A ... 56

APPENDIX B ... 57

APPENDIX C... 58

(8)

LIST OF TABLE

Table 1 : Hadoop and Spark Small Data Execution Time ... 37

Table 2 : Hadoop and Spark Small Data Power Consumption ... 39

Table 3 : Hadoop and Spark Medium Data Execution Time ... 41

Table 4 : Hadoop and Spark Medium Data Power consumption ... 42

Table 5 : Hadoop and Spark Large Data Execution Time ... 44

Table 6 : Hadoop and Spark Large Data Power Consumption ... 45

Table 7 : Comparison to previous research ... 50

LST OF FIGURE Figure 1 : Updated Schedule ... 6

Figure 2 : Cloud Computing Layered Architecture[6] ... 8

Figure 3 : Hadoop Framework Architecture ... 12

Figure 4 : MapReduce Work flow ... 13

Figure 5 : Illustration of the Data Flow in Spark [14] ... 15

Figure 6 : Power Aware Diagram on Energy Efficiency [18] ... 16

Figure 7 : Power consumption breakdown of Server components [24] ... 19

Figure 8 : SoC Cloud testbed Virtual Machine ... 26

Figure 9 : Comparison of Hadoop and Spark Small Data Execution Time ... 38

Figure 9 : Comparison of Hadoop and Spark Small Data Execution Time ... 39

Figure 10 : Comparison of Hadoop and Spark Small Data Power Consumption ... 40

Figure 11 : Comparison of Hadop and Spark Small Data Energy Efficiency ... 40

Figure 12 : Comparison of Hadoop and Spark Medium Data Execution Time ... 42

Figure 13 : Comparison of Hadoop and Spark Medium Data Power consumption ... 43

Figure 14 : Comparison of Hadoop and Spark Medium Data Energy Efficiency ... 43

Figure 15 : Comparison of Hadoop and Spark Large Data Execution Time... 45

Figure 16 : Comparison of Hadoop and Spark Large Data Power consumption ... 46

(9)

1

CHAPTER 1: INTRODUCTION

1.1 Project Aim

Recently, the term “energy efficiency” has become a widespread topic among the Cloud provider and services to promote a better solution towards better green environment, helping to lower the carbon emissions and reducing electricity consumption. The objective of better energy efficiency in Cloud has always been a subject that captures the interest of researcher, due to the potential in optimizing its benefits of effective energy usage and cost reduction.

Sharing the similar notion, the aim of this project is to evaluate and compare the power consumption of two data processing tools, namely Hadoop and Spark. Power consumption model will be utilized to calculate the power consumption for each tool. Next, by comparing the power consumption for each tool, the tool with better energy efficiency will also be determined.

1.1.1 Motivation

Hadoop MapReduce has been de facto standard platform commonly used in Cloud for analyzing big data. However in 2012, Apache, the same company that licensed Hadoop- has introduced a new open-source distribution for big data analytic, named Spark. Spark can be used in standalone mode or

sits on top of Hadoop. With a bold tagline on their website “100 times faster than Hadoop MapReduce

in memory”[19], and being dubbed as successor to MapReduce by people in industry, it is very intriguing to examine the potential of this new distribution in terms of performance and power consumption.

1.1.2 Research Question

This research aims to answer below questions:

Between Hadoop and Spark,

● Which model offers better running time as a measure of performance? ● Which model offers lower power consumption?

● Based on the performance and power consumption parameters studied, which data processing tool is more energy efficient?

(10)

2

1.2 Objectives

Two objectives underpin this project. Firstly, to determine power consumption and task execution time of Hadoop and Spark. Secondly, to determine the most energy efficient data processing tools between them. The steps to achieve these objectives are:

 Have a good understanding on cloud computing, big data, energy efficiency and other relevant

background topics

 Investigate how Hadoop Framework and data processing tools work

 Design set of experiments to evaluate the power consumption and energy efficiency

 Implement the experiments for Hadoop and Spark

 Analyze the experiment data to evaluate hypothesis and draw meaningful answers for the research questions

1.2.1 Minimum requirements

 Hadoop Installation

Hadoop installation is required within the project scope. Hadoop is used to execute data processing task, in order to measure its total execution time and power consumption. At a minimum, Hadoop should be installed in Virtual Machine of School of Computing (SoC) Cloud Testbed.

 Spark installation

Spark installation is required within the project scope. Spark is used to execute data processing task, in order to measure its total execution time and power consumption, allowing a comparison to be made with Hadoop. At a minimum, Spark should be installed in Virtual Machine of School of Computing (SoC) Cloud Testbed.

 Execute data processing task

This step is required within the project scope in order to analyze the execution time and power consumption while running the task. As a minimum requirement, Spark and Hadoop needs to carry out at least one data processing task.

 Run experiments on different workload

In order to achieve better analysis on the power consumption of Hadoop and Spark when they are run in parallel distribution, the experiments will use different size of input data. As a minimum requirement, the data processing task should be executed with two different data size; small and large.

 Run experiments on different cluster size

In order to achieve better analysis on the power consumption of Hadoop and Spark when running in parallel, the experiments will run in different number of nodes. As a minimum requirement, the data processing task should be performed on 2 or 4 nodes, allowing comparison and efficiency analysis to be made.

(11)

3

 Analyze power consumption and energy efficiency

This is the most crucial part in this project. Power consumption model should be applied in order to analyze and calculate the power consumption of Virtual Machine in Cloud environment.

1.2.2 Possible extensions

The possible extensions for the project are:

 Power consumption measurement using hardware tools

Even though students are not permitted to enter the SoC Cloud Testbed Server room, Hadoop and Spark can be installed on Virtual Machine with the same specification/variable used in the experiments in student’s own laptop or PC. By applying this method, it is possible to simulate the same virtual environment and measure the power consumption of Hadoop and Spark using hardware tools; such as Watts Meter. Although this may not as accurate as measuring the actual Server, it should provide a fine example or benchmark for comparison of using power consumption model.

 Real world usage

It is possible to extend the project implementation by covering some real-world application data processing task. There are many large datasets available online that can be used; for example Twitter data. These data can be processed accordingly and simulate the real-world usage rather than rely on synthetic benchmarks.

 Public cloud

It is possible to extend the implementation to Public Cloud such as Amazon EC2. With additional resources in Public Cloud, it is also possible to experiment with different kind of variables and data processing task.

1.3 Methodology

In order to accomplish the objectives and to meet the minimum requirements of this project, a methodical approach was needed to ensure its completion. The approach is based on Agile method, and was broken into three phases as follows:

1.3.1 Research and Planning

The purpose of this phase is to develop the understanding on various topic backgrounds related to the project scope. Research on topics such as Cloud Computing would inform latest accompanying trends and technologies. Next, big data topics will be covered, including data processing tools and Hadoop framework. Lastly, energy efficient topics will also be covered in order to investigate the past and current research of how energy efficient can be applied in Cloud.

(12)

4

1.3.2 Experimentation

In order to investigate the power consumption of both Hadoop and Spark, experiments will be designed accordingly to certain variables such as input data size and scale of cluster. Resources and time constraint are taken into consideration as they may affect project schedule. Execution time and power consumption data were identified as the necessary metrics to be collected and measured. These experiments are then executed and the results will be collected to perform the analysis.

1.3.3 Evaluation

Once all the experiments are carried out, the evaluation will be performed to analyze the results of the experiments. The execution time and power consumption of each experiment are analyzed and evaluated accordingly to the hypothesis proposed.

Next, an overall review will be carried out in order to determine whether the research questions have been answered, parallel to checking if the aim and objectives of the project has been met.

1.4 Schedule

In order to create a proper schedule of completing the project, high level task and any significant project milestones are initially defined. Next, a project Gantt chart will be created based on each task and milestones.

1.4.1 Project task

Below is a list of high level tasks and explanation for each.

1) Determine aim

Proper aim and objectives are crucial as it will be the main focus in designing experiment. In addition, these two concepts help in maintaining context within project scope.

2) Choose experiments

Significant experiments are chosen based on the objectives of the project. Any variable and its reasoning are presented in order to ensure they are applicable within project scope.

3) Initial testing

In this task, initial experiments are conducted in order to test out Hadoop and Spark. The purpose is to see if the experiments that are going to be conducted are feasible. Both Hadoop and Spark job will be installed in a single virtual machine and executed with the examples that come together with the installation package (for example; Pi calculation).

(13)

5

4) Configure multi-node cluster

Based on initial testing results and it is verified that the experiments can be put into operation, Hadoop and Spark clusters will be implemented in SoC Cloud Testbed.

5) Run experiments

At this stage, all the experiments will be conducted on clusters that have been configured previously. Results will be measured and collected in order to analyze and calculate power consumption and energy efficiency for Hadoop and Spark each.

6) Evaluate data

Evaluation will be carried out to analyze the results of the experiments and to derive reasonable conclusions.

1.4.2 Project Milestones

A few milestones are required to be met within the project timeline. These are fundamentals in order to make sure that necessary tasks are accomplished before the milestones and the project is on track.

1) Submission of Aims and Objectives

It is vital to determine the Aims and Objectives of the project clearly, in making sure that the project is achievable and successfully implemented within specified scope.

2) Interim report

The interim report needs to include overall aim and objectives, minimum requirements and also information on the current progress of the project. Other than that, it will also include the draft chapter on literature review and evaluation method. All these information are essential in order for the assessor to provide feedback, as well as perspective in tackling any issues that arise during the course of the project.

3) Progress Meeting

The progress meeting timeline is estimated within six weeks before project completion. At this point of time, it is advisable for the implementation and experiments have been performed.

4) Final Report Submission

(14)

6

1.4.3 Schedule

In order to maintain and keep track of each milestone along with required project tasks that were

mentioned previously, an initial schedule plan (refer Appendix D)was created after the first few weeks

of submitting the aims and objectives. The schedule was constructed on weekly basis with the works required need to be done within a certain allocated period. Allocation time of each task was based on its complexity and precedence, with the schedule final goal is to complete the project punctually.

However, considering that the initial schedule was created at earlier stage of the project, it was vague in certain aspects, thus requires further amendment and updates. When substantial information is obtained from research background and work on prototype testing is initiated, a good idea of required time for each task is obtained. Shown in Figure 1 is the updated project schedule:

Project schedule (Week)

Task 1 2 3 4 5 6 7 8 9 10 11 12 13

Background Research Aim & Objectives Literature Review Choose Experiment Initial Testing Interim Report Configure Cluster Experiment Implementation Run Hadoop Job Run Spark Job Report write-up Progress meeting Progress Meeting reflection Evaluation Final Submission

(15)

7

CHAPTER 2: BACKGROUND RESEARCH

In the background research, most of the foundational technologies and concepts which are related to my project are discussed thoroughly. The purpose of this research is to gain a better understanding of the project. Three main topics have been identified which are:

1. Cloud Computing 2. Big Data

3. Energy efficiency in Cloud

2.1 Cloud Computing Overview of the Cloud

”Clouds are a large pool of easily usable and accessible virtualised resources. These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilisation. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the infrastructure provider by means of customised Service Level Agreements” [1] Cloud is the essential key to define the cloud computing technology. In rough term, cloud can be equated with a large group of interconnected computers set for public or private uses[2, 3]. These interconnected computers can be PCs or network servers. To illustrate, Google may host a cloud that is built on a PC or network servers. This type of Cloud is usually set as private cloud, which limits the use of the cloud only for the company itself. On the other hand, Google owns the cloud that can be accessed by the mass public. This means that any authorized user can access the data and application from his or her computer. The infrastructure and the technology remain invisible to the user[4, 5].

Cloud computing is not to be confused with network computing. Network computing documents or applications are hosted on a single server owned by the company, making them accessible from any computer on the network. In comparison, cloud computing utilizes multiple servers, multiple networks and also, involves multiple companies. Secondly, cloud computing together with its service and storage, can be accessed from anywhere using an internet connection[3]. This is dissimilar to network computing in which the access is limited only within the company's network. Third, cloud computing is also not outsourcing where a company hires an external firm or agency to conduct its computing services. If outsourcing is opted by the company for its data or application, the data however will be set to be limited to the private use of the employees of the company and not to the public via the internet[3].

(16)

8

2.1.1 Cloud Architecture

According to Buyya et al [16] shows that Cloud architecture mainly consists of user-level middleware,

core-middleware, and system level, as shown in Figure 2:

Figure 2 : Cloud Computing Layered Architecture[6]

Beginning from the top of the architecture, Cloud application layers contain applications which can be accessed by end-users directly. Alternatively, users’ own applications can be deployed in this layer [6]. In addition, the user-level middleware layer contains software frameworks that assist the developer to create an environment for applications to be developed, deployed and executed in Clouds. Moreover, the platform level services that set the run-time environment are performed in the core-middleware layer to host and control services at user-level application. Finally, the system level layer is where massive physical resources, like servers, exist, and these resources are managed by the virtualization services set above this layer[6].

2.1.2 Cloud Service Models

The National Institute of Standards and Technology (NIST)[2] have classified three service models comprising of Infrastructure, Platform and Software-as-a-service. Each of the services is described briefly below:

 Software as a Service (SaaS) – This service allows the access to whole applications to be used over the internet on readily existing cloud infrastructure. Through various platforms, these applications are accessible to many common devices[3]. However, the access may be limited, and secondly, accessing different internet-based SaaS from different browsers, or from different devices such as mobile will result to different presentations of the same application. In this case, the users accessing the applications cannot see or alter the underlying infrastructure but only able to personalize some minor configuration settings[1, 3].

(17)

9

 Platform as a Service (PaaS) – This service allows deployment, management and alteration of certain range of applications on readily existing cloud infrastructure. This can be achieved

by utilizing “programming languages, libraries, services, and tools supported by the

provider”[3]. Similar to SaaS, users cannot see, manipulate or alter the underlying infrastructure, but are able to configure some minor settings of the applications. In addition, the settings in PaaS enable more user control of the applications and to some extent, the application environment[3, 5].

 Infrastructure as a Service (IaaS) – This service requires the infrastructure including storage, networking capacity and related resources to enable the user to run their own software or platforms. The consumer however, does not manage the infrastructure. Instead, the users are able to modify the amount that they requested, in addition to having some control over the storage managements, operating systems, and their deployed applications. The consumer also has limited control of some networking components e.g. the firewall[4, 7].

2.1.3 Cloud Deployment model

In addition to the classification listed above, NIST also provides a suitable recommendation of Cloud deployment models. The recommendation outlines four different types of models as follow:

 Private Cloud: Private cloud and its sources are usually operated completely for an organization. In contrast to the internet connections designed for public use, the private line accessibility of this type of cloud provides each client with dedicated information space[1, 3]. The environment is highly managed and is usually personalized according to the customer specifications. In some cases, the client is offered accessibility protection and management. Private cloud may be operated by the company itself or outsourced to a third party[3, 4].

 Community Cloud: The facilities in this cloud type are usually distributed by several companies. The cloud assists a specific group with distributed concerns varying from the objectives, protection specifications, and policy or conformity considerations[3]. Similar to the private cloud, community cloud may be run by the company itself or third parties[5].

 Public cloud: This type of cloud facilities are designed and owned by company providing cloud services for group or large industry group to use. Normally, a group cloud represents multiple customers with web servers, storage and connection in a distributed operating atmosphere. Sources from the cloud are flexibly assigned in accordance to the levels of demand. Service agency commonly assigned the standard of the accessibility[3].

 Hybrid cloud: The facilities for hybrid cloud combine two or more atmosphere explained before, that is private, community or public cloud. The organization of the independent cloud remains exclusive but are connected by consistent or technology that permits information and application mobility to take place[4].

(18)

10

2.1.4 Virtualization

System virtualization adds a hardware abstraction layer. The layer renders an interface that imitates and functions very similar to that of the actual hardware for several virtual machines, on top of the bare hardware. Virtualization permits the users to run two or more environments on the same machine but in a way that retain the isolation of each environment to be exclusively running on their own[8].

The benefit of virtualization is that it minimizes the capital cost, power consumption and cooling systems cost This is because many hardware resources used in Clouds can practically be “virtualized”. To add, multiple virtual servers are allowed to run at a same time on a single physical server. The live migration feature of the virtual machine that does not utilize the physical servers to the full will allow many physical servers to be turned off[9]. This will lead to better energy efficiency for data centers. Next, dynamic configurations for variegated applications' resource requirement and the ability to aggregate these resources for differing needs are also offered by virtualization in Cloud Computing[9].

This technique also enhances responsiveness by automatically monitoring, maintaining and provisioning available resources. To reiterate, the features offered by virtualization technique are exploited to meet the criterion of SLAs business requirements[8].

There are two main ways in which virtualization deployment can take place. The first type is the hosted system virtualization or application virtualization. This type of virtualization happens when the virtualization layer is installed on a readily available operating system as an application. The second type is the hypervisor systems. The virtualization layer according to this type is installed directly on clean x86-based hardware, as a replacement to the conventional operating system. Hypervisor systems are usually utilized for server consolidation.

Numerous other kinds of virtualization can be traced as an addition to the two earlier mentioned types. This may include full virtualization, hardware assisted virtualization, partial virtualization, paravirtualization, hybrid virtualization, and operation system level virtualization. Each of these kinds of virtualization is highlighted next.

 Full virtualization - A guest operating system can be run unmodified and in isolation through full virtualization. This is done by optimally simulating an actual hardware[8].

 Hardware assisted virtualization – By improving the computational performance sans

interpreting and isolating the instructions passed directly to the host processor, this technique allows the efficiency of guest environment to be enhanced[8].

 Partial virtualization – Most of the underlying host environment is virtualized in this approach. In utilizing partial virtualization- paravirtualization, hybrid virtualization and operating system level virtualization can be exploited as a way to utilize partial virtualization.

(19)

11

a) Paravirtualization supplies a similar software interface or API's of virtual machines employed in the underlying host hardware[8].

b) Hybrid virtualisation combines the advantages of both hardware assisted virtualisation and paravirtualisation to obtain finer performance in contrast to the software-only paravirtualisation[7].

2.3 Big Data

Large and complex data sets comprise of variety of structured and unstructured data, often in the quality of too big, too fast or too difficult to be handled by traditional techniques are referred to as “Big Data”. There are 4 qualities that underpin the Big Data, which is Volume, Velocity, Variety, and Veracity – collectively named as the 4Vs. Each of the qualities is described briefly below:

1. Volume : The quantity or amount of the data[10]

2. Variety : The diversity of the data types[10]

3. Velocity : The speed of the data generation and the speed of which the data need to be processed[10].

4. Veracity : The ability to trust the data to be highly accurate and reliable in times of imperative decision making[10].

Many modern enterprises are now focusing on Big Data, as it is believed to be potentially advantageous in influencing core business processes, providing competitive benefits and induce company revenues and profits [10]. As such, many organizations are researching ways to exploit the advantageous features of Big Data. This is done especially in analyzing them to suggest meaningful findings that will lead to better business decisions, and to add value to their business. Contextually speaking, the ability to process and manage large quantity of data in parallel mode is supported by the features of cloud where data analytics concern. Regardless of the data volume collected, data processing and management remain the core issue in almost all projects.

(20)

12

2.4 Hadoop

Figure 3 : Hadoop Framework Architecture

Hadoop is an open source software project that consists of various related Hadoop sub-projects, which emphasizes on distributed processing of large data sets across clusters in commodity servers[19]. It is now a top-level project within Apache Software Foundation, and is used by global community and users. The scope of this project will limit itself to the above framework architecture (see Figure 3). Other Hadoop sub-project such as Apache Pig and Hive will not be included in research background.

Moving forward, the experiments reported in this project will be using the term Hadoop, referring to Hadoop MapReduce. Similarly, other components will be referred according to their names such as HDFS and YARN.

2.4.1 Hadoop MapReduce

In 2004, MapReduce programming model was developed and introduced by Google as the solution model for processing demands arising from their large scale data sets in use. It quickly became popular as an industrial choice and also, within academia [11]. However, the complicated features of the programming model have disabled users to write useful programs [11]. In the same year, Doug Cutting improvised the MapReduce model by creating the Hadoop MapReduce. Named after his son’s stuffed elephant toy[19][11], Hadoop MapReduce is the Apache own open source version of the Google’s version, but with better features. 4 years later in January 2008, Hadoop MapReduce successfully established itself as a top-level Apache Software Foundation Project. To date, Hadoop MapReduce has received many contributions from academic field and commercial-based industry. Among these contributors, Yahoo is the largest with 2500 servers and handles 25 petabytes of application data. Their largest cluster is 3500 machine server. Also, Hadoop MapReduce is used worldwide by almost 100 organizations. In 2009, a world record is set by Hadoop MapReduce by sorting through a terabyte of data running on a 910-node cluster in less than three and half minutes

(21)

13

2.4.1.1 Hadoop Mapreduce Programming Model

Figure 4 : MapReduce Work flow

The central designs of Hadoop MapReduce are the scalable model for large scale data-intensive computing and fault tolerance. To use the model, user’s first need to specify two specific functions intended (refer Figure above). The first is the Map function that processes key or value pairs to generate an intermediate key/value pair result[12]. Second function is the Reduce function which combines the intermediate values that have the equivalent key[13]. Next, these two functions can be executed in parallel non-overlapping inputs and intermediate data. Any program that are written using the Hadoop MapReduce programming model is automatically parallelized making it compatible to be run on a large cluster of commodity machines. Several tasks managed by the runtime system such as the partitioning of the input data, scheduling the program’s execution over a number of machines, handling machine failures and looking after inter-machine communication. This is helpful for programmers with limited knowledge or experience in parallel or distributed programming as they can create an application that utilizes the resources available from a large distributed system, tailored to their needs[12].

2.4.2 Hadoop Distributed Filesystem (HDFS)

HDFS is a file system that is intended for MapReduce jobs. It specializes in reading a large input of data, process it, and write large output as well[19]. But HDFS is known for its incompetency in handling random access. In HDFS, file data is mirrored to multiple storage nodes to ensure its reliability. Collectively, this is termed as the replication by the Hadoop community[19]. To safeguard the system from failing, a set of replicated data is ascertained to be present. In this case, the data consumer will not know should the storage server failed[11].

(22)

14

Two processes support the HDFS services:

 NameNode is the responsible node for managing the file system metadata. Next, it also controls and operates the management services[11].

 DataNode, the block storage provider as well as for retrieval services[11].

2.4.3 YARN

Yet Another Resource Negotiator , or its shorter name Hadoop YARN operates as a scheduler and resource negotiation system. YARN allocates tasks to machines according to their resources available in the cluster[11]. Resources in this case, can be characterized as an abstract container, storing certain information on the resources of each machine. The abstract container stores information on resources’ memory or its CPU in each machine, to name some examples. Another key aspect is that each machine is able to hold more than a single container. Next, work is allocated on a per-container scheme[11].

Job submission is also carried out by YARN. This mainly operates with the function that YARN plays within the Hadoop architecture, where job management is decoupled from the task allocation components in the actual data processing component. By doing this, new data processing frameworks can be created as to substitute Hadoop MapReduce. At the same time, this step secures the benefit of the scheduling and resource allocations components in the Hadoop project in accomplishing the distribution of the workload of data processing tasks[11], as they are known to be very competent at.

2.5 Spark

Spark is a cluster computing system that performs analytics functions based on in-memory techniques, originally developed in the AMPLab at UC Berkeley. Spark is built on top of the Hadoop Distributed File System (HDFS) but it is not tied to an acylic data flow model such as the MapReduce or Dryad[14]. Iterative algorithms in machine learning and interactive data mining are central in Spark task-parallel job, making it dissimilar to the traditional systems existing. A plus point in using Spark is that its distributed memory abstraction and Resilient Distributed Dataset (RDDs) substantiate the coarse-grained transformation to take place. Without any system calls to access the memory, RDDs may access the process heap promptly, making them highly efficient. As for fault tolerance, an RDD logs the transformation to build dataset as its lineage, facilitating the recovery of any lost RDD using the information from other uncorrupted RDDs. The figure below describes the data flow model of a single parallel operation in Spark.

(23)

15

Figure 5 : Illustration of the Data Flow in Spark [14]

By caching the data in memory on each worker node, the amount of data being transmitted over the network is essentially reduces by Spark. This makes it very suitable for iterative algorithms which reuse a working set of data across multiple parallel algorithms. Within these conditions, Spark applications are able to run competitively faster in comparison to the disk-based implementations on Hadoop due to the reduced amount of data passed over the network[14].

2.6 Energy efficiency

There has been a lot of research paper being produced in related to energy efficiency of Cloud computing. Shown above is the diagram of how to improve energy efficiency by using power aware computing. Different methods and approaches such Dynamic Voltage Frequency Scaling (DVFS)[15], task and Virtual Machine consolidation[13, 15, 16] has been aimed towards data centers and cloud architecture such as network and protocols.

(24)

16

Figure 6 : Power Aware Diagram on Energy Efficiency [18]

Several considerations are outlined in applying energy efficiency in Cloud are discussed below:

1) Metrics

One of the important steps in analyzing energy efficiency is to determine the key metrics. Next, one should also understand the differences between power and energy. Power is commonly explained as the rate of energy consumption. On the other hand, the mathematical equation for energy is power multiplied with time[16].

2) System parameters:

Three types of parameters can be potentially optimized in data processing tools, such as Hadoop. They comprise of:

 Static parameters covering number of workers to use in the Hadoop cluster, number of HDFS

data nodes, HDFS block size, and HDPS replication factor [12].

 Dynamic parameters which consist of the number of map reduce jobs; HDFS block placement

schemes, and scheduling algorithms [13].

 Hardware parameters, such as the CPU, and speed of diverse IO paths which comprise the memory, network and size of memory [16].

(25)

17

CHAPTER 3: EXPERIMENTAL DESIGN

This chapter explains how series of experiments were designed in order to analyze the performance and power consumption of Hadoop and Spark. In order to answer research questions, any relevant factors which will help to shape the experiments were presented and discussed. Taken into consideration were resource limitation and time constraint in project schedule's completion.

3.1 Experiment Variables

In designing an appropriate set of experiments in regards to the aims of the project, there were few key variables that should be taken into consideration.

3.1.1 Data processing tools

One of the most important factors to consider was the tool to be used in processing the data. This is based on research goals; to analyze the performance and energy efficiency between Map Reduce and Spark.

3.1.1.1 Hadoop

As mentioned in section 2.4, Hadoop is the de-facto standard in data processing and use Map Reduce paradigm implementation. Even though Hadoop as ecosystem framework compromises other feature such as HDFS, in relation to this research specific goals, Hadoop Mapreduce will be compared with Spark.

3.1.1.2 Spark

While Spark has a wider set of operations, solving data processing tasks using just Map and Reduce allow Spark and Hadoop to be directly compared, consistent with the objective to answer the second research question.

It is important to note that all the experiments will be run in Hadoop Ecosystem as Spark Standalone mode. This required shared file system to be implemented (in this case, HDFS) for it to be able to run in multi-node cluster.

3.1.2 Data processing task

The second important variable that has to be taken into consideration is the data processing task that needs to be executed by the selected tools mentioned previously. It is important that these performed tasks “behave” in the same way and show any significant impact on the runtime and power consumption.

(26)

18

3.1.2.1 WordCount

Due to time constraint in implementing the project, it is not feasible to implement any complex data-intensive computing task, or any other additional features that are available in Spark over Hadoop. In this study, Word count has been chosen in order to measure how effective Hadoop and Spark in processing the most basic map reduce problem.

As a classic MapReduce problem, word count can be easily formulated to be solved by processing each document in parallel. There have been studies regarding MapReduce paradigm in terms of its performance and use of Word count as one of the parameters in data processing synthetic benchmarks. For example, a recent paper studied on MPI data [18], word count is amongst one of the task that is used in carrying out performance comparison between Hadoop and Spark.

3.1.3 Cluster size

Another important aspect which requires consideration is the variation of cluster size. Consideration has to be made upon the availability of resources; CPU, RAM and disk size of each physical machine of SoC Cloud testbed (refer Section 4.3 for specification details). Besides, there are other users sharing and utilizing the SoC testbed for their experiments and projects thus making it unreasonable for all the resources to be used for the experiments in this project.

However, it is crucial to see how good the data processing task performed at different number of nodes. The experiments will be scaled upon 1, 2, 4 and 6 nodes, processing specific size of data. This variable will also help in determining the power consumption on different scale of Cloud, thus assisting in achieving the main goal of this research project; evaluating power consumption and energy efficiency.

3.1.4 Input data size

Data scalability is another important factor in this research experiment. It would be useful to see the runtime and power consumption of each mentioned data processing tools in dealing with different size of input data. Besides taking into consideration that the experiment will also include the increment of number of nodes, it would be interesting to see the improvement of runtime and power consumption in processing different data scalability. Further details of data size will be discussed on Section 4.1.1.1.

3.2 Metrics

This section defines the quantitative metrics that need to be collected from the experiments for them to be used in the experiment’s objective analysis.

(27)

19

3.2.1 Execution time

Execution time, or duration time of the data processing task to complete, should be measured. It is one of the main criteria in evaluating the performance of each experiment using different data processing tools. Previously mentioned variable such as different input data, will have impact on execution time, in which larger size of data would cause a longer duration of processing time. Time duration of processing the data would also reduce when implementing a larger scale of cluster size since the additional number of nodes will help in processing the data in parallel.

3.2.2 Estimated power consumption

In order to answer the second research question; calculation of power consumption is absolute. However, there is none specific method in determining the accurate power consumption of Virtual machine even though there has been increasing studies on it. Based upon research background, the work of Bohra (2011) was chosen. The method pertains to a mathematical formula that calculate the utilization CPU,DRAM, Cache and disk, and the authors have managed to predict up to 82%-85% accuracy of power consumption on a single Virtual machine.

Total Power = c0 + (c1*CPU) + (c2*CACHE) + (c3*DRAM) + (c4*DISK)[17]

Nonetheless, the formulated method requires weight or breakdown of each mentioned component (out of total power consumption). This can only be done by using hardware tool; for example Watts meter, to calculate the power consumption manually and analyze the possible weight of each criterion, but this is not possible due to time constraint. To overcome this problem, the same constant is decided for each variable criterion. The experiment is then carried out.

Figure 7 : Power consumption breakdown of Server components [24]

To determine the weight idle, it has been decided to use power consumption breakdown of a server, based on work of Tsirogiannis et al (2010) (see Figure 7). While idle, the server consumes more than half of the total power. Therefore, the value of 0.5 is used as the weight for constant when idle. Meanwhile the CPU weight is 0.25 and disk 0.1. For memory and cache, it is 0.1 and 0.5 respectively.

(28)

20

3.2.3 Energy Efficiency

Another metric that is required to be measured is the energy efficiency of each data processing tools. Energy, being defined in law of Physic, is the physical currency used to accomplish a particular task. Extending this concept, energy efficiency is defined as the ratio of task done per unit of energy consumed. By using the previous power consumption metric in 3.2.2, energy efficiency metric is derived by using the mathematical formula as below:

Energy Efficiency = Work Done/Energy

= Work Done/Power x Time

= Performance/Power [24]

This metric is important as it will demonstrate which of the two processing models e.g. Hadoop or Spark as the most energy efficient data processing tools, thus answering the third research question.

3.3 Hypothesis

The project aim was to provide comparative analysis between Spark and Hadoop by answering three core research questions. Evaluation on each experiment is made to gather all the required data in answering those questions. However, it would be best to infer possible hypothesis on each research question results rather than on each individual experiment. This will contribute to a broader perspective within this research study.

3.3.1 Does Spark has better execution time than Hadoop?

With the use of RDD, it can be best presumed that Spark will triumph the performance of total execution time of Hadoop. RDD will allow the objects partitioned across the nodes to be cached and stored in memory [25], rather than being written to the hard disk. The In-memory data sharing that Spark provided will reduce the cost of communication overhead and synchronization of data that Mapreduce has and further manage these distributed cache blocks.

In contrast, original map reduce workflow is simple but rigid. The processes of mapping, shuffling and reducing in mapreduce model leads to frequently Input-Output (I/O) read and write operations and hence results in performance inefficiency. This would cause Spark performance to outperform mapreduce, thus resulting lower execution time.

3.3.2 Does Spark consume more power than Hadoop?

The high use of memory and cache in Spark will most definitely lead to high consumption of power. Spark functionality which stored intermediate results in memory and are able to cache instantly to other RDD will consume higher memory usage. In correlation to its power consumption, memory has the second highest power consumption after CPU[24][23].

Meanwhile, MapReduce utilizes read and write of I-O to disk. In correlation to its power consumption, disk is one the lowest power consumption components [23]. Thus, it is pre-determined in this hypothesis that Spark will consume more power.

(29)

21

3.3.3 Which is more energy efficient; Hadoop or Spark?

At this point, without any statistical evidence and any experiments conducted, it was likely impossible to inferred which data processing tools; Spark and Hadoop, to be more energy efficient than the other. To calculate Energy Efficiency, performance and power consumption correlate between both aspects, thus it is inadequate to pre-determine any hypothesis.

However, Hadoop is assumed to consume less power than Spark as was mentioned in previous Section 3.3.2 hypothesis. Thus, as an initial hypothesis, this research project assumes Hadoop has better energy efficiency.

(30)

22

CHAPTER 4: EXPERIMENT IMPLEMENTATION

In this chapter the implementation details of the experiments will be discussed. It will provide accounts on the criteria of chosen input data, deployment environment and also programming language and framework that being used. Also, the steps of installation and configuration setup for Hadoop Mapreduce Environments and Spark are described briefly.

4.1 Choice of data

In every big data problems, it is common that the type and source of data to be considered in relation to the data processing task that has been selected. As mentioned in previous section; the data selected is the Word Count. The importance of selecting dataset which would represent the capabilities of what data processing tools are usually used for would be appropriate to portray how the tools perform the word count task.

4.1.1 Criteria

Desirable criteria of dataset were outlined and discussed for better understanding of its relevant in conducted experiments.

4.1.1.1 Size

Data size scalability will be measured in experiment conducted as it should reflect the real-world scenario in which there will be certain impact when data size increased. Three comparative sizes of input data have been chosen in this project. These dataset will be relatively in “small”, “medium” and “large” scale. These mentioned sizes are based on the limitation of resources; disk usage of physical host and Virtual Machine default disk size of 10GB.

It is important when it is attributed as relatively “small”, the size should not be too small for the data processing tools to distribute it in parallel, thus impacting the result of experiments. The same concept should be applied when employing “large” size; the range should not be too big as there will be issues with the procedure of storing and processing the data, especially when it is running on single-node.

Taken into account of all the explanation, the experiment will be conducted on three relative sizes as below:

• 500MB Input Data - Small • 2 GB Input Data - Medium • 4 GB Input Data – Large

(31)

23

4.1.1.2 Ease of processing

Even though word count is considered a classic example of map reduce problem, it is important to find a suitable dataset which is appropriate and would be able to be processed by the data processing tools. Certain format such as PDF requires custom input adaptors to be created in the data processing program to parse and process the data. Thus, it becomes not suitable within the scope of this project since it has significant impact on the power consumption of the tools.

An ideal candidate is a plain text data which has random words to be counted. This may be e-books, speech, articles and others possible texts. Both Hadoop Map Reduce and Spark have APIs that can read text file and handle plain text in other ways. It is possible these input datasets to be simply created by random text writer. But in the experiment conducted, it would be good if these datasets are substance which made “sense” and have context, rather than some random, meaningless words. Having this in mind, it is decided that the use e-books text file, which are available on Project Gutenberg fits the outlines specified. More details are explained in next section.

4.1.2 E-book and Project Gutenberg

E-book, as defined by The Oxford Dictionary of English is an electronic version of a printed text which can be read on device or computer screen. As computer systems and communication technology evolved, e-book becomes more relevant in this modern era and has been made into various kinds of formats such as txt (text), html (web browser), pdf, jpg (image) and others[22].

Project Gutenberg is the longest-established, volunteer initiative e-book project that placed public domain works online and can be used freely in electronic format[22]. This community project was founded by Michael S. Hart and the first e-book was officially being added to the catalogue in December 1971. There is an estimated 43,000 ebooks in Project Gutenberg catalogue. These e-books are offered in several file formats to be downloaded by users. For instance, from the most commonly used pdf ebook to the file to the most recent mobi file that is Kindle-compatible, as well as Nook, Kobo or iBooks (epub).

For the purpose of experiment that shall be conducted, I have downloaded and compiled various of e-books text files to meet the criteria of data size that I have specified in previous section- which are relatively in small (500 MB), medium (2 GB) and large(4 GB) size. The use of Project Gutenberg as data source is also justifiable, since there are several past studies that have applied the same data source to measure performance of data processing tools in Cloud [22].

4.2 Data collection and measurement

Data collection method is a fundamental aspect for the experiment conducted. As mentioned in previous section, it is required in the project scope to monitor execution time of running the data processing task. Other than that, the performance data such as CPU, RAM, Cache and Disk utilization is also necessary to be collected in order to calculate the estimation of power consumption. Data collection method on these metrics will be discussed in this section.

(32)

24

4.2.1 Execution Time

To measure the execution time, a UNIX command called time was used. When the command time is

used with Hadoop or Spark task, it will then show the time taken to complete the job once the job is

completed. The full UNIX command used would be /usr/bin/time <hadoop job/spark job

command>. This information was important so that the total runtime for the task could be measured. Example shown below is based on Spark word count task:

-hduser@master ~ % /usr/bin/time /usr/local/spark/bin/spark-submit --class

org.apache.spark.examples.JavaWordCount --deploy-mode cluster --master yarn

/usr/local/spark/examples/target/scala-2.10/spark-examples-1.0.2-hadoop2.4.0.jar gutenberg1G

Example output:

12.11user 7.48system 2:55.36elapsed 1%CPU (0avgtext+0avgdata 240608maxresident)k 354064inputs+3424outputs (18major+98236minor)pagefaults 0swaps

4.2.2 Collectd

In order to calculate the estimated power consumption using the mathematical formula mentioned in previous section, this project decided to use Collectd, a daemon which gather statistical information about different components about the system it is running on. Examples of standard system information metrics that Collectd is able to track are CPU load, memory usage, network traffic and many others. It also has a plugin system that extends its functionality, which allow user to track common software such as PostgreSQL, MySQL and Apache. Besides that, Collectd also provides variety of ways to store the tracked information value, such as in RRD files or CSV files.

There are other monitoring and measurement tools or method that can be used to perform system monitoring and data collection. The most basic example is by creating a Shell script to obtain and record required parameter data from UNIX/Linux system. However, this is not feasible due to time constraint and complexity of collecting data in cluster environment. Collectd allow client/server feature between nodes, thus more consistent with this project scope. The use of Collectd is also justifiable as the data collection method due to similar approach of collecting system information has been applied in previous research [21].

4.2.2.1 Collectd configuration

Once the Collectd is successfully installed, it will automatically enable the default plug-in, which is the CPU and memory plug-ins. However, to calculate power consumption, it is required to obtain other

data such as disk utilization and memory cache utilization. Thus, additional plug-in (disk and memory

cache) needs to be enabled and running to collect the required statistical information. Other than that,

it is also required to enable the CSV plug-in so that the data can be stored in CSV format for result

(33)

25

1) Enable additional plugin

To enable these mentioned plugin, Collectd configuration file was edited manually by running below command:

vi /etc/collectd.conf

To enable the required plug-in, the # sign in front of any LoadPlugin is removed. As mentioned above, the required plug-ins are:

LoadPlugin CSV LoadPlugin disk LoadPlugin memcache

This will enable the required plugin and it will start working once the Collectd service is restarted.

2) Enable Client/Server

One of the advantages and the reason that Collectd was chosen as the data collection method is that it allows centralized data collection in one main server. The feature will allow all clients to send the data to that particular main server. Since the experiments will be running on different number of nodes, this feature would be helpful as all the statistical information from across all slave nodes will be collected into one master node.

To enable the feature of server and client/s in Collectd, again, it is necessary to edit the Collectd configuration file manually by running below UNIX command:

vi /etc/collectd.conf

On server/master side, the below information is appended. It is important to configure the IP address below with the IP address of the Collectd master server.

# Server

<Plugin "network">

Listen "IP address of Collectd master server" "25827" </Plugin>

On all the client/slave side, the below information is appended. It is important to configure the IP address below with the IP address of the Collectd master server.

# Server

(34)

26

Server"IP address of Collectd master server" "25827"

</Plugin>

This will enable the client/server services and it will start working once the Collectd service is restarted.

4.3 Deployment Platform

The experiment implementation for this project is done with The School of Computing Cloud Test Bed as its environment. The setup of this environment is illustrated below:

Figure 8 : SoC Cloud testbed Virtual Machine

The SoC Cloud Testbed is made up of 8 physical machines named testgrid3 to testgrid10. Testgrid3 becomes the cloud's front end or also referred to as the head node or the controller node. The 7 other machines are hosts. Each machine has 4 processor cores which translate into a capacity of 32 simultaneous processors. The frontend, Testgrid3 has the OpenNebula and Xen installed for running and managing virtual resources and machines. With OpenNebula toolkit, virtual machine can be created as required and added when cluster size increase.

4.3.1 Virtual Machine / Node specifications

By default, SoC Cloud Testbed only allow up to 10 Virtual Machine to be created per user, with each Virtual machines is permitted with one virtual processor core and 1GB of memory. However, after running few trial Hadoop experiments on large dataset, it was observed that numerous “map” tasks are failing. Even though the job completed in the end, this totally impacted the actual execution time. Based on the logs output, this is due to memory issue which required higher heap memory. Thus, to overcome this problem, it is requested for the memory quota to be increased. Further, actual experiments are conducted with 2GB memory.

However, as noted previously, due to limitations of resources, as well as other users are also using and accessing the SoC cloud Testbed, resources co-ordination are scheduled between other

(35)

27

users so that the experiment can be conducted with required memory resources. Consequently, this matter has impacted the project’s schedule.

Although data processing tools such as Hadoop has been designed to use large numbers of nodes, but in practice and research experiments, smaller cluster sizes are commonly used. As for the cluster size and scalability, it is considered appropriate in this case study and likewise consistent with previous research in the field.

4.3.2 Software

According to the initial plan, the project experiments aspired to use the latest version of both Hadoop and Spark software packages. However, due to lack of support and compatibility issue between both package, Hadoop version 2.4.0 and Spark 1.0.2 were installed in each of the nodes. Spark was configured to use HDFS in order to limit the effect of shared file system usage in the experiment conducted, and also due to time constraint to explore other potential options. Even though Spark has Yarn -dependent mechanism included in the default package, the conducted experiments do not include the mechanism in this experiment scope as this would limit the effectiveness of the comparison between Spark and Hadoop. Spark standalone mode using HDFS is sufficient for this research project.

Moreover, to limit the cost of communication overhead, the data was loaded into HDFS before experiments. This would allow the data to be distributed across the nodes, rather than to have it transferred each time the experiment is launched.

4.4 Programming language and Frameworks

The implementation of Word Count processing task in both Hadoop and Spark can be done in Java Programming language. Java is a functional programming language with Object Orientated support, which uses the Java Virtual Machine as a runtime environment.

The other alternative is Scala Programming language. However, for Hadoop to be compatible with the standard Scala collections API, Scalding library needs to be primarily configured. Scalding is a Scala library developed and maintained by Twitter.

Considering that there is no significant impact in processing performance between Scala and Java, and acknowledging that there is time constraint for initiating and exploring Scalding in Hadoop, it has been decided to proceed with the experiments using Java in implementing Word Count processing task.

4.5 Configuration setup

In this section, the details of configuration setup for both Hadoop and Spark will be presented. Purpose of each steps are being explained and defined.

4.5.1 Pre-configuration setup

(36)

28

4.5.1.1 Java Version 1.7 installation

Java is required for Hadoop and Spark, hence it is installed in each node. For this experiment, the

latest Java version, which is Java 7 is used. The apt-get install openjdk-7-jdk command will be used

to install the Java Development Kit (JDK). Once JDK is installed, java -version command is used to

verify for its version and making sure the installation is successful.

4.5.1.2 Hadoop group and user creation

A Hadoop group was created by using the command addgroup hadoop. Then, the Hadoop system

user called hduser was created and added to the newly created Hadoop group by using the command

adduser --ingroup hadoop hduser. The purpose of these steps is to create a dedicated Hadoop system user. It is a good practice for system administrator to separate the Hadoop ɪnstallation/configuration and any other Hadoop related task in case of security and user permission. The hduser will be used to run both Hadoop Mapreduce and Spark jobs.

4.5.1.3 Edit /etc/hosts

The purpose of this step is to add the association between the hostnames and the IP address for the master and the slaves on all the nodes in the /etc/hosts file. Once configured, it is made sure that all the nodes are able to ping each other to verify everything is established correctly.

4.5.1.4 SSH

To allow public key authentication in each node, both Hadoop and Spark require SSH to be installed and configured. This is to enable the master to do a password-less SSH to all the slaves.

Fɪrst, a RSA key pair with an empty password was generated for the Master node. This was

done by using the command ssh-keygen -t rsa -P ””. Then, by running the command the cat

$HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys, SSH access was enabled to the localhost. To test on the connection and save localhost’s host key fingerprint to the hduser user’s known_hosts file, ssh localhost command was used.

Next, the following command ssh-copy-id-i $HOME/.ssh/id rsa.pub hduser@<slave name>

was run on the master node where slave name is the name of each slave according to cluster size

setup requirement; for example, slave1, slave2, slave3…slaven.

4.5.2 Hadoop installation and configuration

The process of Hadoop installation and its configuration for multi-node cluster environment are explained in each step.