Transformation of Hadoop : A Survey

(1)

Transformation of Hadoop: A Survey

Yogesh Prabhu Prof. Sachin Deshpande

Department of Computer Engineering Department of Computer Engineering

VIT, Mumbai, India VIT, Mumbai, India

Abstract

In recent years, the issue of large amount of growing data has gained a lot of attention. Big Data is defined as data that is too big to fit on a single server and too unstructured to fit into a traditional row-and-column database, or too continuously flowing to fit into a static data warehouse. This data is providing huge opportunities to uncover new aspects. Volume, Velocity and Veracity are three major characteristics which are used to define Big Data. Hadoop is a widely adopted open source tool which implements the Google's famous computation model, MapReduce. It is a batch processing Java based programming model which can process large amount of data sets in a distributed environment. Hadoop consist of two major components Hadoop Distributed File System (HDFS) and processing unit called YARN. Hadoop Distributed File System (HDFS) is a distributed file system to store large amount of data on cluster, and Yet another Resource Negotiator (YARN) provides distributed processing of data on cluster. In this paper, we have studied journey of open-source software framework called Hadoop. Being an open-source project Hadoop has evolved tremendously over the years. Every version has improved the capabilities of the platform to help users to solve big data challenges.

Keywords: Hadoop, Mapreduce, YARN, big data

________________________________________________________________________________________________________

I. INTRODUCTION

Hadoop is an open-source software framework written in Java for storing enormous data and distributed processing of very large data. It is capable of handling thousands of terabytes of data as well as running application systems with thousands of hardware nodes. Its distributed file system facilitates rapid data transfers while allowing the system to continue operating in case of a node failure. Consequently, this has allowed Hadoop to emerge as the leading foundation for big data processing tasks. It all started in 2006 when two computer scientists wanted to build a search engine that could index up to a billion pages[3]. Thus, Mike Cafarella and Doug Cutting decided to create a “Web Crawler” program to index the World Wide Web. This was dubbed the Apache Nutch Project. After extensive research they estimated that such a system would cost around half a million dollars in hardware, with a monthly running cost of $30,000. However, their bigger problem was that their architecture lacked the processing bandwidth to support and work around billions of pages on the web. Just as the developers were stumped as to how to proceed, they came across an obscure paper written by Google that described the architecture of the Google File System (GFS). It contained the very blueprints for solving the problem of storing large files generated as a part of the web crawling and indexing process. Later in 2004, Google published another paper that introduced MapReduce to the World. Finally, these two papers led to the foundation of the framework called Hadoop. Meanwhile, the Web giant Yahoo was having trouble scaling its search engine. Their data scientists and researches witnessed the benefits GFS and Map Reduce brought to Google and wanted the same thing. Thus, they turned their sights to Hadoop. Doug Cutting joined Yahoo!, creating a win-win situation where Hadoop received a dedicated team and the resources to run at a web scale. In February of 2007, Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster. A year later, Yahoo released Hadoop as an open source project to Apache Software Foundation, confirming its success and its diverse, active community. In April of 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, it handled billions of searches and indexing millions of web pages in 209 seconds. From there on out, Hadoop was adopted by many notable top companies such as JP Morgan, Facebook, IBM, Microsoft, and Amazon Web Services, etc[4].

II. JOURNEY OF HADOOP

Hadoop 1.x

Hadoop 1.x consists of two components first is HDFS stands for Hadoop Distributed File System. It is also known as HDFS V1 as it is part of Hadoop 1.x. It is used as a distributed storage system in hadoop architecture. And second is MapReduce is a batch processing or distributed data processing module. It is built by following Google’s mapreduce algorithm. It is also known as “MR V1” or “classic mapreduce” as it is part of Hadoop 1.x.

HDFS:

(2)

(Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java. Though one can run several Datanodes on a single machine, but in the practical world, these Datanodes are spread across various machines[1].

a) Namenode:

Namenode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the Datanodes (slave nodes). Namenode is a very highly available server that manages the File System Namespace and controls access to files by clients.

1) Functions of Namenode: It is the master daemon that maintains and manages the Datanodes (slave nodes) It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the metadata: FsImage: It contains the complete state of the file system namespace since the start of the Namenode. EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage. It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the Namenode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the Datanodes in the cluster to ensure that the Datanodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are located. The Namenode is also responsible to take care of the replication factor of all the blocks which we will discuss in detail later in this HDFS tutorial blog. In case of the Datanode failure, the Namenode chooses new Datanodes for new replicas, balance disk usage and manages the communication traffic to the Datanodes.

b) Datanode:

Datanodes are the slave nodes in HDFS. Unlike Namenode, Datanode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The Datanode is a block server that stores the data in the local filesystem. 1) Functions of Datanode: These are slave daemons or process which runs on each slave machine. The actual data is stored on

Datanodes. The Datanodes perform the low-level read and write requests from the file system’s clients. They send heartbeats to the Namenode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.

MapReduce:

MapReduce is a Distributed Data Processing or Batch Processing Programming Model. Like HDFS, MapReduce component also uses Commodity Hardware to process “High Volume of Variety of Data at High Velocity Rate” in a reliable and fault-tolerant manner.

MapReduce component is again divided into two sub-components: a) Job Tracker:

Job Tracker is used to assign MapReduce Tasks to Task Trackers in the Cluster of Nodes. Sometimes, it reassigns same tasks to other Task Trackers as previous Task Trackers are failed or shutdown scenarios. Job Tracker maintains all the Task Trackers status like Up/running, Failed, Recovered etc.

b) Task Tracker:

Task Tracker executes the Tasks which are assigned by Job Tracker and sends the status of those tasks to Job Tracker. Hadoop 2.x:

Hadoop 2.x consists of three components HDFS stands for Hadoop Distributed File System. It is also known as HDFS V2 as it is part of Hadoop 2.x with some enhanced features. It is used as a Distributed Storage System in Hadoop Architecture. YARN stands for yet another Resource Negotiator. It is new Component in Hadoop 2.x Architecture. It is also known as “MR V2”[7].

HDFS:

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS has master/slave architecture. An HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes[7]. The Namenode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the Namenode.

YARN:

(3)

Difference between Hadoop 1.x & 2.x:

Hadoop gives unprecedented access to cluster computational resources to every individual in an organization. The MapReduce programming model is simple and supports ‘develop once deploy at any scale’ paradigm. This leads to users exploiting Hadoop for data processing jobs where MapReduce is not a good fit, for example, web servers being deployed in long-running map jobs. MapReduce is not known to be affable for iterative algorithms. Hacks were developed to make Hadoop run iterative algorithms. These hacks posed severe challenges to cluster resource utilization and capacity planning. Hadoop 1.X has a centralized job flow control. Centralized systems are hard to scale as they are the single point of load lifting. Job Tracker failure means that all the jobs in the system have to be restarted, exerting extreme pressure on a centralized component. Integration of Hadoop with other kinds of clusters is difficult with this model.

The early releases in Hadoop 1.X had a single Namenode that stored all the metadata about the HDFS directories and files. The data on the entire cluster hinged on this single point of failure. Subsequent releases had a cold standby in the form of a secondary Namenode. The secondary Namenode merged the edit logs and Namenode image files, periodically bringing in two benefits. One, the primary Namenode startup time was reduced as the Namenode did not have to do the entire merge on startup. Two, the secondary Namenode acted as a replica that could minimize data loss on Namenode disasters. However, the secondary Namenode (secondary Namenode is not a backup node for Namenode) was still not a hot standby, leading to high failover and recovery times and affecting cluster availability.

Hadoop 1.X is mainly a Unix-based massive data processing framework. Native support on machines running Microsoft Windows Server is not possible. With Microsoft entering cloud computing and big data analytics in a big way, coupled with existing heavy Windows Server investments in the industry, it's very important for Hadoop to enter the Microsoft Windows landscape as well. Hadoop's success comes mainly from enterprise play. Adoption of Hadoop mainly comes from the availability of enterprise features. Though Hadoop 1.X tries to support some of them, such as security, there is a list of other features that are badly needed by the enterprise. In Hadoop 1.X, resource allocation and job execution were the responsibilities of JobTracker. Since the computing model was closely tied to the resources in the cluster, MapReduce was the only supported model. This tight coupling led to developers force-fitting other paradigms, leading to unintended use of MapReduce.

a) Yet Another Resource Negotiator(YARN):

The primary goal of YARN is to separate concerns relating to resource management and application execution. By separating these functions, other application paradigms can be added onboard a Hadoop computing cluster. Improvements in interoperability and support for diverse applications lead to efficient and effective utilization of resources. It integrates well with the existing infrastructure in an enterprise.

Fig. 1: Architecture of Hadoop Version1 vs. Hadoop Version2[7]

Achieving loose coupling between resource management and job management should not be at the cost of loss in backward compatibility. For almost 6 years, Hadoop has been the leading software to crunch massive datasets in a parallel and distributed fashion. This means huge investments in development; testing and deployment were already in place. YARN maintains backward compatibility with Hadoop 1.X (hadoop-0.20.205+) APIs. An older MapReduce program can continue execution in YARN with no code changes. However, recompiling the older code is mandatory.

(4)

jobs are scheduled to run by RMs. The metadata of the jobs are stored in persistent storage to recover from RM crashes. When a job is scheduled, RM allocates a container for the AM of the job on a node in the cluster. AM then takes over orchestrating the specifics of the job. These specifics include requesting resources, managing task execution, optimizations, and handling tasks or job failures. AM can be written in any language, and different versions of AM can execute independently on a cluster. An AM resource request contains specifications about the locality and the kind of resource expected by it. RM puts in its best effort to satisfy AM's needs based on policies and availability of resources. When a container is available for use by AM, it can launch application-specific code in this container. The container is free to communicate with its AM. RM is agnostic to this communication.

b) Other improvements:

Namenode is a directory service for Hadoop and contains metadata pertaining to the files within cluster storage. Hadoop 1.X had a secondary Namenode, a cold standby that needed minutes to come up. Hadoop 2.X provides features to have a hot standby of Namenode. On the failure of an active Namenode, the standby can become the active Namenode in a matter of minutes. There is no data loss or loss of Namenode service availability. With hot standbys, automated failover becomes easier too. The wire protocol for RPCs within Hadoop is now based on Protocol Buffers. Previously, Java serialization via Writables was used. This improvement not only eases maintaining backward compatibility, but also aids in rolling the upgrades of different cluster components. RPCs allow for client-side retries as well.

HDFS in Hadoop 1.X was agnostic about the type of storage being used. Mechanical or SSD drives were treated uniformly. The user did not have any control on data placement. Hadoop 2.X releases in 2014 are aware of the type of storage and expose this information to applications as well. Applications can use this to optimize their data fetch and placement strategies. HDFS append support has been brought into Hadoop 2.X. HDFS access in Hadoop 1.X releases has been through HDFS clients. In Hadoop 2.X, support for NFSv3 has been brought into the NFS gateway component. Clients can now mount HDFS onto their compatible local filesystem, allowing them to download and upload files directly to and from HDFS. Appends to files are allowed, but random writes are not.

In over 6 years of its existence, Hadoop has become the number one choice as a framework for massively parallel and distributed computing. The community has been shaping Hadoop to gear up for enterprise use. In 1.X releases, HDFS append and security, were the key features that made Hadoop enterprise-friendly. Hadoop's storage layer was enhanced in 2.X to separate the filesystem from the block storage service. This enables features such as supporting multiple namespaces and integration with other file-systems and also improvements in Hadoop storage availability and snapshotting.

Hadoop 3.x:

Apache Hadoop 3.x incorporates a number of significant enhancements over the previous major release line hadoop-2.x. Though it’s in alpha stage that means it facilitates testing and collection of feedbacks from users and application developers[6]. Till the time of this survey there are no guarantee on API stability and quality. This section covers few significant changes come along with Hadoop 3.x.

Hadoop 3.x requires minimum java version java8. Erasure coding is a method for durably storing data with significant space savings compared to replication[9]. Erasure Coding (EC) which provides the same level of fault-tolerance with much less storage space. In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%. Integrating EC with HDFS can improve storage efficiency while still providing similar data durability as traditional replication-based HDFS deployments. As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6 data, 3 parity) deployment, it will only consume 9 blocks of disk space.

Hadoop 3.x provides major revision of YARN Timeline Service. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation[10]. YARN Timeline Service v.2 separates the collection (writes) of data from serving (reads) of data. It uses distributed collectors, essentially one collector for each YARN application. The readers are separate instances that are dedicated to serving queries via REST API. Hadoop 3.x provides support for more than 2 Namenodes. The initial implementation of HDFS Namenode high-availability provided for a single active Namenode and a single Standby Namenode. By replicating edits to a quorum of three Journalnodes, this architecture is able to tolerate the failure of any one node in the system. However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby Namenodes. For instance, by configuring three Namenodes and five Journalnodes, the cluster is able to tolerate the failure of two nodes rather than just one.

III. CONCLUSION

(5)

REFERENCES

[1] T. White, "Hadoop: The definitive guide," O'Reilly Media, Inc (2015).

[2] A. C. Murthy, V.K. Vavilapalli, D. Eadline, J. Niemiec, J. Markham, "Apache Hadoop YARN: Moving beyond MapReduce and batch processing with Apache Hadoop 2," in Pearson Education (2013).

[3] http://aponia.co/17620-2/

[4] https://www.edureka.co/blog/hadoop-tutorial/

[5] J. Dean, and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," in Communications of the ACM 51.1 (2008). [6] http://blog.bigdataweek.com/2016/08/01/hadoop-importanthandling-big-data/

[7] https://www.packtpub.com/books/content/evolution hadoop/

[8] M. Zaharia, A. Kowinski, A. Joseph, R. Katz, and I. Stoica, “Improving MapReduce Performance in Heterogeneous Environments,” USENIX OSDI, 2008. [9] https://hadoop.apache.org/docs/r3.0.0-alpha2/index.html

[10] https://hadoop.apache.org/docs/r3.0.0alpha2/hadoop-projectdist/hadoophdfs/HDFSHighAvailabilityWithQJM.html [11] Gartner IT Glossary, What is Big Data?, URL:http://www.gartner.com/it-glossary/big

[12] S. O. Fadiya, S. Saydam, & V. V.Zira. "Advancing big data for humanitarian needs" Procedia Engineering,78,88-95(2014). [13] “Apache Hadoop,” http://hadoop.apache.org/