Hadoop_Linux.pdf

(1)

HADOOP in Linux(Ubuntu)

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.

(2)

Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

Reduce stage : This stage is the combination of the Shuffle stage and

the Reduce stage. The Reducer’s job is to process the data that comes from the

mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

HDFS Architecture

(3)

HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

 Manages the file system namespace.

 Regulates client’s access to files.

 It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

 Datanodes perform read-write operations on the file systems, as per client request.

 They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Advantages

 Allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines.

 Hadoop does not rely on hardware to provide fault-tolerance, rather Hadoop library itself has been designed to detect and handle failures at the application layer.

 Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.

(4)

Disadvantages

 Security is concerns  Vulnerable by nature  Not fit for small data  Potential stability issues

Terminology

 PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.

 Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.

 NamedNode - Node that manages the Hadoop Distributed File System (HDFS).

 DataNode - Node where data is presented in advance before any processing takes place.

 MasterNode - Node where JobTracker runs and which accepts job requests from clients.

 SlaveNode - Node where Map and Reduce program runs.

 JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.

 Task Tracker - Tracks the task and reports status to JobTracker.

 Job - A program is an execution of a Mapper and Reducer across a dataset.

 Task - An execution of a Mapper or a Reducer on a slice of data.

 Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

Prerequisites

•Update

(5)

•Install java

sudo apt-get install default-jdk •Install ssh

sudo apt-get install ssh

Adding a dedicated Hadoop system user

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

This will add user hduser and the group Hadoop to your local machine

Configuring SSH

(6)

SSH setup to localhost

Need to save your local machine’s host key fingerprint to the hduser user’s known host.

Install Hadoop

hduser@laptop:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

hduser@laptop:~$ tar xvzf hadoop-2.6.0.tar.gz

Move the Hadoop installation to the /usr/local/hadoop directory using the following command:

hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop

[sudo] password for hduser:

hduser is not in the sudoers file. This incident will be reported. Oops!... We got:

"hduser is not in the sudoers file. This incident will be reported."

This error can be resolved by logging in as a root user, and then add hduser to sudo: hduser@laptop:~/hadoop-2.6.0$ su k

Password:

k@laptop:/home/hduser$ sudo adduser hduser sudo [sudo] password for k:

(7)

Now, the hduser has root priviledge, we can move the Hadoop installation to the/usr/local/hadoop directory without any problem:

k@laptop:/home/hduser$ sudo su hduser

hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop

hduser@laptop:~/hadoop-2.6.0$ sudo chown -R hduser:hadoop /usr/local/hadoop

Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup: ~/.bashrc /usr/local/hadoop/etc/hadoop/hadoop-env.sh /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/hdfs-site.xml 1. ~/.bashrc:

Before editing the .bashrc file in our home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable:

Now we can append the following to the end of ~/.bashrc: hduser@laptop:~$ vi ~/.bashrc

#HADOOP VARIABLES START

(8)

export YARN_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END

hduser@laptop:~$ source ~/.bashrc

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

We need to set JAVA_HOME by modifying hadoop-env.sh file. hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will be available to Hadoop whenever it is started up.

3. /usr/local/hadoop/etc/hadoop/core-site.xml:

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up.

This file can be used to override the default settings that Hadoop starts with. hduser@laptop:~$ sudo mkdir -p /app/hadoop/tmp

(9)

Open the file and enter the following in between the <configuration></configuration> tag:

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml

<name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description> </property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>

(10)

4. /usr/local/hadoop/etc/hadoop/mapred-site.xml

By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template

file which has to be renamed/copied with the name mapred-site.xml:

hduser@laptop:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to specify which framework is being used for MapReduce.

We need to enter the following content in between the <configuration></configuration> tag:

(11)

<name>mapred.job.tracker</name> <value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map

and reduce task. </description> </property> </configuration>

5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used.

It is used to specify the directories which will be used as the namenode and the datanode on that host.

Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation.

This can be done using the following commands:

hduser@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode hduser@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode hduser@laptop:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store Open the file and enter the following content in between the

<configuration></configuration> tag:

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<name>dfs.replication</name> <value>1</value>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

(12)

</description> </property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property>

</configuration>

Format the New Hadoop Filesystem

Now, the Hadoop file system needs to be formatted so that we can start to use it. The format command should be issued with write permission since it

creates current directory

(13)

Starting/Restarting Hadoop

(14)

Stopping Hadoop

(15)

Hadoop Web Interfaces

(16)

Running a MapReduce job

Copy local example data to HDFS

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/Gutenberg

Run the MapReduce job

(17)