HADOOP in Linux(Ubuntu)
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.
Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage : This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Advantages
Allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines.
Hadoop does not rely on hardware to provide fault-tolerance, rather Hadoop library itself has been designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
Disadvantages
Security is concerns Vulnerable by nature Not fit for small data Potential stability issues
Terminology
PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
Prerequisites
•Update
•Install java
sudo apt-get install default-jdk •Install ssh
sudo apt-get install ssh
Adding a dedicated Hadoop system user
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
This will add user hduser and the group Hadoop to your local machine
Configuring SSH
SSH setup to localhost
Need to save your local machine’s host key fingerprint to the hduser user’s known host.
Install Hadoop
hduser@laptop:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
hduser@laptop:~$ tar xvzf hadoop-2.6.0.tar.gz
Move the Hadoop installation to the /usr/local/hadoop directory using the following command:
hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
[sudo] password for hduser:
hduser is not in the sudoers file. This incident will be reported. Oops!... We got:
"hduser is not in the sudoers file. This incident will be reported."
This error can be resolved by logging in as a root user, and then add hduser to sudo: hduser@laptop:~/hadoop-2.6.0$ su k
Password:
k@laptop:/home/hduser$ sudo adduser hduser sudo [sudo] password for k:
Now, the hduser has root priviledge, we can move the Hadoop installation to the/usr/local/hadoop directory without any problem:
k@laptop:/home/hduser$ sudo su hduser
hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
hduser@laptop:~/hadoop-2.6.0$ sudo chown -R hduser:hadoop /usr/local/hadoop
Setup Configuration Files
The following files will have to be modified to complete the Hadoop setup: ~/.bashrc /usr/local/hadoop/etc/hadoop/hadoop-env.sh /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/hdfs-site.xml 1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable:
Now we can append the following to the end of ~/.bashrc: hduser@laptop:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END
hduser@laptop:~$ source ~/.bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file. hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will be available to Hadoop whenever it is started up.
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up.
This file can be used to override the default settings that Hadoop starts with. hduser@laptop:~$ sudo mkdir -p /app/hadoop/tmp
Open the file and enter the following in between the <configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration> <property>
<name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description> </property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml:
hduser@laptop:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:
<name>mapred.job.tracker</name> <value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map
and reduce task. </description> </property> </configuration>
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used.
It is used to specify the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation.
This can be done using the following commands:
hduser@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode hduser@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode hduser@laptop:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store Open the file and enter the following content in between the
<configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration> <property>
<name>dfs.replication</name> <value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
</description> </property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property>
</configuration>
Format the New Hadoop Filesystem
Now, the Hadoop file system needs to be formatted so that we can start to use it. The format command should be issued with write permission since it
creates current directory
Starting/Restarting Hadoop
Stopping Hadoop
Hadoop Web Interfaces
Running a MapReduce job
Copy local example data to HDFS
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/Gutenberg
Run the MapReduce job