Apache Hadoop new way
for the company to store
and analyze big data
Reyna Ulaque
Software Engineer
Agenda
What is Big Data?
What is Hadoop?
Who uses Hadoop?
Hadoop Architecture
− Hadoop Distributed File System
− Hadoop MapReduce
Hadoop Ecosystem
Running Hadoop
What is big Data?
Big data is a popular term used to describe the exponential
growth and availability of data, both structured and unstructured
When dealing with larger datasets, organizations face difficulties
in being able to create, manipulate, and manage big data, basic
there are 2 problem with big data
− How to store and work with large volumes of data.
− And most importantly, how to interpret and analyze this data,
Hadoop appears on the market as a solution to these problems,
providing a way to store and process this data
What is hadoop?
Hadoop is an open source software that provides a framework
written in Java to allow for the distributed processing of large
data sets across clusters built with commodity hardware.
Design can go from few nodes to thousands of nodes in a flexible
Hadoop is a distributed system using a master-slave architecture,
− Using to store your Hadoop Distributed File System (HDFS)
− And MapReduce algorithms for computing
Both MapReduce and the Hadoop Distributed File System are
designed so that node failures are automatically handled by the
framework
Who uses Hadoop?
Hadoop Architecture
Hadoop Distributed File System
HDFS is a distributed file system designed to run on
commodity hardware
HDFS is highly fault-tolerant and is designed to be
deployed on low-cost hardware.
HDFS provides high throughput access to application data
and is suitable for applications that have large data sets
Write-once-read-many access model for files
HDFS has a master/slave architecture. An HDFS cluster
consists of a single NameNode and many DataNodes
Hadoop Distributed File System
NameNode
− A master server that manages the file system namespace and regulates access to files by clients.
− Executes file system namespace operations like opening, closing, and renaming files and directories and determines the mapping of blocks to DataNodes.
− Keep in memory the file system metadata and control file blocks that each DataNode
DataNodes
− They are responsible for serving read and write requests from the file system’s clients.
− They also perform block creation, deletion, and replication upon instruction from the NameNode.
HDFS Architecture
Hadoop MapReduce
It is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-
sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
MapReduce divides workloads up into multiple tasks that
can be executed in parallel.
Typically the compute nodes and the storage nodes are the
same
Hadoop MapReduce
The Map/Reduce framework consists of a single master
JobTracker and many TaskTracker one per cluster-node.
− The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks.
− The slaves execute the tasks as directed by the master.
Two main phases: Map and Reduce:
− Map step: Master node receives and divides a task into smaller tasks which distributes to other nodes to process them.
− Reduce step: The master node collects all the responses received and combines them to generate the output.
Hadoop MapReduce
MapReduce job is converted into map and reduce tasks
Developers need ONLY to implement the Map and Reduce
classes
MapReduce Logical Architecture
Hadoop ecosystem
Running Hadoop
Hadoop Cluster Installation
Supported Platforms
− GNU/Linux is supported as a development and production platform
− Win32 is supported as a development platform. Distributed operation has not been well tested on Win32
Required Software
− Java 1.6 or +, preferably from Sun, must be installed.
− ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
To get a Hadoop distribution, download a recent stable
release from one of the Apache Download Mirrors
http://www.gtlib.gatech.edu/pub/apache/hadoop/core/
Hadoop Cluster Installation
Installing a Hadoop cluster typically involves unpacking the
software on all the machines in the cluster
The root of the distribution is referred to as
HADOOP_HOME
Configuration Files
− hadoop-env.sh
− Master and slave configuration (master only): conf/masters and conf/slaves
− Site-specific configuration (all machines): conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml.
Hadoop Cluster Installation
Formatting the HDFS filesystem via the NameNode
− /usr/local/hadoop/bin/hadoop namenode –format
Starting your multi-node cluster (only master server)
− HDFS daemons: bin/start-dfs.sh
− MapReduce daemons: bin/start-mapred.sh
− The following Java processes should run on master and salve
Hadoop Cluster Installation
Stopping the multi-node cluster (only master server)
− MapReduce daemons: bin/stop-mapred.sh
− HDFS daemons: bin/stop-dfs.sh
If there are any errors, examine the log files in the /logs/
directory.
− hadoop-hduser-namenode-hadoopsrv2.log, hadoop-hduser-
datanode-hadoopsrv2.log, hadoop-hduser-tasktracker-hadoopsrv2.log
Hadoop Web Interfaces
− http://localhost:50070/ – web UI of the NameNode daemon
− http://localhost:50030/ – web UI of the JobTracker daemon
Hadoop Cluster Setup common problems
Problem 1
− Starting a hadoop cluster a warning message is raised:
$HADOOP_HOME is deprecated
− So to resolve that in .bashrc file, replace the "HADOOP_HOME"
variable with "HADOOP_PREFIX" variable.
Problem 2
− java.io.IOException: Incompatible namespaceIDs: If you have
formatted the namenode twice. In this case the namespaceID is not replicated to the DataNodes
(/app/hadoop/tmp/dfs/name/current/VERSION)
Running a MapReduce job
We will use the WordCount example job which reads text
files and counts how often words occur. Download
example input data
Copy local example data to HDFS
− bin/hadoop dfs -copyFromLocal /tmp/inputtext/ /user/hduser/input- text
Running a MapReduce job
Run the MapReduce job
− bin/hadoop jar hadoop-examples-1.1.2.jar wordcount /user/hduser/input-text /user/hduser/results-output
− bin/hadoop dfs -ls /user/hduser
Retrieve the job result from HDFS
− bin/hadoop dfs -cat /user/hduser/results-output/part-r-00000
− mkdir /tmp/results-output
− bin/hadoop dfs -getmerge /user/hduser/results-output /tmp/results- output
Running a MapReduce job
Retrieve the job result from HDFS
− head /tmp/results-output/results-output