• No results found

Apache Hadoop new way for the company to store and analyze big data

N/A
N/A
Protected

Academic year: 2021

Share "Apache Hadoop new way for the company to store and analyze big data"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Apache Hadoop new way

for the company to store

and analyze big data

Reyna Ulaque

Software Engineer

(2)

Agenda

What is Big Data?

What is Hadoop?

Who uses Hadoop?

Hadoop Architecture

− Hadoop Distributed File System

− Hadoop MapReduce

Hadoop Ecosystem

Running Hadoop

(3)

What is big Data?

Big data is a popular term used to describe the exponential

growth and availability of data, both structured and unstructured

When dealing with larger datasets, organizations face difficulties

in being able to create, manipulate, and manage big data, basic

there are 2 problem with big data

− How to store and work with large volumes of data.

− And most importantly, how to interpret and analyze this data,

Hadoop appears on the market as a solution to these problems,

providing a way to store and process this data

(4)

What is hadoop?

Hadoop is an open source software that provides a framework

written in Java to allow for the distributed processing of large

data sets across clusters built with commodity hardware.

Design can go from few nodes to thousands of nodes in a flexible

Hadoop is a distributed system using a master-slave architecture,

− Using to store your Hadoop Distributed File System (HDFS)

− And MapReduce algorithms for computing

Both MapReduce and the Hadoop Distributed File System are

designed so that node failures are automatically handled by the

framework

(5)

Who uses Hadoop?

(6)

Hadoop Architecture

(7)

Hadoop Distributed File System

HDFS is a distributed file system designed to run on

commodity hardware

HDFS is highly fault-tolerant and is designed to be

deployed on low-cost hardware.

HDFS provides high throughput access to application data

and is suitable for applications that have large data sets

Write-once-read-many access model for files

HDFS has a master/slave architecture. An HDFS cluster

consists of a single NameNode and many DataNodes

(8)

Hadoop Distributed File System

NameNode

− A master server that manages the file system namespace and regulates access to files by clients.

− Executes file system namespace operations like opening, closing, and renaming files and directories and determines the mapping of blocks to DataNodes.

− Keep in memory the file system metadata and control file blocks that each DataNode

DataNodes

− They are responsible for serving read and write requests from the file system’s clients.

− They also perform block creation, deletion, and replication upon instruction from the NameNode.

(9)

HDFS Architecture

(10)

Hadoop MapReduce

It is a software framework for easily writing applications

which process vast amounts of data (multi-terabyte data-

sets) in-parallel on large clusters (thousands of nodes) of

commodity hardware in a reliable, fault-tolerant manner.

MapReduce divides workloads up into multiple tasks that

can be executed in parallel.

Typically the compute nodes and the storage nodes are the

same

(11)

Hadoop MapReduce

The Map/Reduce framework consists of a single master

JobTracker and many TaskTracker one per cluster-node.

The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks.

The slaves execute the tasks as directed by the master.

Two main phases: Map and Reduce:

Map step: Master node receives and divides a task into smaller tasks which distributes to other nodes to process them.

Reduce step: The master node collects all the responses received and combines them to generate the output.

(12)

Hadoop MapReduce

MapReduce job is converted into map and reduce tasks

Developers need ONLY to implement the Map and Reduce

classes

(13)

MapReduce Logical Architecture

(14)

Hadoop ecosystem

(15)

Running Hadoop

(16)

Hadoop Cluster Installation

Supported Platforms

− GNU/Linux is supported as a development and production platform

− Win32 is supported as a development platform. Distributed operation has not been well tested on Win32

Required Software

− Java 1.6 or +, preferably from Sun, must be installed.

− ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

To get a Hadoop distribution, download a recent stable

release from one of the Apache Download Mirrors

http://www.gtlib.gatech.edu/pub/apache/hadoop/core/

(17)

Hadoop Cluster Installation

Installing a Hadoop cluster typically involves unpacking the

software on all the machines in the cluster

The root of the distribution is referred to as

HADOOP_HOME

Configuration Files

− hadoop-env.sh

− Master and slave configuration (master only): conf/masters and conf/slaves

− Site-specific configuration (all machines): conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml.

(18)

Hadoop Cluster Installation

Formatting the HDFS filesystem via the NameNode

− /usr/local/hadoop/bin/hadoop namenode –format

Starting your multi-node cluster (only master server)

− HDFS daemons: bin/start-dfs.sh

− MapReduce daemons: bin/start-mapred.sh

− The following Java processes should run on master and salve

(19)

Hadoop Cluster Installation

Stopping the multi-node cluster (only master server)

− MapReduce daemons: bin/stop-mapred.sh

− HDFS daemons: bin/stop-dfs.sh

If there are any errors, examine the log files in the /logs/

directory.

− hadoop-hduser-namenode-hadoopsrv2.log, hadoop-hduser-

datanode-hadoopsrv2.log, hadoop-hduser-tasktracker-hadoopsrv2.log

Hadoop Web Interfaces

− http://localhost:50070/ – web UI of the NameNode daemon

− http://localhost:50030/ – web UI of the JobTracker daemon

(20)

Hadoop Cluster Setup common problems

Problem 1

− Starting a hadoop cluster a warning message is raised:

$HADOOP_HOME is deprecated

− So to resolve that in .bashrc file, replace the "HADOOP_HOME"

variable with "HADOOP_PREFIX" variable.

Problem 2

− java.io.IOException: Incompatible namespaceIDs: If you have

formatted the namenode twice. In this case the namespaceID is not replicated to the DataNodes

(/app/hadoop/tmp/dfs/name/current/VERSION)

(21)

Running a MapReduce job

We will use the WordCount example job which reads text

files and counts how often words occur. Download

example input data

Copy local example data to HDFS

− bin/hadoop dfs -copyFromLocal /tmp/inputtext/ /user/hduser/input- text

(22)

Running a MapReduce job

Run the MapReduce job

− bin/hadoop jar hadoop-examples-1.1.2.jar wordcount /user/hduser/input-text /user/hduser/results-output

− bin/hadoop dfs -ls /user/hduser

Retrieve the job result from HDFS

− bin/hadoop dfs -cat /user/hduser/results-output/part-r-00000

− mkdir /tmp/results-output

− bin/hadoop dfs -getmerge /user/hduser/results-output /tmp/results- output

(23)

Running a MapReduce job

Retrieve the job result from HDFS

− head /tmp/results-output/results-output

(24)

Q&A

(25)

References

Related documents

Komerční banka offers day-to-day banking services (current accounts, payment cards, direct banking, payment system) as well as saving products, financing products (loans, credit

It is showed that the activation energy of SiO2 appears to be slightly dependent on the rf power input but, the ARRHENIUS Plot in the high limit etches rate, indicates that

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Big Data is a collection of large and complex data sets that cannot be processed using regular database the other hand, the Apache Hadoop software library is

At times, we may pass some of this information to other insurers or to other persons such as the Malta Insurance Association, insurance intermediaries, motor surveyors,

Hammer-ons and pull-offs are indicated by slurs (the context tells

The aim of this study was to describe the visual perception and visual-motor performance of five-year-old (5 years 6 months to 5 years 11 months) English-speaking children

Authentication protocols for the IoT may be improved in terms of (1) addressing both the authentication and privacy problem, (2) developing efficient IDSs, (3) improving the