• No results found

Hadoop Installation MapReduce Examples Jake Karnes

N/A
N/A
Protected

Academic year: 2021

Share "Hadoop Installation MapReduce Examples Jake Karnes"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data

Management

Hadoop Installation

MapReduce Examples

Jake Karnes

These slides are based on materials / slides from • Cloudera.com

• Amazon.com

(2)

Prerequistes

You must have an Amazon Web Services

account before you begin.

– You can sign up for an account here:

http://aws.amazon.com and click the Sign Up button.

– You will have to provide a credit card number during

installation. You will likely incur some charges, but we can take steps to minimize these.

● Although Amazon has a Free Tier we will require

(3)

Prerequistes

This tutorial requires accessing a remote server

through SSH.

– UNIX based Operating Systems (Mac and Linux

Distros) have this functionality available from their terminals.

– Windows does not. You'll need to download and install

a SSH client.

● PuTTY should work for our purposes.

– I have not personally tested this.

A working understanding of Java is also

expected when we discuss code examples for

MapReduce.

(4)
(5)

Your First Server

First log into your

Amazon Web Console

.

Go to EC2 (It should be in the upper left).

You should see this screen.

(6)

Creating a Security Group

Click Security Groups in the left menu.

Click the Create Security Group Button.

Provide a Name and Description when

prompted.

In the bottom panel, go to the Inbound tab.

Authorize all TCP communications.

Authorise SSH Access on port 22.

Authorize ICMP (Echo Reply).

Click the button underneath the rule

(7)
(8)
(9)

Creating SSH keys

Click Key Pairs in the left menu.

Click the Create Key Button

Provide a name for your key pair.

Your private key < keypair-name >.pem will be

downloaded automatically.

AWS does not store the private keys.

If you lose this file, you won't be able to SSH

into instances you provision with this key

pair.

(10)
(11)

Launch an EC2 Instance

Click Instances in the left menu.

Click the Launch Instance button

Choose Ubuntu 12.04 LTS 64 bit.

Go to the General Purpose tab and select m1.large.

In Step 3, choose to create 4 instances.

In Step 4, allocate 20GB to the root drive.

Continue past Step 5.

In Step 6 (Configure Security Group) choose the

group that you created earlier.

Ignore warnings about the security group.

Choose the Key Pair you created earlier.

(12)
(13)
(14)
(15)
(16)
(17)
(18)

Connect to your server

Click on Instance in the left menu.

Choose one of the instances you just created

and copy the public DNS.

Ex:

(19)

Connect to your server

Open a terminal on your local computer

Enter the following command to ensure your

private key isn't publicly viewable.

– chmod 400 ~/.ssh/<my-key-pair>.pem

Enter the following command to connect to your

Amazon instance.

– ssh -i ~/.ssh/<my-key-pair>.pem ubuntu@<Public

DNS>

– EX: ssh -i ~/.ssh/HadoopKey.pem

[email protected]

Accept the fingerprint.

You are now connected!

(20)

Install Cloudera &

Hadoop

(21)

What's Cloudera Manager?

● Cloudera was the first,

and is currently, the leading provider and supporter of Apache Hadoop for Enterprise users.

● We will be using Cloudera

Manager.

● Cloudera Manager is

adminstrative tool for installing and maintaing Hadoop and many other tools in the Hadoop

Ecosystem.

● CDH is Cloudera's open

source distribution of Apache Hadoop.

(22)

Installing Cloudera Manager

After you've connected to your instance. Enter

the following command to download the

Cloudera Installer.

– wget

http://archive.cloudera.com/cm4/installer/latest/cloude ra-manager-installer.bin

Execute the installer with these commands:

– sudo su

– chmod +x cloudera-manager-installer.bin – ./cloudera-manager-installer.bin

Accept the licenses and wait for the installer to

(23)
(24)
(25)
(26)

Troubleshooting

● If the installation pauses at any one step for more than

5 minutes, something has gone wrong.

● First try to cancel the installation by using CTRL+C.

Exit the installater and reexecute the .bin file.

● If you cannot exit using CTRL+C, close the terminal

window, reconnect to the server, and relauch the installer.

(27)

Using Cloudera Manager

After point your browser to: http:\\<Public

DNS>:7180

– EX:

http://ec2-54-193-92-102.us-west-1.compute.amazonaws.com:7180

(28)

Using Cloudera Manager

(29)

Using Cloudera Manager

(30)

Using Cloudera Manager

(31)

Using Cloudera Manager

Enter the Public DNS for each of your instances.

Click Search. Ensure that all instances are

(32)

Using Cloudera Manager

(33)

Using Cloudera Manager

Enter ubuntu as the user.

Upload the .pem file that was downloaded

(34)

Using Cloudera Manager

(35)

Using Cloudera Manager

(36)

Using Cloudera Manager

(37)

Using Cloudera Manager

(38)

Using Cloudera Manager

(39)

Using Cloudera Manager

(40)

Using Cloudera Manager

Use embedded databases.

(41)

Using Cloudera Manager

(42)

Using Cloudera Manager

(43)

Using Cloudera Manager

(44)

Using Hadoop and

MapReduce

(45)

Getting Test Data

● Download the following tar.gz file to your local

machine:

– https://drive.google.com/file/d/0B9FMXVD4BtEdQWZsTEgyaUE5cTg/edit?usp=sharing ● Upload the file to your EC2 instance with the following

command.

– scp -i ~/.ssh/<KEY FILE NAME>.pem <LOCAL PATH TO

FILE>/shakespere.tar.gz ubuntu@<PUBLIC DNS>:~

– EX: scp -i ~/.ssh/HadoopKey.pem

/home/jake/Desktop/cs157b/shakes/output/shakespeare.tar.gz [email protected]:~

● Log into the same EC2 instance.

● Unzip the file with these commands:

– mkdir shakes

(46)

All of Shakespeare's Work

● You now have all of Shakepeare's written works. ● Typically Hadoop works better with larger files, but

(47)

Deploying Test Data into HDFS

● Run the following command to make a directory on

HDFS.

– sudo -u hdfs hadoop fs -mkdir /user/ubuntu

● The next command changes the ownership of the

newly created directory to our user (ubuntu)

– sudo -u hdfs hadoop fs -chown -R ubuntu /user/ubuntu

● Create an input directory

– hadoop fs -mkdir /user/ubuntu/input

● Load our test text files into HDFS

– hadoop fs -put ~/shakes/* /user/ubuntu/input

● Our files are now replicated and distrubuted across our

(48)

Word Count

● Let's count how many times each word is used.

– The data has been normalized to remove punctuation and case

sensitivity.

● Download the WordCount.java file to your EC2

Instance:

– cd ~

– wget cs.cmu.edu/~abeutel/WordCount.java

● Let's compile the code into a jar with these commands:

– mkdir wordcount_classes

– javac -classpath

/opt/cloudera/parcels/CDH/lib/hadoop-0.20-

mapreduce/hadoop-core.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar -d wordcount_classes WordCount.java

– jar -cvf ~/wordcount.jar -C wordcount_classes/ .

● Let's run it!

– hadoop jar ~/wordcount.jar WordCount /user/ubuntu/input

(49)

What Did We Just Do?

● We've just run our first

MapReduce job!

● We have counted how

many times each word appears.

● To check on the output,

run the following command:

– hadoop dfs -cat

/user/ubuntu/output/part-00000

● On the left side we have

the individual words.

● On the right is the

number of times they appeared in all of

(50)

Let's Look at Code (Finally)

You can download the WordCount.java file by

going here:

– cs.cmu.edu/~abeutel/WordCount.java

At a high level, we'll see a class called

WordCount. It contains:

– 2 inner, static classes that define a single method

each.

● Map

● Reduce

(51)

The Map Class

● LongWritable key = byte offset of the line. ● Text value = a single line of text

● OutputCollector = A collection of KV pairs that will be

sent to a Reducer once all Mappers are finished.

– OutputCollector – Text = A single word

(52)

Map Method I/O

● Input: ● ● ● ● ● ● Output:

(53)

The Reduce Class

● Text key = A single word

● Iterator<IntWritable> value = An iterator over all of

the 1 values associated with the given key (word).

● OutputCollector = A collection of KV pairs that will be

sent to a Reducer once all Mappers are finished.

– OutputCollector – Text = A same word

– OutputCollector – IntWritable = The number of

(54)

Reduce Method I/O

Input:

(55)
(56)

Inverted Index

● Let's count how many times each word is used in total and

how many times it's used per file!

● Download the WordCount.java file to your local machine:

– https://drive.google.com/file/d/0B9FMXVD4BtEdQWszYTJMMTZFTXc/edit?usp=sharing

● Move the file to your EC2 instance.

– scp -i ~/.ssh/HadoopKey.pem <LOCAL PATH TO FILE>InvertedIndex.java

ubuntu@<PUBLIC DNS>:~

● Log into your EC2 instance.

● Let's compile the code into a jar with these commands:

– mkdir invertedindex_classes

– javac -classpath

/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-core.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar -d invertedindex_classes InvertedIndex.java

– jar -cvf ~/invertedindex.jar -C invertedindex_classes/ .

● Let's run it!

– hadoop fs -rm -r /user/ubuntu/output

– hadoop jar ~/invertedindex.jar InvertedIndex /user/ubuntu/input

(57)
(58)

What did we change?

● Only minor changes were needed to enhance our

WordCount program into the InvertedIndex program.

● You should already have the InvertedIndex.java file

downloaded to your computer if you want to open it and inspect for yourself.

(59)

The New Map Class

● LongWritable key = byte offset of the line. ● Text value = a single line of text

● OutputCollector = A collection of KV pairs that will be

sent to a Reducer once all Mappers are finished.

– OutputCollector – Text = A single word

– OutputCollector – Text = The file name containing this

(60)

New Map Method I/O

● Input: ● ● ● ● ● ● Output:

(61)

The New Reduce Class

● Text key = A single word

● Iterator<Text> value = An iterator over all of the filenames

containing the given key (word).

● OutputCollector = A collection of KV pairs that will be sent to a

Reducer once all Mappers are finished.

– OutputCollector – Text = A same word

– OutputCollector – Text = The number of occurrences of that

(62)

New Reduce Method I/O

Input:

(63)
(64)

Retrieving files

● Now that we're done with MapReduce, let's get out files

from HDFS to our local machines.

● Begin by being logged into your EC2 instance ● Get the files out of HDFS

– hadoop fs -get /user/ubuntu/output/part* ~

● Now you have 2 new files in your home directory of the

EC2 instance.

– Verify this by running: ls

● To download these to your local machine – Log out of the EC2 instance.

● Enter: ~.

● You terminal will be returned to controlling your local machine

– Run this command to download the output part files:

● scp -i ~/.ssh/<KeyFile>.pem ubuntu@<Public DNS>:~/part* ~/Desktop/

● You can now open the new files on your desktop in a text

(65)

Terminate Your Instances

● After you're done using Hadoop, you want to terminate

your EC2 instances.

● If you don't, you will continue to be charged per hour

(even if you aren't actively using them)!

● When you terminate your instances though, you will

lose ALL data/customizations.

● Therefore always download any necessary files to your

location machine before terminating your instances.

● From the AWS console, click Instances in the left menu. ● Mark the check box for all of your instances on the left

side.

● Click on Actions, then choose terminate.

● You will then see your instances shutting down. ● They will disappear after a few hours.

(66)

References

Related documents

90’ı 5 µm altı olan gezegen bilyalı değirmende 4 saat boyunca öğütülen tozla (C4 tozu) devam edilmesi uygun görülmüş olup bu tozun detaylı tane boyut analizi Şekil

The dynamic model of the grid consists of turbine governors (TG), automatic voltage regulators (AVR) as well as wind turbines, solar power units and energy storage units1.

For each individual whose compensation must be reported in Schedule J, report compensation from the organization on row ( i) and from related organizations , described in

In the following section, information on the VDR variants found in the populations of South Africa was collated, with the aim of determining the function of such polymorphisms

Software Requirements: Apache Hadoop, Java/Scala/Python, MapReduce, Machine Learning Hardware Requirements: Cluster hosting the data platform.. Prerequisites: Java, Python,

The Hadoop Distributed File System (HDFS) How Google MapReduce Algorithm works Anatomy of a Hadoop Cluster.. Who

 Hadoop Distributed File System(HDFS) is the data storage unit of Hadoop..  Hadoop MapReduce is the data processing unit which works on distributed

This is the file, which sets the Hadoop environment, like JAVA PATH, memory requirement etc1. Add a line in the file depending upon the JAVA installation PATH -