Hadoop Installation MapReduce Examples Jake Karnes

(1)

Big Data

Management

Hadoop Installation

MapReduce Examples

Jake Karnes

These slides are based on materials / slides from • Cloudera.com

• Amazon.com

(2)

Prerequistes

●

You must have an Amazon Web Services

account before you begin.

– You can sign up for an account here:

http://aws.amazon.com and click the Sign Up button.

– You will have to provide a credit card number during

installation. You will likely incur some charges, but we can take steps to minimize these.

● Although Amazon has a Free Tier we will require

(3)

Prerequistes

●

This tutorial requires accessing a remote server

through SSH.

– UNIX based Operating Systems (Mac and Linux

Distros) have this functionality available from their terminals.

– Windows does not. You'll need to download and install

a SSH client.

● PuTTY should work for our purposes.

– I have not personally tested this.

●

A working understanding of Java is also

expected when we discuss code examples for

MapReduce.

(4)

(5)

Your First Server

●

First log into your

Amazon Web Console

.

●

Go to EC2 (It should be in the upper left).

●

You should see this screen.

(6)

Creating a Security Group

●

Click Security Groups in the left menu.

●

Click the Create Security Group Button.

●

Provide a Name and Description when

prompted.

●

In the bottom panel, go to the Inbound tab.

●

Authorize all TCP communications.

●

Authorise SSH Access on port 22.

●

Authorize ICMP (Echo Reply).

●

Click the button underneath the rule

(7)

(8)

(9)

Creating SSH keys

●

Click Key Pairs in the left menu.

●

Click the Create Key Button

●

Provide a name for your key pair.

●

Your private key < keypair-name >.pem will be

downloaded automatically.

–

AWS does not store the private keys.

–

If you lose this file, you won't be able to SSH

into instances you provision with this key

pair.

(10)

(11)

Launch an EC2 Instance

●

Click Instances in the left menu.

●

Click the Launch Instance button

●

Choose Ubuntu 12.04 LTS 64 bit.

●

Go to the General Purpose tab and select m1.large.

●

In Step 3, choose to create 4 instances.

●

In Step 4, allocate 20GB to the root drive.

●

Continue past Step 5.

●

In Step 6 (Configure Security Group) choose the

group that you created earlier.

–

Ignore warnings about the security group.

●

Choose the Key Pair you created earlier.

(12)

(13)

(14)

(15)

(16)

(17)

(18)

Connect to your server

●

Click on Instance in the left menu.

●

Choose one of the instances you just created

and copy the public DNS.

–

Ex:

(19)

Connect to your server

●

Open a terminal on your local computer

●

Enter the following command to ensure your

private key isn't publicly viewable.

– chmod 400 ~/.ssh/<my-key-pair>.pem

●

Enter the following command to connect to your

Amazon instance.

– ssh -i ~/.ssh/<my-key-pair>.pem ubuntu@<Public

DNS>

– EX: ssh -i ~/.ssh/HadoopKey.pem

[email protected]

●

Accept the fingerprint.

●

You are now connected!

(20)

Install Cloudera &

Hadoop

(21)

What's Cloudera Manager?

● Cloudera was the first,

and is currently, the leading provider and supporter of Apache Hadoop for Enterprise users.

● We will be using Cloudera

Manager.

● Cloudera Manager is

adminstrative tool for installing and maintaing Hadoop and many other tools in the Hadoop

Ecosystem.

● CDH is Cloudera's open

source distribution of Apache Hadoop.

(22)

Installing Cloudera Manager

●

After you've connected to your instance. Enter

the following command to download the

Cloudera Installer.

– wget

http://archive.cloudera.com/cm4/installer/latest/cloude ra-manager-installer.bin

●

Execute the installer with these commands:

– sudo su

– chmod +x cloudera-manager-installer.bin – ./cloudera-manager-installer.bin

●

Accept the licenses and wait for the installer to

(23)

(24)

(25)

(26)

Troubleshooting

● If the installation pauses at any one step for more than

5 minutes, something has gone wrong.

● First try to cancel the installation by using CTRL+C.

Exit the installater and reexecute the .bin file.

● If you cannot exit using CTRL+C, close the terminal

window, reconnect to the server, and relauch the installer.

(27)

Using Cloudera Manager

●

After point your browser to: http:\\<Public

DNS>:7180

– EX:

http://ec2-54-193-92-102.us-west-1.compute.amazonaws.com:7180

(28)

Using Cloudera Manager

(29)

Using Cloudera Manager

(30)

Using Cloudera Manager

(31)

Using Cloudera Manager

●

Enter the Public DNS for each of your instances.

●

Click Search. Ensure that all instances are

(32)

Using Cloudera Manager

(33)

Using Cloudera Manager

●

Enter ubuntu as the user.

●

Upload the .pem file that was downloaded

(34)

Using Cloudera Manager

(35)

Using Cloudera Manager

(36)

Using Cloudera Manager

(37)

Using Cloudera Manager

(38)

Using Cloudera Manager

(39)

Using Cloudera Manager

(40)

Using Cloudera Manager

●

Use embedded databases.

(41)

Using Cloudera Manager

(42)

Using Cloudera Manager

(43)

Using Cloudera Manager

(44)

Using Hadoop and

MapReduce

(45)

Getting Test Data

● Download the following tar.gz file to your local

machine:

– https://drive.google.com/file/d/0B9FMXVD4BtEdQWZsTEgyaUE5cTg/edit?usp=sharing ● Upload the file to your EC2 instance with the following

command.

– scp -i ~/.ssh/<KEY FILE NAME>.pem <LOCAL PATH TO

FILE>/shakespere.tar.gz ubuntu@<PUBLIC DNS>:~

– EX: scp -i ~/.ssh/HadoopKey.pem

/home/jake/Desktop/cs157b/shakes/output/shakespeare.tar.gz [email protected]:~

● Log into the same EC2 instance.

● Unzip the file with these commands:

– mkdir shakes

(46)

All of Shakespeare's Work

● You now have all of Shakepeare's written works. ● Typically Hadoop works better with larger files, but

(47)

Deploying Test Data into HDFS

● Run the following command to make a directory on

HDFS.

– sudo -u hdfs hadoop fs -mkdir /user/ubuntu

● The next command changes the ownership of the

newly created directory to our user (ubuntu)

– sudo -u hdfs hadoop fs -chown -R ubuntu /user/ubuntu

● Create an input directory

– hadoop fs -mkdir /user/ubuntu/input

● Load our test text files into HDFS

– hadoop fs -put ~/shakes/* /user/ubuntu/input

● Our files are now replicated and distrubuted across our

(48)

Word Count

● Let's count how many times each word is used.

– The data has been normalized to remove punctuation and case

sensitivity.

● Download the WordCount.java file to your EC2

Instance:

– cd ~

– wget cs.cmu.edu/~abeutel/WordCount.java

● Let's compile the code into a jar with these commands:

– mkdir wordcount_classes

– javac -classpath

/opt/cloudera/parcels/CDH/lib/hadoop-0.20-

mapreduce/hadoop-core.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar -d wordcount_classes WordCount.java

– jar -cvf ~/wordcount.jar -C wordcount_classes/ .

● Let's run it!

– hadoop jar ~/wordcount.jar WordCount /user/ubuntu/input

(49)

What Did We Just Do?

● We've just run our first

MapReduce job!

● We have counted how

many times each word appears.

● To check on the output,

run the following command:

– hadoop dfs -cat

/user/ubuntu/output/part-00000

● On the left side we have

the individual words.

● On the right is the

number of times they appeared in all of

(50)

Let's Look at Code (Finally)

●

You can download the WordCount.java file by

going here:

– cs.cmu.edu/~abeutel/WordCount.java

●

At a high level, we'll see a class called

WordCount. It contains:

– 2 inner, static classes that define a single method

each.

● Map

● Reduce

(51)

The Map Class

● LongWritable key = byte offset of the line. ● Text value = a single line of text

● OutputCollector = A collection of KV pairs that will be

sent to a Reducer once all Mappers are finished.

– OutputCollector – Text = A single word

(52)

Map Method I/O

● Input: ● ● ● ● ● ● Output:

(53)

The Reduce Class

● Text key = A single word

● Iterator<IntWritable> value = An iterator over all of

the 1 values associated with the given key (word).

– OutputCollector – Text = A same word

– OutputCollector – IntWritable = The number of

(54)

Reduce Method I/O

Input:

(55)

(56)

Inverted Index

● Let's count how many times each word is used in total and

how many times it's used per file!

● Download the WordCount.java file to your local machine:

– https://drive.google.com/file/d/0B9FMXVD4BtEdQWszYTJMMTZFTXc/edit?usp=sharing

● Move the file to your EC2 instance.

– scp -i ~/.ssh/HadoopKey.pem <LOCAL PATH TO FILE>InvertedIndex.java

ubuntu@<PUBLIC DNS>:~

● Log into your EC2 instance.

● Let's compile the code into a jar with these commands:

– mkdir invertedindex_classes

– javac -classpath

/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-core.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar -d invertedindex_classes InvertedIndex.java

– jar -cvf ~/invertedindex.jar -C invertedindex_classes/ .

● Let's run it!

– hadoop fs -rm -r /user/ubuntu/output

– hadoop jar ~/invertedindex.jar InvertedIndex /user/ubuntu/input

(57)

(58)

What did we change?

● Only minor changes were needed to enhance our

WordCount program into the InvertedIndex program.

● You should already have the InvertedIndex.java file

downloaded to your computer if you want to open it and inspect for yourself.

(59)

The New Map Class

● LongWritable key = byte offset of the line. ● Text value = a single line of text

– OutputCollector – Text = A single word

– OutputCollector – Text = The file name containing this

(60)

New Map Method I/O

● Input: ● ● ● ● ● ● Output:

(61)

The New Reduce Class

● Text key = A single word

● Iterator<Text> value = An iterator over all of the filenames

containing the given key (word).

● OutputCollector = A collection of KV pairs that will be sent to a

Reducer once all Mappers are finished.

– OutputCollector – Text = A same word

– OutputCollector – Text = The number of occurrences of that

(62)

New Reduce Method I/O

Input:

(63)

(64)

Retrieving files

● Now that we're done with MapReduce, let's get out files

from HDFS to our local machines.

● Begin by being logged into your EC2 instance ● Get the files out of HDFS

– hadoop fs -get /user/ubuntu/output/part* ~

● Now you have 2 new files in your home directory of the

EC2 instance.

– Verify this by running: ls

● To download these to your local machine – Log out of the EC2 instance.

● Enter: ~.

● You terminal will be returned to controlling your local machine

– Run this command to download the output part files:

● scp -i ~/.ssh/<KeyFile>.pem ubuntu@<Public DNS>:~/part* ~/Desktop/

● You can now open the new files on your desktop in a text

(65)

Terminate Your Instances

● After you're done using Hadoop, you want to terminate

your EC2 instances.

● If you don't, you will continue to be charged per hour

(even if you aren't actively using them)!

● When you terminate your instances though, you will

lose ALL data/customizations.

● Therefore always download any necessary files to your

location machine before terminating your instances.

● From the AWS console, click Instances in the left menu. ● Mark the check box for all of your instances on the left

side.

● Click on Actions, then choose terminate.

● You will then see your instances shutting down. ● They will disappear after a few hours.

(66)