Big Data
Management
Hadoop Installation
MapReduce Examples
Jake Karnes
These slides are based on materials / slides from • Cloudera.com
• Amazon.com
Prerequistes
●
You must have an Amazon Web Services
account before you begin.
– You can sign up for an account here:
http://aws.amazon.com and click the Sign Up button.
– You will have to provide a credit card number during
installation. You will likely incur some charges, but we can take steps to minimize these.
● Although Amazon has a Free Tier we will require
Prerequistes
●
This tutorial requires accessing a remote server
through SSH.
– UNIX based Operating Systems (Mac and Linux
Distros) have this functionality available from their terminals.
– Windows does not. You'll need to download and install
a SSH client.
● PuTTY should work for our purposes.
– I have not personally tested this.
●
A working understanding of Java is also
expected when we discuss code examples for
MapReduce.
Your First Server
●
First log into your
Amazon Web Console
.
●Go to EC2 (It should be in the upper left).
●You should see this screen.
Creating a Security Group
●
Click Security Groups in the left menu.
●Click the Create Security Group Button.
●Provide a Name and Description when
prompted.
●
In the bottom panel, go to the Inbound tab.
●Authorize all TCP communications.
●
Authorise SSH Access on port 22.
●Authorize ICMP (Echo Reply).
●
Click the button underneath the rule
Creating SSH keys
●
Click Key Pairs in the left menu.
●Click the Create Key Button
●
Provide a name for your key pair.
●
Your private key < keypair-name >.pem will be
downloaded automatically.
–
AWS does not store the private keys.
–
If you lose this file, you won't be able to SSH
into instances you provision with this key
pair.
Launch an EC2 Instance
●
Click Instances in the left menu.
●
Click the Launch Instance button
●
Choose Ubuntu 12.04 LTS 64 bit.
●
Go to the General Purpose tab and select m1.large.
●
In Step 3, choose to create 4 instances.
●
In Step 4, allocate 20GB to the root drive.
●
Continue past Step 5.
●
In Step 6 (Configure Security Group) choose the
group that you created earlier.
–
Ignore warnings about the security group.
●
Choose the Key Pair you created earlier.
Connect to your server
●
Click on Instance in the left menu.
●
Choose one of the instances you just created
and copy the public DNS.
–
Ex:
Connect to your server
●
Open a terminal on your local computer
●
Enter the following command to ensure your
private key isn't publicly viewable.
– chmod 400 ~/.ssh/<my-key-pair>.pem
●
Enter the following command to connect to your
Amazon instance.
– ssh -i ~/.ssh/<my-key-pair>.pem ubuntu@<Public
DNS>
– EX: ssh -i ~/.ssh/HadoopKey.pem
●
Accept the fingerprint.
●You are now connected!
Install Cloudera &
Hadoop
What's Cloudera Manager?
● Cloudera was the first,
and is currently, the leading provider and supporter of Apache Hadoop for Enterprise users.
● We will be using Cloudera
Manager.
● Cloudera Manager is
adminstrative tool for installing and maintaing Hadoop and many other tools in the Hadoop
Ecosystem.
● CDH is Cloudera's open
source distribution of Apache Hadoop.
Installing Cloudera Manager
●
After you've connected to your instance. Enter
the following command to download the
Cloudera Installer.
– wget
http://archive.cloudera.com/cm4/installer/latest/cloude ra-manager-installer.bin
●
Execute the installer with these commands:
– sudo su
– chmod +x cloudera-manager-installer.bin – ./cloudera-manager-installer.bin
●
Accept the licenses and wait for the installer to
Troubleshooting
● If the installation pauses at any one step for more than
5 minutes, something has gone wrong.
● First try to cancel the installation by using CTRL+C.
Exit the installater and reexecute the .bin file.
● If you cannot exit using CTRL+C, close the terminal
window, reconnect to the server, and relauch the installer.
Using Cloudera Manager
●
After point your browser to: http:\\<Public
DNS>:7180
– EX:
http://ec2-54-193-92-102.us-west-1.compute.amazonaws.com:7180
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
●
Enter the Public DNS for each of your instances.
●Click Search. Ensure that all instances are
Using Cloudera Manager
Using Cloudera Manager
●
Enter ubuntu as the user.
●
Upload the .pem file that was downloaded
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
●
Use embedded databases.
Using Cloudera Manager
Using Cloudera Manager
Using Cloudera Manager
Using Hadoop and
MapReduce
Getting Test Data
● Download the following tar.gz file to your local
machine:
– https://drive.google.com/file/d/0B9FMXVD4BtEdQWZsTEgyaUE5cTg/edit?usp=sharing ● Upload the file to your EC2 instance with the following
command.
– scp -i ~/.ssh/<KEY FILE NAME>.pem <LOCAL PATH TO
FILE>/shakespere.tar.gz ubuntu@<PUBLIC DNS>:~
– EX: scp -i ~/.ssh/HadoopKey.pem
/home/jake/Desktop/cs157b/shakes/output/shakespeare.tar.gz [email protected]:~
● Log into the same EC2 instance.
● Unzip the file with these commands:
– mkdir shakes
All of Shakespeare's Work
● You now have all of Shakepeare's written works. ● Typically Hadoop works better with larger files, but
Deploying Test Data into HDFS
● Run the following command to make a directory on
HDFS.
– sudo -u hdfs hadoop fs -mkdir /user/ubuntu
● The next command changes the ownership of the
newly created directory to our user (ubuntu)
– sudo -u hdfs hadoop fs -chown -R ubuntu /user/ubuntu
● Create an input directory
– hadoop fs -mkdir /user/ubuntu/input
● Load our test text files into HDFS
– hadoop fs -put ~/shakes/* /user/ubuntu/input
● Our files are now replicated and distrubuted across our
Word Count
● Let's count how many times each word is used.
– The data has been normalized to remove punctuation and case
sensitivity.
● Download the WordCount.java file to your EC2
Instance:
– cd ~
– wget cs.cmu.edu/~abeutel/WordCount.java
● Let's compile the code into a jar with these commands:
– mkdir wordcount_classes
– javac -classpath
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-
mapreduce/hadoop-core.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar -d wordcount_classes WordCount.java
– jar -cvf ~/wordcount.jar -C wordcount_classes/ .
● Let's run it!
– hadoop jar ~/wordcount.jar WordCount /user/ubuntu/input
What Did We Just Do?
● We've just run our first
MapReduce job!
● We have counted how
many times each word appears.
● To check on the output,
run the following command:
– hadoop dfs -cat
/user/ubuntu/output/part-00000
● On the left side we have
the individual words.
● On the right is the
number of times they appeared in all of
Let's Look at Code (Finally)
●
You can download the WordCount.java file by
going here:
– cs.cmu.edu/~abeutel/WordCount.java
●
At a high level, we'll see a class called
WordCount. It contains:
– 2 inner, static classes that define a single method
each.
● Map
● Reduce
The Map Class
● LongWritable key = byte offset of the line. ● Text value = a single line of text
● OutputCollector = A collection of KV pairs that will be
sent to a Reducer once all Mappers are finished.
– OutputCollector – Text = A single word
Map Method I/O
● Input: ● ● ● ● ● ● Output:The Reduce Class
● Text key = A single word
● Iterator<IntWritable> value = An iterator over all of
the 1 values associated with the given key (word).
● OutputCollector = A collection of KV pairs that will be
sent to a Reducer once all Mappers are finished.
– OutputCollector – Text = A same word
– OutputCollector – IntWritable = The number of
Reduce Method I/O
Input:
Inverted Index
● Let's count how many times each word is used in total and
how many times it's used per file!
● Download the WordCount.java file to your local machine:
– https://drive.google.com/file/d/0B9FMXVD4BtEdQWszYTJMMTZFTXc/edit?usp=sharing
● Move the file to your EC2 instance.
– scp -i ~/.ssh/HadoopKey.pem <LOCAL PATH TO FILE>InvertedIndex.java
ubuntu@<PUBLIC DNS>:~
● Log into your EC2 instance.
● Let's compile the code into a jar with these commands:
– mkdir invertedindex_classes
– javac -classpath
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-core.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar -d invertedindex_classes InvertedIndex.java
– jar -cvf ~/invertedindex.jar -C invertedindex_classes/ .
● Let's run it!
– hadoop fs -rm -r /user/ubuntu/output
– hadoop jar ~/invertedindex.jar InvertedIndex /user/ubuntu/input
What did we change?
● Only minor changes were needed to enhance our
WordCount program into the InvertedIndex program.
● You should already have the InvertedIndex.java file
downloaded to your computer if you want to open it and inspect for yourself.
The New Map Class
● LongWritable key = byte offset of the line. ● Text value = a single line of text
● OutputCollector = A collection of KV pairs that will be
sent to a Reducer once all Mappers are finished.
– OutputCollector – Text = A single word
– OutputCollector – Text = The file name containing this
New Map Method I/O
● Input: ● ● ● ● ● ● Output:The New Reduce Class
● Text key = A single word
● Iterator<Text> value = An iterator over all of the filenames
containing the given key (word).
● OutputCollector = A collection of KV pairs that will be sent to a
Reducer once all Mappers are finished.
– OutputCollector – Text = A same word
– OutputCollector – Text = The number of occurrences of that
New Reduce Method I/O
Input:
Retrieving files
● Now that we're done with MapReduce, let's get out files
from HDFS to our local machines.
● Begin by being logged into your EC2 instance ● Get the files out of HDFS
– hadoop fs -get /user/ubuntu/output/part* ~
● Now you have 2 new files in your home directory of the
EC2 instance.
– Verify this by running: ls
● To download these to your local machine – Log out of the EC2 instance.
● Enter: ~.
● You terminal will be returned to controlling your local machine
– Run this command to download the output part files:
● scp -i ~/.ssh/<KeyFile>.pem ubuntu@<Public DNS>:~/part* ~/Desktop/
● You can now open the new files on your desktop in a text
Terminate Your Instances
● After you're done using Hadoop, you want to terminate
your EC2 instances.
● If you don't, you will continue to be charged per hour
(even if you aren't actively using them)!
● When you terminate your instances though, you will
lose ALL data/customizations.
● Therefore always download any necessary files to your
location machine before terminating your instances.
● From the AWS console, click Instances in the left menu. ● Mark the check box for all of your instances on the left
side.
● Click on Actions, then choose terminate.
● You will then see your instances shutting down. ● They will disappear after a few hours.