www.hpottech.com
Last Updated:- 3
rdNov 2012
All software will be in D:\\Software
Verify that VM player is installed else you need to install it.
Start your VM player and start the Hadoop VM Node master
Logon with the following credentials:
root / root123
Create directory hadoop :- mkdir /hadoop
The required steps for setting up a single-node
Hadoop
cluster using the
Hadoop Distributed File System
(HDFS)
on
RedHat Linux
.
Red Hat Linux
Hadoop
1.0.3, released May 2012
Prerequisites
Sun Java 6 – Verify java as below.
# java -version
java version "1.6.0_20"
www.hpottech.com
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
If Java is not there then installed Java : change directory to /hadoop and run the following
# sh /mnt/hgfs/Hadoopsw/jdk-6u17-linux-i586.bin
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use
Hadoop on it .
#su - root
# ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/root/.ssh/id_rsa):
Created directory '/home/root/.ssh'.
Your identification has been saved in /home/root/.ssh/id_rsa.
Your public key has been saved in /home/root/.ssh/id_rsa.pub.
The key fingerprint is:
www.hpottech.com
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 root@ubuntu
The key's randomart image is:
[...snipp...]
The second line will create an RSA key pair with an empty password.
Second, you have to enable SSH access to your local machine with this newly created key.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the
root
user. The step is also
needed to save your local machine’s host key fingerprint to the
root
user’s
known_hosts
file.
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
www.hpottech.com
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
# ssh localhost
www.hpottech.com
Start the VM and follow the following steps
Hadoop
Installation
Extract the contents of the Hadoop package to a /hadoop.
$ cd /hadoop
$ tar xzf hadoop-1.0.0.tar.gz
$ mv hadoop-1.0.0 hadoop
$ chown -R hduser:hadoop hadoop (optional)
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user root. If you use a shell other than bash, you should of course
update its appropriate configuration files instead of .bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/hadoop/hadoop
(Changes the directory app. to your installation.)
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/hadoop/jdk1.6.0_17
www.hpottech.com
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
Hadoop Distributed File System (HDFS)
Configuration
Our goal in this tutorial is a single-node setup of Hadoop
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME.
Open /hadoop/hadoop/conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun
JDK/JRE 6 directory.
Change
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
www.hpottech.com
to
# The java implementation to use. Required.
export JAVA_HOME=/hadoop/jdk1.6.0_17/
conf/*-site.xml
Now we create the directory and set the required ownerships and permissions:
www.hpottech.com
Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml:
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system </description>
</property>
www.hpottech.com
In file conf/mapred-site.xml:
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs at.</description>
</property>
www.hpottech.com
In file conf/hdfs-site.xml:
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. </description>
</property>
www.hpottech.com
Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local
filesystem of your “cluster” .
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
hduser@ubuntu:~$ /hadoop/hadoop/bin/hadoop namenode -format
The output will look like this:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo'
on Fri Feb 19 08:07:34 UTC 2010
www.hpottech.com
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:/usr/local/hadoop$
Starting your single-node cluster
Run the command:
www.hpottech.com
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$
A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0).
hduser@ubuntu:$ jps
(if jps doesn’t found, go to /Software/JDK/bin and execute ./jps)
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
www.hpottech.com
1788 NameNode
You can also check with netstat if Hadoop is listening on the configured ports.
hduser@ubuntu:~$ netstat -plten | grep java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java
tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java
tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java
tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java
tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java
tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java
tcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java
tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java
hduser@ubuntu:~$
If there are any errors, examine the log files in the /logs/ directory.
Stopping your single-node cluster
Run the command
www.hpottech.com
to stop all the daemons running on your machine.
Exemplary output:
hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
Lab objectives
In this lab you will practice HDFS command line interface.
Lab instructions
This lab has been developed as a tutorial. Simply execute the commands provided, and analyze the results.
Basic Hadoop Filesystem commands
1. In order to work with HDFS you need to use the hadoop fs command. For example to list the / and /app directories you need to input the following commands:
> >
hadoop fs -ls / hadoop fs -ls /app
> hadoop fs -mkdir test
Now let's see the directory we've created: >
>
hadoop fs -ls /
hadoop fs -ls /user/root
You will notice that the test directory got created under the /user/root directory. This is because as the root user, your default path is /user/root and thus if you don't specify an absolute path all HDFS commands work out of /user/root (this will be your default working directory).
3. You should be aware that you can pipe (using the | character) any HDFS command to be used with the Linux shell. For example, you can easily use grep with HDFS by doing the following: >
>
hadoop fs -mkdir /user/root/test2 hadoop fs -ls /user/root | grep test
As you can see the grep command only returned the lines which had test in them (thus removing the "Found x items" line and oozie-root directory from the listing.
Copy pg20417.txt from software folder to /hadoop >
>
hadoop fs -put /hadoop/pg20417.txt pg20417.txt hadoop fs -ls /user/root
You should now see a new file called /user/root/README listed. In order to view the contents of this file we will use the -cat command as follows:
> hadoop fs -cat pg20417.txt
You should see the output of the README file (that is stored in HDFS). We can also use the linux diff command to see if the file we put on HDFS is actually the same as the original on the local filesystem. You can do this as follows:
> diff <( hadoop fs -cat pg20417.txt) /hadoop/pg20417.txt
Since the diff command produces no output we know that the files are the same (the diff command prints all the lines in the files that differ).
Some more Hadoop Filesystem commands
1. In order to use HDFS commands recursively generally you add an "r" to the HDFS command (In the Linux shell this is generally done with the "-R" argument) For example, to do a recursive listing we'll use the -lsr command rather than just -ls. Try this:
> hadoop fs -ls /user
2. In order to find the size of files you need to use the -du or -dus commands. Keep in mind that these commands return the file size in bytes. To find the size of the pg20417.txt file use the following command:
> hadoop fs -du pg20417.txt
To find the size of all files individually in the /user/root directory use the following command: > hadoop fs -du /user/root
To find the size of all files in total of the /user/root directory use the following command: > hadoop fs -dus /user/root
3. If you would like to get more information about a given command, invoke -help as follows: > hadoop fs -help
For example, to get help on the dus command you'd do the following: > hadoop fs -help dus
--- This is the end of this lab ---
System -> Network Config -> DNS –
Hostname - Activate
/etc/sysconfig/network
www.hpottech.com
bin/hadoop dfsadmin –report
www.hpottech.com
bin/hadoop dfsadmin -metasave hadoop.txt
This will save some of NameNode’s metadata into its log directory under filename.
In this metadata, you’ll find lists of blocks waiting for replication, blocks being replicated, and blocks awaiting
deletion. For replication each block will also have a list of DataNodes being replicated to. Finally, the metasave file
will also have summary statistics on each DataNode.
Go to log folder:
# cd /hadoop/hadoop/logs
# ls
www.hpottech.com
hadoop dfsadmin -safemode get
hadoop dfsadmin -safemode enter
www.hpottech.com
hadoop fsck /
Hpot-Tech.com
Installation
To run the fair scheduler in your Hadoop installation, you need to put it on the CLASSPATH.
copy the hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar from
/hadoop/hadoop-2.0.0-mr1-cdh4.1.0/contrib/fairscheduler to HADOOP_HOME/lib. Using the
following command.
cp hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar $HADOOP_HOME/lib
Edit HADOOP_CONF_DIR/mapred-site.xml to have Hadoop use the fair scheduler:
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
Once you restart the cluster, you can check that the fair scheduler is running by going to
http://<jobtracker URL>/scheduler on the JobTracker's web UI.
Hpot-Tech.com
http://192.168.1.5:50030/scheduler
Run the map reduce job from two telnet window and observer the link display.
Create two different folders for input and output.
hPotTech | hadoop
1)
Verifying File System Health
a.
bin/hadoop fsck /
hPotTech | hadoop
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/core-site.xml) available at these
locations:
http://192.168.80.133:50030/
– web UI for MapReduce job tracker(s)
http://192.168.80.133:50060/
– web UI for task tracker(s)
http://
92.168.80.133:50070/
– web UI for HDFS name node(s)
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give
them a try.
MapReduce Job Tracker Web Interface
The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web
UI is running on).
By default, it’s available at
http://localhost:50030/
A screenshot of Hadoop's Job Tracker web interface.
hPotTech
MapReduce Job Tracker Web Interface
provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web
http://localhost:50030/
.
A screenshot of Hadoop's Job Tracker web interface.
hPotTech | hadoop
provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web
Task Tracker Web Interface
The task tracker web UI shows you running and non
log files.
By default, it’s available at
http://localhost:50060/
A screenshot of Hadoop's Task Tracker web interface.
hPotTech
The task tracker web UI shows you running and non-running tasks. It also gives access to the ”local machine’s” Hadoop
http://localhost:50060/
.
A screenshot of Hadoop's Task Tracker web interface.
hPotTech | hadoop
hPotTech | hadoop
HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead
nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It
also gives access to the ”local machine’s” Hadoop log files.
A screenshot of Hadoop's Name Node web interface.
hPotTech
A screenshot of Hadoop's Name Node web interface.
www.hpottech.com
Start the Hadoop VM Master
Base Directory:- /hadoop
Java Installation
1.
Java 1.6.x (from Sun) is installed on /usr/jre16.
www.hpottech.com
Pig Installation
www.hpottech.com
#Set the PIG_HOME environment variable (vi $HOME/.bashrc)
www.hpottech.com
#export PIG_HOME=/hadoop/pig-0.10.0
www.hpottech.com
Start the hadoop cluster.
#bash
www.hpottech.com
$ cd /hadoop/pig-0.10.0/
www.hpottech.com
$ bin/pig -x local
Enter the following command in the Grunt shell;
log = LOAD '/hadoop/pig-0.10.0/tutorial/data/excite-small.log' AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO ‘output’;
www.hpottech.com
# quit
file:///hadoop/pig-0.10.0/tutorial/data/output
Hadoop | HPot-Tech
1
Install: Hbase
-1)
Get the hbase software from /software.
2)
Untar the software in /hadoop
$ tar xfz hbase-0.92.1.tar.gz –C /hadoop $ cd /hadoop/hbase-0.92.1
3)
edit
conf/hbase-site.xmland set the directory for HBase to write to
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>file:///hadoop/hbase</value> </property> </configuration>
Hadoop | HPot-Tech
2
Edit
conf/hbase-env.sh, uncommenting the
JAVA_HOMEline pointing it to your java install.
Hadoop | HPot-Tech
3
Start hadoop
Start HBase
$ ./bin/start-hbase.sh
starting Master, logging to logs/hbase-user-master-example.org.out
Shell Exercises
Connect to your running HBase via the shell.
$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell
Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010
Hadoop | HPot-Tech
4
Create a table named
testwith a single column family named
cf. Verify its creation by listing all tables and then insert
some values.
hbase(main):003:0> create 'test', 'cf' 0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test' ..
1 row(s) in 0.0550 seconds
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0560 seconds
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0370 seconds
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0450 seconds
Hadoop | HPot-Tech
5
Above we inserted 3 values, one at a time. The first insert is at
row1, column
cf:awith a value of
value1. Columns in
HBase are comprised of a column family prefix --
cfin this example -- followed by a colon and then a column qualifier
Hadoop | HPot-Tech
6
Verify the data insert.
Run a scan of the table by doing the following
hbase(main):007:0> scan 'test' ROW COLUMN+CELL
row1 column=cf:a, timestamp=1288380727188, value=value1 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 3 row(s) in 0.0590 seconds
Hadoop | HPot-Tech
7
Get a single row as follows
hbase(main):008:0> get 'test', 'row1' COLUMN CELL
cf:a timestamp=1288380727188, value=value1 1 row(s) in 0.0400 seconds
Hadoop | HPot-Tech
8
Now, disable and drop your table. This will clean up all done above.
hbase(main):012:0> disable 'test' 0 row(s) in 1.0930 seconds
hbase(main):013:0> drop 'test' 0 row(s) in 0.0770 seconds
Exit the shell by typing exit.
hbase(main):014:0> exit
Stopping HBase
Stop your hbase instance by running the stop script.
$ ./bin/stop-hbase.sh
www.hpottech.com
Page 1
Installing Hive
www.hpottech.com
Page 2
Set the environment variable HIVE_HOME to point to the installation directory at $HOME/.bashrc:
$ export HIVE_HOME=/hadoop/hive-0.9.0
Finally, add $HIVE_HOME/bin to your PATH:
www.hpottech.com
Page 3
Running Hive
Hive uses hadoop that means:
• you must have hadoop in your path OR
• export HADOOP_HOME=<hadoop-install-dir>
In addition, you must create /tmp and /user/hive/warehouse
(aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive. Start hadoop service.
Commands to perform this setup
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
www.hpottech.com
Page 4
Errors log:
/tmp/<user.name>/hive.log Type hive #bash #hiveCreates a table called pokes with two columns, the first being an integer and the other a string
www.hpottech.com
Page 5
hive> SHOW TABLES;hive> DESCRIBE invites;
hive> LOAD DATA LOCAL INPATH
'/hadoop/hive-0.9.0/examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
www.hpottech.com
Page 6
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';www.hpottech.com
Page 7
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';www.hpottech.com
Page 8
#hadoop fs -cat /tmp/hdfs_out/000000_0
Dropping tables:
hive> DROP TABLE invites;
Hpot-Tech
Start Hadoop cluster – start-all.sh
Start hive.
copy extend1502.log to /hadoop
Create table from the Hive console.
hive > CREATE TABLE weblog (mdate STRING , mtime STRING, ssitename STRING, scomputername STRING, sip STRING, csmethod STRING,
csuristem STRING, suriquery STRING, sport STRING, csusername STRING, cip STRING, csversion STRING, csUserAgent STRING, csCookie STRING,
csReferer STRING, cshost STRING, scstatus STRING, scsubstatus STRING, scwin32status STRING, scbytes STRING, csbytes STRING, timetaken
STRING)
Hpot-Tech
hive> SET hive.enforce.bucketing = true;
hPot-Tech
hPot-Tech
package com.hp.types; // == JobBuilder import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool;publicclass JobBuilder {
privatefinal Class<?> driverClass; privatefinal Job job;
privatefinalintextraArgCount; privatefinal String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException { this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException { this.driverClass = driverClass;
this.extraArgCount = extraArgCount; this.job = new Job();
this.job.setJarByClass(driverClass); this.extrArgsUsage = extrArgsUsage; }
// vv JobBuilder
publicstatic Job parseInputAndOutput(Tool tool, Configuration conf, String[] args) throws IOException {
if (args.length != 2) {
hPot-Tech
returnnull; }
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job;
}
publicstaticvoid printUsage(Tool tool, String extraArgsUsage) { System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err); }
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException { Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args); String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n", driverClass.getSimpleName(), extrArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err); System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) { overwrite = true;
index++; }
Path input = new Path(otherArgs[index++]); Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index); }
hPot-Tech
if (overwrite) {
output.getFileSystem(conf).delete(output, true); }
FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); returnthis;
}
public Job build() { returnjob; }
public String[] getExtraArgs() { returnextraArgs;
} }
hPot-Tech
package com.hp.types;// cc SmallFilesToSequenceFileConverter A MapReduce program for packaging a collection of small files as a single SequenceFile import java.io.IOException; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; //vv SmallFilesToSequenceFileConverter
publicclass SmallFilesToSequenceFileConverter extends Configured implements Tool {
staticclass SequenceFileMapper
extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
private Text filenameKey;
@Override
protectedvoid setup(Context context) throws IOException, InterruptedException {
InputSplit split = context.getInputSplit(); Path path = ((FileSplit) split).getPath(); filenameKey = new Text(path.toString()); }
@Override
protectedvoid map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
hPot-Tech
} }
@Override
publicint run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args); if (job == null) { return -1; } job.setInputFormatClass(WholeFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(BytesWritable.class); job.setMapperClass(SequenceFileMapper.class);
return job.waitForCompletion(true) ? 0 : 1; }
publicstaticvoid main(String[] args) throws Exception { args = new String[2];
args[0]="input";
args[1]="output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args); System.exit(exitCode);
} }
hPot-Tech
package com.hp.types;// cc WholeFileInputFormat An InputFormat for reading a whole file as a record import java.io.IOException; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.*; //vv WholeFileInputFormat
publicclass WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable> {
@Override
protectedboolean isSplitable(JobContext context, Path file) { returnfalse;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader( InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
WholeFileRecordReader reader = new WholeFileRecordReader(); reader.initialize(split, context);
return reader; }
}
hPot-Tech
package com.hp.types;// cc WholeFileRecordReader The RecordReader used by WholeFileInputFormat for reading a whole file as a record import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileSplit; //vv WholeFileRecordReader
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
private FileSplit fileSplit; private Configuration conf;
private BytesWritable value = new BytesWritable(); privatebooleanprocessed = false;
@Override
publicvoid initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration(); }
@Override
publicboolean nextKeyValue() throws IOException, InterruptedException { if (!processed) {
byte[] contents = newbyte[(int) fileSplit.getLength()]; Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf); FSDataInputStream in = null;
try {
hPot-Tech
IOUtils.readFully(in, contents, 0, contents.length); value.set(contents, 0, contents.length);
} finally { IOUtils.closeStream(in); } processed = true; returntrue; } returnfalse; } @Override
public NullWritable getCurrentKey() throws IOException, InterruptedException { return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
returnvalue; }
@Override
publicfloat getProgress() throws IOException { returnprocessed ? 1.0f : 0.0f;
}
@Override
publicvoid close() throws IOException { // do nothing
} }
hPot-Tech
Copy the following files:
hPot-Tech
Submit the jobs to cluster
Un common the path initialization as follow:
/* args = new String[2]; args[0]="input";
args[1]="output"+System.currentTimeMillis();*/
Export the jar and run the following command
#hadoop fs -mkdir smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/a smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/b smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/c smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/d smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/e smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/f smallfiles/
hPot-Tech
ww.hpottech.com
From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Red hat boxes in
this tutorial. The best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two
RH boxes, and in a second step to “merge” these two single-node clusters into one multi-node cluster in which one RH
box will become the designated master (but also act as a slave with regard to data storage and processing), and the other
box will become only a slave. It’s much easier to track down any problems you might encounter due to the reduced
complexity of doing a single-node cluster setup first on each machine.
ww.hpottech.com
Prerequisites
Configuring single-node clusters first in both the VM
Use the earlier tutorial.
Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make one RH
box the ”master” (which will also act as a slave) and the other RH box a ”slave”.
We will call the designated master machine just the
master
from now on and the slave-only machine the
slave
. We will also give the two machines these respective hostnames in their networking setup, most notably
in
/etc/hosts
. If the hostnames of your machines are different (e.g.
node01
) then you must adapt the
settings in this tutorial as appropriate.
ww.hpottech.com
ww.hpottech.com
ww.hpottech.com
Start both the VM
Change the New VM host name:
System -> Administration -> Network -> DNS – hadoopslave
ww.hpottech.com
Verify the IP as belows:
ww.hpottech.com
Networking
Both machines must be able to reach each other over the network.
Update /etc/hosts on both machines with the following lines:
# vi /etc/hosts (for master AND slave)
10.72.47.42 master
ww.hpottech.com
SSH access
The root user on the master must be able to connect
a) to its own user account on the master – i.e. ssh master in this context and not necessarily ssh
localhost – and
b) to the root user account on the slave via a password-less SSH login.
You have to add the root@master‘s public SSH key (which should be in$HOME/.ssh/id_rsa.pub) to
the authorized_keys file of root@slave (in this user’s$HOME/.ssh/authorized_keys).
You can do this manually or use the
following SSH command
:
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave
This command will prompt you for the login password for user root on slave, then copy the public SSH key for you,
creating the correct directory and fixing the permissions as necessary.
ww.hpottech.com
The final step is to test the SSH setup by connecting with user root from the master to the user account root on
the slave. The step is also needed to save slave‘s host key fingerprint to the root@master‘sknown_hosts file.
So, connecting from master to master…
$ ssh master
ww.hpottech.com
And from master to slave.
ww.hpottech.com
Hadoop
Cluster Overview
ww.hpottech.com
The master node will run the “master” daemons for each layer: NameNode for the HDFS storage layer, and JobTracker
for the MapReduce processing layer. Both machines will run the “slave” daemons: DataNode for the HDFS layer, and
TaskTracker for MapReduce processing layer. Basically, the “master” daemons are responsible for coordination and
management of the “slave” daemons while the latter will do the actual data storage and data processing work.
Masters vs. Slaves
Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively.
These are the actual “master nodes”. The rest of the machines in the cluster act as both DataNode and TaskTracker.
These are the slaves or “worker nodes”.
Configuration
conf/masters (
master
only)
On master, update /conf/masters that it looks like this:
ww.hpottech.com
conf/slaves (
master
only)
On master, update conf/slaves that it looks like this:
master
slave
ww.hpottech.com
conf/*-site.xml (all machines)
Note: As of Hadoop 0.20.x and 1.x, the configuration settings previously found in
hadoop-site.xmlwere moved
to conf/core-site.xml (fs.default.name), conf/mapred-site.xml(mapred.job.tracker)
and conf/hdfs-site.xml (dfs.replication).
Assuming you configured each machine as described in the
single-node cluster tutorial
, you will only have to change a
few variables.
Important: You have to change the configuration files
conf/core-site.xml
,
conf/mapred-site.xml
and
conf/hdfs-site.xml
on ALL machines as follows.
First, we have to change the
fs.default.name
variable (in conf/core-site.xml) which specifies
the
NameNode
(the HDFS master) host and port. In our case, this is the master machine.
ww.hpottech.com
<!-- In: conf/core-site.xml -->
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system </description>
</property>
ww.hpottech.com
Second, we have to change the
mapred.job.tracker
variable (in conf/mapred-site.xml) which specifies
the
JobTracker
(MapReduce master) host and port. Again, this is the master in our case.
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs at </description>
</property>
ww.hpottech.com
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.name.dir</name>
<value>/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication. </description>
</property>
ww.hpottech.com
Create the necessary folder structure in both the node
#mkdir –p /hadoop/hdfs/name
#mkdir –p /hadoop/hdfs/data
Formatting the HDFS filesystem via the NameNode
ww.hpottech.com
Background: The HDFS name table is stored on the NameNode’s (here: master) local filesystem in the directory
specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information
for the DataNodes.
Starting the multi-node cluster
HDFS daemons
Run the command /bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will
bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the
machines listed in the conf/slaves file.
In our case, we will run bin/start-dfs.sh on master:
ww.hpottech.com
On slave, you can examine the success or failure of this command by inspecting the log file logs/. Exemplary
ww.hpottech.com
As you can see in slave‘s output above, it will automatically format it’s storage directory (specified
bydfs.data.dir) if it is not formatted already. It will also create the directory if it does not exist yet.
At this point, the following Java processes should run on master…
ww.hpottech.com
…and the following on slave.
ww.hpottech.com
MapReduce daemons
Run the command /bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up
the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/start-mapred.sh on master:
ww.hpottech.com
On slave, you can examine the success or failure of this command by inspecting the log file
ww.hpottech.com
At this point, the following Java processes should run on master…
$ jps
ww.hpottech.com
…and the following on slave.
ww.hpottech.com
Stopping the multi-node cluster
Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however. First, we begin
with stopping the MapReduce daemons: the JobTracker is stopped on master, and TaskTracker daemons are stopped
on all slaves (here: master and slave). Second, the HDFS daemons are stopped: the NameNode daemon is
stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave).
MapReduce daemons
Run the command /bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce
cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-mapred.sh on master:
ww.hpottech.com
(Note: The output above might suggest that the JobTracker was running and stopped on slave, but you can be assured
that the JobTracker ran on master.)
At this point, the following Java processes should run on master…
ww.hpottech.com
…and the following on slave.
$ jps
ww.hpottech.com
HDFS daemons
Run the command /bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the
NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in
the conf/slaves file.
In our case, we will run bin/stop-dfs.sh on master:
ww.hpottech.com
At this point, the only following Java processes should run on master…
$ jps
…and the following on slave.
ww.hpottech.com
Running a MapReduce job
Just follow the steps described in the section
Running a MapReduce job
of the
single-node cluster tutorial
.
Here’s the exemplary output on master…
Copy the data before running the following.
ww.hpottech.com
…and on slave for its datanode…
ww.hpottech.com
…and on slave for its tasktracker.
# from logs/ hadoop-root-tasktracker-hadoopslave.log on slave
Hpot-Tech
Hpot-Tech
package com.hp.join; // == JobBuilder import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool;publicclass JobBuilder {
privatefinal Class<?> driverClass; privatefinal Job job;
privatefinalintextraArgCount; privatefinal String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException { this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException { this.driverClass = driverClass;
this.extraArgCount = extraArgCount; this.job = new Job();
this.job.setJarByClass(driverClass); this.extrArgsUsage = extrArgsUsage; }
// vv JobBuilder
publicstatic Job parseInputAndOutput(Tool tool, Configuration conf, String[] args) throws IOException {
if (args.length != 2) {
printUsage(tool, "<input> <output>"); returnnull;
Hpot-Tech
}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job;
}
publicstaticvoid printUsage(Tool tool, String extraArgsUsage) { System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err); }
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException { Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args); String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n", driverClass.getSimpleName(), extrArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err); System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) { overwrite = true;
index++; }
Path input = new Path(otherArgs[index++]); Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index); }
Hpot-Tech
output.getFileSystem(conf).delete(output, true); }
FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); returnthis;
}
public Job build() { returnjob; }
public String[] getExtraArgs() { returnextraArgs;
} }
Hpot-Tech
package com.hp.join; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter;publicclass JoinRecordMapper extends MapReduceBase implements Mapper<LongWritable, Text, TextPair, Text> { private NcdcRecordParser parser = new NcdcRecordParser();
publicvoid map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter) throws IOException {
parser.parse(value);
output.collect(new TextPair(parser.getStationId(), "1"), value); }
Hpot-Tech
package com.hp.join; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.MultipleInputs; import org.apache.hadoop.util.*; @SuppressWarnings("deprecation")publicclass JoinRecordWithStationName extends Configured implements Tool {
publicstaticclass KeyPartitioner implements Partitioner<TextPair, Text> { @Override
publicvoid configure(JobConf job) {}
@Override
publicint getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions; }
}
@Override
publicint run(String[] args) throws Exception { if (args.length != 3) {
JobBuilder.printUsage(this, "<ncdc input> <station input> <output>"); return -1;
}
JobConf conf = new JobConf(getConf(), getClass()); conf.setJobName("Join record with station name");
Path ncdcInputPath = new Path(args[0]); Path stationInputPath = new Path(args[1]); Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(conf, ncdcInputPath, TextInputFormat.class, JoinRecordMapper.class); MultipleInputs.addInputPath(conf, stationInputPath, TextInputFormat.class, JoinStationMapper.class);
Hpot-Tech
FileOutputFormat.setOutputPath(conf, outputPath); conf.setPartitionerClass(KeyPartitioner.class); conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class); conf.setMapOutputKeyClass(TextPair.class); conf.setReducerClass(JoinReducer.class); conf.setOutputKeyClass(Text.class); JobClient.runJob(conf); return 0; }publicstaticvoid main(String[] args) throws Exception { args = new String[3];
args[0] = "inputncdc"; args[1] = "inputstation";
args[2] = "output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args); System.exit(exitCode);
} }
Hpot-Tech
package com.hp.join; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*;publicclass JoinReducer extends MapReduceBase implements
Reducer<TextPair, Text, Text, Text> {
publicvoid reduce(TextPair key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
Text stationName = new Text(values.next()); while (values.hasNext()) {
Text record = values.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString()); output.collect(key.getFirst(), outValue);
} } }
Hpot-Tech
package com.hp.join;import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
publicclass JoinStationMapper extends MapReduceBase implements Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
publicvoid map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter) throws IOException {
if (parser.parse(value)) {
output.collect(new TextPair(parser.getStationId(), "0"), new Text(parser.getStationName()));
} } }
Hpot-Tech
package com.hp.join;import java.math.*;
import org.apache.hadoop.io.Text;
publicclass MetOfficeRecordParser {
private String year;
private String airTemperatureString; privateintairTemperature;
privatebooleanairTemperatureValid;
publicvoid parse(String record) { if (record.length() < 18) { return; } year = record.substring(3, 7); if (isValidRecord(year)) { airTemperatureString = record.substring(13, 18); if (!airTemperatureString.trim().equals("---")) {
BigDecimal temp = new BigDecimal(airTemperatureString.trim()); temp = temp.multiply(new BigDecimal(BigInteger.TEN));
airTemperature = temp.intValueExact(); airTemperatureValid = true; } } }
privateboolean isValidRecord(String year) { try { Integer.parseInt(year); returntrue; } catch (NumberFormatException e) { returnfalse; } }
publicvoid parse(Text record) { parse(record.toString()); }
Hpot-Tech
public String getYear() { returnyear;
}
publicint getAirTemperature() { returnairTemperature;
}
public String getAirTemperatureString() { returnairTemperatureString;
}
publicboolean isValidTemperature() { returnairTemperatureValid;
}
Hpot-Tech
package com.hp.join;import java.text.*;
import java.util.Date;
import org.apache.hadoop.io.Text;
publicclass NcdcRecordParser {
privatestaticfinalintMISSING_TEMPERATURE = 9999;
privatestaticfinal DateFormat DATE_FORMAT = new SimpleDateFormat("yyyyMMddHHmm");
private String stationId;
private String observationDateString; private String year;
private String airTemperatureString; privateintairTemperature;
privatebooleanairTemperatureMalformed; private String quality;
publicvoid parse(String record) {
stationId = record.substring(4, 10) + "-" + record.substring(10, 15); observationDateString = record.substring(15, 27);
year = record.substring(15, 19); airTemperatureMalformed = false;
// Remove leading plus sign as parseInt doesn't like them
if (record.charAt(87) == '+') {
airTemperatureString = record.substring(88, 92);
airTemperature = Integer.parseInt(airTemperatureString); } elseif (record.charAt(87) == '-') {
airTemperatureString = record.substring(87, 92);
airTemperature = Integer.parseInt(airTemperatureString); } else {
airTemperatureMalformed = true; }
airTemperature = Integer.parseInt(airTemperatureString); quality = record.substring(92, 93);
}