MapReduce v1 (MR1) Setup and Run

In document Pivotal HD Enterprise (Page 118-129)

Unpack the MR1 tarball hadoop-<PHD_MR1_VERSION>.tar.gz and modify the HADOOP_HOME

environment variable by appending the following to file ~/.bashrc:

~/.bashrc

# Add HADOOP_HOME to the path

export HADOOP_HOME=/path/to/hadoop-mr1 PATH=$HADOOP_HOME/bin:$PATH

When you trying to start HDFS service, you need to specify the full path of

like .

hadoop-daemon.sh $HOME_HDFS_HOME/sbin/hadoop-daemon.sh

Edit the files under hadoop-mr1/tar/hadoop-<PHD_MR1_VERSION>/conf/ directory and setup HDFS according to section Prerequsite and HDFS Setup.

Modify the file $HADOOP_HOME/conf/mapred-site.xml

conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property>

Pivotal Product Documentation 4. 5. 6. 7. 8. 9. <name>mapred.system.dir</name> <value>/mapred/system</value> <final>true</final> </property> <property> <name>mapreduce.jobtracker.staging.root.dir</name> <value>/user</value> </property> </configuration>

Ensure you have already started HDFS services by checking you can access the HDFS dashboard: , if you are on using name node machine.If you using browser to open that

http://localhost:50070/

dashboard from another machine, replace localhost in the URL with the full host name of your name node machine.

If you cannot access the dashboard, refer to HDFS Setup section to start the services. Create basic directory on HDFS:

bash

$ $HADOOP_HDFS_HOME/bin/hdfs dfs -mkdir /tmp

$ $HADOOP_HDFS_HOME/bin/hdfs dfs -chmod -R 1777 /tmp $ $HADOOP_HDFS_HOME/bin/hdfs dfs -mkdir -p /user/hadoop $ $HADOOP_HDFS_HOME/bin/hdfs dfs -mkdir -p /mapred/system

Start job tracker and task tracker services:

bash

$ $HADOOP_HOME/bin/hadoop-daemon.sh start jobtracker $ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

Accessing map/reduce administration page at http://localhost:50030. If you using browser to open that dashboard from another machine, replace localhost in the URL with the full host name of your job tracker machine.

Now run an example Map/Reduce job to check MR1 is working

bash

$ cd $HADOOP_HOME

$ bin/hadoop jar hadoop-examples-*.jar pi 2 10000 <you should see job succeeded>

$ exit

If you can see the job is finished succeeded, you have setup PHDMR1 successfully, otherwise, check your configuration files and ensure all HDFS services, job tracker and task tracker all started

successfully.

1. 2. 3. 4. 5. 6.

Unpack the Zookeeper tarball zookeeper-<PHD_ZOOKEEPER_VERSION>.tar.gz and add the environment variable by appending the following to :

ZK_HOME ~/.bashrc

~/.bashrc

# Add ZK_HOME to the path

export ZK_HOME=/path/to/zookeeper PATH=$PATH:$ZK_HOME/bin

And make sure the ~/.bashrc file take effect:

~/.bashrc

$ source ~/.bashrc

Go to the folder $ZK_HOME/conf,

$ cd $ZK_HOME/conf

$ cp zoo_sample.cfg zoo.cfg

Since you are running Zookeeper on a single node, no need to change the configuration file. Start Zookeeper server service:

$ cd $ZK_HOME

$ bin/zkServer.sh start

Confirm that Zookeeper is running properly by running the following test:

$ cd $ZK_HOME $ bin/zkCli.sh

> create /zk_test my_data > get /zk_test

> quit

To stop the Zookeeper server service:

$ cd $ZK_HOME

$ bin/zkServer.sh stop

4.3.4 HBase

Following is an example of installing an instance of HBase that is running in pseudo-distributed mode. There is also an option to install a standalone or fully distributed HBase. Refer to Apache HBase documentation for

1. 2. 3. 1. 2. 4. 5. 6.

Unpack the HBase tar file hbase-<PHD_HBASE_VERSION>.tar.gz, the extracted folder is referred as $HBASE_HOME, edit file $HBASE_HOME/conf/hbase-site.xml to add the following properties

$HBASE_HOME/conf/hbase-site.xml <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8020/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value>

<description>mode: fully distributed, not to manage zookeeper</description> </property> <property> <name>hbase.zookeeper.quorum</name> <value>localhost</value> </property> </configuration>

Edit $HBASE_HOME/conf/hbase-env.sh to turn off the HBase management.

$HBASE_HOME/conf/hbase-env.sh

HBASE_MANAGES_ZK=false

HBase has the hadoop jars in the $HBASE_HOME/lib dir. If you are already have

version of Hadoop jar libraries in that directory, you can omit this step.

<PHD_HADOOP_VERSION>

Otherwise, you need:

Delete the $HBASE_HOME/lib/hadoop-*.jar files.

Copy the $HADOOP_HOME/*/hadoop-.jar files to $HBASE_HOME/lib/. Start HBase:

Before starting HBase, please make sure Zookeeper server is running.

$ $HBASE_HOME/bin/start-hbase.sh

You can check the status of HBase at the following location: http://localhost:60010. If you using browser to open that dashboard from another machine, replace localhost in the URL with the full host name of your HBase master machine.

Confirm that HBase is installed and running properly by conducting the following test

$ cd $HBASE_HOME $ bin/hbase shell

hbase(main):003:0> create 'test', 'cf' hbase(main):003:0> list 'test'

Pivotal Product Documentation 7. 1. 2. 3. 4.

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1' hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' hbase(main):007:0> scan 'test'

hbase(main):008:0> get 'test', 'row1' hbase(main):012:0> disable 'test' hbase(main):013:0> drop 'test' hbase(main):014:0> exit

To stop HBase:

$ $HBASE_HOME/bin/stop-hbase.sh

4.3.5 Hive

Unpack the Hive tarball hive-<PHD_HIVE_VERSION>.tar.gz and append the following environment variables to ~/.bashrc:

~/.bashrc

export HIVE_HOME=/path/to/hive export PATH=$HIVE_HOME/bin:$PATH

export CLASSPATH=$HIVE_HOME/lib:$CLASSPATH

Make sure the ~/.bashrc file take effect:

$ source ~/.bashrc

Create /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them group write access in HDFS

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Test

$ cd $HIVE_HOME $ bin/hive

hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; hive> SELECT a.* FROM pokes a where foo=400;

hive> DROP TABLE pokes; hive> quit;

1.

2.

3.

4.3.6 HCatalog

HCatalog is contained in the same tarball as Hive. After you extracted tarball

, append the following environment variables to :

hive-<PHD_HIVE_VERSION>.tar.gz ~/.bashrc ~/.bashrc export HCAT_HOME=$HIVE_HOME/hcatalog export HCAT_PREFIX=$HCAT_HOME export HIVE_CONF_DIR=$HIVE_HOME/conf export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec

Make sure the ~/.bashrc file take effect:

bash

$ source ~/.bashrc

Now you can run some HCatalog commands to verify your setup is OK. You should see similar output as shown below. Some trivial output is omitted for better illustration:

bash

$ cd $HCAT_HOME

$ bin/hcat -e "create table pokes (foo int, bar string)" OK

Time taken: 9.625 seconds $ bin/hcat -e "show tables" OK

pokes

Time taken: 7.783 seconds $ bin/hcat -e "describe pokes" OK

foo int None bar string None

Time taken: 7.301 seconds

$ bin/hcat -e "alter table pokes add columns (new_col int)" OK

Time taken: 7.003 seconds $ bin/hcat -e "describe pokes" OK

foo int None bar string None new_col int None Time taken: 7.014 seconds

$ bin/hcat -e "drop table pokes" OK

Time taken: 9.78 seconds $ exit

1. 2. 3. 4. 5. 1.

After you installed HCatalog, manually copy the configure file:

$ cp $HCAT_HOME/etc/webhcat/webhcat-default.xml $HIVE_CONF_DIR/webhcat-site.xml

Then edit the file you just copied:

webhcat-site.xml

<property>

<name>templeton.exec.envs</name>

<value>...,HIVE_CONF_DIR,HADOOP_LIBEXEC_DIR</value>

<description>The environment variables passed through to exec.</description> </property>

Please be noted the "..." in above script-let means the original value of the property. You need to append two more variable name to value of this property.

Start WebCatalog service:

$ cd $HCAT_HOME

$ sbin/webhcat_server.sh start

Note that starting WebCatalog service will write something under current directory, so ensure current user has permission to write in current directory.

Now you can run test:

$ curl http://localhost:50111/templeton/v1/ddl/database/?user.name=hadoop

Stop WebCatalog service:

bash

$ cd $HCAT_HOME

$ sbin/webhcat_server.sh stop

4.3.7 Pig

Unpack the Hive tarball pig-<PHD_PIG_VERSION>.tar.gz and append the following environment variables to ~/.bashrc:

export PIG_HOME=/path/to/pig export PATH=$PIG_HOME/bin:$PATH

3. 1. 2. 3. $ source ~/.bashrc Test

[hadoop@localhost ~]$ hadoop fs -put /etc/passwd passwd [hadoop@localhost ~]$ pig

grunt> A = load 'passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id;

grunt> dump B; (root) (bin) (daemon) ... (flume) (sqoop) (oozie) grunt> quit;

The output in the above commands are omitted, after the dump Bcommand, a Map/Reduce job should be started, and you should find users defined in your /etc/passwd file is listed in the output.

4.3.8 Mahout

Unpack the Mahout mahout-distribution-<PHD_MAHOUT_VERSION>.tar.gz and append the following environment variables to ~/.bashrc:

~/.bashrc

export MAHOUT_HOME=/path/to/mahout export PATH=$MAHOUT_HOME/bin:$PATH

export CLASSPATH=$MAHOUT_HOME/lib:$CLASSPATH

Make sure the ~/.bashrc file take effect:

$ source ~/.bashrc

Test (make sure HDFS and Map/Reduce service are running)

$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data $ $HADOOP_HDFS_HOME/bin/hdfs dfs -mkdir -p /user/hadoop/testdata

$ $HADOOP_HDFS_HOME/bin/hdfs dfs -put synthetic_control.data testdata

$ $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job $ $HADOOP_HDFS_HOME/bin/hdfs dfs -ls -R output

1. 2. 3. 4. 5. 6. 7.

4.3.9 Sqoop

Install and Deploy MySQL

Unpack the Sqoop

and append

sqoop-<PHD_SQOOP_VERSION>.bin__hadoop-<PHD_HADOOP_VERSION>.tar.gz

the following environment variables to ~/.bashrc:

~/.bashrc

export SQOOP_HOME=/path/to/sqoop export PATH=$SQOOP_HOME/bin:$PATH

export CLASSPATH=$SQOOP_HOME/lib:$CLASSPATH

Make sure the ~/.bashrc file take effect:

$ source ~/.bashrc

Move file mysql-connector-java.jar to directory /usr/share/java/ and make a symbolic link point to it at sqoop's lib folder

$ ln -sf /usr/share/java/mysql-connector-java.jar $SQOOP_HOME/lib/mysql-connector-java.jar

Create user hadoop in MySQL system, and grant all privileges to the user.

$ mysql -u root [-p]

mysql> insert into mysql.user(Host,User,Password) values("%","hadoop",password("hadoop")); mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'%' identified by 'hadoop';

mysql> flush privileges;

Start MySQL service

$ service mysqld start

Now do some test, first, create a table student in MySQL system:

$ mysql

mysql> use test

CREATE TABLE student (id INT PRIMARY KEY, name VARCHAR(100)); insert into student (id, name) values (1, "Elena");

insert into student (id, name) values (2, "Stephan"); insert into student (id, name) values (3, "Damon"); exit

8.

9.

1.

2.

3.

Create a user home folder in HDFS, you are using user hadoop, create directory /user/hadoop in HDFS.

With user hadoopto execute:

[hadoop@localhost]$ sqoop import --connect jdbc:mysql://localhost/test --table student --username hadoop --target-dir hdfs://localhost/tmp/sqoop_output"

If you installed MySQL on another machine, replace the localhost part in the jdbc url with the real MySQL server name in the command.

You should see a Map/Reduce job started to import data from the MySQL table to HDFS.

4.3.10 Flume

Unpack the Mahout apache-flume-<PHD_FLUME_VERSION>-bin.tar.gz and append the following environment variables to ~/.bashrc:

~/.bashrc

export FLUME_HOME=/path/to/flume

Make sure the ~/.bashrc file take effect:

$ source ~/.bashrc

Create a Flume configuration file under $FLUME_HOME (assume you name it as example.conf), which you probably copy from $FLUME_HOME/conf/flume-conf.properties.template, according to the following example:

example.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent a1.sources = r1

a1.sinks = k1 a1.channels = c1

# Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444

# Describe the sink a1.sinks.k1.type = logger

Pivotal Product Documentation

4.

# Use a channel which buffers events in memory a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

Run example use the example configuration to verify Flume is working properly.

$ cd $FLUME_HOME

$ bin/flume-ng agent --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console (note: on the above command, "a1" refers to the agent name set in file example.conf)

1.

In document Pivotal HD Enterprise (Page 118-129)