4 Pivotal HD MR1 1.1 Stack Binary Package

In document Pivotal HD Enterprise (Page 113-118)

4.1 Overview

Pivotal HD MapReduce V1 (MR1) 1.1 is a full Apache Hadoop distribution with Pivotal add-ons and a native integration with the Pivotal Greenplum database.

The binary distribution of PHD MR1 1.1 contains the following:

HDFS 2.0.5-alpha MapReduce 1.0.3 Pig 0.10.1 Zookeeper 3.4.5 HBase 0.94.8 Hive 0.11.0 Hcatalog 0.11.0 Mahout 0.7 Flume 1.3.1 Sqoop 1.4.2

4.2 Accessing PHD MR1 1.1

You can download the MR1 package PHDMR1-1.1.x.0-bin-xx.tar.gz from EMC Download Center, expand the package in your working_dir:

$ tar zxvf PHDMR1-1.1.0.0-bin-xx.tar.gz $ ls -l PHDMR1-1.1.0.0-bin-xx

total 44

drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 flume drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 hadoop drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 hadoop-mr1 drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 hbase drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 hive drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 hcatalog drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 mahout drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 pig -rw-rw-r-- 1 hadoop hadoop 406 Jun 26 04:38 README drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 sqoop drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 utility drwxr-xr-x 3 hadoop hadoop 4096 Jun 26 04:38 zookeeper

1. 2.

Component PHD Version Replaced String

Hadoop 2.0.5_alpha_gphd_2_1_0_0 <PHD_HADOOP_VERSION> MR1 mr1-1.0.3_gphd_2_1_0_0 <PHD_MR1_VERSION> HBase 0.94.8_gphd_2_1_0_0 <PHD_HBASE_VERSION> Hive 0.11.0_gphd_2_1_0_0 <PHD_HIVE_VERSION> Pig 0.10.1_gphd_2_1_0_0 <PHD_PIG_VERSION> Mahout 0.7_gphd_2_1_0_0 <PHD_MAHOUT_VERSION> HCatalog 0.10.1_gphd_2_1_0_0 <PHD_HCATALOG_VERSION> Sqoop 1.4.2_gphd_2_1_0_0 <PHD_SQOOP_VERSION> Flume 1.3.1_gphd_2_1_0_0 <PHD_FLUME_VERSION> Zookeeper 3.4.5_gphd_2_1_0_0 <PHD_ZOOKEEPER_VERSION> Oozie 3.3.2_gphd_2_1_0_0 <PHD_OOZIE_VERSION> bigtop-jsvc 1.0.15_gphd_2_1_0_0 <PHD_BIGTOP_JSVC_VERSION> bigtop-utils 0.4.0_gphd_2_1_0_0 <PHD_BIGTOP_UTILS_VERSION>

All component packages should come from same package (PHDMR1)

4.3 Installation

This section provides instructions for installing and running the Pivotal HD MR1 1.1.0 components from the downloaded binary tarball files.

The installation instructions provided here are intended only as a Quick Start guide that will start the services on one single host. Refer to Apache Hadoop documentation for information about other installation

configurations. http://hadoop.apache.org/docs/r2.0.5-alpha/

PHDMR1 should not be installed on the same cluster.

All packages used during this process should come from same distribution tarball, do not mix using package from different tarballs.

4.3.1 Prerequisites

Follow the instructions below to install the Hadoop components (cluster install):

If not created already, add a new user hadoopand switch to that user. All packages should be installed by user hadoop .

$ useradd hadoop $ passwd hadoop $ su - hadoop

Make sure Oracle Java Run-time (JRE) 1.7 is installed on the system and set environment variable to point to the directory where JRE is installed. Appending the following script snippet to

JAVA_HOME

the file ~/.bashrc:

~/.bashrc

export JAVA_HOME=/usr/java/default

Make sure the ~/.bashrc file take effect:

$ source ~/.bashrc

SSH (both client and server) command is required. Set up password-less SSH login according to the following commands.

Password-less SSH login is required to be setup on HDFS name node to each HDFS data node, also on YARN resource manager to each YARN node manager.

Because we are setting up a single node cluster, which means the only machine is the HDFS name node, YARN resource manager, and the only HDFS data node YARN node manager. So the setup is easier.

# Assume you already log into the single node with user hadoop $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# Set the permissions on the file on each slave host $ chmod 0600 ~/.ssh/authorized_keys

On a real cluster (distributed), use the following scripts, to setup password-less SSH login, it need to be executed twice, once on HDFS name node, another once on YARN resource manager node, unless you setup HDFS name node and YARN resource manager on same machine. (For your reference only, not needed for this single node cluster installation)

1.

2.

3.

4.

# First login to the master host (YARN resource mananger or HDFS name node). # Replace master@host-master with the real user name and host name of your master host.

$ ssh master@host-master

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# copy authorized_keys to each slave hosts (YARN node mananger or HDFS data node) in the cluster using scp

# Replace slave@host-slave with the real user name and host name of your slave host, and do it for each of your slave host.

# NOTE: if an authorized_keys file already exists for# the user, rename your file authorized_keys2

$ scp ~/.ssh/authorized_keys slave@host-slave:~/.ssh/

# Set the permissions on the file on each slave host

# Replace slave@host-slave with the real user name and host name of your slave host, and do it for each of your slave host.

$ ssh slave@host-slave

$ chmod 0600 ~/.ssh/authorized_keys

4.3.2 Hadoop

Unpack the Hadoop tarball file

$ tar zxf hadoop-<PHD_HADOOP_VERSION>.tar.gz

Edit file ~/.bashrc to update environment HADOOP_HOME and HADOOP_HDFS_HOME to be the directory where tarball file is extracted, and add hadoop to file search path.

~/.bashrc

# export HADOOP_HOME, HADOOP_HDFS_HOME export HADOOP_HOME=/path/to/hadoop

export HADOOP_HDFS_HOME=/path/to/hadoop export PATH=$HADOOP_HOME/bin:$PATH

And make sure the ~/.bashrc file take effect:

$ source ~/.bashrc

In the sections below, all the shell commands, unless explicitly specified, are run from this .

1. 2. 3. 4. 5. 6.

HDFS setup

Modify the file $HADOOP_HOME/etc/hadoop/core-site.xml, add the following to the configuration section $HADOOP_HOME/etc/hadoop/core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020/</value> </property>

Modify the file $HADOOP_HOME/etc/hadoop/hdfs-site.xml, add the following to the configuration section: $HADOOP_HOME/etc/hadoop/hdfs-site.xml <property> <name>dfs.replication</name> <value>1</value> </property>

Format the HDFS name node directory using default configurations:

$ $HADOOP_HDFS_HOME/bin/hdfs namenode -format

The default location for storing the name node data is:

/tmp/hadoop-hadoop/dfs/name/

Start name node service:

$ $HADOOP_HDFS_HOME/sbin/hadoop-daemon.sh start namenode

Start each data node service:

$ $HADOOP_HDFS_HOME/sbin/hadoop-daemon.sh start datanode

After the name node and data node services are started, you can access the HDFS dashboard at , if you are on using name node machine.If you using browser to open that

http://localhost:50070/

dashboard from another machine, replace localhost in the URL with the full host name of your name node machine.

Pivotal Product Documentation 7. 8. 1. 2. 3. $ $HADOOP_HDFS_HOME/bin/hdfs dfs -ls /

$ $HADOOP_HDFS_HOME/bin/hdfs dfs -mkdir -p /user/hadoop #you can see a full list of hdfs dfs command options $ $HADOOP_HDFS_HOME/bin/hdfs dfs

#put a local file to hdfs

$ $HADOOP_HDFS_HOME/bin/hdfs dfs -copyFromLocal /etc/passwd /user/hadoop/

To stop data node service:

$ $HADOOP_HDFS_HOME/sbin/hadoop-daemon.sh stop datanode

To stop name node service:

$ $HADOOP_HDFS_HOME/sbin/hadoop-daemon.sh stop namenode

HDFS data node and name node services are required to be started for running the examples below.

In document Pivotal HD Enterprise (Page 113-118)