Hadoop on windows.pdf

(1)

Hadoop Implementation on Windows

Hadoop:

Hadoop is a software framework written in Java for distributed storage and distributed processing of large data sets.

Hadoop is mainly divided into two parts:

1. Hadoop Distributed File System (HDFS), Storage Part 2. MapReduce, Processing Part.

Hadoop splits the files into large blocks and distribute them among nodes and clusters. To process the data MapReduce transfers the packaged code for nodes to process in

parallel.

MapReduce:

MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of

a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Terminology

1. NameNode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

2. DataNode: Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

3. JobTracker: The JobTracker is the service within Hadoop that farms

out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

4. TaskTracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

(2)

can be hosted on a separate machine.

Hadoop Installation on Windows Using Cygwin

Download Prerequisite Softwares:

There are set of prerequisite softwares which you need to install before you setup Hadoop.

Cygwin:

You need to download Unix command-line tool Cygwin. Cygwin is a large collection of GNU and Open Source tools which provide functionality similar to a Linux distribution on Windows. It is needed to run the scripts supplied with Hadoop because they are all written for the Unix platform.

Download Cygwin installer from below link. Select either 32 bit or 64 bit as per your operating system.

Java:

You need to download Java 1.6 or latest version.

Install Cygwin:

1. Run the downloaded Cygwin setup file

2. On the Choose Installation Type screen, select Install from Internet, then click Next.

(3)

4. Select Local Package Directory as the same folder - C:\Cywgin

(4)

6. On the Choose Download Site(s) screen, select any site from the available list, then click Next.

7. Keep pressing Next again and you would see the Select Package screen.

(5)

9. Again search for "dos2unix" and select this file too.

(6)

11.Click Finish to complete the installation process.

Cygwin is installed successfully.

Configure OpenSSH in Cygwin:

(7)

2. Right-click on your Cygwin shortcut, and click on "Run as administrator". This will make sure we have the proper privileges for everything. You'll see an empty Cygwin window come up.

3. Execute the following command to start configuration wizard for openssh.

ssh-host-config

You'll see the script generate some default files, and then you'll be prompted for whether or not you want to enable "Privilege Separation." It's on by default in standard installations of

OpenSSH on other systems, so go ahead and say "yes" to the prompt.

Next, you'll be asked if you want sshd to run as a service. This will allow you to get SSH access regardless of whether or not Cygwin is currently running, which is what we want. Go ahead and hit "yes" to continue.

Next, you'll be asked to enter a value for the daemon. Enter the value ntsec You'll see the script give you some information on your system and then it will ask you to create a privileged account with the default username "cyg_server". The default works well, so type "no" when it asks you if you want to use a different account name, although you can change this if you really like.

4. You can either restart, or enter the following command to start the sshd service: net start sshd

Setup authorization keys

1. Open cygwin command prompt

2. Execute the following command to generate keys

(8)

3. When prompted for filenames and pass phrases press ENTER to accept default values.

4. After command has finished generating they key, enter the following command to change into your .ssh directory

cd ~/.ssh

5. Check if the keys where indeed generated by executing the following command

ls -l

You should see two file id_rsa.pub and id_rsa with the recent creation dates. These files contain authorization keys.

6. To register the new authorization keys enter the following command. Note that double brackets, they are very important.

cat id_rsa.pub >> authorized_keys

7. Now check if the keys where set-up correctly by executing the following command

ssh localhost

Since it is a new ssh installation you warned that authenticity of the host could not be established and will be prompted whether you really want to connect, answer yes and press ENTER. You should see the cygwin prompt again, which means that you have successfully connected.

(9)

ssh localhost

This time you should not be prompted for anything.

Set Environment Variable for Cygwin and Java:

The next step is to set up the PATH environment variable. Path is an environment variable which is used by the operating system to find the executable.

1. Right click on "My Computer and select Properties item from the menu.

2. Click on "Advanced System Settings". In "System Properties" window, click on "Environment Variables" button and locate the PATH variable in "System Variables" section.

3. Append the bin folder path of Installed Cygwin C:\cygwin64\bin; and click OK. Be default the path will be C:\cygwin64\bin; In case the installation path is different, you may have to select a different path.

4. Add new System Variable JAVA_HOME and add the installed JAVA path C:\JAVA (no semicolon at the end).

Download The Hadoop:

Now that you have successfully installed and configured all pre-requisite software, let us download and install Hadoop.

1. Download Hadoop from the apache website.

2. Open Cygwin terminal (Run as Administration) 3. Execute command "explorer ." to locate your home directory. It will open up Cygwin Home Directory Folder.

4. Copy the Hadoop folder (you just downloaded) and place it in the home directory folder (which was opened in previous step)

Unzip The Hadoop Package:

Now we need to unzip the downloaded Hadoop package and save it to Cygwin home folder.

1. Open Cygwin terminal (Run as Administration)

2. Execute the tar command as below to start unpacking the Hadoop package. tar -xzf hadoop-x.y.z.tar.gz

Here x.y.z will be your Hadoop version.

(10)

Configurations:

1. Open Cygwin and type the command "explorer ." to open Home folder.

Create a folder with name "hadoop-dir". And inside "hadoop-dir" folder create 2 folder with names "datadir" and "namedir". 2. In Cygwin execute chmod command to change folder permissions so that it will be accesses by Hadoop. $ chmod 755 hadoop-dir cd hadoop-dir

$ chmod 755 datadir $ chmod 755 namedir

You need to change the current folder using cd command before executing these commands.

Open Cygwin terminal (Run as Administration) and execute following command $ cd hadoop-x.y.z $ cd conf $ explorer .

It will open conf folder in Windows explorer window.

Open hadoop-env.sh file to set Java home as you did it before for environmental variable setup Uncomment the line which contains "export JAVA_HOME" and provide your Java path. export JAVA_HOME= "C:\Java\"

Open core-site.xml file and add below code.

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:50000</value> </property> </configuration>

Open mapred -site.xml file and add below code

<name>mapred.job.tracker</name>

<value>localhost:50001</value>

(11)

</configuration>

Open hdfs -site.xml file and add below code Change the USER_FOLDER_NAME in the below code as per your your user name.

<configuration> <property> <name>dfs.data.dir</name> <value>/home/USER_FOLDER_NAME/hadoop-dir/datadir</value> <value>/home/USER_FOLDER_NAME/hadoop-dir/datadir</value> </property> <property> <name>dfs.name.dir</name> <value>/home/USER_FOLDER_NAME/hadoop-dir/namedir</value> </property> </configuration>

For every file we changed, run dos2unix command.

Now we have successfully installed Hadoop on Windows.

Format the NameNode and Run Hadoop Daemons :

Format the NameNode:

We first need to format the NameNode to create a Hadoop Distributed File System (HDFS).

Open Cygwin terminal (Run as Administration) and execute following command $ cd hadoop-x.y.z

$ bin/hadoop namenode -format

This command will run for some time. You should be able to see message "Storage Directory has been successfully formatted"

(12)

Next step would be to check and start Hadoop Cluster Daemons NameNode, DataNode, SecondaryNameNode, JobTracker, TaskTracker.

Restart the Cygwin Terminal and execute below command to start all daemons on Hadoop Cluster.

$ bin/start-all.sh

This command will start all the services in Cluster and now you have your Hadoop Cluster running.

Stop Hadoop Daemons: To stop all the daemons, we can execute the command

$ bin/stop-all.sh

Web interface for the NameNode and the JobTracker:

After you have started the Hadoop Daemons by using the command bin/startall.sh, you can open and check NameNode and JobTracker in browser. By default they are available at below address. NameNode: http://localhost:50070/ JobTracker : http://localhost:50030/ The web interface for these services provide information and status of each of these components. They are first entry point to obtain a view of the state of a Hadoop cluster.

NameNode Web Interface: The NameNode web interface can be accessed via the URL http://<host>:50070/ In this case, http://localhost:50070/

1. The first section of the web page displays the name of the server running the NameNode which is 127.0.0.1 and the port 9100, when it was started, version information. 2. In the Next section you see Cluster Summary which represent a high view of the state of the cluster.

files and directories, blocks: Each Filesystem metadata item consumes this much memory.

Configured Capacity: It represents total capacity of HDFS.

DFS Used: It represent space used in HDFS.

Non DFS Used: It tells about the space used for non-HDFS items like any other application running on system.

JobTracker Web Interface: The JobTracker web interface can be accessed via the URL http://<namenode_host>:50030/ In this case, http://localhost:50030/

MapReduce wordcount example

1. First of all we will create new directory named “example” in our hadoop-0.19.1 folder.

(13)

3. Next thing we need to do is copy this file to the HDFS using below command.

bin/hadoop dfs -copyFromLocal example example

4. Run the wordcount java program from the hadoop directory

bin/hadoop jar hadoop-0.19.1-examples.jar wordcount example example-output

The map reduce operations in jobtracker are shown as below,

(14)