Extreme computing lab exercises Session one
Michail Basios ([email protected]) Stratis Viglas ([email protected])
1 Getting started
First you need to access the machine where you will be doing all the work. Do this by typing:
ssh namenode
If this machine is busy, try another one (for example bw1425n13, bw1425n14 or any other number up to bw1425n24). In case you want to access Hadoop from a non-Dice machine, you should firstly access a dice machine through ssh by using the following command:
ssh [email protected] where sXXXXXXX is your matriculation number. Then you can type:
ssh namenode and access Hadoop.
Hadoop is installed on DICE under /opt/hadoop/hadoop-0.20.2 (this will also be referred to as the Hadoop home directory). The hadoop command can be found in the /opt/hadoop/hadoop-0.20.2/bin directory. To make sure you can run Hadoop command without having to be in the same directory, use your favorite editor (gedit) to edit the ~/.benv file by typing:
gedit ~/.benv and add the following line:
PATH=$PATH:/opt/hadoop/hadoop-0.20.2/bin/
Don’t forget to save the changes! After that, run the following command:
source ~/.benv
to make sure new changes take place right away. After you do this, run hadoop dfs -ls /
If you get a listing of directories on HDFS, you’ve successfully configured everything. If not, make sure you do ALL of the described steps EXACTLY as they appear in this document. Take in mind that you should not continue if you have not managed to do this section.
2 HDFS
Hadoop consists of two major parts: a distributed filesystem (HDFS for Hadoop DFS) and the mech- anism for running jobs. Files on HDFS are distributed across the network and replicated on different nodes for reliability and speed. When a job requires a particular file, it is fetched from the ma- chine/rack nearest to the machines that are actually executing the job.
You will now proceed to do some simple commands in HDFS. REMEMBER: all the commands mentioned here refer to HDFS shell commands, NOT to UNIX commands which may have the same name. You should keep all your data within a directory named /user/sXXXXXXX (where sXXXXXXX is your matriculation number) which should already be created for you.
Very important is to understand the difference between files stored locally on a DICE machine
and the files stored on HDFS (see Figure 1). This difference will be obvious after we have completed
this tutorial. Figure 1 shows some existing folders (s1250553, s1250553/data/) and other folders
(s14xxxxxx/) that we will need to create for this task.
s14xxxxxx
uploads data
input output
/user/
s1250553
data
example1.txt
HDFS
DICE
hdfs -copyFromLocal
Desktop /afs/inf.ed.ac.uk/user/s14/s14xxxxxx/
hdfs -copyToLocal
Local
Folder sampleInput.txt
Figure 1: File transfer between HDFS and local Dice machine
1. Initially, make sure that your home directory exists:
Tasks B
hadoop dfs -ls /user/sXXXXXXX
where sXXXXXXX should be your matriculation number. To create a directory in Hadoop the following command is used:
hadoop dfs -mkdir dirName
For this task, initialy create the following directories in a similar way (these directories will NOT have been created for you, so you need to create them yourself):
• /user/sXXXXXXX/data
• /user/sXXXXXXX/data/input
• /user/sXXXXXXX/data/output
• /user/sXXXXXXX/uploads
Confirm that you’ve done the right thing by typing
hadoop dfs -ls /user/sXXXXXXX
For example, if your matriculation number is s0123456, you should see something like:
Found 2 items
drwxr-xr-x - s0123456 s0123456 0 2011-10-19 09:55 /user/s0123456/data drwxr-xr-x - s0123456 s0123456 0 2011-10-19 09:54 /user/s0123456/uploads 2. Copy the file /user/s1250553/data/example1.txt to /user/sXXXXXXX/data/output by typ-
ing:
hadoop dfs -cp /user/s1250553/data/example1.txt /user/sXXXXXXX/data/output 3. Obviously, example1.txt doesn’t belong there. Move it from /user/sXXXXXXX/data/output to
/user/sXXXXXXX/data/input where it belongs and delete the /user/sXXXXXXX/data/output
directory:
hadoop dfs -mv /user/sXXXXXXX/data/output/example1.txt /user/sXXXXXXX/data/input
hadoop dfs -rmr /user/sXXXXXXX/data/output/
4. Examine the content of example1.txt by using cat and then tail:
hadoop dfs -cat /user/sXXXXXXX/data/input/example1.txt hadoop dfs -tail /user/sXXXXXXX/data/input/example1.txt 5. Create an empty file named example2.txt in /user/sXXXXXXX/data/input.
hadoop dfs -touchz /user/sXXXXXXX/data/input/example2.txt 6. Remove the file example2.txt:
hadoop dfs -rm /user/sXXXXXXX/data/input/example2.txt 7. Create locally on the Desktop of your dice machine a folder named diceFolder:
mkdir ~/Desktop/diceFolder cd ~/Desktop/diceFolder
8. Inside diceFolder create a file named sampleInput.txt and add some text content:
touch sampleInput.txt
echo "This is a message" > sampleInput.txt
9. Next, in order to copy (upload) the sampleInput.txt file on HDFS execute the following com- mand:
hadoop dfs -copyFromLocal ~/Desktop/diceFolder/sampleInput.txt /user/s14xxxxxx/uploads
Observe that the first path (~/Desktop/diceFolder/sampleInput.txt) refers to the local Dice machine and the second path (/user/s14xxxxxx/uploads) to HDFS.
2.1 List of HDFS commands
What follows is a list of useful HDFS shell commands.
• cat – copy files to stdout, similar to UNIX cat command.
Example:
hadoop dfs -cat /user/hadoop/file4
• copyFromLocal – copy single src, or multiple srcs from local file system to the destination filesystem. Source has to be a local file reference.
Example:
hadoop dfs -copyFromLocal localfile /user/hadoop/file1
• copyToLocal – copy files to the local file system. Destination must be a local file reference.
Example:
hadoop dfs -copyToLocal /user/hadoop/file localfile
• cp – copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Similar to UNIX cp command.
Example:
hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2
• getmerge – take a source directory and a destination file as input and concatenate files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file. Example:
hadoop dfs -getmerge /user/hadoop/mydir/ ~/result_file
• ls – for a file returns stat on the file with the format:
filename <number of replicas> size modification_date modification_time permissions userid groupid
For a directory it returns list of its direct children as in UNIX, with the format:
dirname <dir> modification_time modification_time permissions userid groupid Example:
hadoop dfs -ls /user/hadoop/file1
• lsr – recursive version of ls. Similar to UNIX ls -R command.
Example:
hadoop dfs -lsr /user/hadoop/
• mkdir – create a directory. Behaves similar to UNIX mkdir -p command creating parent direc- tories along the path (for bragging rights, what is the difference?)
Example:
hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
• mv – move files from source to destination similar to UNIX mv command. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.
Example:
hadoop dfs -mv /user/hadoop/file1 /user/hadoop/file2
• rm – delete files, similar to UNIX rm command. Only deletes empty directories and files.
Example:
hadoop dfs -rm /user/hadoop/file1
• rmr – recursive version of rm. Same as rm -r on UNIX.
Example:
hadoop dfs -rmr /user/hadoop/dir1/
• tail – Displays last kilobyte of the file to stdout. Similar to UNIX tail command. Options:
-f output appended data as the file grows (follow)
hadoop dfs -tail /user/hadoop/file1
• touchz – create a file of zero length. Similar to UNIX touch command.
Example:
hadoop dfs -touchz /user/hadoop/file1
More analytical information about hadoop commands can be found here (http://hadoop.
apache.org/docs/r0.18.3/hdfs_shell.html).
3 Running jobs
3.1 Demo: Word counting
Hadoop has a number of demo applications and here we will look at the canonical task of word counting.
We will count the number of times each word appears in a document. For this purpose, we C Task
will use the /user/s1250553/data/example1.txt file, so first copy that file to your input direc- tory. Second, make sure you delete your output directory before running the job or the job will fail (IMPORTANT). We run the wordcount example by typing:
hadoop jar /opt/hadoop/hadoop-0.20.2/hadoop-0.20.2-examples.jar wordcount <in> <out>
where in and out are the input and output directories, respectively. After running the example, examine (using ls) the contents of the output directory. From the output directory, copy the file part-r-00000 to a local directory (somewhere in your home directory) and examine the contents.
Was the job successful?
3.2 Running streaming jobs
In order to run your programms in hadoop we have to use the Hadoop streaming utility. It allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. The way it works is very simple: input is converted into lines which are fed to the stdin of the mapper process. The mapper processes this data and writes to stdout. Lines from the stdout of the mapper process are converted into key/value pairs by splitting them on the first tab character. 1 The key/value pairs are fed to the stdin of the reducer process which collects and processes them.
Finally, the reducer writes to stdout which is the final output of the program. Everything will become much clearer through examples later.
It is important to note that with Hadoop streaming mappers and reducers can be any programs that read from stdin and write to stdout, so the choice of the programming language is left to the programmer. Here, we will use Python.
3.2.1 Writing a word counting program in Python
In this subsection, we will see how to create a program in Python that can count the number of words of a specific file. Initially, we will test the code locally in small files before using it in a MapReduce job. As we will see later, this is important as it helps in not running jobs in Hadoop that can give wrong results.
Step 1: Mapper
• Using Streaming, a Mapper reads from STDIN and writes to STDOUT
• Keys and Values are delimited (by default) using tabs.
• Records are split using newlines
For this task, create a file named mapper.py and add the following code:
# ! / u s r / b i n / p y t h o n import sys
# i n p u t f r o m s t a n d a r d i n p u t f o r l i n e in sys . s t d i n :
# r e m o v e w h i t e s p a c e s l i n e = l i n e . s t r i p ( )
# s p l i t t h e l i n e i n t o t o k e n s tokens = l i n e . s p l i t ( )
f o r token in tokens :
# w r i t e t h e r e s u l t s t o s t a n d a r d o u t p u t p r i n t ’%s\ t%s ’ % ( token , 1 )
1