Hadoop Tutorial
Group 7 - Tools For Big Data
Indian Institute of Technology Bombay
Dipojjwal Ray
Sandeep Prasad
1
Introduction
In installation manual we listed out the steps for 1.0.3 and
hadoop-1.0.4. In this report we will present various examples conducted on hadoop.
After installation is complete any of the mentioned below example can be run
on hadoop as a check for proper installation. The examples explained in this
report are as mentioned below
1. wordcount: listing the words that occur is given file along with their
oc-currence frequency [1]
2. pi: calculating the value of pi [2]
3. pagerank:
4. inverted indexing:
5. indexing wikipedia: In this section we will index the entire English wikipedia
2
Wordcount
Wordcount example is counting and sorting words in a given single file or
group of files. Files of various size were used for this example. 1
stset of
experi-ment was conducted using single files and 2
ndset of experiment was conducted
using group of files. For 1
stset of experiments 5 files were used whose details
along with time required for execution of wordcount is given in table 1. For 2
ndset of experiment combination of files from 1
stset were used whose details can
be found in table 2
The figures given below are for line 3 of table 2 with 3 files in gutenberg
directory in /tmp. Figure 1 shows command given in Listing 1 executed on
my machine. It is assumed that the files are located in /tmp directory under
appropriate name (in my case the directory name is /tmp/gutenberg).
1 $ b i n / hadoop d f s −copyFromLocal /tmp/ g u t e n b e r g / u s e r / h d u s e r / g u t e n b e r g 2 $ b i n / hadoop d f s −l s / u s e r / h d u s e r / g u t e n b e r g
1
stset of experiments
file name
size
cpu time required (ms)
pg20417.txt
674.6 KB
3380
pg2243.txt
137.3 KB
2270
pg28885.txt
177.4 KB
2520
pg4300.txt
1.6 MB
4090
pg5000.txt
1.4 MB
3700
Table 1: Time required to count words in single files
2
ndset of experiments
file names
total size
cpu time required (ms)
pg4300.txt, pg5000.txt
3.0 MB
6860
pg4300.txt, pg5000.txt, pg20417.txt
3.7 MB
9580
pg2243.txt, pg5000.txt, pg20417.txt, pg28885.txt
2.4 MB
9090
pg2243.txt, pg4300.txt, pg5000.txt, pg20417.txt, pg28885.txt
4.0 MB
11410
Table 2: Time required to count words in multiple files
Line 1 in listing 1 is copying files from /tmp/gutenberg in local machine to
hadoop’s file system in directory /user/hduser/gutenberg. Line 2 in Listing 1
is listing/checking the files just copied in /user/hduser/gutenberg
Figure 1: copy files to dfs
The command to run wordcount is given in listing 2 and the command
exe-cuted on my machine is given in listing 3. Files from /user/hduser/gutenberg
are used and it’s output is stored in /user/hduser/gutenberg-output
1 $ b i n / hadoop j a r hadoop∗e x a m p l e s∗. j a r w o r d c o u n t / u s e r / h d u s e r / g u t e n b e r g / u s e r / h d u s e r / g u t e n b e r g−o u t o u t
Listing 2: Copying files from user machine to hadoop’s file system
1 hduser@ada−d e s k t o p : / u s r / l o c a l / hadoop$ b i n / hadoop j a r hadoop∗e x a m p l e s∗. j a r w o r d c o u n t / u s e r / h d u s e r / g u t e n b e r g / u s e r / h d u s e r / g u t e n b e r g−o u t p u t 2 Warning : $HADOOP HOME i s d e p r e c a t e d .
3
4 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 INFO i n p u t . F i l e I n p u t F o r m a t : T o t a l i n p u t p a t h s t o p r o c e s s : 3
5 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 INFO u t i l . N a t i v e C o d e L o a d e r : Loaded t h e n a t i v e−hadoop l i b r a r y
6 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 WARN snappy . LoadSnappy : Snappy n a t i v e l i b r a r y n o t l o a d e d
7 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 INFO mapred . J o b C l i e n t : Running j o b : j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1
8 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 8 INFO mapred . J o b C l i e n t : map 0% r e d u c e 0% 9 1 3 / 0 7 / 2 9 1 4 : 2 1 : 1 3 INFO mapred . J o b C l i e n t : map 66% r e d u c e 0% 10 1 3 / 0 7 / 2 9 1 4 : 2 1 : 1 9 INFO mapred . J o b C l i e n t : map 100% r e d u c e 0% 11 1 3 / 0 7 / 2 9 1 4 : 2 1 : 2 2 INFO mapred . J o b C l i e n t : map 100% r e d u c e 22% 12 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 1 INFO mapred . J o b C l i e n t : map 100% r e d u c e 100% 13 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Job c o m p l e t e :
j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1
14 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : C o u n t e r s : 29 15 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Job C o u n t e r s
16 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Launched r e d u c e t a s k s =1 17 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : SLOTS MILLIS MAPS=20523 18 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : T o t a l t i m e s p e n t by a l l
r e d u c e s w a i t i n g a f t e r r e s e r v i n g s l o t s ( ms ) =0
19 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : T o t a l t i m e s p e n t by a l l maps w a i t i n g a f t e r r e s e r v i n g s l o t s ( ms ) =0
20 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Launched map t a s k s =3 21 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Data−l o c a l map t a s k s =3 22 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : SLOTS MILLIS REDUCES=16245 23 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : F i l e Output Format C o u n t e r s 24 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : B y t e s W r i t t e n =880838 25 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : F i l e S y s t e m C o u n t e r s
26 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : FILE BYTES READ=2214875 27 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : HDFS BYTES READ=3671884 28 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : FILE BYTES WRITTEN=3775583 29 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : HDFS BYTES WRITTEN=880838 30 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : F i l e I n p u t Format C o u n t e r s 31 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : B y t e s Read =3671523 32 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map−Reduce Framework 33 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map o u t p u t m a t e r i a l i z e d
b y t e s =1474367
34 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map i n p u t r e c o r d s =77931 35 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce s h u f f l e b y t e s
=1207341
36 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : S p i l l e d R e c o r d s =255966 37 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map o u t p u t b y t e s =6076101 38 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : T o t a l committed heap u s a g e
( b y t e s ) =586285056
39 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : CPU t i m e s p e n t ( ms ) =9580 40 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Combine i n p u t r e c o r d s
=629172
41 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : SPLIT RAW BYTES=361
42 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce i n p u t r e c o r d s =102324 43 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce i n p u t g r o u p s =82335 44 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Combine o u t p u t r e c o r d s
=102324
45 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : P h y s i c a l memory ( b y t e s ) s n a p s h o t =625811456
46 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce o u t p u t r e c o r d s =82335 47 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : V i r t u a l memory ( b y t e s )
s n a p s h o t =1897635840
48 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map o u t p u t r e c o r d s =629172 49 hduser@ada−d e s k t o p : / u s r / l o c a l / hadoop$
Listing 3: wordcount executed on /user/hduser/gutenberg
In case the system is not able to detect the jar file the following error message
is received
1 E x c e p t i o n i n t h r e a d ” main ” j a v a . i o . I O E x c e p t i o n : E r r o r o p e n i n g j o b j a r : hadoop∗e x a m p l e s∗. j a r a t o r g . a p a c h e . hadoop . u t i l . RunJar . main ( RunJar . j a v a : 9 0 )
2 Caused by : j a v a . u t i l . z i p . Z i p E x c e p t i o n : e r r o r i n o p e n i n g z i p f i l e
In such cases use complete name of jar file (instead of hadoop*examples*.jar
use hadoop-examples-1.0.3.jar) and run the command again
As mentioned the output is stored in /user/hduser/gutenberg-output, to
check if file exist run the command given in line 2 of listing 1 and in
com-mand replace gutenberg with gutenberg-output. Figure 2 shows the file present
in my system.
Figure 2: checking the files produced by wordcount
Figure 3 shows the retrieved output which can be checked by importing the
results back to local system. notice -getmerge in line 2 of listing 4, it merges
everything present in gutenberg-output folder.
1 $ mkdir /tmp/ g u t e n b e r g−o u t p u t
2 $ b i n / hadoop d f s −g e t m e r g e / u s e r / h d u s e r / g u t e n b e r g−o u t p u t /tmp/ g u t e n b e r g
−o u t p u t
3 $ head /tmp/ g u t e n b e r g−o u t p u t / g u t e n b e r g−o u t p u t
Listing 4: Checking wordcount results after importing results to local system
Figure 3: Checking wordcount results
Results can be retrieved without importing the results also, just use the
com-mand given in listing!5
1 $ b i n / hadoop d f s −c a t / u s e r / h d u s e r / g u t e n b e r g−o u t p u t / p a r t−r−00000
3
Value of PI
Hadoop can be used to calculate value of PI. value of pi is 3.14159
1. Value
of pi is calculated using quasi-Monte Carlo method in this example. Value of
pi can be estimated using command in listing 6. We define two values after ‘pi’
first value is of ‘x’ the number of maps and second value is ‘y’ the number of
samples per map. Result of some experiments conducted is given in table 3
1 $ b i n / hadoop j a r hadoop∗e x a m p l e s∗. j a r p i 10 100