Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Loading....

Loading....

Loading....

Loading....

Loading....

Full text

(1)

Hadoop Tutorial

Group 7 - Tools For Big Data

Indian Institute of Technology Bombay

Dipojjwal Ray

Sandeep Prasad

1

Introduction

In installation manual we listed out the steps for 1.0.3 and

hadoop-1.0.4. In this report we will present various examples conducted on hadoop.

After installation is complete any of the mentioned below example can be run

on hadoop as a check for proper installation. The examples explained in this

report are as mentioned below

1. wordcount: listing the words that occur is given file along with their

oc-currence frequency [1]

2. pi: calculating the value of pi [2]

3. pagerank:

4. inverted indexing:

5. indexing wikipedia: In this section we will index the entire English wikipedia

2

Wordcount

Wordcount example is counting and sorting words in a given single file or

group of files. Files of various size were used for this example. 1

st

set of

experi-ment was conducted using single files and 2

nd

set of experiment was conducted

using group of files. For 1

st

set of experiments 5 files were used whose details

along with time required for execution of wordcount is given in table 1. For 2

nd

set of experiment combination of files from 1

st

set were used whose details can

be found in table 2

The figures given below are for line 3 of table 2 with 3 files in gutenberg

directory in /tmp. Figure 1 shows command given in Listing 1 executed on

my machine. It is assumed that the files are located in /tmp directory under

appropriate name (in my case the directory name is /tmp/gutenberg).

1 $ b i n / hadoop d f s −copyFromLocal /tmp/ g u t e n b e r g / u s e r / h d u s e r / g u t e n b e r g 2 $ b i n / hadoop d f s −l s / u s e r / h d u s e r / g u t e n b e r g

(2)

1

st

set of experiments

file name

size

cpu time required (ms)

pg20417.txt

674.6 KB

3380

pg2243.txt

137.3 KB

2270

pg28885.txt

177.4 KB

2520

pg4300.txt

1.6 MB

4090

pg5000.txt

1.4 MB

3700

Table 1: Time required to count words in single files

2

nd

set of experiments

file names

total size

cpu time required (ms)

pg4300.txt, pg5000.txt

3.0 MB

6860

pg4300.txt, pg5000.txt, pg20417.txt

3.7 MB

9580

pg2243.txt, pg5000.txt, pg20417.txt, pg28885.txt

2.4 MB

9090

pg2243.txt, pg4300.txt, pg5000.txt, pg20417.txt, pg28885.txt

4.0 MB

11410

Table 2: Time required to count words in multiple files

Line 1 in listing 1 is copying files from /tmp/gutenberg in local machine to

hadoop’s file system in directory /user/hduser/gutenberg. Line 2 in Listing 1

is listing/checking the files just copied in /user/hduser/gutenberg

Figure 1: copy files to dfs

The command to run wordcount is given in listing 2 and the command

exe-cuted on my machine is given in listing 3. Files from /user/hduser/gutenberg

are used and it’s output is stored in /user/hduser/gutenberg-output

1 $ b i n / hadoop j a r hadoop∗e x a m p l e s∗. j a r w o r d c o u n t / u s e r / h d u s e r / g u t e n b e r g / u s e r / h d u s e r / g u t e n b e r g−o u t o u t

Listing 2: Copying files from user machine to hadoop’s file system

1 hduser@ada−d e s k t o p : / u s r / l o c a l / hadoop$ b i n / hadoop j a r hadoop∗e x a m p l e s∗. j a r w o r d c o u n t / u s e r / h d u s e r / g u t e n b e r g / u s e r / h d u s e r / g u t e n b e r g−o u t p u t 2 Warning : $HADOOP HOME i s d e p r e c a t e d .

3

4 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 INFO i n p u t . F i l e I n p u t F o r m a t : T o t a l i n p u t p a t h s t o p r o c e s s : 3

5 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 INFO u t i l . N a t i v e C o d e L o a d e r : Loaded t h e n a t i v e−hadoop l i b r a r y

6 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 WARN snappy . LoadSnappy : Snappy n a t i v e l i b r a r y n o t l o a d e d

(3)

7 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 7 INFO mapred . J o b C l i e n t : Running j o b : j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1

8 1 3 / 0 7 / 2 9 1 4 : 2 0 : 5 8 INFO mapred . J o b C l i e n t : map 0% r e d u c e 0% 9 1 3 / 0 7 / 2 9 1 4 : 2 1 : 1 3 INFO mapred . J o b C l i e n t : map 66% r e d u c e 0% 10 1 3 / 0 7 / 2 9 1 4 : 2 1 : 1 9 INFO mapred . J o b C l i e n t : map 100% r e d u c e 0% 11 1 3 / 0 7 / 2 9 1 4 : 2 1 : 2 2 INFO mapred . J o b C l i e n t : map 100% r e d u c e 22% 12 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 1 INFO mapred . J o b C l i e n t : map 100% r e d u c e 100% 13 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Job c o m p l e t e :

j o b 2 0 1 3 0 7 2 9 1 3 4 9 0 0 0 1

14 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : C o u n t e r s : 29 15 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Job C o u n t e r s

16 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Launched r e d u c e t a s k s =1 17 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : SLOTS MILLIS MAPS=20523 18 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : T o t a l t i m e s p e n t by a l l

r e d u c e s w a i t i n g a f t e r r e s e r v i n g s l o t s ( ms ) =0

19 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : T o t a l t i m e s p e n t by a l l maps w a i t i n g a f t e r r e s e r v i n g s l o t s ( ms ) =0

20 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Launched map t a s k s =3 21 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Data−l o c a l map t a s k s =3 22 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : SLOTS MILLIS REDUCES=16245 23 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : F i l e Output Format C o u n t e r s 24 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : B y t e s W r i t t e n =880838 25 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : F i l e S y s t e m C o u n t e r s

26 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : FILE BYTES READ=2214875 27 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : HDFS BYTES READ=3671884 28 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : FILE BYTES WRITTEN=3775583 29 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : HDFS BYTES WRITTEN=880838 30 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : F i l e I n p u t Format C o u n t e r s 31 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : B y t e s Read =3671523 32 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map−Reduce Framework 33 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map o u t p u t m a t e r i a l i z e d

b y t e s =1474367

34 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map i n p u t r e c o r d s =77931 35 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce s h u f f l e b y t e s

=1207341

36 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : S p i l l e d R e c o r d s =255966 37 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map o u t p u t b y t e s =6076101 38 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : T o t a l committed heap u s a g e

( b y t e s ) =586285056

39 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : CPU t i m e s p e n t ( ms ) =9580 40 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Combine i n p u t r e c o r d s

=629172

41 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : SPLIT RAW BYTES=361

42 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce i n p u t r e c o r d s =102324 43 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce i n p u t g r o u p s =82335 44 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Combine o u t p u t r e c o r d s

=102324

45 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : P h y s i c a l memory ( b y t e s ) s n a p s h o t =625811456

46 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Reduce o u t p u t r e c o r d s =82335 47 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : V i r t u a l memory ( b y t e s )

s n a p s h o t =1897635840

48 1 3 / 0 7 / 2 9 1 4 : 2 1 : 3 6 INFO mapred . J o b C l i e n t : Map o u t p u t r e c o r d s =629172 49 hduser@ada−d e s k t o p : / u s r / l o c a l / hadoop$

Listing 3: wordcount executed on /user/hduser/gutenberg

In case the system is not able to detect the jar file the following error message

is received

1 E x c e p t i o n i n t h r e a d ” main ” j a v a . i o . I O E x c e p t i o n : E r r o r o p e n i n g j o b j a r : hadoop∗e x a m p l e s∗. j a r a t o r g . a p a c h e . hadoop . u t i l . RunJar . main ( RunJar . j a v a : 9 0 )

2 Caused by : j a v a . u t i l . z i p . Z i p E x c e p t i o n : e r r o r i n o p e n i n g z i p f i l e

In such cases use complete name of jar file (instead of hadoop*examples*.jar

use hadoop-examples-1.0.3.jar) and run the command again

(4)

As mentioned the output is stored in /user/hduser/gutenberg-output, to

check if file exist run the command given in line 2 of listing 1 and in

com-mand replace gutenberg with gutenberg-output. Figure 2 shows the file present

in my system.

Figure 2: checking the files produced by wordcount

Figure 3 shows the retrieved output which can be checked by importing the

results back to local system. notice -getmerge in line 2 of listing 4, it merges

everything present in gutenberg-output folder.

1 $ mkdir /tmp/ g u t e n b e r g−o u t p u t

2 $ b i n / hadoop d f s −g e t m e r g e / u s e r / h d u s e r / g u t e n b e r g−o u t p u t /tmp/ g u t e n b e r g

−o u t p u t

3 $ head /tmp/ g u t e n b e r g−o u t p u t / g u t e n b e r g−o u t p u t

Listing 4: Checking wordcount results after importing results to local system

Figure 3: Checking wordcount results

Results can be retrieved without importing the results also, just use the

com-mand given in listing!5

1 $ b i n / hadoop d f s −c a t / u s e r / h d u s e r / g u t e n b e r g−o u t p u t / p a r t−r−00000

(5)

3

Value of PI

Hadoop can be used to calculate value of PI. value of pi is 3.14159

1

. Value

of pi is calculated using quasi-Monte Carlo method in this example. Value of

pi can be estimated using command in listing 6. We define two values after ‘pi’

first value is of ‘x’ the number of maps and second value is ‘y’ the number of

samples per map. Result of some experiments conducted is given in table 3

1 $ b i n / hadoop j a r hadoop∗e x a m p l e s∗. j a r p i 10 100

Listing 6: command to calculate value of pi

x

y

Time required (secs)

Value calculated

10

100

60.53

3.148

10

200

53.53

3.144

10

400

55.58

3.14

10

1000000

54.45

3.1415844

50

100

178.82

3.1418

Table 3: Time required to calculate value of PI for different x and y

References

[1] Michael G. Noll. Running hadoop on ubuntu linux (single-node cluster)

- michael g. noll.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/.

[2] Cloud 9. Cloud9: A mapreduce library for hadoop

>>

getting started

in standalone mode.

http://lintool.github.io/Cloud9/docs/content/start-standalone.html.

Figure

Updating...

References

Updating...

Related subjects :