• No results found

To compile this code, I used the Java compiler javac, which was installed with the JDK when Java 1.6 was installed. The compiler expects the Java file name to match the class name, so I renamed the example code as WordCount.Java.

The classes on which this example relies are found in the Hadoop core library that is in the Hadoop release, so I specified that when compiling the code. Also, I placed the compiled output into a subdirectory called wc_classes, which can be used when building an example jar file.

[hadoop@hc1nn wordcount]$ cp wc-ex1.java WordCount.java [hadoop@hc1nn wordcount]$ mkdir wc_classes

[hadoop@hc1nn wordcount]$ javac -classpath $HADOOP_PREFIX/hadoop-core-1.2.1.jar -d wc_classes WordCount.java

The following recursive listing shows all of the subdirectories and classes from the build of the first example code:

[hadoop@hc1nn wordcount]$ ls -R wc_classes wc_classes:

org

wc_classes/org:

myorg

wc_classes/org/myorg:

WordCount.class WordCount$Map.class WordCount$Reduce.class

Building the code into a jar library using the jar command creates the wordcount1.jar file:

[hadoop@hc1nn wordcount]$ jar -cvf ./wordcount1.jar -C wc_classes . added manifest

adding: org/(in = 0) (out= 0)(stored 0%) adding: org/myorg/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/WordCount.class(in = 1546) (out= 750)(deflated 51%) adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 648)(deflated 59%) adding: org/myorg/WordCount$Map.class(in = 1938) (out= 798)(deflated 58%) [hadoop@hc1nn wordcount]$ ls -l *.jar

-rw-rw-r--. 1 hadoop hadoop 3169 Jun 15 15:05 wordcount1.jar

This file can now be used to run a word-count task on Hadoop. As in previous Map Reduce runs, the input and output data for the job will be taken from HDFS. To provide the words to count, I copied some data from Edgar Allan Poe books into a directory on HDFS from the Linux file system. The Linux ls command shows the text files that will be used:

[hadoop@hc1nn wordcount]$ ls $HOME/edgar

10031.txt 15143.txt 17192.txt 2149.txt 932.txt

Copying these files to the HDFS directory called /user/hadoop/edgar, using the Hadoop file system copyFromLocal command, sets up the data for the word-count job:

[hadoop@hc1nn wordcount]$ hadoop dfs -copyFromLocal $HOME/edgar/* /user/hadoop/edgar [hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/edgar

Found 5 items

-rw-r--r-- 1 hadoop supergroup 410012 2014-06-15 15:53 /user/hadoop/edgar/10031.txt -rw-r--r-- 1 hadoop supergroup 559352 2014-06-15 15:53 /user/hadoop/edgar/15143.txt -rw-r--r-- 1 hadoop supergroup 66401 2014-06-15 15:53 /user/hadoop/edgar/17192.txt -rw-r--r-- 1 hadoop supergroup 596736 2014-06-15 15:53 /user/hadoop/edgar/2149.txt -rw-r--r-- 1 hadoop supergroup 63278 2014-06-15 15:53 /user/hadoop/edgar/932.txt

By running the word-count example against the data in the input directory (/user/hadoop/edgar), you create the results data in the output directory (/user/hadoop/edgar-results). First, though, make sure the processes are all up before you run the job using jps.

[hadoop@hc1nn wordcount]$ jps 1959 SecondaryNameNode 1839 DataNode

4166 TaskTracker 4272 Jps

1720 NameNode 4044 JobTracker

This shows that the HDFS processes for the data node and name node are running on hc1nn. Also, the Map Reduce processes for the Task and Job Trackers are running. If you are going to rerun this job, then you will need to delete the HDFS-based results directory by using the Hadoop file system rmr command:

[hadoop@hc1nn wordcount]$ hadoop dfs -rmr /user/hadoop/edgar-results

You can run the job via the Hadoop jar command. The parameters passed to it are the library file you have just created, the name of the class to run in that library, the input directory on HDFS, and the output directory:

[hadoop@hc1nn wordcount]$ hadoop jar ./wordcount1.jar org.myorg.WordCount /user/hadoop/edgar /user/

hadoop/edgar-results

14/06/15 16:04:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/06/15 16:04:50 INFO mapred.FileInputFormat: Total input paths to process : 5 14/06/15 16:04:51 INFO mapred.JobClient: Running job: job_201406151602_0001 14/06/15 16:04:52 INFO mapred.JobClient: map 0% reduce 0%

14/06/15 16:05:02 INFO mapred.JobClient: map 20% reduce 0%

14/06/15 16:05:03 INFO mapred.JobClient: map 40% reduce 0%

14/06/15 16:05:04 INFO mapred.JobClient: map 60% reduce 0%

...

14/06/15 16:05:19 INFO mapred.JobClient: Combine input records=284829 14/06/15 16:05:19 INFO mapred.JobClient: Reduce input records=55496 14/06/15 16:05:19 INFO mapred.JobClient: Reduce input groups=36348 14/06/15 16:05:19 INFO mapred.JobClient: Combine output records=55496

14/06/15 16:05:19 INFO mapred.JobClient: Physical memory (bytes) snapshot=912035840 14/06/15 16:05:19 INFO mapred.JobClient: Reduce output records=36348

14/06/15 16:05:19 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7949012992 14/06/15 16:05:19 INFO mapred.JobClient: Map output records=284829

The job has completed (the output shown above has been trimmed), so you can check the output on HDFS under /user/hadoop/edgar-results/ by using the Hadoop file system ls command:

[hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/edgar-results/

Found 3 items

-rw-r--r-- 1 hadoop supergroup 0 2014-06-15 16:05 /user/hadoop/edgar-results/_SUCCESS drwxr-xr-x - hadoop supergroup 0 2014-06-15 16:04 /user/hadoop/edgar-results/_logs -rw-r--r-- 1 hadoop supergroup 396500 2014-06-15 16:05 /user/hadoop/edgar-results/part-00000

These results show a _SUCCESS file, so the job was completed without error. As in previous examples, you use the Hadoop file system cat command to dump the contents of the results file and the Linux head command to limit the job results to the first 10 rows:

[hadoop@hc1nn wordcount]$ hadoop dfs -cat /user/hadoop/edgar-results/part-00000 | head -10

!) 1

"''T 1

"'And 1

"'As 1

"'Be 2

"'But--still--monsieur----' 1

"'Catherine, 1

"'Comb 1

"'Come 1

"'Eyes,' 1

Well done! You have just compiled and run your own native Map Reduce job from a source file. To create more, you can simply change the algorithm in Java (or write your own) and follow the same process. One change that might be useful is to ignore the white-space and symbol characters when counting the words. The example’s output data contains characters like these (“ or -). The next example adds these refinements.