To run this second Java example, you copy its file to the WordCount.java file so that the file name matches the Java class.
[hadoop@hc1nn wordcount]$ cp wc-ex2.java WordCount.java
Then, you remove the contents of the wc_classes directory and re-create it to receive the Java build output. Use the Linux rm command for the Remove with r for “recursive”and f for “force switches.” Use the Linux mkdir command to re-create the directory:
[hadoop@hc1nn wordcount]$ rm -rf wc_classes [hadoop@hc1nn wordcount]$ mkdir wc_classes
You build the WordCount java file by specifying an output directory called wc_classes:
[hadoop@hc1nn wordcount]$ javac -classpath $HADOOP_PREFIX/hadoop-core-1.2.1.jar -d wc_classes WordCount.java
Then, you list the contents of the wc_classes directory recursively to ensure that the org.myorg directory structure exists and contains the newly compiled classes:
[hadoop@hc1nn wordcount]$ ls -R wc_classes wc_classes:
org
wc_classes/org:
myorg
wc_classes/org/myorg:
WordCount.class WordCount$Map.class WordCount$Map$Counters.class WordCount$Reduce.class You build these classes into a jar library called wordcount1.jar, so that the resulting jar file can be used for a Hadoop Map Reduce job run. Use the Linux jar command for this (which operates in a similar manner to tar) by using the options C for “create,” v for “verbose,” and f to specify the file to create:
[hadoop@hc1nn wordcount]$ jar -cvf ./wordcount1.jar -C wc_classes.
added manifest
adding: org/(in = 0) (out= 0)(stored 0%) adding: org/myorg/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/WordCount.class(in = 2671) (out= 1289)(deflated 51%) adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 648)(deflated 59%) adding: org/myorg/WordCount$Map$Counters.class(in = 983) (out= 504)(deflated 48%) adding: org/myorg/WordCount$Map.class(in = 4661) (out= 2217)(deflated 52%)
[hadoop@hc1nn wordcount]$ ls -l *.jar
-rw-rw-r--. 1 hadoop hadoop 5799 Jun 21 17:19 wordcount1.jar
The test data from the first example is still available on HDFS under the directory /user/hadoop/edgar; this is shown by using the Hadoop file system ls command:
[hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/edgar Found 5 items
-rw-r--r-- 1 hadoop supergroup 410012 2014-06-19 11:59 /user/hadoop/edgar/10031.txt -rw-r--r-- 1 hadoop supergroup 559352 2014-06-19 11:59 /user/hadoop/edgar/15143.txt -rw-r--r-- 1 hadoop supergroup 66401 2014-06-19 11:59 /user/hadoop/edgar/17192.txt -rw-r--r-- 1 hadoop supergroup 596736 2014-06-19 11:59 /user/hadoop/edgar/2149.txt -rw-r--r-- 1 hadoop supergroup 63278 2014-06-19 11:59 /user/hadoop/edgar/932.txt
To give this first example a thorough test, I also created a patterns file called patterns.txt that contains a series of unwanted characters. I have dumped the contents of the file shown here by using the Linux cat command. Note that some characters have an Escape character (\) at the start of the line to avoid processing errors for characters that Java might consider to have special meaning. By using an Escape character, you will ensure that these patterns are just treated as text:
[hadoop@hc1nn wordcount]$ cat patterns.txt
!
"
' _
;
\(
\)
\#
\$
\&
\.
\,
\*
\-\/
\{
\}
Copy the patterns.txt onto HDFS into the directory /user/hadoop/java by using the Hadoop file system copyFromLocal command. Using the Hadoop file system ls command, list the patterns.txt file that is now on HDFS:
[hadoop@hc1nn wordcount]$ hadoop dfs -copyFromLocal ./patterns.txt /user/hadoop/java/patterns.txt [hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/java
Found 1 items
-rw-r--r-- 1 hadoop supergroup 46 2014-06-21 17:29 /user/hadoop/java/patterns.txt
Now you are ready to run this extended version of the Java Map Reduce task. The library that was just created is specified via the Hadoop jar option. This is followed by the Class name to be called within that library. Next, a flag is set via the -D option to switch the case sensitivity off. After that, the input data file and output directory names on HDFS are listed. Finally, you specify a skip file to remove any unwanted characters in the data processed:
[hadoop@hc1nn wordcount]$ hadoop jar ./wordcount1.jar org.myorg.WordCount -Dwordcount.case.sensitive=false /user/hadoop/edgar/10031.txt
/user/hadoop/edgar-results -skip /user/hadoop/java/patterns.txt The command produces the following Map Reduce task output:
14/06/21 17:40:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/06/21 17:40:06 INFO mapred.FileInputFormat: Total input paths to process : 1 14/06/21 17:40:07 INFO mapred.JobClient: Running job: job_201406211041_0004 14/06/21 17:40:08 INFO mapred.JobClient: map 0% reduce 0%
14/06/21 17:40:15 INFO mapred.JobClient: map 50% reduce 0%
14/06/21 17:40:23 INFO mapred.JobClient: map 100% reduce 16%
14/06/21 17:40:30 INFO mapred.JobClient: map 100% reduce 100%
14/06/21 17:40:31 INFO mapred.JobClient: Job complete: job_201406211041_0004 14/06/21 17:40:31 INFO mapred.JobClient: Counters: 32
14/06/21 17:40:31 INFO mapred.JobClient: Job Counters
14/06/21 17:40:31 INFO mapred.JobClient: Launched reduce tasks=1 14/06/21 17:40:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=17198
14/06/21 17:40:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
...
14/06/21 17:40:31 INFO mapred.JobClient: CPU time spent (ms)=5880 14/06/21 17:40:31 INFO mapred.JobClient: Map input bytes=410012 14/06/21 17:40:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=198
14/06/21 17:40:31 INFO mapred.JobClient: Combine input records=63590 14/06/21 17:40:31 INFO mapred.JobClient: Reduce input records=12581 14/06/21 17:40:31 INFO mapred.JobClient: Reduce input groups=9941 14/06/21 17:40:31 INFO mapred.JobClient: Combine output records=12581
14/06/21 17:40:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=404115456 14/06/21 17:40:31 INFO mapred.JobClient: Reduce output records=9941
14/06/21 17:40:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4109373440 14/06/21 17:40:31 INFO mapred.JobClient: Map output records=63590
Check the results directory on HDFS by using the Hadoop file system ls command. The existence of a _SUCCESS file shows that the job was a success:
[hadoop@hc1nn wordcount]$ hadoop dfs -ls /user/hadoop/edgar-results Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2014-06-21 17:40 /user/hadoop/edgar-results/_SUCCESS drwxr-xr-x - hadoop supergroup 0 2014-06-21 17:40 /user/hadoop/edgar-results/_logs -rw-r--r-- 1 hadoop supergroup 103300 2014-06-21 17:40 /user/hadoop/edgar-results/part-00000
Checking the last 10 lines of the results part file using the Hadoop file system cat command and the Linux tail command gives a sorted word count with any unwanted characters removed:
[hadoop@hc1nn wordcount]$ hadoop dfs -cat /user/hadoop/edgar-results/part-00000 | tail -10 zanthe 1
zeal 2 zeboin 1 zelo 1 zephyr 1
zimmermann 1 zipped 1
zoar 1 zoilus 3 zone 1