USING HDFS ON DISCOVERY CLUSTER – TWO EXAMPLES - test1 and test2
(Using HDFS on Discovery Cluster for Discovery Cluster Users – email [email protected] if you have questions or need more
clarifications. Nilay K. Roy, Ph.D)
To use HDFS on Discovery Cluster login to any of the login nodes as usual. Make sure that your .bashrc has the following
modules loaded. “module whatis hadoop-2.4.1” will give you the prerequisites. After login type “module list” to make sure the
proper modules are loaded as shown below:
Next using the “hadoop-10g” queue get an interactive-node. Every user is restricted to no more than 10 cores. The interactive
nodes are the hadoop file system data nodes. Generally you do not need X-windows forwarding so “bsub -Is -q hadoop-10g -n 1
/bin/bash” will suffice. If you do need X-windows forwarding add the “-XF” tag for “bsub -Is -XF -q hadoop-10g -n 1 /bin/bash”.
This is shown below. “-n” is number of cores requested. If you have a bad ssh connection and want a persistent one use “screen”
available on the two login nodes “discovery2” and “discovery4”. This way you can logout of the cluster while your job runs and
login later without killing anything. Remember to detach from the “screen” session before logging out of the login nodes. You can
reattach to the relevant “screen” after login. Further details are out-of-scope here. Please note that there is no time restriction on
this queue. So when you are done “exit” from the login to the compute node and the LSF interactive session will automatically be
killed. Check it again by using “bjobs -w” and if still running use “bkill <job_ib>”. Similarly exit all “screen” sessions if you are
not doing anything in them.
Using HDFS on Discovery Cluster consists of three steps:
1) Compile hadoop code using Java and get the *.jar file.
2) Create your data directory on the HDFS file system following the “Discovery Cluster” best practice procedure as in
/scratch.
3) Move (stage) the input data to the HDFS file system in your top level data folder.
4) Run your code.
5) Move your output and other data back to your /home/<my_neu_id> or /scratch/<my_neu_id> folder.
6) Exit your interactive session. Check your interactive job is killed on exit using “bjobs –w”.
7) Exit “screen” if using it.
Example – “test1”
1. Compile Hadoop code:
Download the example file
here - http://nuweb12.neu.edu/rc/wp-content/uploads/2014/09/NKR_hadoop_example.tar.gz.
Now upload to your /home/<my_neu_id> or /scratch/<my_neu_id> and extract then compile using “javac” as shown
below. Make a directory for the classes. In this example it is called “wordcount_classes”. Before compiling run “export
CLASSPATH=$(hadoop classpath):$CLASSPATH” where “hadoop” will be in your path if you have loaded the modules
correctly as shown above. You can ignore the “deprecated API” note. In the example java source file you will see “import
org.apache.hadoop.filecache.DistributedCache;”. This is for fully distributed Hadoop implementation with multi-level
replication and fully distributed cache as in this case. In this case the replication and distributed cache is across three data
nodes “compute-2-004, compute-2-005, compute-2-006”.
2. Create your HDFS top-level directory:
The rule followed for the top-level HDFS directory is <myneu_id>. This your “id” that you use to login into the cluster.
You create the HDFS directory as follows:
The command to use is “hadoop fs -mkdir hdfs://discovery3:9000/tmp/nilay.roy” where my “my_neu_id” is “nilay.roy”.
Ignore the native-hadoop stack guard warnings. These will be fixed in later versions. We use a global Hadoop install via
modules not a local per node install. Hence the “WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable” warnings.
[nilay.roy@compute-2-005 ~]$ tar -zxvf NKR_hadoop_example.tar.gz hadoop_test/ hadoop_test/test1/ hadoop_test/test1/WordCount.java hadoop_test/test1/file02 hadoop_test/test1/file01 [nilay.roy@compute-2-005 ~]$ cd hadoop_test/ [nilay.roy@compute-2-005 hadoop_test]$ cd test1 [nilay.roy@compute-2-005 test1]$ ls -la
total 83
drwxr-xr-x 2 nilay.roy GID_nilay.roy 80 Aug 18 15:24 . drwxr-xr-x 3 nilay.roy GID_nilay.roy 23 Aug 15 11:50 .. -rw-r--r-- 1 nilay.roy GID_nilay.roy 24 Aug 15 19:32 file01 -rw-r--r-- 1 nilay.roy GID_nilay.roy 33 Aug 15 19:32 file02
-rw-r--r-- 1 nilay.roy GID_nilay.roy 4544 Aug 15 12:12 WordCount.java [nilay.roy@compute-2-005 test1]$ mkdir wordcount_classes
[nilay.roy@compute-2-005 test1]$ export CLASSPATH=$(hadoop classpath):$CLASSPATH [nilay.roy@compute-2-005 test1]$ javac -d wordcount_classes WordCount.java Note: WordCount.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
[nilay.roy@compute-2-005 test1]$ cd wordcount_classes/org/myorg [nilay.roy@compute-2-005 myorg]$ ls -la
total 109
drwxr-xr-x 2 nilay.roy GID_nilay.roy 156 Aug 18 15:44 . drwxr-xr-x 3 nilay.roy GID_nilay.roy 23 Aug 18 15:44 ..
-rw-r--r-- 1 nilay.roy GID_nilay.roy 2671 Aug 18 15:44 WordCount.class -rw-r--r-- 1 nilay.roy GID_nilay.roy 4661 Aug 18 15:44 WordCount$Map.class -rw-r--r-- 1 nilay.roy GID_nilay.roy 983 Aug 18 15:44 WordCount$Map$Counters.class -rw-r--r-- 1 nilay.roy GID_nilay.roy 1611 Aug 18 15:44 WordCount$Reduce.class [nilay.roy@compute-2-005 myorg]$
You can now list the directory and also view it in the GUI. For GUI go to http://discovery3.neu.edu:50070 and click the
“Utilities” tab on top. From the drop down select “Browse the file system”. From the command line - see below:
3. Move (stage) the input data to the HDFS file system in your top level data folder:
Now we can move the input file into the input directory. Create a local input directory for this file and put the files there
and then copy the whole hadoop_test folder over. This is shown below:
4. Run your code:
Now create the jar file to run as shown below using “jar -cvf wordcount.jar -C wordcount_classes/ .”
[nilay.roy@compute-2-005 test1]$ hadoop fs -mkdir hdfs://discovery3:9000/tmp/nilay.roy
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /shared/apps/hadoop-2.4.1/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/08/18 17:04:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[nilay.roy@compute-2-005 test1]$
[nilay.roy@compute-2-005 test1]$ hdfs dfs -ls hdfs://discovery3:9000/tmp/
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /shared/apps/hadoop-2.4.1/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/08/18 17:13:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
drwxrwx--- - hadoopuser supergroup 0 2014-08-15 16:34 hdfs://discovery3:9000/tmp/hadoop-yarn drwxr-xr-x - nilay.roy supergroup 0 2014-08-18 17:04 hdfs://discovery3:9000/tmp/nilay.roy [nilay.roy@compute-2-005 test1]$
[nilay.roy@compute-2-005 ~]$ hdfs dfs -put hadoop_test/ hdfs://discovery3:9000/tmp/nilay.roy/.
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /shared/apps/hadoop-2.4.1/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/08/18 17:20:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[nilay.roy@compute-2-005 ~]$ hdfs dfs -lsr hdfs://discovery3:9000/tmp/nilay.roy lsr: DEPRECATED: Please use 'ls -R' instead.
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /shared/apps/hadoop-2.4.1/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/08/18 17:20:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwxr-xr-x - nilay.roy supergroup 0 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test drwxr-xr-x - nilay.roy supergroup 0 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1
-rw-r--r-- 3 nilay.roy supergroup 4544 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/WordCount.java -rw-r--r-- 3 nilay.roy supergroup 24 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/file01
-rw-r--r-- 3 nilay.roy supergroup 33 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/file02 drwxr-xr-x - nilay.roy supergroup 0 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input -rw-r--r-- 3 nilay.roy supergroup 24 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input/file01 -rw-r--r-- 3 nilay.roy supergroup 33 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input/file02 drwxr-xr-x - nilay.roy supergroup 0 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes drwxr-xr-x - nilay.roy supergroup 0 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org drwxr-xr-x - nilay.roy supergroup 0 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg -rw-r--r-- 3 nilay.roy supergroup 983 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/WordCount$Map$Counters.class -rw-r--r-- 3 nilay.roy supergroup 4661 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/WordCount$Map.class -rw-r--r-- 3 nilay.roy supergroup 1611 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/WordCount$Reduce.class -rw-r--r-- 3 nilay.roy supergroup 2671 2014-08-18 17:20 hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/wordcount_classes/org/myorg/WordCount.class [nilay.roy@compute-2-005 ~]$
Then run in on the HDFS file system as show below using:
“
hadoop jar /home/nilay.roy/hadoop_test/test1/wordcount.jar org.myorg.WordCount hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output”Although the above seems intimidating what we are doing is simply giving the full path to the *.jar to run and the full
paths for input and output in HDFS with the “org.myorg.WordCount” compiled class libraries. When you run this you
will see output as shown below:
5. Move data out from HDFS to your home directory:
Browse your output data using the command shown below and then move it over using the “hdfs dfs -get” command.
This “cat” example is shown below and the “get” is left as an exercise. You can then remove your data with the “hdfs dfs
-rmr” command. This is also left as an exercise for the user.
[nilay.roy@compute-2-005 test1]$ jar -cvf wordcount.jar -C wordcount_classes/ . added manifest
adding: org/(in = 0) (out= 0)(stored 0%) adding: org/myorg/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/WordCount$Map.class(in = 4661) (out= 2217)(deflated 52%) adding: org/myorg/WordCount.class(in = 2671) (out= 1289)(deflated 51%)
adding: org/myorg/WordCount$Map$Counters.class(in = 983) (out= 504)(deflated 48%) adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 648)(deflated 59%) [nilay.roy@compute-2-005 test1]$ ls -la
total 114
drwxr-xr-x 4 nilay.roy GID_nilay.roy 169 Aug 18 17:27 . drwxr-xr-x 3 nilay.roy GID_nilay.roy 23 Aug 15 11:50 .. -rw-r--r-- 1 nilay.roy GID_nilay.roy 24 Aug 15 19:32 file01 -rw-r--r-- 1 nilay.roy GID_nilay.roy 33 Aug 15 19:32 file02 drwxr-xr-x 2 nilay.roy GID_nilay.roy 48 Aug 18 17:18 input
drwxr-xr-x 3 nilay.roy GID_nilay.roy 21 Aug 18 15:44 wordcount_classes -rw-r--r-- 1 nilay.roy GID_nilay.roy 5799 Aug 18 17:27 wordcount.jar -rw-r--r-- 1 nilay.roy GID_nilay.roy 4544 Aug 15 12:12 WordCount.java [nilay.roy@compute-2-005 test1]$
[nilay.roy@compute-2-005 test1]$ hadoop jar /home/nilay.roy/hadoop_test/test1/wordcount.jar org.myorg.WordCount hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/input hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /shared/apps/hadoop-2.4.1/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/08/18 17:35:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/08/18 17:35:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/08/18 17:35:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/08/18 17:35:44 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 14/08/18 17:35:45 INFO mapred.FileInputFormat: Total input paths to process : 2
14/08/18 17:35:45 INFO mapreduce.JobSubmitter: number of splits:2
14/08/18 17:35:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1163063881_0001
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::LOT OF OUTPUT OMITTED FOR CONCISENESS:::::::::::::::::::::::::::::::::::::::::::::::::::::::::: FILE: Number of write operations=0
HDFS: Number of bytes read=147 HDFS: Number of bytes written=67 HDFS: Number of read operations=25 Reduce output records=8
File Input Format Counters Bytes Read=57 File Output Format Counters
Bytes Written=67 org.myorg.WordCount$Map$Counters
INPUT_WORDS=9 [nilay.roy@compute-2-005 test1]$
6. Exit your interactive session:
Exit your interactive session. Check your interactive job is killed on exit using “bjobs -w”.
The entire online Hadoop User Guide is
here:
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
File System Shell GUIDE is
here:
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
If you are not sure what you are doing, have questions, encounter problems, and require help with Hadoop, JAVA, PYTHON or
any other issues contact ITS-Research Computing at [email protected]. Alternatively email me at [email protected].
Please be mindful that this 50TB Hadoop cluster is a shared resource and must not be used for data storage. Once you are done
with your work please clean up everything under your top - level HDFS directory that must be:
hdfs://discovery.neu.edu:9000/tmp/<my_neu_id>
- where <my_neu_id> is your Discovery Cluster Login ID and the same ID that you use to
login to “myneu”
DO NOT USE any other paths here.
[nilay.roy@compute-2-005 test1]$ hdfs dfs -cat hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output/part-00000
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /shared/apps/hadoop-2.4.1/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/08/18 17:54:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1 [nilay.roy@compute-2-005 test1]$
Another Example – “test2”
Download the example file here - http://nuweb12.neu.edu/rc/wp-content/uploads/2014/09/NKR_hadoop_example.tar.gz.
Here is a more advanced example of secondary sort that shows of the power of Hadoop Map-Reduce. Here we sort the values
coming into the Reducer of a Hadoop Map/Reduce (MR) Job.
We extend the Java API for our use and do the following:
1.Use a composite key.
2.Extend org.apache.hadoop.mapreduce.Partitioner.
3.Extend org.apache.hadoop.io.WritableComparator.
The main part of the code that one should focus on is the use of the M/R API “org.apache.hadoop.mapreduce.*”.
The problem
Imagine we have stock data that looks like the following. Each line represents the value of a stock at a particular time.
Each value in a line is delimited by a comma. The first value is the stock symbol (i.e. GOOG), the second value is the
timestamp (i.e. the number of milliseconds since January 1, 1970, 00:00:00 GMT), and the third value is the stock’s
price. The data below is a toy data set. As you can see, there are 3 stock symbols: a, b, and c. The timestamps are also
simple: 1, 2, 3, 4. The values are fake as well: 1.0, 2.0, 3.0, and 4.0.
a, 1, 1.0
b, 1, 1.0
c, 1, 1.0
a, 2, 2.0
b, 2, 2.0
c, 2, 2.0
a, 3, 3.0
b, 3, 3.0
c, 3, 3.0
a, 4, 4.0
b, 4, 4.0
c, 4, 4.0
Let’s say we want for each stock symbol (the reducer key input, or alternatively, the mapper key output), to order the
values descendingly by timestamp when they come into the reducer. How do we sort the timestamp descendingly? This
problem is known as secondary sorting. Hadoop’s M/R platform sorts the keys, but not the values. (Note, Google’s
M/R platform explicitly supports secondary sorting, see Lin and Dyer 2010).
A solution for secondary sorting
Use a composite key
A solution for secondary sorting involves doing multiple things. First, instead of simply emitting the stock symbol as
the key from the mapper, we need to emit a composite key, a key that has multiple parts. The key will have the stock
symbol and timestamp. The process for a M/R Job is as follows.
•
(K1,V1) –> Map –> (K2,V2)
•
(K2,List[V2]) –> Reduce –> (K3,V3)
In the toy data above, K1 will be of type LongWritable, and V1 will be of type Text. Without secondary sorting, K2
will be of type Text and V2 will be of type DoubleWritable (we simply emit the stock symbol and price from the
mapper to the reducer). So, K2=symbol, and V2=price, or (K2,V2) = (symbol,price). However, if we emit such an
intermediary key-value pair, secondary sorting is not possible. We have to emit a composite key,
K2={symbol,timestamp}. So the intermediary key-value pair is (K2,V2) = ({symbol,timestamp},price). Note that
composite data structures, such as the composite key, is held within the curly braces. Our reducer simply outputs a K3
of type Text and V3 of type Text; (K3,V3) = (symbol, price). The complete M/R job with the new composite key is
shown below.
•
(LongWritable,Text) –> Map –> ({symbol,timestamp},price)
•
({symbol,timestamp},List[price]) –> Reduce –> (symbol,price)
K2 is a composite key, but inside it, the symbol part/component is referred to as the “natural” key. It is the key which
values will be grouped by.
Use a composite key comparator
The composite key comparator is where the secondary sorting takes place. It compares composite key by symbol
ascendingly and timestamp descendingly. It is shown below. Notice here we sort based on symbol and timestamp. All
the components of the composite key are considered.
public
class
CompositeKeyComparator extends
WritableComparator {
protected
CompositeKeyComparator() {
super(StockKey.class, true);
}
@SuppressWarnings("rawtypes")
@Override
public
int
compare(WritableComparable w1, WritableComparable w2) {
StockKey k1 = (StockKey)w1;
StockKey k2 = (StockKey)w2;
int
result = k1.getSymbol().compareTo(k2.getSymbol());
if(0
== result) {
result = -1* k1.getTimestamp().compareTo(k2.getTimestamp());
}
return
result;
}
}
Use a natural key grouping comparator
The natural key group comparator “groups” values together according to the natural key. Without this component, each
K2={symbol,timestamp} and its associated V2=price may go to different reducers. Notice here, we only consider the
“natural” key.
public
class
NaturalKeyGroupingComparator extends
WritableComparator {
protected
NaturalKeyGroupingComparator() {
super(StockKey.class, true);
}
@SuppressWarnings("rawtypes")
@Override
public
int
compare(WritableComparable w1, WritableComparable w2) {
StockKey k1 = (StockKey)w1;
StockKey k2 = (StockKey)w2;
return
k1.getSymbol().compareTo(k2.getSymbol());
}
}
Use a natural key partitioner
The natural key partitioner uses the natural key to partition the data to the reducer(s). Again, note that here, we only
consider the “natural” key.
public
class
NaturalKeyPartitioner extends
Partitioner<StockKey, DoubleWritable> {
@Override
public
int
getPartition(StockKey key, DoubleWritable val, int
numPartitions) {
int
hash = key.getSymbol().hashCode();
int
partition = hash % numPartitions;
return
partition;
}
}
The M/R Job
Once we define the Mapper, Reducer, natural key grouping comparator, natural key partitioner, composite key
comparator, and composite key, in Hadoop’s M/R API, we may configure the Job as follows.
public
class
SsJob extends
Configured implements
Tool {
public
static
void
main(String[] args) throws
Exception {
ToolRunner.run(new
Configuration(), new
SsJob(), args);
}
@Override
public
int
run(String[] args) throws
Exception {
Configuration conf = getConf();
Job job = new
Job(conf, "secondary sort");
job.setJarByClass(SsJob.class);
job.setPartitionerClass(NaturalKeyPartitioner.class);
job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
job.setMapOutputKeyClass(StockKey.class);
job.setMapOutputValueClass(DoubleWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(SsMapper.class);
job.setReducerClass(SsReducer.class);
job.waitForCompletion(true);
return
0;
}
}
Comments
You need at least 4 new classes. The composite key class needs to hold the natural key and other data that you will sort
on. The composite key comparator will perform the sorting of the keys (and thus values). The natural key grouping
comparator will group values based on the natural key. The natural key partitioner will send values with the same
natural key to the same reducer.
NOTES:
After the job is executed the result is shown below:
The job was run using the command:
“hadoop jar demo_classes.jar demo.SsJob Dmapred.input.dir=hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test2/data
-Dmapred.output.dir=hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test2/result”
The jar file was created as shown below:
The code was compiled using the command:
“javac -d demo_classes CompositeKeyComparator.java NaturalKeyGroupingComparator.java NaturalKeyPartitioner.java
SsJob.java SsMapper.java SsReducer.java StockKey.java”
============================================================================================
============================================================================================
Nilay K Roy, Ph.D
========================================================== Nilay Roy, PhD Computational Physics, MS Computer Science
Assistant Director - Research Computing, Information Technology Services Northeastern University, 221-177, 360 Huntington Avenue, Boston, MA 02115 Email: [email protected] (C) 508.226.2261 (Preferred) / (O) 617.373.6048 Northeastern Research Computing Website: http://www.northeastern.edu/rc ==========================================================