XML Configuration File - What is Map‐Reduce

5.7.6 Configure Master Nodes

6.1 What is Map‐Reduce

6.1.3 XML Configuration File

cnt++;

avg += value.get();

}

avg = avg/cnt;

Then it writes both the NPI number and the average value to a BasicBSONObject.

BasicBSONObject output = new BasicBSONObject();

output.put("Provider NPI Number", pKey.get());

output.put("Average Claim Amount", avg);

Finally this BSON object is written to pContext, which is written back into MongoDB after the reducer runs.

pContext.write(null, new BSONWritable(output));

The first parameter above is used for specifying the document _id field, if it is set to null then MongoDB will automatically generate a unique _id, whereas if you wanted to specify an _id then you could simply put the desired id number instead of null. The second parameter tells the writer which

6.1.3 XML Configuration File

All of the options for running the map‐reduce jobs are stored in an xml file that is named test.xml, which can be found in Appendix D. The TestXMLCoinfig.java file in Appendix C tells the map reduce function where to find the configuration file with the code:

Configuration.addDefaultResource("test.xml");

The options that we need to pay attention to in this file are:

1. mongo.input.uri 2. mongo.output.uri 3. mongo.job.mapper 4. mongo.job.reducer 5. mongo.job.input.format 6. mongo.job.output.format 7. mongo.job.output.key 8. mongo.job.output.value 9. mongo.job.mapper.output.key 10. mongo.job.mapper.output.value mongo.input.uri

This should be set to the URI of your mongoDB which should be:

<!‐‐ If you are reading from mongo, the URI ‐‐>

<name>mongo.input.uri</name>

<value>mongodb://ec2‐54‐203‐82‐12.us‐west‐2.compute.amazonaws.com/testData.Provider</value>

</property>

Where ec2‐54‐203‐82‐12.us‐west‐2.compute.amazonaws.com is the public DNS of your server running MongoDB, testData is the name of the database, and Provider is the name of the collection in the database. Alternatively you could probably use the alias “mongo” that we setup earlier, but it wasn’t tested it this way.

mongo.output.uri

This should likewise be set to:

<!‐‐ If you are writing to mongo, the URI ‐‐>

<name>mongo.output.uri</name>

<value>mongodb://ec2‐54‐203‐82‐12.us‐west‐2.compute.amazonaws.com/testData.AvgPerNPI</value>

</property>

mongo.job.mapper

This should be set to the class path of the mapper class:

<!‐‐ Class for the mapper ‐‐>

<name>mongo.job.mapper</name>

<value>avgPerNPI.AvgPerNPIMapper</value>

</property>

mongo.job.reducer

This should be set to the class path of the reducer class:

<!‐‐ Reducer class ‐‐>

<name>mongo.job.reducer</name>

</property>

mongo.job.input.format

When using MongoDB as input, this should be set to:

<!‐‐ InputFormat Class ‐‐>

<name>mongo.job.input.format</name>

<value>com.mongodb.hadoop.MongoInputFormat</value>

</property>

mongo.job.output.format

When using MongoDB as output, this should be set to:

<!‐‐ OutputFormat Class ‐‐>

<name>mongo.job.output.format</name>

<value>com.mongodb.hadoop.MongoOutputFormat</value>

</property>

mongo.job.output.key

This should be set to the same class as the mapper output key:

<!‐‐ Output key class for the output format ‐‐>

<name>mongo.job.output.key</name>

<value>org.apache.hadoop.io.LongWritable</value>

</property>

Where the mapper output key we used was:

public class AvgPerNPIMapper

extends Mapper<Object, BSONObject, LongWritable, DoubleWritable> {

mongo.job.output.value

This should be set to the class used for the final output of the map‐reduce job:

<!‐‐ Output value class for the output format ‐‐>

<name>mongo.job.output.value</name>

<value>com.mongodb.hadoop.io.BSONWritable</value>

</property>

Where the final output class for the map‐reduce job is:

public class AvgPerNPIReducer extends Reducer<LongWritable, DoubleWritable, BSONWritable, BSONWritable>

mongo.job.mapper.output.key

If you download the example code from the mongo‐hadoop connector on gitHub, it will say that this is optional, but it is not. This must be set to the same class as the mapper output key in the key‐value pair:

<!‐‐ Output key class for the output format ‐‐>

<name>mongo.job.output.key</name>

<value>org.apache.hadoop.io.LongWritable</value>

</property>

Where the mapper output key that was used was:

public class AvgPerNPIMapper

extends Mapper<Object, BSONObject, LongWritable, DoubleWritable> {

mongo.job.mapper.output.value

If you download the example code from the mongo‐hadoop connector on gitHub, it will say that this is optional, but it is not. This must be set to the same class as the mapper output value in the key‐value pair:

<!‐‐ Output value class for the mapper ‐‐>

<name>mongo.job.mapper.output.value</name>

<value>org.apache.hadoop.io.DoubleWritable</value>

</property>

Where the mapper output value that was used was:

public class AvgPerNPIMapper

extends Mapper<Object, BSONObject, LongWritable, DoubleWritable> {