6.6 Importing Content Into MarkLogic Server
6.6.10 Creating Documents from Hadoop Sequence Files
A Hadoop sequence file is a flat binary file of key-value pairs. You can use mlcp to create a document from each key-value pair. The only supported value types are Text and BytesWritable.
This section covers the following topics: • Basic Steps
• Implementing the Key and Value Interfaces • Deploying your Key and Value Implementation • Loading Documents From Your Sequence Files • Running the SequenceFile Example
6.6.10.1 Basic Steps
You must implement a Hadoop SequenceFile reader and writer that also implements 2 special
mlcp interfaces. To learn more about Apache Hadoop SequenceFile, see
http://wiki.apache.org/hadoop/SequenceFile/.
1. Implement com.marklogic.contentpump.SequenceFileKey and com.marklogic.contentpump.SequenceFileValue.
2. Generate one or more sequence files using your classes. 3. Deploy your classes into mlcp_install_dir/lib.
4. Use the mlcp import command to create documents from your sequence files.
The source distribution of mlcp, available from http://developer.marklogic.com, includes an example in com.marklogic.contentpump.examples.
6.6.10.2 Implementing the Key and Value Interfaces
You must read and write your sequence files using classes that implement
com.marklogic.contentpump.SequenceFileKey and
com.marklogic.contentpump.SequenceFileValue. These interfaces are included in the mlcp jar
file:
mlcp_install_dir/lib/mlcp-version.jar
Where version is your mlcp version. For example, if you install mlcp version 1.3 to
/opt/mlcp-1.3, then the jar file is:
/opt/mlcp-1.3/lib/mlcp-1.3.jar
Source and an example implementation are available in the mlcp source distribution on developer.marklogic.com.
Your key class must implement the following interface:
package com.marklogic.contentpump;
import com.marklogic.mapreduce.DocumentURI;
public interface SequenceFileKey { DocumentURI getDocumentURI(); }
Your value class must implement the following interface:
package com.marklogic.contentpump;
public interface SequenceFileValue<T> { T getValue();
}
For an example, see com.marklogic.contentpump.example.SimpleSequenceFileKey and com.marklogic.contentpump.example.SimpleSequenceFileValue.
These interfaces depend on Hadoop and the MarkLogic Connector for Hadoop. The connector library is included in the mlcp distribution as:
mlcp_install_dir/lib/marklogic-mapreduceN-version.jar
where N is the Hadoop major version and version is the connector version. The Hadoop major version will correspond to the Hadoop major version of your mlcp distribution. For example, if you install the Hadoop v2 compatible version of mlcp, then the connector jar file name might be:
marklogic-mapreduce2-2.1.jar
For details, see the MarkLogic Connector for Hadoop Developer’s Guide and the MarkLogic Hadoop MapReduce Connector API.
You must implement a sequence file creator. You can find an example in
com.marklogic.contentpump.examples.SimpleSequenceFileCreator.
When compiling your classes, include the following on the Java class path:
• mlcp_install_dir/lib/mlcp-version.jar
• mlcp_install_dir/lib/marklogic-mapreduceN-version.jar • mlcp_install_dir/lib/hadoop-common-version.jar
For example:
$ javac -cp
$MLCP_DIR/lib/mlcp-1.3.jar:$MLCP_DIR/lib/marklogic-mapreduce2-2.1.jar: $MLCP_DIR/lib/hadoop-mapreduce-client-core-2.6.0.jar \
$ jar -cf myseqfile.jar *.class
6.6.10.3 Deploying your Key and Value Implementation
Once you compile your SequenceFileKey and SequenceFileValue implementations into a JAR file,
copy your JAR file and any dependent libraries into the mlcp lib/ directory so that mlcp can find
your classes at runtime. For example:
$ cp myseqfile.jar /space/mlcp-1.3/lib
6.6.10.4 Loading Documents From Your Sequence Files
Once you have created one or more sequence files using your implementation, you can create a document from each key-value pair using the following procedure:
1. Set -input_file_path:
• To load from a single file, set -input_file_path to the path to the file.
• To load from multiple files, set -input_file_path to a directory containing the
sequence files.
2. Set -sequencefile_key_class to the name of your SequenceFileKey implementation.
3. Set -sequencefile_value_class to the name of your SequeneFileValue implementation.
4. Set -sequencefile_value_type to either Text or BytesWritable, depending on the contents
of your sequence files.
5. Set -input_file_type to sequencefile.
By default, the key in each key-value pair is used as the document URI. You can further tailor the URI using command line options, as described in “Controlling Database URIs During Ingestion” on page 45.
For an example, see “Running the SequenceFile Example” on page 60. 6.6.10.5 Running the SequenceFile Example
This section walks you through creating a sequence file and loading its contents as documents. Create an input text file from which to create a sequence file. The file should contain pairs of lines where the first line is a URI that acts as the key, and the second line is the value. For example:
$ cat > seq_input.txt /doc/foo.xml
<foo/>
/doc/bar.xml <bar/>
To use the example classes provided with mlcp, put the following libraries on your Java classpath:
• mlcp_install_dir/lib/mlcp-version.jar
• mlcp_install_dir/lib/hadoop-mapreduce-client-core-version.jar • mlcp_install_dir/lib/commons-logging-1.1.3.jar
• mlcp_install_dir/lib/marklogic-mapreduceN-version.jar
For example:
• mlcp_install_dir/lib/mlcp-1.3.jar
• mlcp_install_dir/lib/hadoop-mapreduce-client-core-2.6.0.jar • mlcp_install_dir/lib/commons-logging-1.1.3.jar
• mlcp_install_dir/lib/marklogic-mapreduce2-2.1.jar
Generate a sequence file from your test data using
com.marklogic.contentpump.examples.SimpleSequenceFileCreator. The first argument to
program is the output sequence file name. The second argument is the input data file name. The following command generates seq_output from seq_input.txt.
$ java com.marklogic.contentpump.examples.SimpleSequenceFileCreator seq_output seq_input.txt
Load the contents of the sequence file into MarkLogic Server:
# Windows users, see Modifying the Example Commands for Windows
$ mlcp.sh import -username user -password password -host localhost \ -port 8000 -input_file_path seq_output -mode local \
-input_file_type sequencefile -sequencefile_key_class \ com.marklogic.contentpump.examples.SimpleSequenceFileKey \ -sequencefile_value_class \
com.marklogic.contentpump.examples.SimpleSequenceFileValue \ -sequencefile_value_type Text -document_type xml
Two documents are created in the database with URIs /doc/foo.xml and /doc/bar.xml.