• No results found

6.6 Importing Content Into MarkLogic Server

6.6.10 Creating Documents from Hadoop Sequence Files

A Hadoop sequence file is a flat binary file of key-value pairs. You can use mlcp to create a document from each key-value pair. The only supported value types are Text and BytesWritable.

This section covers the following topics: • Basic Steps

• Implementing the Key and Value Interfaces • Deploying your Key and Value Implementation • Loading Documents From Your Sequence Files • Running the SequenceFile Example

6.6.10.1 Basic Steps

You must implement a Hadoop SequenceFile reader and writer that also implements 2 special

mlcp interfaces. To learn more about Apache Hadoop SequenceFile, see

http://wiki.apache.org/hadoop/SequenceFile/.

1. Implement com.marklogic.contentpump.SequenceFileKey and com.marklogic.contentpump.SequenceFileValue.

2. Generate one or more sequence files using your classes. 3. Deploy your classes into mlcp_install_dir/lib.

4. Use the mlcp import command to create documents from your sequence files.

The source distribution of mlcp, available from http://developer.marklogic.com, includes an example in com.marklogic.contentpump.examples.

6.6.10.2 Implementing the Key and Value Interfaces

You must read and write your sequence files using classes that implement

com.marklogic.contentpump.SequenceFileKey and

com.marklogic.contentpump.SequenceFileValue. These interfaces are included in the mlcp jar

file:

mlcp_install_dir/lib/mlcp-version.jar

Where version is your mlcp version. For example, if you install mlcp version 1.3 to

/opt/mlcp-1.3, then the jar file is:

/opt/mlcp-1.3/lib/mlcp-1.3.jar

Source and an example implementation are available in the mlcp source distribution on developer.marklogic.com.

Your key class must implement the following interface:

package com.marklogic.contentpump;

import com.marklogic.mapreduce.DocumentURI;

public interface SequenceFileKey { DocumentURI getDocumentURI(); }

Your value class must implement the following interface:

package com.marklogic.contentpump;

public interface SequenceFileValue<T> { T getValue();

}

For an example, see com.marklogic.contentpump.example.SimpleSequenceFileKey and com.marklogic.contentpump.example.SimpleSequenceFileValue.

These interfaces depend on Hadoop and the MarkLogic Connector for Hadoop. The connector library is included in the mlcp distribution as:

mlcp_install_dir/lib/marklogic-mapreduceN-version.jar

where N is the Hadoop major version and version is the connector version. The Hadoop major version will correspond to the Hadoop major version of your mlcp distribution. For example, if you install the Hadoop v2 compatible version of mlcp, then the connector jar file name might be:

marklogic-mapreduce2-2.1.jar

For details, see the MarkLogic Connector for Hadoop Developer’s Guide and the MarkLogic Hadoop MapReduce Connector API.

You must implement a sequence file creator. You can find an example in

com.marklogic.contentpump.examples.SimpleSequenceFileCreator.

When compiling your classes, include the following on the Java class path:

mlcp_install_dir/lib/mlcp-version.jar

mlcp_install_dir/lib/marklogic-mapreduceN-version.jar • mlcp_install_dir/lib/hadoop-common-version.jar

For example:

$ javac -cp

$MLCP_DIR/lib/mlcp-1.3.jar:$MLCP_DIR/lib/marklogic-mapreduce2-2.1.jar: $MLCP_DIR/lib/hadoop-mapreduce-client-core-2.6.0.jar \

$ jar -cf myseqfile.jar *.class

6.6.10.3 Deploying your Key and Value Implementation

Once you compile your SequenceFileKey and SequenceFileValue implementations into a JAR file,

copy your JAR file and any dependent libraries into the mlcp lib/ directory so that mlcp can find

your classes at runtime. For example:

$ cp myseqfile.jar /space/mlcp-1.3/lib

6.6.10.4 Loading Documents From Your Sequence Files

Once you have created one or more sequence files using your implementation, you can create a document from each key-value pair using the following procedure:

1. Set -input_file_path:

• To load from a single file, set -input_file_path to the path to the file.

• To load from multiple files, set -input_file_path to a directory containing the

sequence files.

2. Set -sequencefile_key_class to the name of your SequenceFileKey implementation.

3. Set -sequencefile_value_class to the name of your SequeneFileValue implementation.

4. Set -sequencefile_value_type to either Text or BytesWritable, depending on the contents

of your sequence files.

5. Set -input_file_type to sequencefile.

By default, the key in each key-value pair is used as the document URI. You can further tailor the URI using command line options, as described in “Controlling Database URIs During Ingestion” on page 45.

For an example, see “Running the SequenceFile Example” on page 60. 6.6.10.5 Running the SequenceFile Example

This section walks you through creating a sequence file and loading its contents as documents. Create an input text file from which to create a sequence file. The file should contain pairs of lines where the first line is a URI that acts as the key, and the second line is the value. For example:

$ cat > seq_input.txt /doc/foo.xml

<foo/>

/doc/bar.xml <bar/>

To use the example classes provided with mlcp, put the following libraries on your Java classpath:

mlcp_install_dir/lib/mlcp-version.jar

mlcp_install_dir/lib/hadoop-mapreduce-client-core-version.jar • mlcp_install_dir/lib/commons-logging-1.1.3.jar

mlcp_install_dir/lib/marklogic-mapreduceN-version.jar

For example:

mlcp_install_dir/lib/mlcp-1.3.jar

mlcp_install_dir/lib/hadoop-mapreduce-client-core-2.6.0.jar • mlcp_install_dir/lib/commons-logging-1.1.3.jar

mlcp_install_dir/lib/marklogic-mapreduce2-2.1.jar

Generate a sequence file from your test data using

com.marklogic.contentpump.examples.SimpleSequenceFileCreator. The first argument to

program is the output sequence file name. The second argument is the input data file name. The following command generates seq_output from seq_input.txt.

$ java com.marklogic.contentpump.examples.SimpleSequenceFileCreator seq_output seq_input.txt

Load the contents of the sequence file into MarkLogic Server:

# Windows users, see Modifying the Example Commands for Windows

$ mlcp.sh import -username user -password password -host localhost \ -port 8000 -input_file_path seq_output -mode local \

-input_file_type sequencefile -sequencefile_key_class \ com.marklogic.contentpump.examples.SimpleSequenceFileKey \ -sequencefile_value_class \

com.marklogic.contentpump.examples.SimpleSequenceFileValue \ -sequencefile_value_type Text -document_type xml

Two documents are created in the database with URIs /doc/foo.xml and /doc/bar.xml.