• No results found

Avro’s schema and code generation

Data serialization— working with text

TECHNIQUE 12 Avro’s schema and code generation

3.3.5 Avro

Doug Cutting created Avro, a data serialization and RPC library, to help improve data interchange, interoperability, and versioning in MapReduce. Avro utilizes a compact binary data format—which you have the option to compress—that results in fast serial- ization times. Although it has the concept of a schema, similar to Protocol Buffers, Avro improves on Protocol Buffers because its code generation is optional, and it embeds the schema in the container file format, allowing for dynamic discovery and data interactions. Avro has a mechanism to work with schema data that uses generic data types (an example of which can be seen in chapter 4).

The Avro file format is shown in figure 3.11. The schema is serialized as part of the header, which makes deserialization simple and loosens restrictions around users hav- ing to maintain and access the schema outside of the Avro data files being interacted with. Each data block contains a number of Avro records, and by default is 16 KB in size.

The holy grail of data serialization supports code generation, versioning, and com- pression, and has a high level of integration with MapReduce. Equally important is schema evolution, and that’s the reason why Hadoop SequenceFiles aren’t appeal- ing—they don’t support the notion of a schema or any form of data evolution.

In this section you’ll get an overview of Avro’s schema and code-generation capabil- ities, how to read and write Avro container files, and the various ways Avro can be inte- grated with MapReduce. At the end we’ll also look at Avro support in Hive and Pig.

Let’s get rolling with a look at Avro’s schema and code generation. TECHNIQUE 12 Avro’s schema and code generation

Avro has the notion of generic data and specific data:

Generic data allows you to work with data at a low level without having to under- stand schema specifics.

Specific data allows you to work with Avro using code-generated Avro primitives,

which supports a simple and type-safe method of working with your Avro data. This technique looks at how to work with specific data in Avro.

Magic Metadata Sync Block Block Block ...

Three bytes, “Obj”, that identify the file as being Avro.

Each block contains a count of the serialized objects, the

sizes of the objects, and a sync marker to delimit the

end of the block. A randomly generated

sync marker used to delimit blocks in the

data section. Includes the schema

and compression codec.

Header Data

■ Problem

You want to define an Avro schema and generate code so you can work with your Avro records in Java.

■ Solution

Author your schema in JSON form, and then use Avro tools to generate rich APIs to interact with your data.

■ Discussion

You can use Avro in one of two ways: either with code-generated classes or with its generic classes. In this technique we’ll work with the code-generated classes, but you can see an example of how Avro’s generic records are used in technique 29 in chapter 4.

Getting Avro The appendix contains instructions on how to get your hands on Avro.

In the code-generated approach, everything starts with a schema. The first step is to create an Avro schema to represent an entry in the stock data:25

{

"name": "Stock", "type": "record",

"namespace": "hip.ch3.avro.gen", "fields": [

{"name": "symbol", "type": "string"}, {"name": "date", "type": "string"}, {"name": "open", "type": "double"}, {"name": "high", "type": "double"}, {"name": "low", "type": "double"}, {"name": "close", "type": "double"}, {"name": "volume", "type": "int"}, {"name": "adjClose", "type": "double"} ]

}

Avro supports code generation for schema data as well as RPC messages (which aren’t covered in this book). To generate Java code for a schema, use the Avro tools JAR as follows:

$ cd $HIP_HOME && mkdir src && cd src $ jar -xvf ../hip-2.0.0-sources.jar $ cd ..

$ java -jar $HIP_HOME/lib/avro-tools-1.7.4.jar \ compile schema \

$HIP_HOME/src/hip/ch3/avro/stock.avsc \ $HIP_HOME/src/hip/ch3/avro/stockavg.avsc \ $HIP_HOME/src/

25GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch3/avro/stock .avsc.

Create a directory

for the sources. Expand the source JAR

into the directory. Tell the Avro tool that you

want to generate classes

for an Avro schema. schema file.The input

The tool supports multiple input schema files. The output directory where generated code is written.

95

TECHNIQUE 12 Avro’s schema and code generation

Generated code will be put into the hip.ch3.avro.gen package. Now that you have gen- erated code, how do you use it to read and write Avro container files?26

DataFileWriter<Stock> writer = new DataFileWriter<Stock>(

new SpecificDatumWriter<Stock>()); writer.setCodec(CodecFactory.snappyCodec()); writer.create(Stock.SCHEMA$, outputStream);

for(Stock stock: StockUtils.fromCsvFile(inputFile)) { writer.append(stock);

}

IOUtils.closeStream(writer); IOUtils.closeStream(outputStream);

As you see, you can specify the compression codec that should be used to compress the data. In this example you’re using Snappy, which, as shown in chapter 4, is the fast- est codec for reads and writes.

The following code example shows how you can marshal a Stock object from a line in the input file. As you can see, the generated Stock class is a POJO with a bunch of setters (and matching getters):

public static Stock fromCsv(String line) throws IOException { String parts[] = parser.parseLine(line);

Stock stock = new Stock(); stock.setSymbol(parts[0]); stock.setDate(parts[1]); stock.setOpen(Double.valueOf(parts[2])); stock.setHigh(Double.valueOf(parts[3])); stock.setLow(Double.valueOf(parts[4])); stock.setClose(Double.valueOf(parts[5])); stock.setVolume(Integer.valueOf(parts[6])); stock.setAdjClose(Double.valueOf(parts[7])); return stock; }

Now, how about reading the file you just wrote?27

DataFileStream<Stock> reader = new DataFileStream<Stock>(

is,

Listing 3.5 Writing Avro files from outside of MapReduce

26GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch3/avro/ AvroStockFileWrite.java.

27GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch3/avro/ AvroStockFileRead.java.

Create a writer that can write Avro’s data file format.

Specify that Snappy should be used to compress the data.

Indicate the schema that will be used.

Write each stock to the Avro file.

Use Avro’s file container deserialization class to read from an input stream.

new SpecificDatumReader<Stock>(Stock.class)); for (Stock a : reader) {

System.out.println(ToStringBuilder.reflectionToString(a, ToStringStyle.SIMPLE_STYLE )); } IOUtils.closeStream(is); IOUtils.closeStream(reader);

Go ahead and execute this writer and reader pair:

$ hip hip.ch3.avro.AvroStockFileWrite \ --input test-data/stocks.txt \ --output stocks.avro $ hip hip.ch3.avro.AvroStockFileRead \ --input stocks.avro AAPL,2009-01-02,85.88,91.04,85.16,90.75,26643400,90.75 AAPL,2008-01-02,199.27,200.26,192.55,194.84,38542100,194.84 AAPL,2007-01-03,86.29,86.58,81.9,83.8,44225700,83.8 AAPL,2006-01-03,72.38,74.75,72.25,74.75,28829800,74.75 AAPL,2005-01-03,64.78,65.11,62.6,63.29,24714000,31.65 ...

Avro comes bundled with some tools to make it easy to examine the contents of Avro files. To view the contents of an Avro file as JSON, simply run this command:

$ java -jar $HIP_HOME/lib/avro-tools-1.7.4.jar tojson stocks.avro {"symbol":"AAPL","date":"2009-01-02","open":85.88,"high":91.04,... {"symbol":"AAPL","date":"2008-01-02","open":199.27,"high":200.26,... {"symbol":"AAPL","date":"2007-01-03","open":86.29,"high":86.58,... ...

This assumes that the file exists on the local filesystem. Similarly, you can get a JSON representation of your Avro file with the following command:

$ java -jar $HIP_HOME/lib/avro-tools-1.7.4.jar getschema stocks.avro { "type" : "record", "name" : "Stock", "namespace" : "hip.ch3.avro.gen", "fields" : [ { "name" : "symbol", "type" : "string" }, { "name" : "date", "type" : "string" }, { "name" : "open",

Loop through the Stock objects and use the Apache Commons ToStringBuilder to help dump all the members to the console.

Reads the stock.txt file from the local filesystem and writes the Avro output file stocks.avro to HDFS

Reads the Avro file stocks.avro from HDFS and dumps the records to the terminal

97

TECHNIQUE 12 Avro’s schema and code generation

"type" : "double" }, { "name" : "high", "type" : "double" }, { "name" : "low", "type" : "double" }, { "name" : "close", "type" : "double" }, { "name" : "volume", "type" : "int" }, { "name" : "adjClose", "type" : "double" } ] }

You can run the Avro tools without any options to view all the tools you can use:

$ java -jar $HIP_HOME/lib/avro-tools-1.7.4.jar

compile Generates Java code for the given schema. concat Concatenates avro files without re-compressing. fragtojson Renders a binary-encoded Avro datum as JSON.

fromjson Reads JSON records and writes an Avro data file. fromtext Imports a text file into an avro data file.

getmeta Prints out the metadata of an Avro data file. getschema Prints out schema of an Avro data file.

idl Generates a JSON schema from an Avro IDL file induce Induce schema/protocol from Java class/interface

via reflection.

jsontofrag Renders a JSON-encoded Avro datum as binary. recodec Alters the codec of a data file.

rpcprotocol Output the protocol of a RPC service

rpcreceive Opens an RPC Server and listens for one message. rpcsend Sends a single RPC message.

tether Run a tethered mapreduce job.

tojson Dumps an Avro data file as JSON, one record per line. totext Converts an Avro data file to a text file.

trevni_meta Dumps a Trevni file's metadata as JSON.

trevni_random Create a Trevni file filled with random instances of a schema.

trevni_tojson Dumps a Trevni file as JSON.

One shortcoming of the tojson tool is that it doesn’t support reading data in HDFS. I’ve therefore bundled a utility with the book’s code called AvroDump that can dump a text representation of Avro data in HDFS, which we’ll use shortly to examine the out- put of Avro MapReduce jobs:

This utility supports multiple files (they need to be CSV-delimited) and globbing, so you can use wildcards. The following example shows how you would dump out the contents of a MapReduce job that produced Avro output into a directory called mr-output-dir:

$ hip hip.util.AvroDump --file mr-output-dir/part*

Let’s see how Avro integrates with MapReduce.