Columnar storage - Avro and Pig - Data serialization

Data serialization— working with text

TECHNIQUE 19 Avro and Pig

3.4 Columnar storage

When data is written to an I/O device (say a flat file, or a table in a relational database), the most common way to lay out that data is row-based, meaning that all the fields for the first row are written first, followed by all the fields for the second row, and so on. This is how most relational databases write out tables by default, and the same goes for most data serialization formats such as XML, JSON, and Avro container files.

Columnar storage works differently—it lays out data by column first, and then by row. All the values of the first field across all the records are written first, followed by the second field, and so on. Figure 3.12 highlights the differences between the two storage schemes in how the data is laid out.

There are two main benefits to storing data in columnar form:

■ Systems that read columnar data can efficiently extract a subset of the columns, reducing I/O. Row-based systems typically need to read the entire row even if just one or two columns are needed.

■ Optimizations can be made when writing columnar data, such as run-length encoding and bit packing, to efficiently compress the size of the data being written. General compression schemes also work well for compressing columnar data because compression works best on data that has a lot of repeating data, which is the case when columnar data is physically colocated.

These records would be laid out diﬀerently for row- and column-based storage.

39.54 MSFT 05-10-2014 526.62 05-10-2014 GOOGL Price Date Symbol Row storage Sample records 05-10-2014 05-10-2014 MSFT 526.62 39.54 GOOGL Row 1 Row 2 Column storage 526.62 MSFT 05-10-2014 05-10-2014 39.54 GOOGL Column 1 (Symbol) Column 3 (Price) Column 2 (Date)

As a result, columnar file formats work best when working with large datasets where you wish to filter or project data, which is exactly the type of work that’s commonly performed in OLAP-type use cases, as well as MapReduce.

The majority of data formats used in Hadoop, such as JSON and Avro, are row- ordered, which means that you can’t apply the previously mentioned optimizations when reading and writing these files. Imagine that the data in figure 3.12 was in a Hive table and you were to execute the following query:

SELECT AVG(price) FROM stocks;

If the data was laid out in a row-based format, each row would have to be read, even though the only column being operated on is price. In a column-oriented store, only the price column would be read, which could result in drastically reduced processing times when you’re working with large datasets.

There are a number of columnar storage options that can be used in Hadoop:

■ RCFile was the first columnar format available in Hadoop; it came out of a collaboration between Facebook and academia in 2009.39_RCF_{ile is a basic colum-} nar store that supports separate column storage and column compression. It can support projection during reads, but misses out on the more advanced techniques such as run-length encoding. As a result, Facebook has been moving away from RCFile to ORC file.40

■ _ORC_file_{was created by Facebook and Hortonworks to address}_RCF_{ile’s short-}

comings, and its serialization optimizations have yielded smaller data sizes com- pared to RCFile.41_{It also uses indexes to enable predicate pushdowns to} optimize queries so that a column that doesn’t match a filter predicate can be skipped. ORC file is also fully integrated with Hive’s type system and can support nested structures.

■ Parquet is a collaboration between Twitter and Cloudera and employs many of the tricks that ORC file uses to generate compressed files.42_{Parquet is a} language-independent format with a formal specification.

RCFile and ORC file were designed to support Hive as their primary usage, whereas Parquet is independent of any other Hadoop tool and tries to maximize compatibility with the Hadoop ecosystem. Table 3.2 shows how these columnar formats integrate with various tools and languages.

39_{Yongqiang He, et al., “RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based}

Warehouse Systems,” ICDE Conference 2011: www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/ papers/TR-11-4.pdf.

40_{Facebook Engineering Blog, “Scaling the Facebook data warehouse to 300 PB,”}_{https://code.facebook.com/} posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/.

41_{Owen O’Malley, “ORC File Introduction,”}_{www.slideshare.net/oom65/orc-fileintro}_. 42_{Features such as column stats and indexes are planned for the Parquet 2 release.}

115

Columnar storage

For this section, I’ll focus on Parquet due to its compatibility with object models such as Avro.

3.4.1 Understanding object models and storage formats

Before we get started with the techniques, we’ll cover a few Parquet concepts that are important in understanding the interplay between Parquet and Avro (and Thrift and Protocol Buffers):

■ _{Object models}_{are in-memory representations of data. Parquet exposes a simple}

object model that’s supplied more as an example than anything else. Avro, Thrift, and Protocol Buffers are full-featured object models. An example is the Avro Stock class, which was generated by Avro to richly model the schema using Java POJOs.

■ _{Storage formats}_{are serialized representations of a data model. Parquet is a stor-}

age format that serializes data in columnar form. Avro, Thrift, and Protocol Buffers also have their own storage formats that serialize data in row-oriented formats.43_{Storage formats can be thought of as the at-rest representation} of data.

■ _{Parquet object model converters}_{are responsible for converting an object model to}

Parquet’s data types, and vice versa. Parquet is bundled with a number of converters to maximize the interoperability and adoption of Parquet.

Figure 3.13 shows how these concepts work in the context of Parquet.

What’s unique about Parquet is that it has converters that allow it to support common object models such as Avro. Behind the scenes, the data is stored in Parquet binary form, but when you’re working with your data, you’re using your preferred object model, such as Avro objects. This gives you the best of both worlds: you can continue to use a rich object model such as Avro to interact with your data, and that data will be efficiently laid out on disk using Parquet.

Table 3.2 Columnar storage formats supported in Hadoop

Format Hadoop support Supported object models

Supported programming languages

Advanced compression support

RCFile MapReduce, Pig, Hive (0.4+), Impala

Thrift, Protocol Buffersa

Java No

ORC file MapReduce, Pig, Hive (0.11+)

None Java Yes

Parquet MapReduce, Pig, Hive, Impala

Avro, Protocol Buffers, Thrift

Java, C++, Python Yes

a _{Elephant Bird provides the ability to use Thrift and Protocol Buffers with RCFile.}

43_{Avro does have a columnar storage format called Trevni:}_{http://avro.apache.org/docs/1.7.6/trevni/} spec.html.

Storage format interoperability Storage formats generally aren’t interoperable. When you’re combining Avro and Parquet, you’re combining Avro’s object model and Parquet’s storage format. Therefore, if you have existing Avro data sitting in HDFS that was serialized using Avro’s storage format, you can’t read that data using Parquet’s storage format, as they are two very different ways of encoding data. The reverse is also true—Parquet can’t be read using the nor- mal Avro methods (such as the AvroInputFormat in MapReduce); you must use Parquet implementations of input formats and Hive SerDes to work with Par- quet data.

To summarize, choose Parquet if you want your data to be serialized in a columnar form. Once you’ve selected Parquet, you’ll need to decide which object model you’ll be working with. I recommend you pick the object model that has the most traction in your organization. Otherwise I recommend going with Avro (section 3.3.5 explains why Avro can be a good choice).

The Parquet file format The Parquet file format is beyond the scope of this book; for more details, take a look at the Parquet project page at https:// github.com/Parquet/parquet-format.

3.4.2 Parquet and the Hadoop ecosystem

The goal of Parquet is to maximize support throughout the Hadoop ecosystem. It cur- rently supports MapReduce, Hive, Pig, Impala, and Spark, and hopefully we’ll see it being supported by other systems such as Sqoop.

Because Parquet is a standard file format, a Parquet file that’s written by any one of these technologies can also be read by the others. Maximizing support across the Hadoop ecosystem is critical to the success of a file format, and Parquet is poised to become the ubiquitous file format in big data.

Parquet

Parquet contains both a storage format and converters that map object models to and

from Parquet's ﬁle format types. Object

models

Storage

format Binary Parquet file format

Example Hive Pig protobuf Thrift Avro Object model

converters Avro Thrift protobuf Pig Hive Example

117

In document Hadoop in Practice 2nd Edition {PRG} pdf (Page 138-142)