Data serialization— working with text
TECHNIQUE 19 Avro and Pig
3.4 Columnar storage
When data is written to an I/O device (say a flat file, or a table in a relational database), the most common way to lay out that data is row-based, meaning that all the fields for the first row are written first, followed by all the fields for the second row, and so on. This is how most relational databases write out tables by default, and the same goes for most data serialization formats such as XML, JSON, and Avro container files.
Columnar storage works differently—it lays out data by column first, and then by row. All the values of the first field across all the records are written first, followed by the second field, and so on. Figure 3.12 highlights the differences between the two storage schemes in how the data is laid out.
There are two main benefits to storing data in columnar form:
■ Systems that read columnar data can efficiently extract a subset of the columns, reducing I/O. Row-based systems typically need to read the entire row even if just one or two columns are needed.
■ Optimizations can be made when writing columnar data, such as run-length encoding and bit packing, to efficiently compress the size of the data being writ- ten. General compression schemes also work well for compressing columnar data because compression works best on data that has a lot of repeating data, which is the case when columnar data is physically colocated.
These records would be laid out differently for row- and column-based storage.
39.54 MSFT 05-10-2014 526.62 05-10-2014 GOOGL Price Date Symbol Row storage Sample records 05-10-2014 05-10-2014 MSFT 526.62 39.54 GOOGL Row 1 Row 2 Column storage 526.62 MSFT 05-10-2014 05-10-2014 39.54 GOOGL Column 1 (Symbol) Column 3 (Price) Column 2 (Date)
As a result, columnar file formats work best when working with large datasets where you wish to filter or project data, which is exactly the type of work that’s commonly performed in OLAP-type use cases, as well as MapReduce.
The majority of data formats used in Hadoop, such as JSON and Avro, are row- ordered, which means that you can’t apply the previously mentioned optimizations when reading and writing these files. Imagine that the data in figure 3.12 was in a Hive table and you were to execute the following query:
SELECT AVG(price) FROM stocks;
If the data was laid out in a row-based format, each row would have to be read, even though the only column being operated on is price. In a column-oriented store, only the price column would be read, which could result in drastically reduced processing times when you’re working with large datasets.
There are a number of columnar storage options that can be used in Hadoop:
■ RCFile was the first columnar format available in Hadoop; it came out of a col- laboration between Facebook and academia in 2009.39RCFile is a basic colum- nar store that supports separate column storage and column compression. It can support projection during reads, but misses out on the more advanced techniques such as run-length encoding. As a result, Facebook has been moving away from RCFile to ORC file.40
■ ORC file was created by Facebook and Hortonworks to address RCFile’s short-
comings, and its serialization optimizations have yielded smaller data sizes com- pared to RCFile.41 It also uses indexes to enable predicate pushdowns to optimize queries so that a column that doesn’t match a filter predicate can be skipped. ORC file is also fully integrated with Hive’s type system and can support nested structures.
■ Parquet is a collaboration between Twitter and Cloudera and employs many of the tricks that ORC file uses to generate compressed files.42 Parquet is a language-independent format with a formal specification.
RCFile and ORC file were designed to support Hive as their primary usage, whereas Parquet is independent of any other Hadoop tool and tries to maximize compatibility with the Hadoop ecosystem. Table 3.2 shows how these columnar formats integrate with various tools and languages.
39Yongqiang He, et al., “RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based
Warehouse Systems,” ICDE Conference 2011: www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/ papers/TR-11-4.pdf.
40Facebook Engineering Blog, “Scaling the Facebook data warehouse to 300 PB,” https://code.facebook.com/ posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/.
41Owen O’Malley, “ORC File Introduction,” www.slideshare.net/oom65/orc-fileintro. 42Features such as column stats and indexes are planned for the Parquet 2 release.
115
Columnar storage
For this section, I’ll focus on Parquet due to its compatibility with object models such as Avro.
3.4.1 Understanding object models and storage formats
Before we get started with the techniques, we’ll cover a few Parquet concepts that are important in understanding the interplay between Parquet and Avro (and Thrift and Protocol Buffers):
■ Object models are in-memory representations of data. Parquet exposes a simple
object model that’s supplied more as an example than anything else. Avro, Thrift, and Protocol Buffers are full-featured object models. An example is the Avro Stock class, which was generated by Avro to richly model the schema using Java POJOs.
■ Storage formats are serialized representations of a data model. Parquet is a stor-
age format that serializes data in columnar form. Avro, Thrift, and Protocol Buffers also have their own storage formats that serialize data in row-oriented formats.43 Storage formats can be thought of as the at-rest representation of data.
■ Parquet object model converters are responsible for converting an object model to
Parquet’s data types, and vice versa. Parquet is bundled with a number of con- verters to maximize the interoperability and adoption of Parquet.
Figure 3.13 shows how these concepts work in the context of Parquet.
What’s unique about Parquet is that it has converters that allow it to support com- mon object models such as Avro. Behind the scenes, the data is stored in Parquet binary form, but when you’re working with your data, you’re using your preferred object model, such as Avro objects. This gives you the best of both worlds: you can continue to use a rich object model such as Avro to interact with your data, and that data will be efficiently laid out on disk using Parquet.
Table 3.2 Columnar storage formats supported in Hadoop
Format Hadoop support Supported object models
Supported programming languages
Advanced compression support
RCFile MapReduce, Pig, Hive (0.4+), Impala
Thrift, Protocol Buffersa
Java No
ORC file MapReduce, Pig, Hive (0.11+)
None Java Yes
Parquet MapReduce, Pig, Hive, Impala
Avro, Protocol Buffers, Thrift
Java, C++, Python Yes
a Elephant Bird provides the ability to use Thrift and Protocol Buffers with RCFile.
43Avro does have a columnar storage format called Trevni: http://avro.apache.org/docs/1.7.6/trevni/ spec.html.
Storage format interoperability Storage formats generally aren’t interoperable. When you’re combining Avro and Parquet, you’re combining Avro’s object model and Parquet’s storage format. Therefore, if you have existing Avro data sitting in HDFS that was serialized using Avro’s storage format, you can’t read that data using Parquet’s storage format, as they are two very different ways of encoding data. The reverse is also true—Parquet can’t be read using the nor- mal Avro methods (such as the AvroInputFormat in MapReduce); you must use Parquet implementations of input formats and Hive SerDes to work with Par- quet data.
To summarize, choose Parquet if you want your data to be serialized in a columnar form. Once you’ve selected Parquet, you’ll need to decide which object model you’ll be working with. I recommend you pick the object model that has the most traction in your organization. Otherwise I recommend going with Avro (section 3.3.5 explains why Avro can be a good choice).
The Parquet file format The Parquet file format is beyond the scope of this book; for more details, take a look at the Parquet project page at https:// github.com/Parquet/parquet-format.
3.4.2 Parquet and the Hadoop ecosystem
The goal of Parquet is to maximize support throughout the Hadoop ecosystem. It cur- rently supports MapReduce, Hive, Pig, Impala, and Spark, and hopefully we’ll see it being supported by other systems such as Sqoop.
Because Parquet is a standard file format, a Parquet file that’s written by any one of these technologies can also be read by the others. Maximizing support across the Hadoop ecosystem is critical to the success of a file format, and Parquet is poised to become the ubiquitous file format in big data.
Parquet
Parquet contains both a storage format and converters that map object models to and
from Parquet's file format types. Object
models
Storage
format Binary Parquet file format
Example Hive Pig protobuf Thrift Avro Object model
converters Avro Thrift protobuf Pig Hive Example
117