File Formats - Microsoft Big Data Solutions (2014) pdf

Hive uses Hadoop as the underlying data store. Because the actual data is stored in Hadoop, it can be in a wide variety of formats. As discussed in Chapter 5, “Storing and Managing Data in HDFS,” Hadoop stores fi les and doesn’t impose any restrictions in the content or format of those fi les. Hive offers enough fl ex- ibility that you can work with almost any fi le format, but some formats require signifi cantly more effort.

The simplest fi les to work with in Hive are text fi les, and this is the default format Hive expects for fi les. These text fi les are normally delimited by specifi c characters. Common formats in business settings are comma-separated value fi les or tab-separated value fi les. However, the drawback of these formats is that commas and tabs often appear in real data; that is, they are embedded inside other text, and not intended as delimiters in all instances. For that reason, Hive by default uses control characters as delimiters, which are less likely to appear in real data. Table 6.2 describes these default delimiters.

Table 6-2: Hive Default Delimiters for Text Files

DELIMITER

OCTAL

CODE DESCRIPTION

\n \012 New line character; this delimits rows in a text ﬁ le. ^A \001 Separates columns in each row.

^B \002 Separates elements in an ARRAY, STRUCT, and key/value pairs in a MAP.

^C \003 Separates the key from the value in a MAP column.

N O T E The default delimiters can be overridden when the table is created. This is useful when you are dealing with text ﬁ les that use diﬀ erent delimiters, but are still formatted in a very similar way. The options for that are shown in the section “Creating Tables” in this chapter.

What if one of the many text fi les that is accessed through a Hive table uses a different value as a column delimiter? In that case, Hive won’t be able to parse the fi le accurately. The exact results will vary depending on exactly how the text fi le is formatted, and how the Hive table was confi gured. However, it’s likely that Hive will fi nd less than the expected number of columns in the text fi le. In this case, it will fi ll in the columns it fi nds values for, and then output null values for any “missing” columns.

112 Part III ■ Storing and Managing Big Data

The same thing will happen if the data values in the fi les don’t match the data type defi ned on the Hive table. If a fi le contains alphanumeric characters where Hive is expecting only numeric values, it will return null values. This enables Hive to be resilient to data quality issues with the fi les stored in Hadoop.

Some data, however, isn’t stored as text. Binary fi le formats can be faster and more effi cient than text formats, as the data takes less space in the fi les. If the data is stored in a smaller number of bytes, more of it can be read from the disk in a single-read operation, and more of it can fi t in memory. This can improve performance, particularly in a big data system.

Unlike a text fi le, though, you can’t open a binary fi le in your favorite text editor and understand the data. Other applications can’t understand the data either, unless they have been built specifi cally to understand the format. In some cases, though, the improved performance can offset the lack of portability of the binary fi le formats.

Hive supports several binary formats natively for fi les. One option is the Sequence File format. Sequence fi les consist of binary encoded key/value pairs. This is a standard fi le format for Hadoop, so it will be usable by many other tools in the Hadoop ecosystem.

Another option is the RCFile format. RCFile uses a columnar storage approach, rather than the row-based approach familiar to users of relational systems. In the columnar approach, the values in a column are compressed so that only the distinct values for the column need to be stored, rather than the repeated values for each row. This can help compress the data a great deal, particularly if the column values are repeated for many rows. RCFiles are readable through Hive, but not from most other Hadoop tools.

A variation on the RCFile is the Optimized Record Columnar File format (ORCFile). This format includes additional metadata in the fi le system, which can vastly speed up the querying of Hive data. This was released as part of Hive 0.11.

N O T E Compression is an option for your Hadoop data, and Hive can decompress the data as needed for processing. Hive and Hadoop have native support for com- pressing and decompressing ﬁ les on demand using a variety of compression types, including common formats like Zip compression. This can be an alternative that allows to you get the beneﬁ ts of smaller data formats while still keeping the data in a text format.

If the data is in a binary or text format that Hive doesn’t understand, custom logic can be developed to support it. The next section discusses how these can be implemented.

114 Part III ■ Storing and Managing Big Data

Hive has robust support for both standard and complex data types, stored in a wide variety of formats. And as highlighted in the preceding section, if support for a particular fi le format is not included, it can be added via third-party add-ons or custom implementations. This works very well with the type of data that is often found in Hadoop data stores. By using Hive’s ability to apply a tabular structure to the data, it makes it easier for users and tools to consume. But there is another component to making access much easier for existing tools, which is discussed next.

In document Microsoft Big Data Solutions (2014) pdf (Page 133-136)