Hive Data Types - Microsoft Big Data Solutions (2014) pdf

Table 6.1 lists the data types supported by Hive. Many of these data types have equivalent values in SQL Server, but a few are unique to Hive. Even for the data types that appear familiar, it is important to remember that Hive is coded as a Java application, and so these data types are implemented in Java. Their behavior will match the behavior from a Java application that uses the same data type. One immediate difference you will notice is that STRING types do not have a defi ned length. This is normal for Java and other programming languages, but is not typical for relational databases.

Table 6-1: Hive Data Types

TYPE DESCRIPTION EXAMPLES

SQL SERVER EQUIVALENT STRING String enclosed by single or

double quotation marks.

'John Smith' or "John Smith"

varchar(n), nvarchar(n) TINYINT 1-byte signed integer in the

range of -128 to 127.

10 tinyint

SMALLINT 2-byte signed integer in the range of -32,768 to 32,767.

32000 smallint

INT 4-byte signed integer in the range of -2,147,483,648 to 2,147,483,647.

2000000 int

BIGINT 8-byte signed integer in the range of

-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

20000000 bigint

BOOLEAN Boolean true or false. TRUE FALSE

bit

FLOAT 4-byte single-precision ﬂ oating point.

25.189764 real

DOUBLE 8-byte double-precision ﬂ oating point.

25.1897645126 float(53)

DECIMAL A 38-digit precision number. 25.1897654 decimal, numeric

TYPE DESCRIPTION EXAMPLES

SQL SERVER EQUIVALENT TIMESTAMP UNIX timestamp that can be

in one of three forms: Integer: Represents the number of seconds from the UNIX epoch date and time (January 1, 1970 12:00 AM). Floating point: Represents second oﬀ set from UNIX epoch with nanosecond precision.

String: JDBC-compliant timestamp format YYYY-MM-DD HH:MM:SS.fffffffff. 123412123 123412123.1234567 '2013-01-01 12:00:00' datetime2 datetime2(7)

DATE A date in YYYY-MM-DD format.

{{2012-01-01}} date

BINARY A series of bytes. binary(n) STRUCT Deﬁ nes a column that con-

tains a deﬁ ned set of addi- tional values and their types.

struct('John', 'Smith')

MAP Deﬁ nes a collection of key/ value pairs.

map('first', 'John', 'last', 'Smith')

ARRAY Deﬁ nes a sequenced collection of values.

array('John', 'Smith') UNION Similar to sql_variant

types. They hold one value at a time, but it can be any one of the deﬁ ned types for the column.

Varies depending on column

sql_variant

N O T E DATE types are new for Hive 0.12. The Hortonworks Data Platform (HDP) 1.3 release is built using Hive 0.11, so this data type cannot be used with it yet.

The types that are unique to Hive are MAP, ARRAY, and STRUCT. These types are supported in Hive so that it can better work with the denormalized data that is often found in Hadoop data stores. Relational database tables are typically normalized; that is, a row holds only one value for a given column. In Hadoop, though, it is not uncommon to fi nd data where many values are stored in a row for a “column.” This denormalization of the data makes it easier and faster to write the data, but makes it more challenging to retrieve it in a tabular format.

Hive addresses this with the MAP, ARRAY, and STRUCT types, which let a devel- oper fl atten out the denormalized data into a multicolumn structure. The details

110 Part III ■ Storing and Managing Big Data

of querying this denormalized data are discussed later in this chapter. For now, we will review the data types that support these structures.

A STRUCT is a column that contains multiple defi ned fi elds. Each fi eld can have its own data type. This is comparable to structs in most programming languages. In Hive, you can declare a STRUCT for a full name using the following syntax:

STRUCT <FirstName:string, MiddleName:string, LastName:string>

To access the individual fi elds of the STRUCT type, use the column name followed by a period and the name of the fi eld:

FullName.FirstName

An ARRAY is a column that contains an ordered sequence of values. All the values must be of the same type:

ARRAY<STRING>

Because it is ordered, the individual values can be accessed by their index. As with Java and .NET languages, ARRAY types use a zero-based index, so you use an index of 0 to access the fi rst element, and an index of 2 to access the third element. If the preceding Full Name column were declared as an ARRAY, with fi rst name in the fi rst position, middle name in the second position, and last name in the third position, you would access the fi rst name with index 0 and last name with index 2:

FullName[0], FullName[2]

A MAP column is a collection of key/value pairs, where both the key and values have data types. The key and value do not have to use the same data type. A

MAP for Full Name might be declared using the following syntax:

MAP<string, string>

In the Full Name case, you would populate the MAP column with the following key/value pairs:

'FirstName', 'John' 'MiddleName', 'Doe' 'LastName', 'Smith'

You can access MAP column elements using the same syntax as you use with an

ARRAY, except that you use the key value instead of the position as the index. Accessing the fi rst and last names would be done with this syntax:

FullName['FirstName'], FullName['LastName']

After looking at the possible data types, you may be wondering how these are stored in Hive. The next section covers the fi le formats that can be used to store the data.

In document Microsoft Big Data Solutions (2014) pdf (Page 130-133)