Querying a Table - Microsoft Big Data Solutions (2014) pdf

Writing queries against Hive is fairly straightforward if you are familiar with writing SQL queries. Instead of focusing on the everyday SQL, this section focuses on the aspects of querying Hive that differ from most relational databases.

The basic SELECT statement is intact, along with familiar elements such as

WHERE clauses, table and column aliases, and ORDER BY clauses: SELECT c.name, c.city, c.state, c.postalCode, c.purchases FROM MsBigData.customer c LIMIT 100

WHERE c.state='FL' ORDER BY c.postalCode;

N O T E One useful diﬀ erence to note is the LIMIT clause. This restricts the query to an upper limit of rows that it can return. If you are used to SQL Server, you might be familiar with the TOP clause. LIMIT works in the same way, but it doesn’t support per- centage based row limits. LIMIT can prove very handy when you are exploring data and don’t want to process millions or billions of rows in your Hive tables.

When you run the SELECT statement, you’ll notice that the results are as expected, with the exception of the purchases column. Because that column represents a collection of values, Hive fl attens it into something that it can return as a column value. It does this using Java Script Object Notation (JSON), a standard format for representing objects:

John Smith Jacksonville FL 32226 {"Food":456.98,"Lodging":1245.45} This might be useful to get a quick look at a table’s data, but in most instances you will want to extract portions of the value out. Querying individual elements of complex types is fairly straightforward. For MAP types, you reference the key value:

124 Part III ■ Storing and Managing Big Data

If purchases were an ARRAY, you would use the index of the value you are interested in:

SELECT c.name, c.city, c.state, c.postalCode, c.purchases[1] And if purchases were a STRUCT, you would use the fi eld name:

SELECT c.name, c.city, c.state, c.postalCode, c.purchases.Lodging

You can use this syntax in any location where you would use a regular column. Calculations and functions are used in the same way as you would in most SQL dialects. For example, this SELECT statement returns the sum of lodging purchases for any customer who purchased over 100 in food:

SELECT SUM(c.purchases['Lodging']) FROM MsBigData.customer c

WHERE c.purchases['Food'] > 100;

N O T E One interesting feature of Hive is that you can use regular expressions in the column list of the SELECT. For example, this query returns the name column and all columns that start with “address” from the speciﬁ ed table:

SELECT name, 'address.*' FROM shipments;

You can also use the functions RLIKE and REGEXP, which function in the same way as LIKE but allow the use of regular expressions for matching.

Some functions that are of particular interest are those that deal with complex types, because these don’t have equivalent versions in many relational systems. For example, there are functions for determining the size of a collection. There are also functions that generate tables as output. These are the opposite of aggre- gating functions, such as SUM, which take multiple rows and aggregate them into a single result. Table generating functions take a single row of input and produce multiple rows of output. These are useful when dealing with complex types that need to be fl attened out. However, they must be used by themselves in SELECT column lists. Table 6.3 describes the table-generating functions, along with other functions that work with complex types.

Table 6-3: Functions Related to Complex Types

NAME DESCRIPTION

size(MAP | ARRAY) Returns the number of elements in the MAP or ARRAY passed to the function

map_keys(MAP) Returns the key values from a MAP as an ARRAY map_values(MAP) Returns the values from a MAP as an ARRAY array_contains(ARRAY, value) Returns true if the array contains the value,

NAME DESCRIPTION

sort_array(ARRAY) Sorts and returns the ARRAY by the natural order of the elements

explode(MAP | ARRAY) Returns a row for each item in the MAP or ARRAY

inline(ARRAY<STRUCT>) Explodes an array of STRUCTs into a table

N O T E There are also functions for parsing URLs and JSON objects into tables of information that can prove extremely useful if you need to deal with this type of data. For a complete, current list of Hive operators and functions, a good resource is the Hive wiki: https://cwiki.apache.org/confluence/display/Hive/ LanguageManual+UDF.

Hive supports joining tables, but only using equi-join logic. This restriction is due to the distributed nature of the data, and because Hive has to translate many queries to MapReduce jobs. Performing non-equi-joins across distributed data sets is extremely resource intensive, and performance would often be unreasonably poor. For the same reason, ORs cannot be used in JOIN clauses.

INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER JOINs are supported. These function like their SQL equivalents. When processing a SELECT with both JOIN

and WHERE clauses, Hive evaluates the JOIN fi rst, then the WHERE clause is applied on the joined results.

During join operations, Hive makes the assumption that the largest table appears last in the FROM clause. Therefore, it attempts to process the other tables fi rst, and then streams the content of the last table. If you keep this in mind when writing your Hive queries, you will get better performance. You can use a query hint to indicate which table should be streamed, too:

SELECT /*+ STREAMTABLE(bt) */ bt.name, bt.transactionAmount, c.state FROM bigTable bt JOIN customer c ON bt.postalCode = c.PostalCode

When you are using ORDER BY, be aware that this requires ordering of the whole data set. Because this operation cannot be distributed across multiple nodes, it can be quite slow. Hive offers the alternative SORT BY. Instead of sort- ing the entire data set, SORT BY lets each node that is processing results sort its results locally. The overall data set won’t be ordered, but the results from each node will be sorted:

SELECT c.name, c.city, c.state, c.postalCode, c.purchases FROM MsBigData.customer c

126 Part III ■ Storing and Managing Big Data

You can use SORT BY in conjunction with DISTRIBUTE BY to send related data to the same nodes for processing so that there is less overlap from sort- ing on multiple nodes. In the following example, the data is distributed to nodes, based on the state, and then the postal codes are sorted per state for each node:

SELECT c.name, c.city, c.state, c.postalCode, c.purchases FROM MsBigData.customer c

DISTRIBUTE BY c.state;

SORT BY c.state, c.postalCode;

Now that you have explored the basic operations in Hive, the next section will address the more advanced features, like partitioning, views, and indexes.

In document Microsoft Big Data Solutions (2014) pdf (Page 145-148)