Integrating with Apache Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

(1)

HPE Vertica Analytic Database

Software Version: 7.2.x

(2)

statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HPE shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice.

Restricted Rights Legend

Confidential computer software. Valid license from HPE required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license.

Copyright Notice

Trademark Notices

Adobe™ is a trademark of Adobe Systems Incorporated.

Microsoft® and Windows® are U.S. registered trademarks of Microsoft Corporation. UNIX® is a registered trademark of The Open Group.

(3)

Introduction to Hadoop Integration 8 Hadoop Distributions 8 Integration Options 8 Cluster Layout 10 Co-Located Clusters 10 Hardware Recommendations 10

Configuring Hadoop for Co-Located Clusters 12

webHDFS 12

YARN 12

Hadoop Balancer 12

Replication Factor 13

Disk Space for Non-HDFS Use 13

Separate Clusters 13

Choosing Which Hadoop Interface to Use 16

Creating an HDFS Storage Location 16

Using the ORC Reader 16

Using the HCatalog Connector 16

Using the HDFS Connector 17

Using the MapReduce Connector 17

Using Kerberos with Hadoop 18

How Vertica uses Kerberos With Hadoop 18

User Authentication 18

Vertica Authentication 19

Introduction to Hadoop Integration

Apache™ Hadoop, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System. You can use HDFS from Vertica in several ways:

l You can import HDFS data into locally-stored ROS files. l You can access HDFS data in place, using external tables. l You can use HDFS as a storage location for ROS files. Hadoop includes two other components of interest:

l Hive, a data warehouse that provides the ability to query data stored in Hadoop. l HCatalog, a component that makes Hive metadata available to applications, such as

Vertica, outside of Hadoop.

A Hadoop cluster can use Kerberos authentication to protect data stored in HDFS. Vertica integrates with Kerberos to access HDFS data if needed. SeeUsing Kerberos with Hadoop.

Hadoop Distributions

Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. SeeVertica Integrations for Hadoopfor the specific versions that are supported.

Integration Options

Vertica supports two cluster architectures. Which you use affects the decisions you make about integration.

l You can co-locate Vertica on some or all of your Hadoop nodes. Vertica can then take advantage of local data. This option is supported only for Vertica for SQL on Hadoop.

l You can build a Vertica cluster that is separate from your Hadoop cluster. In this configuration, Vertica can fully use each of its nodes; it does not share resources with Hadoop. This option is not supported for Vertica for SQL on Hadoop..

These layout options are described inCluster Layout. Both layouts support several interfaces for using Hadoop:

(9)

l AnHDFS Storage Locationuses HDFS to hold Vertica data (ROS files).

l TheHCatalog Connectorlets Vertica query data that is stored in a Hive database the same way you query data stored natively in a Vertica schema.

l TheORC Readerlets Vertica query data that is stored in the ORC format native to Hadoop. This is faster than using the HCatalog Connector for this type of data. l TheHDFS Connectorlets Vertica import HDFS data. It also lets VerticaVertica read

HDFS data as an external table without using Hive.

l TheMapReduce Connectorlets you create Hadoop MapReduce jobs that retrieve data from Vertica. These jobs can also insert data into Vertica.

(10)

Cluster Layout

Vertica and Hadoop each use a cluster of nodes for distributed processing. These clusters can be co-located, meaning you run both products on the same machines, or separate.

Co-Located Clustersare for use with Vertica for SQL on Hadoop licenses.

Separate Clustersare for use with Premium Edition and Community Edition licenses.

Co-Located Clusters

With co-located clusters, Vertica is installed on some or all of your Hadoop nodes. The Vertica nodes use a private network in addition to the public network used by all

Hadoop nodes, as the following figure shows:

You might choose to place Vertica on all of your Hadoop nodes or only on some of them. If you are using HDFS Storage Locations you should use at least three Vertica nodes, the minimum number forK-Safety. Using more Vertica nodes can improve performance because the HDFS data needed by a query is more likely to be local. Normally, both Hadoop and Vertica use the entire node. Because this configuration uses shared nodes, you must address potential resource contention in your

configuration on those nodes. SeeConfiguring Hadoop for Co-Located Clustersfor more information. No changes are needed on Hadoop-only nodes.

You can place Hadoop and Vertica clusters within a single rack, or you can span across many racks and nodes. Spreading node types across racks can improve efficiency.

Hardware Recommendations

Hadoop clusters frequently do not have identical provisioning requirements or hardware configurations. However, Vertica nodes should be equivalent in size and capability, per

(11)

the best-practice standards recommended inGeneral Hardware and OS Requirements and Recommendationsin Installing Vertica.

Because Hadoop cluster specifications do not always meet these standards, Hewlett Packard Enterprise recommends the following specifications for Vertica nodes in your Hadoop cluster.

Specifications For...

Recommendation

Processor For best performance, run:

l Two-socket servers with 8–14 core CPUs, clocked at or above

2.6 GHz for clusters over 10 TB

l Single-socket servers with 8–12 cores clocked at or above 2.6

GHz for clusters under 10 TB

Memory Distribute the memory appropriately across all memory channels in the server:

l Minimum—8 GB of memory per physical CPU core in the server l High-performance applications12–16 GB of memory per

physical core

l Type—at least DDR3-1600, preferably DDR3-1866

Storage Read/write:

l Minimum— 40 MB/s per physical core of the CPU l For best performance— 60–80 MB/s per physical core

Storage post RAID:Each node should have 1–9 TB. For a

production setting, RAID 10 is recommended. In some cases, RAID 50 is acceptable.

Because of the heavy compression and encoding that Verticadoes, SSDs are not required. In most cases, a RAID of more,

less-expensive HDDs performs just as well as a RAID of fewer SSDs. If you intend to use RAID 50 for your data partition, you should keep a spare node in every rack, allowing for manual failover of a Vertica node in the case of a drive failure. A Vertica node recovery is faster than a RAID 50 rebuild. Also, be sure to never put more than 10 TB compressed on any node, to keep node recovery times at an

(12)

Network 10 GB networking in almost every case. With the introduction of 10 GB over cat6a (Ethernet), the cost difference is minimal.

Configuring Hadoop for Co-Located

Clusters

If you are co-locating Vertica on any HDFS nodes, there are some additional configuration requirements.

webHDFS

Hadoop has two services that can provide web access to HDFS:

l webHDFS

l httpFS

For Vertica, you must use the webHDFS service.

YARN

The YARN service is available in newer releases of Hadoop. It performs resource management for Hadoop clusters. When co-locating Vertica on

YARN-managed Hadoop nodes you must make some changes in YARN.

HPE recommends reserving at least 16GB of memory for Vertica on shared nodes. Reserving more will improve performance. How you do this depends on your Hadoop distribution:

l If you are using Hortonworks, create a "Vertica" node label and assign this to the nodes that are running Vertica.

l If you are using Cloudera, enable and configure static service pools.

Consult the documentation for your Hadoop distribution for details. Alternatively, you can disable YARN on the shared nodes.

Hadoop Balancer

The Hadoop Balancer can redistribute data blocks across HDFS. For many Hadoop services, this feature is useful. However, for Vertica this can reduce performance under some conditions.

(13)

If you are using HDFS storage locations, the Hadoop load balancer can move data away from the Vertica nodes that are operating on it. This degrades performance. This can also occur when reading ORC files if Vertica is not running on all Hadoop nodes. (If you are using separate Vertica and Hadoop clusters, all Hadoop access is over the network, and the performance cost is less noticeable.)

To prevent the undesired movement of data blocks across the HDFS cluster, consider excluding Vertica nodes from rebalancing. See the Hadoop documentation to learn how to do this.

Replication Factor

By default, HDFS stores three copies of each data block. Vertica is generally set up to store two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save space and still provide data protection.

To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in

Troubleshooting HDFS Storage Locations.

Disk Space for Non-HDFS Use

You also need to reserve some disk space for non-HDFS use. To reserve disk space using Ambari, setdfs.datanode.du.reservedto a value in thehdfs-site.xml

configuration file.

Setting this parameter preserves space for non-HDFS files that Vertica requires.

Separate Clusters

In the Premium Edition product, your Vertica and Hadoop clusters must be set up on separate nodes, ideally connected by a high-bandwidth network connection. This is different from the configuration for Vertica for SQL on Hadoop, in which Vertica nodes are co-located on Hadoop nodes.

(14)

The network is a key performance component of any well-configured cluster. When Vertica stores data to HDFS it writes and reads data across the network.

The layout shown in the figure calls for two networks, and there are benefits to adding a third:

l Database Private Network: Vertica uses a private network for command and control and moving data between nodes in support of its database functions. In some networks, the command and control and passing of data are split across two networks.

l Database/Hadoop Shared Network: Each Vertica node must be able to connect to each Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated network for the Hadoop cluster. This is not a technical

requirement, but a dedicated network improves Hadoop performance. Vertica and Hadoop should share the dedicated Hadoop network.

l Optional Client Network: Outside clients may access the clustered networks through a client network. This is not an absolute requirement, but the use of a third network that supports client connections to either Vertica or Hadoop can improve

(15)

performance. If the configuration does not support a client network, than client connections should use the shared network.

(16)

Choosing Which Hadoop Interface to

Use

Vertica provides several ways to interact with data stored in Hadoop. This section explains how to choose among them. Decisions aboutCluster Layoutcan affect the decisions you make about Hadoop interfaces.

Creating an HDFS Storage Location

Using a storage location to store data in the Vertica native file format (ROS) delivers the best query performance among the available Hadoop options. (Storing ROS files on the local disk rather than in Hadoop is faster still.) If you already have data in Hadoop, however, doing this means you are importing that data into Vertica.

For co-located clusters, which does not use local file storage, you might still choose to use an HDFS storage location for better performance. You can use the

HDFS Connector to load data that is already in HDFS into Vertica.

For separate clusters, which use local file storage, consider using an HDFS storage location for lower-priority data.

SeeUsing HDFS Storage LocationsandUsing the HDFS Connector.

Using the ORC Reader

If your data is stored in the Optimized Row Columnar format, an open format supported by most Hadoop providers, Vertica can query that data directly from HDFS. This is faster than using the HCatalog Connector, but you cannot pull schema definitions from Hive directly into the database. The ORC Reader reads the data in place; no extra copies are made.

SeeUsing the ORC Reader.

Using the HCatalog Connector

The HCatalog Connector uses Hadoop services (Hive and HCatalog) to query data stored in HDFS. Like the ORC Reader, it reads data in place rather than making copies. Using this interface you can read all file formats supported by Hadoop, including

Parquet and ORC, and Vertica can use Hive's schema definitions. However,

(17)

changes in the Hadoop libraries on which it depends; upgrading your Hadoop cluster might affect your HCatalog connections.

SeeUsing the HCatalog Connector.

Using the HDFS Connector

The HDFS Connector can be used to create and query external tables, reading the data in place rather than making copies. The HDFS Connector can be used with any data format for which a parser is available. It does not use Hive data; you have to define the table yourself. Its performance can be poor because, like the HCatalog Connector, it cannot take advantage of the benefits of columnar file formats.

SeeUsing the HDFS Connector.

Using the MapReduce Connector

The other interfaces described in this section allow you to read Hadoop data from Vertica or create Vertica data in Hadoop. The MapReduce Connector, in contrast, allows you to integrate with Hadoop's MapReduce jobs. Use this connector to send Vertica data to MapReduce or to have MapReduce jobs create data in Vertica. SeeUsing the MapReduce Connector.

(18)

Using Kerberos with Hadoop

If your Hadoop cluster uses Kerberos authentication to restrict access to HDFS, you must configure Vertica to make authenticated connections. The details of this

configuration vary, based on which methods you are using to access HDFS data: l How Vertica uses Kerberos With Hadoop

l Configuring Kerberos

How Vertica uses Kerberos With Hadoop

Vertica authenticates with Hadoop in two ways that require different configurations: l User Authentication—On behalf of the user, by passing along the user's existing

Kerberos credentials, as occurs with the HDFS Connector and the HCatalog Connector.

l VerticaAuthentication—On behalf of system processes (such as the Tuple Mover), by using a special Kerberos credential stored in a keytab file.

User Authentication

To use Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or sign in to Active Directory, for example. A user who authenticates to a Kerberos server receives a Kerberos ticket. At the beginning of a client session, Vertica automatically retrieves this ticket.The database then uses this ticket to get a Hadoop token, which Hadoop uses to grant access. Vertica uses this token to access HDFS, such as when executing a query on behalf of the user. When the token expires, the database automatically renews it, also renewing the Kerberos ticket if necessary. The following figure shows how the user, Vertica, Hadoop, and Kerberos interact in user authentication:

(19)

When using the HDFS Connector or the HCatalog Connector, or when reading an ORC file stored in HDFS, Vertica uses the client identity as the preceding figure shows.

Vertica Authentication

Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, Vertica uses a special identity (principal) stored in a keytab file on every database node. (This approach is also used for Vertica clusters that use Kerberos but do not use

Hadoop.) After you configure the keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos ticket, much as in the client scenario. In this case, the client does not interact with Kerberos.

(20)

Each Vertica node uses its own principal; it is common to incorporate the name of the node into the principal name. You can either create one keytab per node, containing only that node's principal, or you can create a single keytab containing all the principals and distribute the file to all nodes. Either way, the node uses its principal to get a

Kerberos ticket and then uses that ticket to get a Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one database node.

When creating HDFS storage locations Vertica uses the principal in the keytab file, not the principal of the user issuing the CREATE LOCATION statement.

Configuring Kerberos

Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. This documentation assumes that you are using Kerberos for both your HDFS and Vertica clusters.

Prerequisite:

Setting Up Users and the Keytab

File

If you have not already configured Kerberos authentication for Vertica, follow the instructions inConfigure for Kerberos Authentication. In particular:

l Create one Kerberos principal per node.

l Place the keytab file(s) in the same location on each database node and set its location in KerberosKeytabFile (seeSpecify the Location of the Keytab File). l Set KerberosServiceName to the name of the principal (seeInform About the

Kerberos Principal).

HCatalog Connector

You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of Vertica users. If the current user has a Kerberos key, then Vertica passes it to the HCatalog connector automatically. Verify that all users who need access to Hive have been granted access to HDFS.

In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do this is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you do this before running hcatUtil (seeConfiguring Vertica for HCatalog).

HDFS Connector

The HDFS Connector loads data from HDFS into Vertica on behalf of the user, using a User Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to access HDFS. Verify that all users who use this connector have been granted access to HDFS.

(22)

HDFS Storage Location

You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector).

After you create Kerberos principals for each node, give all of them read and write permissions to the HDFS directory you will use as a storage location. If you plan to back up HDFS storage locations, take the following additional steps:

l Grant Hadoop superuser privileges to the new principals.

l Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions inConfiguring Hadoop and Vertica to Enable Backup of HDFS Storage

l Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" inConfiguring Vertica to Restore HDFS Storage Locations.

Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all

database nodes, use theCREATE LOCATIONstatement to create the storage location as usual.

Token Expiration

Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. The

HadoopFSTokenRefreshFrequency configuration parameter specifies the frequency in seconds:

=> ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400';

If the current age of the token is greater than the value specified in this parameter, Vertica refreshes the token before accessing data stored in HDFS.

Using the ORC

Reader

If your HDFS data is in the Optimized Row Columnar (ORC) format and uses no complex data types, then instead of using the HCatalog Connector you can use the ORC Reader to access the data directly. Reading directly may provide better performance.

The decisions you make when writing ORC files can affect performance when using them. To get the best performance from the ORC Reader, do the following when writing:

l Use the latest available Hive version to write ORC files. (You can still read them with earlier versions.)

l Use a large stripe size; 256MB or greater is preferred. l Partition the data at the table level.

l Sort the columns based on frequency of access, most-frequent first. l Use Snappy or ZLib compression.

Syntax

In theCOPYstatement, specify a format of ORC as follows:

COPY tableName FROM path ORC;

In theCREATE EXTERNAL TABLE AS COPYstatement, specify a format of ORC as follows:

CREATE EXTERNAL TABLE tableName (columns) AS COPY FROM path ORC;

If the file resides on the local file system of the node where you are issuing the

command, use a local file path forpath. If the file resides elsewhere in HDFS, use the

webhdfs://prefix and then specify the host name, port, and file path. Use ON ANY NODE for files that are not local to improve performance.

COPY t FROM 'webhdfs://somehost:port/opt/data/orcfile' ON ANY NODE ORC;

The ORC reader supports ZLib and Snappy compression. It does not support GZIP, BZIP, or LZO compression.

The CREATE EXTERNAL TABLE AS COPY statement must consume all of the columns in the ORC file; unlike with some other data sources, you cannot select only

(24)

the columns of interest. If you omit columns the ORC reader aborts with an error and does not copy any data.

If you load from multiple ORC files in the same COPY statement and any of them is aborted, the entire load is aborted. This is different behavior than for delimited files, where the COPY statement loads what it can and ignores the rest.

Supported Data Types

The Vertica ORC file reader can natively read columns of all data types supported in HIVE version 0.11 and later except for complex types. If complex types such as maps are encountered, the COPY or CREATE EXTERNAL TABLE AS COPY statement aborts with an error message. The ORC reader does not attempt to read only some columns; either the entire file is read or the operation fails. For a complete list of supported types, seeHIVE Data Types.

Timestamps

Reading timestamps from an ORC file in Vertica might result in different values than reading the same ORC file with another tool such as Hive. ORC files store timestamps without any time zone information. Vertica interprets timestamps without time zones as values in the local time zone, which might not be the time zone from which the ORC file was written.

If you know which time zone was used as the reference to write the ORC file, you can set the default time zone in Vertica to correct the problem. For example, suppose the following values were written in an ORC file in the America/New_York time zone:

2016-02-01 00:00:00 2016-08-01 00:00:00

When you read these values into Vertica the values might change, depending on your local time zone:

=> CREATE EXTERNAL TABLE t (ts timestamp) AS COPY FROM '/path/to/file.orc' orc; CREATE TABLE => select ts from t ; ts ---2016-02-01 05:00:00 2016-08-01 04:00:00

To adjust ORC time zones, set theTIMEZONEvariable in Vertica:

(25)

SET

=> SELECT ts AT timezone 'GMT' as tstz from t ; tstz

---2016-02-01 00:00:00-05

2016-08-01 00:00:00-04

TheTIMEZONEvariable is global to your Vertica cluster and affects all timestamp values that do not include a time zone. If you have ORC files written in more than one time zone, you cannot adjust all of them using this approach. For more information about the

TIMEZONEvariable, seeUsing Time Zones With Vertica.

When Hive writes ORC files it converts dates before 1583 to the Julian calendar. Vertica does not perform this conversion. If your ORC file contains dates before this time, values in Hive and the corresponding values in Vertica will differ by up to ten days. This applies to both DATE and TIMESTAMP values.

Kerberos

If the ORC file is located on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal to authenticate. It does not use the database's principal.

Query Performance

When working with external tables in ORC format, Vertica tries to improve performance in two ways: by pushing query execution closer to the data so less has to be read and transmitted, and by taking advantage of data locality in planning the query.

Predicate pushdownmoves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics and row group indexes. Predicates are applied only to the first two levels.

Predicate pushdown works and is automatically applied for ORC files written with HIVE version 0.14 and later. ORC files written with earlier versions of HIVE might not contain the required statistics. When executing a query against an ORC file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_

SUPPORTED event in theQUERY_EVENTSsystem table. If you are seeing performance problems with your queries, check this table for these events.

In a cluster where Vertica nodes are co-located on HDFS nodes, the query can also take advantage of data locality. If data is on an HDFS node where a database node is also present, and if the query is not restricted to specific nodes using ON NODE, then

(26)

the query planner uses that database node to read that data. This allows Vertica to read data locally instead of making a network call.

You can see how much ORC data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC data matched with co-located Vertica nodes". To increase the volume of local reads,

consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data.

Examples

The following example shows how to read from all ORC files in a directory. It uses all supported data types.

CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP,

a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20), a14 DECIMAL(10,5))

AS COPY FROM '/data/orc_test_*.orc' ORC;

The following example shows the error that is produced if the file you specify is not recognized as an ORC file:

CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT) AS COPY FROM '/data/not_an_orc_file.orc' ORC;

(27)

Using the HCatalog Connector

The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table.

If your files are in the Optimized Columnar Row (ORC) format, you might be able to read them directly instead of going through this connector. For more information, seeUsing the ORC Reader.

Hive, HCatalog, and WebHCat Overview

There are several Hadoop components that you need to understand in order to use the HCatalog connector:

l Apache's Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the same way you query data stored in a relational database. Behind the scenes, Hive uses a set of serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and break it into columns and rows. Each SerDe handles data files in a specific format. For example, one SerDe extracts data from comma-separated data files while another interprets data stored in JSON format. l Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's

metadata available to other Hadoop components (such as Pig).

l WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a REST web API. Through it, you can make an HTTP request to retrieve data stored in Hive, as well as information about the Hive schema.

Vertica's HCatalog Connector lets you transparently access data that is available through WebHCat. You use the connector to define a schema in Vertica that

corresponds to a Hive database or schema. When you query data within this schema, the HCatalog Connector transparently extracts and formats the data from Hadoop into tabular data. The data within this HCatalog schema appears as if it is native to Vertica. You can even perform operations such as joins between Vertica-native tables and HCatalog tables. For more details, seeHow the HCatalog Connector Works.

HCatalog Connection Features

The HCatalog Connector lets you query data stored in Hive using the Vertica native SQL syntax. Some of its main features are:

(28)

l The HCatalog Connector always reflects the current state of data stored in Hive. l The HCatalog Connector uses the parallel nature of both Vertica and Hadoop to

process Hive data. The result is that querying data through the HCatalog Connector is often faster than querying the data directly through Hive.

l Since Vertica performs the extraction and parsing of data, the HCatalog Connector does not signficantly increase the load on your Hadoop cluster.

l The data you query through the HCatalog Connector can be used as if it were native Vertica data. For example, you can execute a query that joins data from a table in an HCatalog schema with a native table.

HCatalog Connection Considerations

There are a few things to keep in mind when using the HCatalog Connector:

l Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and deserialized each time it is queried. This deserialization causes Hive's performance to be much slower than Vertica. The HCatalog Connector has to perform the same process as Hive to read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much slower than querying a native Vertica table. If you need to perform extensive analysis on data stored in Hive, you should consider loading it into Vertica through the HCatalog Connector or the WebHDFS connector. Vertica optimization often makes querying data through the HCatalog Connector faster than directly querying it through Hive.

l Hive supports complex data types such as lists, maps, and structs that Vertica does not support. Columns containing these data types are converted to a

JSON representation of the data type and stored as a VARCHAR. SeeData Type Conversions from Hive to Vertica.

Note:The HCatalog Connector is read only. It cannot insert data into Hive.

How the HCatalog Connector Works

When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts the WebHCat server in your Hadoop cluster to determine if the table exists. If it does, the connector retrieves the table's metadata from the metastore database so the query planning can continue. When the query executes, all nodes in the Vertica cluster directly retrieve the data necessary for completing the

(29)

query from the Hadoop HDFS. They then use the Hive SerDe classes to extract the data so the query can execute.

This approach takes advantage of the parallel nature of both Vertica and Hadoop. In addition, by performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact of the query on the Hadoop cluster.

HCatalog Connector Requirements

Before you can use the HCatalog Connector, both your Vertica and Hadoop installations must meet the following requirements.

Vertica Requirements

All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See

Installing the Java Runtime on Your Vertica Cluster.

You must also add certain libraries distributed with Hadoop and Hive to your Vertica installation directory. SeeConfiguring Vertica for HCatalog.

Hadoop Requirements

Your Hadoop cluster must meet several requirements to operate correctly with the Vertica Connector for HCatalog:

(30)

l It must have Hive and HCatalog installed and running. See Apache'sHCatalogpage for more information.

l It must have WebHCat (formerly known as Templeton) installed and running. See Apache' sWebHCatpage for details.

l The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly accessible from all of the hosts in your Vertica database. Verify that any firewall separating the Hadoop cluster and the Vertica cluster will pass WebHCat, metastore database, and HDFS traffic.

l The data that you want to query must be in an internal or external Hive table.

l If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on your Vertica cluster before you can query the data. SeeUsing Non-Standard SerDes.

Testing Connectivity

To test the connection between your database cluster and WebHcat, log into a node in your Vertica cluster. Then, run the following command to execute an HCatalog query:

$ curl http://webHCatServer:port/templeton/v1/status?user.name=hcatUsername

Where:

l webHCatServeris the IP address or hostname of the WebHCat server l portis the port number assigned to the WebHCat service (usually 50111) l hcatUsernameis a valid username authorized to use HCatalog

Usually, you want to append;echoto the command to add a linefeed after the curl command's output. Otherwise, the command prompt is automatically appended to the command's output, making it harder to read.

For example:

$ curl http://hcathost:50111/templeton/v1/status?user.name=hive; echo

If there are no errors, this command returns a status message in JSON format, similar to the following:

{"status":"ok","version":"v1"}

This result indicates that WebHCat is running and that the Vertica host can connect to it and retrieve a result. If you do not receive this result, troubleshoot your Hadoop

(31)

installation and the connectivity between your Hadoop and Vertica clusters. For details, seeTroubleshooting HCatalog Connector Problems.

You can also run some queries to verify that WebHCat is correctly configured to work with Hive. The following example demonstrates listing the databases defined in Hive and the tables defined within a database:

$ curl http://hcathost:50111/templeton/v1/ddl/database?user.name=hive; echo {"databases":["default","production"]}

$ curl http://hcathost:50111/templeton/v1/ddl/database/default/table?user.name=hive; echo {"tables":["messages","weblogs","tweets","transactions"],"database":"default"}

See Apache'sWebHCat referencefor details about querying Hive using WebHCat.

Installing the Java Runtime on Your Vertica

Cluster

The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVM must support Java 6 or later, and must be the same version as the one installed on your Hadoop nodes.

Note:If your Vertica cluster is configured to execute User Defined Extensions (UDxs) written in Java, it already has a correctly-configured JVM installed. See

Developing User Defined Functions in Javain Extending Vertica for more information.

Installing Java on your Vertica cluster is a two-step process: 1. Install a Java runtime on all of the hosts in your cluster.

2. Set the JavaBinaryForUDx configuration parameter to tell Vertica the location of the Java executable.

Installing a Java Runtime

For Java-based features, Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime. Vertica supports runtimes from either Oracle orOpenJDK. You can choose to install either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE.

Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution's documentation for information about installing and configuring OpenJDK. To install the Oracle Java runtime, see theJava Standard Edition (SE) Download Page. You usually run the installation package as root in order to install it. See the download page for instructions.

(32)

Once you have installed a JVM on each host, ensure that thejavacommand is in the search path and calls the correct JVM by running the command:

$ java -version

This command should print something similar to:

java version "1.6.0_37"

Java(TM) SE Runtime Environment (build 1.6.0_37-b06)

Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)

Note:Any previously installed Java VM on your hosts may interfere with a newly installed Java runtime. See your Linux distribution's documentation for instructions on configuring which JVM is the default. Unless absolutely required, you should uninstall any incompatible version of Java before installing the Java 6 or Java 7 runtime.

Setting the JavaBinaryForUDx Configuration

Parameter

The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE to execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameter to the absolute path of the Java executable. You can use the symbolic link that some Java installers create (for example/usr/bin/java). If the Java executable is in your shell search path, you can get the path of the Java executable by running the following command from the Linux command line shell:

$ which java /usr/bin/java

If thejavacommand is not in the shell search path, use the path to the Java executable in the directory where you installed the JRE. Suppose you installed the JRE in

/usr/java/default(which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case the Java executable is

/usr/java/default/bin/java.

You set the configuration parameter by executing the following statement as adatabase superuser:

=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';

SeeALTER DATABASEfor more information on setting configuration parameters. To view the current setting of the configuration parameter, query the

(33)

=> \x

Expanded display is on.

description | Path to the java binary for executing UDx written in Java

Once you have set the configuration parameter, Vertica can find the Java executable on each node in your cluster.

Note:Since the location of the Java executable is set by a single configuration parameter for the entire cluster, you must ensure that the Java executable is installed in the same path on all of the hosts in the cluster.

Configuring Vertica for HCatalog

Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your Vertica installation. You must also copy the Hadoop configuration files that specify various connection properties. Vertica uses the values in those

configuration files to make its own connections to Hadoop.

You need only make these changes on one node in your cluster. After you do this you can install the HCatalog connector.

Copy Hadoop Libraries and Configuration Files

Vertica provides a tool, hcatUtil, to collect the required files from Hadoop. This tool copies selected libraries and XML configuration files from your Hadoop cluster to your Vertica cluster. This tool might also need access to additional libraries:

l If you plan to use Hive to query files that use Snappy compression, you need access to the Snappy native libraries, libhadoop*.so and libsnappy*.so.

l If you plan to use a JSON SerDe with a Hive table, you need access to its library. This is the same library that you used to configure Hive; for example:

hive> add jar /home/release/json-serde-1.3-jar-with-dependencies.jar;

(34)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION '/user/release/vt/nationjson';

If either of these cases applies to you, do one of the following:

l Include the path(s) in the path you specify as the value of --hcatLibPath, or l Copy the file(s) to a directory already on that path.

If Vertica is not co-located on a Hadoop node, you should do the following:

1. Copy /opt/vertica/packages/hcat/tools/hcatUtil to a Hadoop node and run it there, specifying a temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different; in particular, in newer versions of Hadoop the HCatalog directory is usually a subdirectory under the HIVE directory. Use the values from your

environment in the following command:

hcatUtil --copyJars --hadoopHiveHome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoopHiveConfPath="/hadoop;/hive;/webhcat"

--hcatLibPath=/tmp/hadoop-files

2. Verify that all necessary files were copied:

hcatUtil --verifyJars --hcatLibPath=/tmp/hadoop-files

3. Copy that output directory (/tmp/hadoop-files, in this example) to

/opt/vertica/packages/hcat/lib on the Vertica node you will connect to when installing the HCatalog connector. If you are updating a Vertica cluster to use a new Hadoop cluster (or a new version of Hadoop), first remove all JAR files in

/opt/vertica/packages/hcat/lib except vertica-hcatalogudl.jar. 4. Verify that all necessary files were copied:

hcatUtil --verifyJars --hcatLibPath=/opt/vertica/packages/hcat

If you are using the Vertica for SQL on Hadoop product with co-located clusters, you can do this in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use the values from your environment in the following command:

hcatUtil --copyJars --hadoopHiveHome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoopHiveConfPath="/hadoop;/hive;/webhcat"

--hcatLibPath=/opt/vertica/packages/hcat/lib

(35)

-c, --copyJars copy the required JARs from hadoopHivePath to hcatLibPath.

-v, --verifyJars verify that the required JARs are present in hcatLibPath.

--hadoopHiveHome=

"value1;value2;..."

paths to the Hadoop, Hive, and HCatalog home directories. Separate multiple paths by a semicolon (;). Enclose paths in double quotes.

In newer versions of Hadoop, look for the HCatalog directory under the HIVE directory (for example, /hive/hcatalog/share).

--hcatLibPath=

"value1;value2;..."

output path of the lib/ folder of the HCatalog dependency JARs. Usually this is /opt/vertica/packages/hcat. You may use any folder, but make sure to copy all JARs to the hcat/lib folder before installing the HCatalog connector. If you have previously run hcatUtil with a different version of Hadoop, remove the old JAR files first (all except vertica-hcatalogudl.jar).

--hadoopHiveConfPath=

"value"

paths of the Hadoop, HIVE, and other components'

configuration files (such as core-site.xml, hive-site.xml, and webhcat-site.xml). Separate multiple paths by a semicolon (;). Enclose paths in double quotes. These files contain values that would otherwise have to be specified to

CREATE HCATALOG SCHEMA.

If you are using Cloudera, or if your HDFS cluster uses Kerberos authentication, this parameter is required. Otherwise this parameter is optional.

Once you have copied the files and verified them, install the HCatalog connector.

Install the HCatalog Connector

On the same node where you copied the files from hcatUtil, install the HCatalog connector by running the install.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path. This script creates the library and

VHCatSource and VHCatParser.

Note:The data that was copied using hcatUtil is now stored in the database. If you change any of those values in Hadoop, you need to rerun hcatUtil and install.sql. The following statement returns the names of the libraries and configuration files currently being used:

(36)

=> SELECT dependencies FROM user_libraries WHERE lib_name='VHCatalogLib';

Now you can create HCatalog schema parameters, which point to your existing Hadoop/Hive/WebHCat services, as described inDefining a Schema Using the HCatalog Connector.

Using the HCatalog Connector with HA

NameNode

Newer distributions of Hadoop support the High Availability NameNode (HA NN) for HDFS access. Some additional configuration is required to use this feature with the HCatalog Connector. If you do not perform this configuration, attempts to retrieve data through the connector will produce an error.

To use HA NN with Vertica, first copy /etc/hadoop/conf from the HDFS cluster to every node in your Vertica cluster. You can put this directory anywhere, but it must be in the same location on every node. (In the example below it is in /opt/hcat/hadoop_conf.) Then uninstall the HCat library, configure the UDx to use that configuration directory, and reinstall the library:

=> \i /opt/vertica/packages/hcat/ddl/uninstall.sql DROP LIBRARY

=> ALTER DATABASE mydb SET JavaClassPathSuffixForUDx = '/opt/hcat/hadoop_conf';

WARNING 2693: Configuration parameter JavaClassPathSuffixForUDx has been deprecated; setting it has no effect

=> \i /opt/vertica/packages/hcat/ddl/install.sql CREATE LIBRARY

CREATE SOURCE FUNCTION GRANT PRIVILEGE CREATE PARSER FUNCTION GRANT PRIVILEGE

Despite the warning message, this step is necessary. After taking these steps, HCatalog queries will now work.

Defining a Schema Using the HCatalog

Connector

After you set up the HCatalog Connector, you can use it to define a schema in your Vertica database to access the tables in a Hive database. You define the schema using

the CREATE HCATALOG SCHEMA statement. SeeCREATE HCATALOG SCHEMA

(37)

When creating the schema, you must supply at least two pieces of information: l the name of the schema to define in Vertica

l the host name or IP address of Hive's metastore database (the database server that contains metadata about Hive's data, such as the schema and table definitions) Other parameters are optional. If you do not supply a value, Vertica uses default values. After you define the schema, you can query the data in the Hive data warehouse in the same way you query a native Vertica table. The following example demonstrates creating an HCatalog schema and then querying several system tables to examine the contents of the new schema. SeeViewing Hive Schema and Table Metadatafor more information about these tables.

=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default' -> HCATALOG_USER='hcatuser';

CREATE SCHEMA

=> -- Show list of all HCatalog schemas => \x

Expanded display is on.

(38)

Querying Hive Tables Using HCatalog

Connector

Once you have defined the HCatalog schema, you can query data from the Hive database by using the schema name in your query.

=> SELECT * from hcat.messages limit 10;

messageid | userid | time | message

8 | ZJWg185DkZ | 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam tor 9 | E7ipAsYC3 | 2013-10-29 01:36:35 | varius Cum iaculis metus

10 | kStCv | 2013-10-29 01:47:19 | aliquam libero nascetur Cum mal (10 rows)

Since the tables you access through the HCatalog Connector act like Vertica tables, you can perform operations that use both Hive data and native Vertica data, such as a join:

=> SELECT u.FirstName, u.LastName, d.time, d.Message from UserData u -> JOIN hcat.messages d ON u.UserID = d.UserID LIMIT 10;

FirstName | LastName | time | Message

Fay | Moss | 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauri

Viewing Hive Schema and Table Metadata

When using Hive, you access metadata about schemas and tables by executing statements written in HiveQL (Hive's version of SQL) such asSHOW TABLES. When using the HCatalog Connector, you can get metadata about the tables in the Hive database through several Vertica system tables.

There are four system tables that contain metadata about the tables accessible through the HCatalog Connector:

(39)

l HCATALOG_SCHEMATA lists all of the schemas that have been defined using the HCatalog Connector. SeeHCATALOG_SCHEMATAin the SQL Reference Manual for detailed information.

l HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schemas defined using the HCatalog Connector. This table only shows the tables which the user querying the table can access. The information in this table is retrieved using a single call to WebHCat for each schema defined using the HCatalog Connector, which means there is a little overhead when querying this table. SeeHCATALOG_TABLE_LISTin the SQL Reference Manual for detailed information.

l HCATALOG_TABLES contains more in-depth information than HCATALOG_ TABLE_LIST. However, querying this table results in Vertica making a REST web service call to WebHCat for each table available through the HCatalog Connector. If there are many tables in the HCatalog schemas, this query could take a while to complete. SeeHCATALOG_TABLESin the SQL Reference Manual for more information.

l HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables available through the HCatalog Connector. Similarly to HCATALOG_TABLES, querying this table results in one call to WebHCat per table, and therefore can take a while to complete. SeeHCATALOG_COLUMNSin the SQL Reference Manual for more information.

The following example demonstrates querying the system tables containing metadata for the tables available through the HCatalog Connector.

=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost'

-> HCATALOG_SCHEMA='default' HCATALOG_DB='default' HCATALOG_USER='hcatuser'; CREATE SCHEMA

=> SELECT * FROM HCATALOG_SCHEMATA;

=> SELECT * FROM HCATALOG_TABLE_LIST; -[ RECORD 1 ]---+---table_schema_id | 45035996273864536

(40)

=> -- Get detailed description of a specific table

=> SELECT * FROM HCATALOG_TABLES WHERE table_name = 'msgjson';

=> -- Get list of columns in a specific table

(41)

(42)

(43)

Synching an HCatalog Schema With a Local

Schema

Querying data from an HCatalog schema can be slow due to Hive and WebHCat performance issues. This slow performance can be especially annoying when you use the HCatalog Connector to query the HCatalog schema's metadata to examine the structure of the tables in the Hive database.

To avoid this problem you can use the SYNC_WITH_HCATALOG_SCHEMA function to create a snapshot of the HCatalog schema's metadata within a Vertica schema. You supply this function with the name of a pre-existing Vertica schema and an HCatalog schema available through the HCatalog Connector. It creates a set of external tables within the Vertica schema that you can then use to examine the structure of the tables in the Hive database. Because the metadata in the Vertica schema is local, query planning is much faster. You can also use standard Vertica statements and system tables queries to examine the structure of Hive tables in the HCatalog schema.

Caution:The SYNC_WITH_HCATALOG_SCHEMA function overwrites tables in the Vertica schema whose names match a table in the HCatalog schema. To avoid losing data, always create an empty Vertica schema to sync with an HCatalog schema.

The Vertica schema is just a snapshot of the HCatalog schema's metadata. Vertica does not synchronize later changes to the HCatalog schema with the local schema after you call SYNC_WITH_HCATALOG_SCHEMA. You can call the function again to re-synchronize the local schema to the HCatalog schema.

Note:By default, the function does not drop tables that appear in the local schema that do not appear in the HCatalog schema. Thus after the function call the local schema does not reflect tables that have been dropped in the Hive database. You can change this behavior by supplying the optional third Boolean argument that tells

(44)

the function to drop any table in the local schema that does not correspond to a table in the HCatalog schema.

The following example demonstrates calling SYNC_WITH_HCATALOG_SCHEMA to sync the HCatalog schema named hcat with a local schema.

=> CREATE SCHEMA hcat_local; CREATE SCHEMA

=> SELECT sync_with_hcatalog_schema('hcat_local', 'hcat'); sync_with_hcatalog_schema

---Schema hcat_local synchronized with hcat tables in hcat = 56

tables altered in hcat_local = 0 tables created in hcat_local = 56 stale tables in hcat_local = 0 table changes erred in hcat_local = 0 (1 row)

=> -- Use vsql's \d command to describe a table in the synced schema

=> \d hcat_local.messages List of Fields by Tables

Foreign Key

----+---+--+--++--+---+---+---

---hcat_local | messages | id | int | 8 | | f | f |

hcat_local | messages | userid | varchar(65000) | 65000 | | f | f |

hcat_local | messages | "time" | varchar(65000) | 65000 | | f | f |

hcat_local | messages | message | varchar(65000) | 65000 | | f | f |

(4 rows)

Note:You can query tables in the local schema you synched with an HCatalog schema. Querying tables in a synched schema isn't much faster than directly querying the HCatalog schema because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalog schema's metadata. The data in the table is still retrieved using the HCatalog Connector,

Data Type Conversions from Hive to

Vertica

The data types recognized by Hive differ from the data types recognize by Vertica. The following table lists how the HCatalog Connector converts Hive data types into data types compatible with Vertica.

(45)

Hive Data Type Vertica Data Type

TINYINT (1-byte) TINYINT (8-bytes) SMALLINT (2-bytes) SMALLINT (8-bytes)

INT (4-bytes) INT (8-bytes)

BIGINT (8-bytes) BIGINT (8-bytes)

BOOLEAN BOOLEAN

FLOAT (4-bytes) FLOAT (8-bytes) DECIMAL (precision,

scale)

DECIMAL (precision, scale)

DOUBLE (8-bytes) DOUBLE PRECISION (8-bytes)

CHAR (length) CHAR (length)

VARCHAR (length) VARCHAR (length), if length <= 65000 LONG VARCHAR (length), if length > 65000 STRING (2 GB max) VARCHAR (65000)

BINARY (2 GB max) VARBINARY (65000)

DATE DATE

TIMESTAMP TIMESTAMP

LIST/ARRAY VARCHAR (65000) containing a JSON-format

representation of the list.

MAP VARCHAR (65000) containing a JSON-format

representation of the map.

STRUCT VARCHAR (65000) containing a JSON-format

representation of the struct.

Data-Width Handling Differences Between Hive

and Vertica

The HCatalog Connector relies on Hive SerDe classes to extract data from files on HDFS. Therefore, the data read from these files are subject to Hive's data width restrictions. For example, suppose the SerDe parses a value for an INT column into a

Integrating with Apache Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

HPE Vertica Analytic Database

Introduction to Hadoop Integration

Hadoop Distributions

Integration Options

Cluster Layout

Co-Located Clusters

Hardware Recommendations

Configuring Hadoop for Co-Located

Clusters

webHDFS

YARN

Hadoop Balancer

Replication Factor

Disk Space for Non-HDFS Use

Separate Clusters

Choosing Which Hadoop Interface to

Use

Creating an HDFS Storage Location

Using the ORC Reader

Using the HCatalog Connector

Using the HDFS Connector

Using the MapReduce Connector

Using Kerberos with Hadoop

How Vertica uses Kerberos With Hadoop

User Authentication

Vertica Authentication

See Also

Configuring Kerberos

Prerequisite:

Setting Up Users and the Keytab

File

HCatalog Connector

HDFS Connector

HDFS Storage Location

Token Expiration

See Also

Using the ORC

Reader

Syntax

Supported Data Types

Timestamps

Kerberos

Query Performance

Examples

Using the HCatalog Connector

Hive, HCatalog, and WebHCat Overview

HCatalog Connection Features

HCatalog Connection Considerations

How the HCatalog Connector Works

HCatalog Connector Requirements

Vertica Requirements

Hadoop Requirements

Testing Connectivity

Installing the Java Runtime on Your Vertica

Cluster

Installing a Java Runtime

Setting the JavaBinaryForUDx Configuration

Parameter

Configuring Vertica for HCatalog

Copy Hadoop Libraries and Configuration Files

Install the HCatalog Connector

Using the HCatalog Connector with HA

NameNode

Defining a Schema Using the HCatalog

Connector

Querying Hive Tables Using HCatalog

Connector

Viewing Hive Schema and Table Metadata

Synching an HCatalog Schema With a Local

Schema

Data Type Conversions from Hive to

Vertica

Data-Width Handling Differences Between Hive

and Vertica