SAS In-Database Code Accelerator for Hadoop

• SAS/ACCESS Interface to your database (Greenplum, Hadoop, or Teradata)

• The SAS Embedded Process must be installed and configured on your database.

For information about installing and configuring the SAS Embedded Process, see SAS In-Database Products: Administrator's Guide.

• Your DS2 code includes a thread program and a data program.

• The table used as input to the thread program must reside in the database.

• Either the PROC DS2 DS2ACCEL option must be set to YES or the DS2ACCEL system option must be set to ANY.

With in-database processing, data is distributed on different data partitions. Each DS2 thread that is running inside the database has access to its own data partition. When doing BY-group processing, each DS2 thread with a BY statement can group and order only the rows in the same data partition. The data partition might have only part of the entire group of data. You would need to do a final aggregation in the main data program.

However, if you use the PROC DS2 statement’s BYPARTITION=YES option, the entire group of data resides on the same data partition. For more information, see “BY-Group Processing When Running Thread Programs inside the Database” on page 181.

SAS In-Database Code Accelerator for Greenplum

When you use the SAS In-Database Code Accelerator for Greenplum, the thread program and its associated files (format files, packages, and so on) are published to the database. The thread program is executed inside the database, and its result is brought to the data program running on client machine for final processing or aggregation if needed.

SAS In-Database Code Accelerator for Hadoop

Overview

When you use the SAS In-Database Code Accelerator for Hadoop, the data and thread programs run in the MapReduce framework, either MapReduce 1 or YARN/MapReduce 2.

Note: The SAS In-Database Code Accelerator for Hadoop supports only Cloudera 5.2 and Hortonworks 2.1 or later.

Supported File Types

The SAS In-Database Code Accelerator for Hadoop supports these file types:

• Hive: Avro*

• Hive: delimited

176 Chapter 14 • Using the SAS In-Database Code Accelerator

• Hive: ORC*

• Hive: Parquet*

• Hive: RCFile*

• Hive: sequence

• HDMD: binary

• HDMD: delimited

• HDMD: sequence

• HDMD: XML

*In the February 2015 release, the SAS In-Database Code Accelerator for Hadoop supports these file types.

T I P Partitioned Avro or Parquet data is not supported as input to the SAS In-Database Code Accelerator for Hadoop.

T I P The availability of these file types depends on the version of Hive that you use.

The following file types are not supported:

• SPD file format

• SASHDAT

Automatic File Compression with SAS Hadoop

By default, the SAS In-Database Code Accelerator for Hadoop automatically compresses certain output files.

The following default file compressions apply to Hive files unless the user has explicitly configured another compression algorithm:

• Delimited files are not automatically compressed.

• ORC files compress themselves using ZLIB.

• Avro, Parquet, and Sequence files are automatically compressed using Snappy.

HDMD files are never automatically compressed.

Using HCatalog within the SAS Environment

HCatalog is a table management layer that presents a relational view of data in the HDFS to applications within the Hadoop ecosystem. With HCatalog, data structures that are registered in the Hive metastore, including SAS data, can be accessed through standard MapReduce code and Pig. HCatalog is part of Apache Hive.

In the February 2015 release, the SAS In-Database Code Accelerator for Hadoop uses HCatalog to process complex, non-delimited files.

SAS In-Database Code Accelerator for Hadoop 177

Table 14.1 Summary of HCatalog File I/O

File Type Input Output

delimited HDFS direct-read

HCat if partitioned, skewed, or escaped

HDFS direct-read

HCat if partitioned, skewed, or escaped

RCFile HCat HCat

ORC HCat HCat

Parquet HCat^* CREATE TABLE AS SELECT^**

sequence HDMD

HCat if partitioned or skewed

HCat

Avro HCat^* CREATE TABLE AS SELECT^***

* Partitioned input is not supported.

** Unable to write output directly to Parquet files due to these issues: https://issues.apache.org/jira/browse/

HIVE-8838.

*** Unable to write output directly to Avro files due to these issues: https://issues.apache.org/jira/browse/

HIVE-8687.

Consider these requirements when using HCatalog:

• Data that you want to access with HCatalog must first be registered in the Hive metastore.

• The recommended Hive version for the SAS In-Database Code Accelerator for Hadoop is 0.13.0.

• Avro is not a native file type. Additional JAR files are required and must be defined in the SAS_HADOOP_JAR_PATH environment variable.

• Support for HCatalog varies by vendor. For more information, see the documentation for your Hadoop vendor.

Additional Prerequisites When Accessing Files That Are Processed Using HCatalog

If you plan to access complex, non-delimited file types such as Avro or Parquet, there are additional prerequisites:

• To access Avro file types, the avro-1.7.4.jar file must be added to the

SAS_HADOOP_JAR_PATH environment variable. To access Parquet file types, the parquet-hadoop-bundle.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable.

• If you have performed a default CLASSPATH installation, the HCatalog JAR files that are required to access the Hive metastore within a MapReduce job are

automatically included in the Hadoop CLASSPATH. Otherwise, you must manually include the HCatalog JAR files in either the MapReduce library or the Hadoop CLASSPATH.

178 Chapter 14 • Using the SAS In-Database Code Accelerator

BY-Group Processing with Hadoop

When there is no BY statement in the thread program, the number of reducers is set to 0, and the program is run as a map-only task. When there is a BY statement in the thread program and the PROC DS2 statement uses the BYPARTITION=YES option, a MapReduce task runs, where the map task partitions the data, and the reducer task runs the DS2 thread program.

Note: The SAS In-Database Code Accelerator for Hadoop might not produce sorted BY groups when re-partitioning is involved.

For more information, see “BY-Group Processing When Running Thread Programs inside the Database” on page 181.

Using the DBCREATE_TABLE_OPTS Table Option

The DBCREATE_TABLE_OPTS table option is used to provide a free form string in the DATA statement. For the SAS In-Database Code Accelerator for Hadoop, you can use the DBCREATE_TABLE_OPTS table option to specify the output SerDe, the output delimiter of the Hive table, the output escaped by, and any other CREATE TABLE syntax allowed by Hive.

For more information, see “DBCREATE_TABLE_OPTS= Table Option” in SAS DS2 Language Reference.

In document SAS 9.4 In-Database Products (Page 190-193)