9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

(1)

SAS

^®

9.4 SPD Engine:

Storing Data in the

Hadoop Distributed File System

Third Edition

SAS^® Documentation

(2)

For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.

The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated.

U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement.

SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414.

July 2015

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

(3)

What ʼs New

What’s New in the SAS 9.4 SPD Engine to Store Data in HDFS

Overview

In the second maintenance release for SAS 9.4, the SPD Engine has improved

performance. The SPD Engine creates a SAS index much faster, sets a larger I/O block size and expands the scope of the block size, expands parallel processing support for Read operations, performs data filtering in the Hadoop cluster, and enables you to control the number of MapReduce tasks when writing data in HDFS.

In the third maintenance release for SAS 9.4, the SPD Engine expands the supported Hadoop distributions, enables parallel processing for Write operations, expands WHERE processing optimization with more WHERE expression syntax, enhances file system locking by enabling you to specify a pathname for the SPD Engine lock

directory, supports distributed locking, and provides a custom Hive SerDe so that SPD Engine data stored in HDFS can be accessed using Hive.

v

(6)

Hadoop Distribution Support

In the third maintenance release for SAS 9.4, the SPD Engine has expanded the supported Hadoop distributions. For the list of supported Hadoop distributions, see

“Hadoop Distribution Support” on page 6.

Improved Performance When Creating a SAS Index

In the second maintenance release for SAS 9.4, when you create a SAS index for a data set in HDFS, the performance of creating a large index is significantly improved because the index is partitioned. For more information, see “Creating SAS Indexes” on page 11.

Improved Performance By Setting SPD Engine I/O Block Size

In the second maintenance release for SAS 9.4, the scope of the SPD Engine I/O block size is expanded. The default block size is larger at 1,048,576 bytes (1 megabyte). The block size affects compressed, uncompressed, and encrypted data sets. The block size influences the size of I/O operations when reading all data sets and writing compressed data sets. For more information, see “I/O Operation Performance” on page 11. To specify an I/O block size, use the IOBLOCKSIZE= data set option on page 40 or the new IOBLOCKSIZE= LIBNAME statement option on page 33.

(7)

Improved Performance of Reading Data in HDFS

In the second maintenance release for SAS 9.4, to improve the performance of reading data stored in HDFS, the SPD Engine has expanded its support of parallel processing.

You can request parallel processing for all Read operations of data stored in HDFS. For more information, see “Parallel Processing for Data in HDFS” on page 12. To request parallel processing for all Read operations of data stored in HDFS, use the

SPDEPARALLELREAD= system option on page 45, the PARALLELREAD= LIBNAME statement option on page 36, or the PARALLELREAD= data set option on page 42.

Improved Performance of Writing Data to HDFS

In the third maintenance release for SAS 9.4, you can now request parallel processing for all Write operations in HDFS. For more information, see “Parallel Processing for Data in HDFS” on page 12. To request parallel processing for Write operations, use the PARALLELWRITE= LIBNAME statement option on page 36 or the

PARALLELWRITE= data set option on page 43.

Optimized WHERE Processing

To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster, which takes advantage of the filtering and ordering capabilities of the MapReduce framework. For more information, see

“WHERE Processing Optimization with MapReduce” on page 15. To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39.

Optimized WHERE Processing vii

(8)

In the third maintenance release for SAS 9.4, optimized WHERE processing is

expanded to include more operators and compound expressions. For more information, see “WHERE Expression Syntax Support” on page 16.

Controlling Tasks When Writing Data in HDFS

In the second maintenance release for SAS 9.4, to specify the number of MapReduce tasks when writing data in HDFS, you can use the NUMTASKS= LIBNAME statement option. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure. For more information, see the NUMTASKS= LIBNAME statement option on page 35.

SPD Engine File System Locking

In the second maintenance release for SAS 9.4, the SPD Engine implements a locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS. For more information, see “SPD Engine File System Locking” on page 18.

In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. You can specify a pathname for the SPD Engine lock directory by defining the new SAS environment variable SPDELOCKPATH.

For more information, see “SPDELOCKPATH SAS Environment Variable” on page 51.

SPD Engine Distributed Locking

In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

(9)

coordination services to clients over a network connection. For more information, see

“SPD Engine Distributed Locking” on page 20.

To request SPD Engine distributed locking, you must first create an XML configuration file, and then define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML file that is available to the SAS client machine. For more information, see “SPDE_CONFIG_FILE SAS Environment Variable” on page 46.

Configuring the SPD Engine to Store Data in HDFS

To store data in HDFS using the SPD Engine, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. For information about configuring the SPD Engine, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

Accessing SPD Engine Data Using Hive

In the third maintenance release for SAS 9.4, SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for

applications outside of SAS to query using HiveQL. For more information, see Appendix 1, “Hive SerDe for SPD Engine Data,” on page 73.

Accessing SPD Engine Data Using Hive ix

(10)

(11)

1

Introduction to Storing Data in HDFS

Deciding to Store Data in HDFS . . . 1

Using the SPD Engine to Store Data in HDFS . . . 2

What Is the SPD Engine? . . . 2

Understanding the SPD Engine File Format . . . 2

How to Use the SPD Engine . . . 3

Deciding to Store Data in HDFS

Storing data in the Hadoop Distributed File System (HDFS) is a good strategy for very large data sets. HDFS is a component of Apache Hadoop, which is an open-source software framework of tools that are written in Java. HDFS provides distributed data storage and processing of large amounts of data. Reasons for storing SAS data in HDFS include the following:

n HDFS is a low-cost alternative for data storage. Organizations are exploring it as an alternative to commercial relational database solutions.

n HDFS is well suited for distributed storage and processing using commodity

hardware. It is fault tolerant, scalable, and simple to expand. HDFS manages files as blocks of equal size, which are replicated across the machines in a Hadoop cluster to provide fault tolerance.

n SAS provides support within the current SAS product offering and product roadmap.

SAS provides the ability to manage, process, and analyze data in HDFS.

1

(12)

n Hadoop storage is for big data. If standard SAS optimization techniques such as indexes no longer meet your performance needs, then storing the data in HDFS could improve performance.

Using the SPD Engine to Store Data in HDFS

What Is the SPD Engine?

The SAS Scalable Performance Data (SPD) Engine is a scalable engine delivered to SAS customers as part of Base SAS. The SPD Engine is designed for high-

performance data delivery, reading data sets that contain billions of observations. The engine uses threads to read data very rapidly and in parallel.

The SPD Engine reads, writes, and updates data in HDFS. You can use the SPD Engine with standard SAS applications to retrieve data for analysis, perform administrative functions, and update the data.

Understanding the SPD Engine File Format

The SPD Engine organizes data into a streamlined file format that has advantages for a distributed file system like HDFS. The advantages of the SPD Engine file format include the following:

n Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are .dpf for data, .mdf for metadata, and .hbx and .idx for indexes.

n The SPD Engine file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension .dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

(13)

The default partition size is 128 megabytes. You can specify a different partition size with the PARTSIZE= LIBNAME statement option on page 38 or the PARTSIZE=

data set option on page 41.

How to Use the SPD Engine

The SPD Engine works like other SAS data access engines. That is, you execute a LIBNAME statement to assign a libref, specify the engine, and connect to the Hadoop cluster. You then use that libref throughout the SAS session where a libref is valid. The libref is associated with a specific directory in the Hadoop cluster.

Arguments in the LIBNAME statement specify a libref, the engine name, the pathname to a directory in the Hadoop cluster, and the HDFSHOST=DEFAULT argument to indicate that you want to connect to a Hadoop cluster. Here is an example of a LIBNAME statement to connect to a Hadoop cluster:

libname myspde spde '/user/abcdef' hdfshost=default;

To interface with Hadoop and connect to a specific Hadoop cluster, required Hadoop JAR files and Hadoop cluster configuration files must be available to the SAS client machine. To make the required files available, you must define two SAS environment variables to set the location of the required files. For more information about the SAS environment variables, see “SAS and Hadoop Requirements” on page 6.

Any data source that can be accessed with a SAS engine can be loaded into a Hadoop cluster using the SPD Engine. For example:

n You can use the default Base SAS engine to access an existing SAS data set and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See “Example 1: Loading Existing SAS Data Using the COPY Procedure” on page 57.

n You can use a SAS/ACCESS engine such as the SAS/ACCESS to Oracle engine to access an Oracle table and the SPD Engine to connect to the Hadoop cluster. You can then use SAS code to load the data to the Hadoop cluster. See “Example 4:

Loading Oracle Data Using the COPY Procedure” on page 61.

Note: Most existing SAS programs can run with the SPD Engine with little modification other than to the LIBNAME statement. However, some limitations apply. For example, if

Using the SPD Engine to Store Data in HDFS 3

(14)

your default Base SAS engine data has integrity constraints, then the integrity constraints are dropped when the data is converted for the SPD Engine. For more information about supported SAS file features, see “Supported SAS File Features Using the SPD Engine” on page 7.

(15)

2

Storing Data in HDFS

Overview: Storing Data in HDFS . . . 5

SAS and Hadoop Requirements . . . 6

SAS Version . . . 6

Hadoop Distribution Support . . . 6

Configuring Hadoop JAR Files . . . 6

Making Required Hadoop Cluster Configuration Files Available to Your Machine . . . 7

Supported SAS File Features Using the SPD Engine . . . 7

Security . . . 8

Overview: Storing Data in HDFS

To store data in HDFS using the SPD Engine, you must do the following:

n Ensure that all version and configuration requirements are met. See “SAS and Hadoop Requirements” on page 6.

n Understand what the supported and not supported SAS file features are when using the SPD Engine. See “Supported SAS File Features Using the SPD Engine” on page 7.

n Use the LIBNAME statement for the SPD Engine to establish the connection to the Hadoop cluster. See “LIBNAME Statement for HDFS” on page 28.

5

(16)

SAS and Hadoop Requirements

SAS Version

To store data in HDFS using the SPD Engine, you must have the first maintenance release or later for SAS 9.4.

Note: Access to data in HDFS using the SPD Engine is not supported from a SAS session in the z/OS operating environment.

Hadoop Distribution Support

In the third maintenance release for SAS 9.4, the SPD Engine supports the following Hadoop distributions, with or without Kerberos:

n Cloudera CDH 4.x

n Cloudera CDH 5.x

n Hortonworks HDP 2.x

n IBM InfoSphere BigInsights 3.x

n MapR 4.x (for Microsoft Windows and Linux operating environments only)

n Pivotal HD 2.x

Configuring Hadoop JAR Files

To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable

SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

(17)

Making Required Hadoop Cluster

Configuration Files Available to Your Machine

Hadoop cluster configuration files contain information such as the name of the computer that hosts the Hadoop cluster and the TCP port. To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access. The SAS environment

variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop

Configuration Guide for Base SAS and SAS/ACCESS.

Supported SAS File Features Using the SPD Engine

The following SAS file features are supported for data sets using the SPD Engine:

n Encryption

n File compression

n Member-level locking

n SAS indexes

n SAS passwords

n Special missing values

n Physical ordering of returned observations

n User-defined formats and informats

Note: When you create a data set, you cannot request both encryption and file compression.

The following SAS file features are not supported for data sets using the SPD Engine:

n Audit trails

Supported SAS File Features Using the SPD Engine 7

(18)

n Cross-Environment Data Access (CEDA)

n Extended attributes

n Generation data sets

n Integrity constraints

n NLS support (such as to specify encoding for the data)

n Record-level locking

n SAS catalogs, SAS views, and MDDB files

The following SAS software does not support SPD Engine data sets:

n SAS/CONNECT

n SAS/SHARE

Security

HDFS supports defined levels of permissions at both the directory and file levels. The SPD Engine honors those permissions. For example, if the file is available as Read only, you cannot modify it.

If the Hadoop cluster supports Kerberos, the SPD Engine honors Kerberos

authentication and authorization as long as the Hadoop cluster configuration files are accessed. For more information about accessing the Hadoop cluster configuration files, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

Restricting access to members of SAS libraries by assigning SAS passwords to the members is supported when a data set is stored in HDFS. You can specify three levels of permission: Read, Write, and Alter. For more information about SAS passwords, see SAS Language Reference: Concepts.

(19)

3

Using the SPD Engine

Overview: Using the SPD Engine . . . 10

How the SPD Engine Supports Data Distribution . . . 10

I/O Operation Performance . . . 11

Creating SAS Indexes . . . 11

Parallel Processing for Data in HDFS . . . 12

Overview: Parallel Processing for Data in HDFS . . . 12

Parallel Processing Considerations . . . 14

Tuning Parallel Processing Performance . . . 14

WHERE Processing Optimization with MapReduce . . . 15

Overview: WHERE Processing Optimization with MapReduce . . 15

WHERE Expression Syntax Support . . . 16

Data Set and SAS Code Requirements . . . 16

Hadoop Requirements . . . 17

SPD Engine File System Locking . . . 18

Overview: SPD Engine File System Locking . . . 18

Requesting Read Access Lock Files . . . 19

Specifying a Pathname for the SPD Engine Lock Directory . . . 20

SPD Engine Distributed Locking . . . 20

Overview: SPD Engine Distributed Locking . . . 20

Understanding the Service Provider . . . 21

Requirements for SPD Engine Distributed Locking . . . 21

9

(20)

Requesting Distributed Locking . . . 22 Updating Data in HDFS . . . 23 Using SAS High-Performance Analytics Procedures . . . 24

Overview: Using the SPD Engine

The SPD Engine reads, writes, and updates data in HDFS. Specific SPD Engine features are supported for Hadoop storage and are explained in this document.

For more information about the SPD Engine and its features that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine: Reference.

How the SPD Engine Supports Data Distribution

When loading data into a Hadoop cluster, the SPD Engine ensures that the data is distributed appropriately. The SPD Engine uses the SPD Engine partition size and the HDFS block size to compute the maximum number of observations that can fit into both parameters. That is, observations never span multiple partitions or multiple blocks.

After a data set is loaded into a Hadoop cluster, the actual block size of the loaded data might be less than the block size that was defined by the Hadoop administrator. The reason for the size difference can be because of the SPD Engine calculations regarding the partition size, block size, and observation length.

Note: Defragmenting the Hadoop cluster is not recommended. Changing the block size and re-creating the files could result in the data becoming inaccessible by SAS.

(21)

I/O Operation Performance

To improve I/O operation performance, consider setting a different SPD Engine I/O block size. The larger the block size, the less I/O. For example, when reading a data set, the block size can significantly affect performance. When retrieving a large percentage of the data, a larger block size improves performance. However, when retrieving a subset of the data such as with WHERE processing, a smaller block size performs better. You can specify a different block size with the IOBLOCKSIZE=

LIBNAME statement option and the IOBLOCKSIZE= data set option. For more

information, see the IOBLOCKSIZE= LIBNAME statement option on page 33 and the IOBLOCKSIZE= data set option on page 40.

Creating SAS Indexes

When you create a SAS index for a data set that is stored in HDFS, a large index could require a long time to create.

To provide efficient index creation, the SPD Engine partitions the two index files (.hbx and .idx). The index files are spread across multiple files based on the index partition size, which is 2 megabytes. Even though the index files are partitioned, the PARTSIZE=

option, which specifies a size for the SPD Engine data partition file, does not affect the index partition size. You cannot increase or decrease the index partition size.

To improve the performance of creating an index, consider these options:

n Request that indexes be created in parallel, asynchronously. To enable

asynchronous parallel index creation, use the ASYNCINDEX= data set option.

n Request more temporary utility file space for sorting the data. To allocate an

adequate amount of space for processing, use the SPDEUTILLOC= system option.

Specify the utility file location on the SAS client machine, not on the Hadoop cluster.

Creating SAS Indexes 11

(22)

n Request larger memory space for the sorting utility to use when sorting values for creating an index. To specify the amount of memory, use the

SPDEINDEXSORTSIZE= system option.

For more information about these options, see SAS Scalable Performance Data Engine:

Reference.

Parallel Processing for Data in HDFS

Overview: Parallel Processing for Data in HDFS

Parallel processing uses multiple threads that run in parallel so that a large operation is divided into multiple smaller ones that are executed simultaneously. The SPD Engine supports parallel processing to improve the performance of reading and writing data stored in HDFS.

By default, the SPD Engine performs parallel processing only if a Read operation includes WHERE processing. If the Read operation does not include WHERE processing, the Read operation is performed by a single thread. To request parallel processing for all Read operations for all SAS releases and for Write operations in the third maintenance release for SAS 9.4 only, use these options:

n SPDEPARALLELREAD= system option on page 45 to request parallel read processing for the SAS session.

n PARALLELREAD= LIBNAME statement option on page 36 to request parallel read processing when using the assigned libref.

n PARALLELREAD= data set option on page 42 to request parallel read processing for the specific data set.

n In the third maintenance release for SAS 9.4, PARALLELWRITE= LIBNAME statement option on page 36 to request parallel write processing when using the assigned libref.

(23)

n In the third maintenance release for SAS 9.4, PARALLELWRITE= data set option on page 43 to request parallel write processing for the specific data set.

Here is an example of the SPDEPARALLELREAD= system option to request parallel processing for all Read operations for the SAS session:

options spdeparallelread=yes;

In this example, the LIBNAME statement requests parallel processing for all Read operations using the assigned libref. By specifying the PARALLELREAD= LIBNAME statement option, parallel processing is performed for all Read operations using the Class libref:

libname class spde '/user/abcdef' hdfshost=default parallelread=yes;

proc freq data=class.StudentID;

tables age;

run;

In this example, the PARALLELREAD= data set option requests parallel processing for all Read operations for the Class.StudentID data set:

libname class spde '/user/abcdef' hdfshost=default;

proc freq data=class.StudentID (parallelread=yes);

tables age;

run;

Here is an example of the PARALLELWRITE= LIBNAME statement option to request parallel processing for all Write operations using the assigned libref. By specifying the PARALLELWRITE= LIBNAME statement option, parallel processing is performed for all Write operations using the Class libref:

libname class spde '/user/abcdef' hdfshost=default parallelwrite=yes;

TIP To display information in the SAS log about parallel processing, set the

MSGLEVEL= system option to I. When you set options msglevel=i;, the SAS log reports whether parallel processing is in effect.

Parallel Processing for Data in HDFS 13

(24)

Parallel Processing Considerations

The following are considerations for requesting parallel processing:

n For some environments, parallel processing might not improve the performance. The availability of network bandwidth and the number of CPUs on the SAS client

machine determine the performance improvement. It is recommended that you set up a test in your environment to measure performance with and without parallel processing.

n When parallel read processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some applications require that observations be returned in the physical order. For

example, the COMPARE procedure expects that observations are read from the data set in the same order that they were written to the data set. Also, legacy code that uses the DATA step or the OBS= data set option might rely on physical order to produce the expected results.

Tuning Parallel Processing Performance

To tune the performance of parallel processing, consider these SPD Engine options:

n The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing.

n The SPD Engine THREADNUM= data set option specifies the maximum number of threads to use for the processing.

For more information about these options, see SAS Scalable Performance Data Engine:

Reference.

Note: The Base SAS NOTHREADS= and CPUCOUNT= system options have no effect on SPD Engine parallel processing.

(25)

WHERE Processing Optimization with MapReduce

Overview: WHERE Processing Optimization with MapReduce

WHERE processing enables you to conditionally select a subset of observations so that SAS processes only the observations that meet specified conditions. To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster. Then, when you submit SAS code that includes a WHERE

expression (which defines the condition that selected observations must satisfy), the SPD Engine instantiates the WHERE expression as a Java class. The SPD Engine submits the Java class to the Hadoop cluster as a component in a MapReduce program.

By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Performance is often improved with large data sets when the WHERE expression qualifies only a relatively small subset.

By default, data subsetting is performed by the SPD Engine on the SAS client. To request that data subsetting be performed in the Hadoop cluster, you must specify the ACCELWHERE= LIBNAME statement option on page 31 or the ACCELWHERE= data set option on page 39.

Here is an example of a LIBNAME statement that connects to a Hadoop cluster and requests that data subsetting be performed in the Hadoop cluster. By specifying the ACCELWHERE= LIBNAME statement option, subsequent WHERE processing for all data sets accessed with the Class libref are performed in the Hadoop cluster.

libname class spde '/user/abcdef' hdfshost=default accelwhere=yes;

proc freq data=class.StudentID;

tables age;

where age gt 14;

run;

WHERE Processing Optimization with MapReduce 15

(26)

In this example, the ACCELWHERE= data set option requests that data subsetting be performed in the Hadoop cluster. The WHERE processing for the Class.StudentID data set is performed in the Hadoop cluster. WHERE processing for any other data set with the Class libref is performed by the SPD Engine on the SAS client machine.

libname class spde '/user/abcdef' hdfshost=default;

proc freq data=class.StudentID (accelwhere=yes);

tables age;

where age gt 14;

run;

WHERE Expression Syntax Support

In the third maintenance release for SAS 9.4, WHERE processing optimization is expanded. WHERE processing optimization supports the following syntax:

n comparison operators such as EQ (=), NE (^=), GT (>), LT (<), GE (>=), LE (<=)

n IN operator

n full bounded range condition, such as where 500 <= empnum <= 1000;

n BETWEEN-AND operator, such as where empnum between 500 and 1000;

n compound expressions using the logical operators AND, OR, and NOT, such as where skill = 'java' or years = 4;

n parentheses to control the order of evaluation, such as where (product='GRAPH' or product='STAT') and country='Canada';

Data Set and SAS Code Requirements

To perform the data subsetting in the Hadoop cluster, the following data set and SAS code requirements must be met. If any of these requirements are not met, the

subsetting of the data is performed by the SPD Engine, not by a MapReduce program in the Hadoop cluster.

n The data set cannot be encrypted.

n The data set cannot be compressed.

(27)

n The data set must be larger than the HDFS block size.

n The submitted SAS code cannot request BY-group processing.

n The submitted SAS code cannot include the STARTOBS= or ENDOBS= options.

n The LIBNAME statement cannot include the HDFSUSER= option.

n The submitted WHERE expression cannot include any of the following syntax:

o a variable as an operand, such as where lastname;

o variable-to-variable comparison

o SAS functions, such as SUBSTR, TODAY, UPCASE, and PUT

o arithmetic operators *, /, +, -, and **

o IS NULL or IS MISSING and IS NOT NULL or IS NOT MISSING operators

o concatenation operator, such as || or !!

o negative prefix operator, such as where z = -(x + y);

o pattern-matching operators LIKE and CONTAINS

o sounds-like operator SOUNDEX (=*)

o truncated comparison operator using the colon (:) modifier, such as where lastname=: 'S';

TIP To display information in the SAS log regarding WHERE processing optimization, set the MSGLEVEL= system option to I. When you issue options msglevel=i;, the SAS log reports whether the data filtering occurred in the Hadoop cluster. If the optimization occurred, the Hadoop Job ID is displayed in the SAS log. If the

optimization did not occur, additional messages explain why.

Hadoop Requirements

To perform the data subsetting in the Hadoop cluster, the following Hadoop requirements must be met.

WHERE Processing Optimization with MapReduce 17

(28)

n The Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN.

n The JRE version for the Hadoop cluster must be either 1.6, which is the default, or 1.7. If the JRE version is 1.7, use the ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version.

SPD Engine File System Locking

Overview: SPD Engine File System Locking

The HDFS concurrent access model allows multiple readers and a single writer. If an application accesses a file to write to it, no other application can write to the file, but multiple applications can read the file. The SPD Engine supports a file system locking strategy that honors the HDFS concurrent access model and provides additional levels of concurrent access to ensure the integrity of the data stored in HDFS.

By default, the SPD Engine creates a Write access lock file when a data set stored in HDFS is opened for Write access. With the Write access lock file, no other SAS session can write to the file, but multiple SAS sessions can read the file if the readers accessed the data set before the Write access lock file was created.

During concurrent access, the following describes the results of the default SPD Engine locking mechanism:

n Once a SAS session opens a data set for Write access, any previous readers can continue to access the data set. However, the readers could experience unexpected data results. For example, the writer could delete the data set while the readers are accessing the data set.

n Once a SAS session opens a data set for Write access, any subsequent reader is not allowed to access the data set.

With the Write access locking mechanism, a lock error message occurs in these situations:

(29)

n When a SAS session requests Write access to a data set that another SAS session has open for Write access.

n When a SAS session requests Read access to a data set that another SAS session has open for Write access.

n When a SAS session requests to delete a data set that another SAS session has open for Write access.

In the third maintenance release for SAS 9.4, to store the lock files, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight-character hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set), and the suffix _spdslock9, such as BigFile_0000393a_spdslock9. In most situations, you will not see the lock directory because lock files are deleted when the process completes.

TIP In some situations, such as an abnormal termination of a SAS session, lock files might not be properly deleted. The leftover lock files could prohibit access to a data set. If this occurs, the leftover lock files must be manually deleted by submitting HDFS commands.

Requesting Read Access Lock Files

In some situations, you might want to control the level of concurrent access to

guarantee the integrity of the data by requesting that a Read access lock file be created.

To request a Read access lock file, define the SAS environment variable

SPDEREADLOCK and set it to YES. Then, when a SAS session opens a data set for Read access, a Read access lock file is created in addition to any Write access lock files. For more information, see “SPDEREADLOCK SAS Environment Variable” on page 52.

With the Read and Write access locking mechanism, a lock error message occurs in these situations:

n When a SAS session requests Write access to a data set that another SAS session has open for either Read or Write access.

SPD Engine File System Locking 19

(30)

n When a SAS session requests Read access to a data set that another SAS session has open for Write access.

n When a SAS session requests to delete a data set that another SAS session has open for either Read or Write access.

Note: When you request a Read access lock file, all data access, even for Read access, requires Write permission to the Hadoop cluster.

TIP By creating both Read and Write access lock files, the possibility of leftover lock files is increased. If you experience situations such as an abnormal termination of a SAS session, lock files that were not properly deleted must be manually deleted by submitting HDFS commands.

Specifying a Pathname for the SPD Engine Lock Directory

By default, for HDFS concurrent access, the SPD Engine creates a lock directory in the /tmp directory. The lock directory name includes the name of the data set, an eight- character hexadecimal value (which is the checksum of the Hadoop cluster directory that contains the data set, and the suffix _spdslock9, such as

BigFile_0000393a_spdslock9.

In the third maintenance release for SAS 9.4, you can specify a pathname for the SPD Engine lock directory by defining the SAS environment variable SPDELOCKPATH to specify a directory in the Hadoop cluster. For more information, see “SPDELOCKPATH SAS Environment Variable” on page 51.

SPD Engine Distributed Locking

Overview: SPD Engine Distributed Locking

In the third maintenance release for SAS 9.4, the SPD Engine supports distributed locking for data stored in HDFS. Distributed locking provides synchronization and group

(31)

coordination services to clients over a network connection. For the service provider, the SPD Engine uses the Apache ZooKeeper coordination service, specifically the

implementation of the recipe for Shared Lock that is provided by Apache Curator.

Distributed locking provides the following benefits:

n The lock server maintains the lock state information in memory and does not require Write permission to any client or data library disk storage locations.

n A process requesting a lock on a data set that is not available (because the data set is already locked) can choose to wait for the data set to become available, rather than having the lock request fail immediately.

n If a process abnormally terminates while holding locks on data sets, the lock server automatically drops all locks that the client was holding, which eliminates the

possibility of leftover lock files.

Understanding the Service Provider

Apache ZooKeeper is an open-source distributed server that enables reliable distributed coordination to distributed client applications over a network. ZooKeeper safely

coordinates access to shared resources with other applications or processes. At its core, ZooKeeper is a fault tolerant multi-machine server that maintains a virtual hierarchy of data nodes that store coordination data. For more information about ZooKeeper and the ZooKeeper data nodes, see Apache ZooKeeper.

Apache Curator is a high-level API that simplifies using ZooKeeper. Curator adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster. For more information about Curator, see Curator.

The SPD Engine accesses the Curator API to provide the locking services.

Requirements for SPD Engine Distributed Locking

SPD Engine distributed locking has the following requirements:

n ZooKeeper 3.4.0 or later must be downloaded, installed, and running on the Hadoop cluster. The zookeeper JAR file is required.

SPD Engine Distributed Locking 21

(32)

n Curator 2.7.0 or later must be downloaded on the Hadoop cluster. The following Curator JAR files are required:

o curator-client

o curator-framework

o curator-recipes

n The following Hadoop distribution JAR files are required on the client side:

o guava

o log4j

o slf4j

n The JAR files must be available to the SAS client machine. The SAS environment variable SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

TIP To be effective, all access to SPD data sets must use the same locking method.

If some processes or instances use distributed locking and others do not, proper coordination of access to the data sets cannot be guaranteed, and at a minimum, lock failures will be encountered.

Requesting Distributed Locking

To request distributed locking, you must first create an XML configuration file that contains information so that the SPD Engine can communicate with ZooKeeper. The format of the XML is similar to Hadoop configuration files in that the XML contains properties and attributes as name-value pairs. For an example of an XML configuration file, see “XML Configuration File” on page 46.

In addition, you must define the SAS environment variable SPDE_CONFIG_FILE to specify the location of the user-defined XML configuration file. The location must be available to the SAS client machine. For more information, see “SPDE_CONFIG_FILE SAS Environment Variable” on page 46.

(33)

Updating Data in HDFS

HDFS does not support updating data. However, because traditional SAS processing involves updating data, the SPD Engine supports SAS Update operations for data stored in HDFS.

To update data in HDFS, the SPD Engine uses an approach that replaces the data set’s data partition file for each observation that is updated. When an update is requested, the SPD Engine re-creates the data partition file in its entirety (including all replications), and then inserts the updated data into the new data partition file. Because the data partition file is replaced for each observation that is updated, the greater the number of observations to be updated, the longer the process.

For a general-purpose data storage engine like the SPD Engine, the ability to perform small, infrequent updates can be beneficial. However, updating data in HDFS is intended for situations when the time it takes to complete the update outweighs the alternatives.

The following are best practices for Update operations using the SPD Engine:

n It is recommended that you set up a test in your environment to measure Update operation performance. For example, update a small number of observations to gauge how long updates take in your environment. Then, project the test results to a larger number of observations to determine whether updating is realistic.

n It is recommended that you do not use the SQL procedure to update data in HDFS because of how PROC SQL opens, updates, and closes a file. There are other SAS methods that provide better performance such as the DATA step UPDATE statement and MODIFY statement.

n The performance of appending a data set can be slower if the data set has a unique index. Test case results show that appending a data set to another data set without a unique index was significantly faster than appending the same data set to another data set with a unique index.

Updating Data in HDFS 23

(34)

Using SAS High-Performance Analytics Procedures

You can use the SPD Engine with SAS High-Performance Analytics procedures to read and write the SPD Engine file format in HDFS. In many cases, the SPD Engine data used by the procedures can be read and written in parallel using the SAS Embedded Process.

The following are requirements for a SAS Embedded Process parallel read:

n Access to the machines in the cluster where a SAS High-Performance Analytics deployment of Hadoop is installed and running.

n The data set cannot be encrypted or compressed.

n The STARTOBS= and ENDOBS= data set options cannot be specified.

The following are requirements for a SAS Embedded Process parallel write:

n The ALIGN=, COMPRESS=, ENCRYPT=, and PADCOMPRESS= data set options cannot be specified.

n The SAS client machine must have a data representation that is compatible with the data representation of the Hadoop cluster. The SAS client machine must be either Linux x64 or Solaris x64.

The following are best practices when using the SPD Engine with SAS High- Performance Analytics procedures:

n With SAS Enterprise Miner, a SAS process can be terminated in such a way that the SPD Engine does not follow normal shutdown procedures, which can result in a lock file not being deleted. The orphan lock file could prevent a subsequent open of the data set. If this occurs, the orphan lock file must be manually deleted by submitting Hadoop commands. To delete the orphan lock file, you can use the HADOOP procedure to submit Hadoop commands.

n For SAS High-Performance Analytics Work files, the SPD Engine uses the standard UNIX temporary directory /tmp. To override the default Work directory, you can

(35)

define the SAS environment variable SPDE_HADOOP_WORK_PATH to specify a directory in the Hadoop cluster. The directory must exist and you must have Write access. For example, the following OPTIONS statement sets the Work directory:

options set=SPDE_HADOOP_WORK_PATH="/sasdata/cluster1/hpawork";

For more information, see “SPDE_HADOOP_WORK_PATH SAS Environment Variable” on page 50.

Using SAS High-Performance Analytics Procedures 25

(36)

(37)

4

SPD Engine Reference

Overview: SPD Engine Reference . . . 27 Dictionary . . . 28 LIBNAME Statement for HDFS . . . 28 ACCELWHERE= Data Set Option for HDFS . . . 39 IOBLOCKSIZE= Data Set Option for HDFS . . . 40 PARTSIZE= Data Set Option for HDFS . . . 41 PARALLELREAD= Data Set Option for HDFS . . . 42 PARALLELWRITE= Data Set Option for HDFS . . . 43 SPDEPARALLELREAD= System Option for HDFS . . . 45 SPDE_CONFIG_FILE SAS Environment Variable . . . 46 SPDE_HADOOP_WORK_PATH SAS Environment Variable . . . 50 SPDELOCKPATH SAS Environment Variable . . . 51 SPDEREADLOCK SAS Environment Variable . . . 52

Overview: SPD Engine Reference

The SPD Engine reads, writes, and updates data in HDFS. A specific SPD Engine LIBNAME statement and options are provided for Hadoop storage and are explained in this document.

For more information about the SPD Engine LIBNAME statement and options that are not specific to Hadoop storage, see SAS Scalable Performance Data Engine:

Reference.

27

(38)

Dictionary

LIBNAME Statement for HDFS

Associates a libref with a Hadoop cluster to read, write, and update a data set in HDFS.

Restrictions: The SPD Engine LIBNAME statement arguments that are specific to HDFS are not supported in the z/OS operating environment.

You can connect to only one Hadoop cluster at a time per SAS session. You can submit multiple LIBNAME statements to different directories in the Hadoop cluster, but there can be only one Hadoop cluster connection per SAS session.

Requirements: To associate a libref with a Hadoop cluster, you must have the first maintenance release or later for SAS 9.4.

Supported Hadoop distributions: Cloudera CDH 4.x, Cloudera CDH 5.x, Hortonworks HDP 2.x, IBM InfoSphere BigInsights 3.x, MapR 4.x (Microsoft Windows and Linux only), Pivotal HD 2.x, with or without Kerberos

To store data in HDFS using the SPD Engine, you must use a supported Hadoop distribution and configure a required set of Hadoop JAR files. The JAR files must be available to the SAS client machine. The SAS environment variable

SAS_HADOOP_JAR_PATH must be defined and set to the location of the Hadoop JAR files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

To connect to the Hadoop cluster, Hadoop configuration files must be copied from the specific Hadoop cluster to a physical location that the SAS client machine can access.

The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

Example: Chapter 5, “How to Use Hadoop Data Storage,” on page 55

Syntax

LIBNAME libref SPDE 'primary-pathname' HDFSHOST=DEFAULT

<ACCELJAVAVERSION=version> <ACCELWHERE=NO | YES>

<DATAPATH=('pathname')> <HDFSUSER=ID> <IOBLOCKSIZE=n>

<NUMTASKS=n> <PARALLELREAD=NO | YES>

<PARALLELWRITE=NO | YES | threads> <PARTSIZE=n | nM | nG | nT>;

(39)

Summary of Optional Arguments ACCELJAVAVERSION=version

When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster.

ACCELWHERE=NO | YES

Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster.

DATAPATH=('pathname')

When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files.

HDFSUSER=ID

Is an authorized user ID on the Hadoop cluster.

IOBLOCKSIZE=n

Specifies a size in bytes of a block of observations to be used in an I/O operation.

NUMTASKS=n

Specifies the number of MapReduce tasks when writing data in HDFS.

PARALLELREAD=NO | YES

Determines when the SPD Engine uses parallel processing to read data stored in HDFS.

PARALLELWRITE=NO | YES | threads

Determines whether the SPD Engine uses parallel processing to write data in HDFS.

PARTSIZE=n | nM | nG | nT

Specifies a size for the SPD Engine data partition file.

LIBNAME Statement for HDFS 29

(40)

Required Arguments libref

is a valid SAS library name that serves as a shortcut name to associate with a data set in a Hadoop cluster. The name can be up to eight characters long and must conform to the rules for SAS names.

SPDE

is the engine name for the SAS Scalable Performance Data (SPD) Engine.

'primary-pathname'

specifies the fully qualified pathname to a directory in the Hadoop cluster, which is typically UNIX based. Enclose the primary pathname in single or double quotation marks. An example is '/user/abcdef/'.

When data is loaded into a Hadoop cluster directory, the SPD Engine automatically creates a subdirectory with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/, the data partition files are located at /user/abcdef/bigfile_spde/. The SPD Engine metadata and index files are located at /user/abcdef/.

Restrictions Maximum length is 260 characters for Windows and 1024 characters for UNIX.

The primary pathname must be unique for each assigned libref. Assigned librefs that are different but reference the same primary pathname can result in lost data.

Requirement You must use valid directory syntax for the host. The pathname must be recognized by the operating environment.

Interaction You can specify a different location to store the data partition files with the DATAPATH= option on page 32.

HDFSHOST=DEFAULT

specifies that you want to connect to the Hadoop cluster that is defined in the Hadoop cluster configuration files. The SPD Engine locates the Hadoop cluster

(41)

configuration files using the SAS_HADOOP_CONFIG_PATH environment variable.

The environment variable sets the location of the configuration files for a specific cluster. For more information about the SAS_HADOOP_CONFIG_PATH

environment variable, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

Requirement You must specify the HDFSHOST=DEFAULT argument.

Optional Arguments ACCELJAVAVERSION=version

When requesting that WHERE processing be optimized by being performed in the Hadoop cluster, specifies the Java Runtime Environment (JRE) version for the Hadoop cluster. The value must be either 1.6 or 1.7.

Default 1.6

Interaction To request that data subsetting be performed in the Hadoop cluster, use the ACCELWHERE= LIBNAME statement option on page 31. By default, data

subsetting is performed by the SPD Engine on the SAS client.

Example “Example 8: Optimizing WHERE Processing with MapReduce” on page 69

ACCELWHERE=NO | YES

Determines whether WHERE processing is optimized by data subsetting being performed in the Hadoop cluster.

NO

specifies that data subsetting is performed by the SPD Engine on the SAS client.

This is the default setting.

YES

specifies that data subsetting is performed by a MapReduce program in the Hadoop cluster.

(42)

Requirements To perform data subsetting in the Hadoop cluster, there are data set and SAS code requirements. See

“WHERE Processing Optimization with MapReduce”

on page 15.

To submit the MapReduce program to the Hadoop cluster, the Hadoop configuration file must include the properties to run MapReduce (MR1) or

MapReduce 2 (MR2) and YARN.

Interactions If the JRE version for the Hadoop cluster is 1.7 instead of the default version 1.6, use the

ACCELJAVAVERSION= LIBNAME statement option on page 31 to specify the version.

The ACCELWHERE= data set option overrides the ACCELWHERE= LIBNAME statement option. For more information, see ACCELWHERE= data set option on page 39.

Example “Example 8: Optimizing WHERE Processing with MapReduce” on page 69

Default NO DATAPATH=('pathname')

When creating a data set, specifies the fully qualified pathname to a directory in the Hadoop cluster to store data partition files. Enclose the pathname in single or double quotation marks within parentheses. An example is datapath=('/sasdata').

When data is loaded into the directory, a subdirectory is automatically created with the specified data set name and the suffix _spde. The SPD Engine data partition files are contained in that subdirectory. For example, if you load a data set named BigFile into the directory /user/abcdef/ and specify datapath=(‘/sasdata/’), the data partition files are located at /sasdata/bigfile_spde/. The SPD Engine

metadata and index files are located at /user/abcdef/.

(43)

Restrictions You can specify only one pathname to store data partition files.

Maximum length is 260 characters for Windows and 1024 characters for UNIX.

The pathname must be unique for each assigned libref.

Assigned librefs that are different but reference the same pathname can result in lost data.

Requirement You must use valid directory syntax for the host. The pathname must be recognized by the operating environment.

Interaction Specifying the DATAPATH= option overrides the primary pathname for storing the data partition files only. The SPD Engine metadata and index files are always stored in the primary pathname.

HDFSUSER=ID

Is an authorized user ID on the Hadoop cluster. You can specify a user ID to connect to the Hadoop cluster with a different ID than your current logon ID.

Restrictions If the HDFSUSER= option is specified, Kerberos

authentication is bypassed, which prevents access to a secure Hadoop cluster.

If the HDFSUSER= option is specified, WHERE

processing optimization with the ACCELWHERE= option cannot be performed in the Hadoop cluster.

HDFSUSER= is not supported by a MapR Apache Hadoop distribution.

IOBLOCKSIZE=n

Specifies a size in bytes of a block of observations to be used in an I/O operation.

The I/O block size determines the amount of data that is physically transferred together in an I/O operation. The larger the block size, the less I/O. The SPD Engine

(44)

uses blocks in memory to collect the observations to be written to or read from a data component file. The IOBLOCKSIZE= option specifies the size of the block. (The actual size is computed to accommodate the largest number of observations that fit in the specified size of n bytes. Therefore, the actual size is a multiple of the

observation length.)

The block size affects I/O operations for compressed, uncompressed, and encrypted data sets. However, the effects are different and depend on the I/O operation.

n For a compressed data set, the block size determines how many observations are compressed together, which determines the amount of data that is physically transferred for both Read and Write operations. The block size is a permanent attribute of the file. To specify a different block size, you must copy the data set to a new data set, and then specify a new block size for the output file. For a

compressed data set, a larger block size can improve performance for both Read and Write operations.

n For an encrypted data set, the block size is a permanent attribute of the file.

n For an uncompressed data set, the block size determines the size of the blocks that are used to read the data from disk to memory. The block size has no affect when writing data to disk. For an uncompressed data set, the block size is not a permanent attribute of the file. That is, you can specify a different block size based on the Read operation that you are performing. For example, reading data that is randomly distributed or reading a subset of the data calls for a smaller block size because accessing smaller blocks is faster than accessing larger blocks. In contrast, reading data that is uniformly or sequentially distributed or that requires a full data set scan works better with a larger block size.

Default 1,048,576 bytes (1 megabyte)

Ranges The minimum block size is 32,768 bytes.

The maximum block size is half the size of the SPD Engine data partition file.

Restriction The SPD Engine I/O block size must be smaller than or equal to the Hadoop cluster block size.

(45)

Interaction The IOBLOCKSIZE= data set option overrides the IOBLOCKSIZE= LIBNAME statement option. For more information, see “IOBLOCKSIZE= Data Set Option for HDFS” on page 40.

Tip When reading a data set, the block size can significantly affect performance. If retrieving a large percentage of the data, a larger block size improves performance.

However, if retrieving a subset of the data (such as with WHERE processing), a smaller block size performs better.

Example “Example 7: Setting the SPD Engine I/O Block Size” on page 68

NUMTASKS=n

Specifies the number of MapReduce tasks when writing data in HDFS. This option controls parallel processing on the Hadoop cluster when writing output from a SAS High-Performance Analytics procedure using the SAS Embedded Process.

When a high-performance procedure reads and writes Hadoop data, and the amount of output data is similar to the amount of input data, the same number of output tasks as input tasks should be a good default. However, if the amount of output data differs significantly from the amount of input data, you should use this option to tune the number of tasks proportionally to the output data.

Default The number of MapReduce tasks is the number of SAS High-Performance Analytics nodes. Or, if the high-

performance procedure reads a Hadoop file as input, it is the number of tasks that were used to read the input file.

Restriction This option affects writing data in HDFS only when a high-performance procedure writes output to HDFS using the SAS Embedded Process.

Interaction If the specified number of MapReduce tasks is less than the number of SAS High-Performance Analytics nodes on which the procedure runs, the setting is ignored.

(46)

PARALLELREAD=NO | YES

Determines when the SPD Engine uses parallel processing to read data stored in HDFS.

NO

specifies that parallel processing occurs only if a Read operation includes WHERE processing. This is the default behavior for the SPD Engine.

YES

specifies parallel processing for all Read operations using the assigned libref.

Default NO

Interactions The SET statement POINT= option is inconsistent with parallel processing.

When parallel processing occurs, the order in which the observations are returned might not be in the physical order of the observations in the data set. Some

applications require that observations be returned in the physical order.

The PARALLELREAD= LIBNAME statement option overrides the SPDEPARALLELREAD= system option.

For more information, see “SPDEPARALLELREAD=

System Option for HDFS” on page 45.

The PARALLELREAD= LIBNAME statement option can be overridden by the PARALLELREAD= data set option.

For more information, see “PARALLELREAD= Data Set Option for HDFS” on page 42.

See “Parallel Processing for Data in HDFS” on page 12

PARALLELWRITE=NO | YES | threads

Determines whether the SPD Engine uses parallel processing to write data in HDFS.

NO

specifies that parallel processing for a Write operation does not occur. This is the default behavior for the SPD Engine.

(47)

YES

specifies parallel processing for all Write operations using the assigned libref. A thread is used for each CPU on the SAS client machine. For example, if eight CPUs exist on the SAS client machine, then eight threads are used to write data.

threads

specifies parallel processing for all Write operations using the assigned libref and specifies the number of threads to use for the Write operations.

Default The default is 1, which specifies that parallel processing for a Write operation does not occur.

Range 2 to 512

Default NO

Restrictions You cannot use parallel processing for a Write operation and also request to create a SAS index.

You cannot use parallel processing for a Write operation and also request BY-group processing or sorting.

Interactions When parallel Write processing occurs, the order in which the observations are written is unpredictable. The order in which the observations are returned cannot be determined unless the application imposes ordering criteria.

The SPD Engine SPDEMAXTHREADS= system option specifies the maximum number of threads that the SPD Engine uses for processing. For more information, see SAS Scalable Performance Data Engine: Reference.

The PARALLELWRITE= LIBNAME statement option can be overridden by the PARALLELWRITE= data set

option. For more information, see “PARALLELWRITE=

Data Set Option for HDFS” on page 43.

9.4 SPD Engine: Storing Data in the Hadoop Distributed File System

SAS

9.4 SPD Engine:

Storing Data in the

Hadoop Distributed File System

Third Edition

Contents

What ʼs New

What’s New in the SAS 9.4 SPD Engine to Store Data in HDFS

Overview

Hadoop Distribution Support

Improved Performance When Creating a SAS Index

Improved Performance By Setting SPD Engine I/O Block Size

Improved Performance of Reading Data in HDFS

Improved Performance of Writing Data to HDFS

Optimized WHERE Processing

Controlling Tasks When Writing Data in HDFS

SPD Engine File System Locking

SPD Engine Distributed Locking

Configuring the SPD Engine to Store Data in HDFS

Accessing SPD Engine Data Using Hive

1

Introduction to Storing Data in HDFS

Deciding to Store Data in HDFS

Using the SPD Engine to Store Data in HDFS

What Is the SPD Engine?

Understanding the SPD Engine File Format

How to Use the SPD Engine

2

Storing Data in HDFS

Overview: Storing Data in HDFS

SAS and Hadoop Requirements

SAS Version

Hadoop Distribution Support

Configuring Hadoop JAR Files

Making Required Hadoop Cluster

Configuration Files Available to Your Machine

Supported SAS File Features Using the SPD Engine

Security

3

Using the SPD Engine

Overview: Using the SPD Engine

How the SPD Engine Supports Data Distribution

I/O Operation Performance

Creating SAS Indexes

Parallel Processing for Data in HDFS

Overview: Parallel Processing for Data in HDFS

Parallel Processing Considerations

Tuning Parallel Processing Performance

WHERE Processing Optimization with MapReduce

Overview: WHERE Processing Optimization with MapReduce

WHERE Expression Syntax Support

Data Set and SAS Code Requirements

Hadoop Requirements

SPD Engine File System Locking

Overview: SPD Engine File System Locking

Requesting Read Access Lock Files

Specifying a Pathname for the SPD Engine Lock Directory

SPD Engine Distributed Locking

Overview: SPD Engine Distributed Locking

Understanding the Service Provider

Requirements for SPD Engine Distributed Locking

Requesting Distributed Locking

Updating Data in HDFS

Using SAS High-Performance Analytics Procedures

4

SPD Engine Reference

Overview: SPD Engine Reference

Dictionary