Platfora Installation Guide

(1)

Platfora Installation Guide

Version 4.5

For On-Premise Hadoop Deployments

Copyright Platfora 2015

(2)

Document Conventions... 5

Contact Platfora Support...6

Copyright Notices... 6

Chapter 1: Installation Overview (On-Premise)... 8

On-Premise Hadoop Deployments... 8

Master vs Worker Node Installations... 9

Preinstall Check List... 10

High-Level Install Steps... 11

Chapter 2: System Requirements (On-Premise)... 13

Platfora Server Requirements...13

Port Configuration Requirements...14

Ports to Open on Platfora Nodes... 15

Ports to Open on Hadoop Nodes...15

Supported Hadoop and Hive Versions... 17

Hadoop Resource Requirements...17

Browser Requirements...18

Chapter 3: Configure Hadoop for Platfora Access... 19

Create Platfora User on Hadoop Nodes...19

Create Platfora Directories and Permissions in Hadoop... 19

HDFS Tuning for Platfora... 21

Increase Open File Limits...21

Increase Platfora User Limits... 22

Increase DataNode File Limits... 22

Allow Platfora Local Access... 22

MapReduce Tuning for Platfora... 23

YARN Tuning for Platfora... 25

Chapter 4: Install Platfora Software and Dependencies...27

About the Platfora Installer Packages... 27

Install Using RPM Packages... 28

Install Dependencies RPM Package... 28

Install Optional Security RPM Package...29

Install Platfora RPM Package (Master Only)...30

Install Using the TAR Package...31

Create the Platfora System User... 31

(3)

Install Platfora TAR Package (Master Only)... 39

Install PDF Dependencies (Master Only)... 40

Chapter 5: Configure Environment on Platfora Nodes...43

Install the MapR Client Software (MapR Only)...43

Configure Network Environment... 45

Configure /etc/hosts File... 45

Verify Connectivity Between Platfora Nodes... 46

Verify Connectivity to Hadoop Nodes...47

Open Firewall Ports... 48

Configure Passwordless SSH... 49

Verify Local SSH Access...49

Exchange SSH Keys (Multi-Node Only)...49

Synchronize the System Clocks... 50

Create Local Storage Directories...51

Verify Environment Variables...52

Chapter 6: Configure Platfora for Secure Hadoop Access... 53

About Secure Hadoop...53

Configure Kerberos Authentication to Hadoop... 54

Obtain Kerberos Tickets for a Platfora Server... 54

Auto-Renew Kerberos Tickets for a Platfora Server... 54

Configure Secure Impersonation in Hadoop...55

Chapter 7: Initialize Platfora Master Node... 57

Connect Platfora to Your Hadoop Services...57

Understand How Platfora Connects to Hadoop... 57

Obtain Hadoop Configuration Files... 59

Create Local Hadoop Configuration Directory...59

Initialize the Platfora Master... 69

Configure SSL for Client Connections...71

Configure SSL for Catalog Connections... 73

About System Diagnostic Data...74

Troubleshoot Setup Issues... 75

View the Platfora Log Files... 75

Setup Fails Setting up Catalog Metadata Service...75

TEST FAILED: Checking integrity of binaries... 76

Chapter 8: Start Platfora...78

Start the Platfora Server... 78

Log in to the Platfora Web Application... 79

Add a License Key...81

(4)

Load the Tutorial Data... 82

Chapter 9: Initialize a Worker Node... 84

Appendix A: Command Line Utility Reference...85

setup.py... 85

hadoop-check... 89

hadoopcp... 92

hadoopfs... 93

install-node... 94

platfora-catalog... 95

platfora-catalog ssl...97

platfora-config... 98

platfora-export...100

platfora-import...104

platfora-license... 106

platfora-license install... 107

platfora-license uninstall... 108

platfora-license view... 108

platfora-node...109

platfora-node add...110

platfora-node config... 111

platfora-services... 112

platfora-services start...113

platfora-services stop...115

platfora-services restart... 117

platfora-services status... 118

platfora-services sync... 120

platfora-syscapture... 120

platfora-syscheck...122

Appendix B: Glossary... 125

(5)

This guide provides information and instructions for installing and initializing a Platfora®_{cluster. This} guide is intended for system administrators with knowledge of Linux/Unix system administration and basic Hadoop administration.

This on-premise installation guide is for data center environments (either physical or virtual data centers) that have a permanent, managed Hadoop cluster. Platfora is installed in the same network as your

Hadoop cluster.

Document Conventions

This documentation uses certain text conventions for language syntax and code examples.

Convention Usage Example

$ Commandline prompt

-proceeds a command to be entered in a command-line terminal session.

$ls

$sudo Command-line prompt

for a command that requires root permissions (commands will be prefixed with sudo).

$sudo yum install open-jdk-1.7

UPPERCASE Function names and keywords are shown in all uppercase for readability, but keywords are case-insensitive (can be written in upper or lower case).

SUM(page_views)

italics Italics indicate a user-supplied argument or variable.

SUM(field_name)

[ ] (square brackets)

Square brackets denote optional syntax items.

CONCAT(string_expression[,...]) ...

(elipsis)

An elipsis denotes a syntax item that can be repeated any number of times.

(6)

Contact Platfora Support

For technical support, you can send an email to: [email protected]

Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and product tips.

http://support.platfora.com

To access the support portal, you must have a valid support agreement with Platfora. Please contact your Platfora sales representative for details about obtaining a valid support agreement or with questions about your account.

Copyright Notices

Platfora believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA

CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR

PURPOSE.

Use, copying, and distribution of any Platfora software described in this publication requires an

applicable software license. Platfora®_{, You Should Know}™_{, Interest Driven Pipeline}™_{, Fractal Cache}™_, and Adaptive Job Synthesis™_{are trademarks of the Platfora Corporation. Apache Hadoop}™_{and Apache} Hive™_{are trademarks of the Apache Software Foundation. All other trademarks used herein are the} property of their respective owners.

Embedded Software Copyrights and License Agreements

Platfora contains the following open source and third-party proprietary software subject to their respective copyrights and license agreements:

• Apache Hive PDK • dom4j

• freemarker • GeoNames • Google Maps API

(7)

• javax.servlet • Mortbay Jetty 6.1.26 • OWASP CSRFGuard 3 • PostgreSQL JDBC 9.1-901 • Scala • sjsxp : 1.0.1 • Unboundid

(8)

1

Installation Overview (On-Premise)

This section provides an overview of the Platfora installation process for environments that will use an

on-premise deployment of Hadoop with Platfora.

Topics:

•

On-Premise Hadoop Deployments

•

Master vs Worker Node Installations

•

Preinstall Check List

•

High-Level Install Steps

On-Premise Hadoop Deployments

An on-premise Hadoop deployment means that you already have an existing Hadoop installation in your data center (either a physical data center or a virtual private cloud).

(9)

Platfora connects to the Hadoop cluster managed by your organization, and the majority of your organization's data is stored in the distributed file system of this primary Hadoop cluster.

For on-premise Hadoop deployments, the Platfora servers should be on their own dedicated hardware co-located in the same data center as your Hadoop cluster. A data center can be a physical location with actual hardware resources, or a virtual private cloud environment with virtual server instances (such as Rackspace or Amazon EC2). Platfora recommends putting the Platfora servers on a network with at least 1 Gbps connectivity to the Hadoop nodes.

Platfora users access the Platfora master node using an HTML5-compliant web browser. The Platfora master node accesses the HDFS NameNode and the MapReduce JobTracker or YARN Resource

Manager using native Hadoop protocols. The Platfora worker nodes access the HDFS DataNodes

directly. If using a firewall, Platfora recommends placing the Platfora servers on the same side of the firewall as your Hadoop cluster.

Platfora software can run on a wide variety of server configurations – on as little as one server or scale across multiple servers. Since Platfora runs best with all of the active lenses readily available in RAM, Platfora recommends obtaining servers optimized for higher RAM capacity and a minimum of 8 CPUs.

Master vs Worker Node Installations

If you are installing Platfora for the very first time, you begin by installing, configuring and initializing the Platfora master node. Once you have the master node up and running, you can then add in additional worker nodes as needed.

(10)

All nodes in a Platfora cluster (master and workers) must meet the minimum system requirements and have the required prerequisite software installed. If you are using the RPM installer packages, you can use the base installer package to install the required software on each Platfora node. If you are using the TAR installer packages, you must manually install the required software on each Platfora node.

You only need to install the Platfora server software, however, on the master node. Platfora copies the server software from the master to the worker nodes during the worker node initialization process. All nodes in a Platfora cluster also require you to configure the network environment so that all the nodes can talk to each other, as well as to the Hadoop cluster nodes. If you are adding additional worker nodes to an existing Platfora cluster, make sure to follow the instructions for installing dependencies and configuring the environment. You can skip any tasks denoted as 'Master Only' - these tasks are only required for first-time installations of the Platfora master node.

Preinstall Check List

Here is a list of items and information you will need in order to install a new Platfora cluster with an on-premise Hadoop deployment. Platfora must be able to connect to Hadoop services during setup, so you will also need information from your Hadoop installation.

Platfora Checklist

This is a list of things you will need in order to install Platfora nodes.

What You Need Description

Platfora License Platfora Customer Support must issue

you a license file. Trial period licenses are available upon request for pilot installations.

Platfora Software A Platfora customer support representative

can give you the download link to the

Platfora installation package for your chosen operating system and Hadoop distribrution and version. Platfora provides both rpm and tar installer packages.

(MapR Only) MapR Client Software If you are using a MapR Hadoop cluster with Platfora, you will need the MapR client software for the version of MapR you are using. The MapR client software must be installed on all Platfora nodes.

(11)

Hadoop Checklist

This is a list of things you will need from your Hadoop environment in order to install Platfora.

What You Need Description

Hadoop Distribution and Version Number When you install Platfora, you need to specify what Hadoop distribution you have (Cloudera, Hortonworks, MapR, etc.) and what version you are running.

Hadoop Hostnames and Ports You will need to know the hostnames and ports of your Hadoop services (NameNode, Resource Manager or JobTracker, Hive Server, DataNodes, etc.)

Hadoop Configuration Files Platfora requires local versions of Hadoop's configuration files. It uses these files to connect to Hadoop services:

• core-site.xml and hdfs-site.xml for HDFS • mapred-site.xml and yarn-site.xml for

data processing

• hive-site.xml for the Hive metastore The locations of these files varies based on your Hadoop distribution.

Platfora Data Directory Location in HDFS Platfora requires a directory location in HDFS to store its library files and output (lenses).

High-Level Install Steps

This section lists the high-level steps involved in installing Platfora to work with an on-premise Hadoop cluster. Note that there are different procedures if you are installing a new Platfora cluster verses adding a worker node to an existing Platfora cluster.

New Platfora Installation

When installing Platfora for the first time, you begin with installing and configuring the Platfora master node first. After the master node is installed, initialized and connected to the Hadoop services it needs, then you can use the master node to add additional worker nodes into the cluster.

These are the high-level steps for installing Platfora for the first time: 1. Make sure your systems meet the minimum System Requirements.

(12)

2. .Configure Hadoop for Platfora Access . 3. Install Platfora Software and Dependencies. 4. Configure Environment on Platfora Nodes.

5. (Secure Hadoop Only) Configure Platfora for Secure Hadoop Access. 6. Obtain a Copy of Your Hadoop Configuration Files.

7. Configure Access to Your Hadoop Services. 8. Initialize the Platfora Master.

9. Start Platfora.

10.Login to the Platfora Application. 11.Install the License File.

12.(Optional) Load the Tutorial Data (as a quick way to test that everything works). 13.Add Worker Nodes.

Additional Worker Node Installation

Once you have a Platfora master node up and running, you can use it to initialize additional worker nodes. Before you can initialize a worker node, however, you must make sure that it has the required dependencies installed.

These are the high-level steps for adding a worker node to an existing Platfora cluster: 1. Install the prerequisite software only directly on the worker node.

• If using the RPM installer packages, Install Dependencies RPM Package.

• If using the TAR installer packages, you must manually Create the Platfora System User, Set OS Kernel Parameters, and Install Dependent Software.

2. Configure Environment on Platfora Nodes.

3. (Secure Hadoop Only) Configure Kerberos Authentication to Hadoop. 4. Add Worker Node to Platfora Cluster.

(13)

2

System Requirements (On-Premise)

The Platfora software runs on a scale-out cluster of servers. You can install Platfora on a single node to start, and then scale up storage and processing capacity by adding additional nodes. Platfora requires access to an existing, compatible Hadoop implementation in order to start. Users then access the Platfora application using a compatible web browser client. This section describes the system requirements for on-premise deployments of the Platfora servers, Hadoop source systems, network connectivity, and web browser clients.

Topics:

•

Platfora Server Requirements

•

Port Configuration Requirements

•

Supported Hadoop and Hive Versions

•

Hadoop Resource Requirements

•

Browser Requirements

Platfora Server Requirements

Platfora recommends the following minimum system requirements for Platfora servers. For multi-node installations, the master server and all worker servers must be the same operating system (OS) and system configuration (same amount of memory, CPU, etc.).

64-bit Operating System or Amazon Machine Image (AMIs)

CentOS 6.2-6.5 (7.0 is not supported) RHEL 6.2-6.5 (7.0 is not supported) Scientific Linux 6.2

Amazon Linux AMI 2014.03+ Oracle Enterprise Linux 6.x Ubuntu 12.04.1 LTS or higher Security-Enhanced Linux 6.21

1 _{If you wish to install Security-Enhanced Linux, refer to}_{Platfora's Support site}_for installation instructions.

(14)

Software Java 1.7

Python 2.6.8, 2.7.1, 2.7.3 through 2.7.6 (3.0 not supported) PostgreSQL 9.2.1-1, 9.2.5, 9.2.7 or 9.3 (master only)

OpenSSL 1.0.1 or higher2

Unix Utilities rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget

Memory 64 GB minimum, 256 recommended

The server needs enough memory to accommodate actively used lens data. Additionally, it needs 1-2 GB reserved for normal operations and the lens query engine workspace.

CPU 8 cores minimum, 16 recommended

Disk All Platfora nodes (master or worker) require 300MB for the Platfora installation. Every node requires high-speed local storage and a local disk cache configured as a single logical volume. Hardware RAID is recommended for the best performance.

All nodes combined require appropriate free space for aggregated data structures (Platfora lenses). At a minimum, you will need twice the amount of disk space as the amount of system memory. The Platfora master node requires an additional, approximately 700 MB for metadata catalog (dataset definitions, vizboard and visualization definitions, lens definitions, etc.)

Network 1 Gbps reliable network connectivity between Platfora master server and query processing servers

1 Gbps reliable network connectivity between Platfora master server and Hadoop NameNode and JobTracker/ResourceManager node

Network bandwidth should be comparable to the amount of memory on the Platfora master server

Port Configuration Requirements

You must open ports in the firewall of your Platfora nodes to allow client access and intra-cluster communications. You also must open ports within your Hadoop cluster to allow access from Platfora. This section lists the default ports required.

(15)

Ports to Open on Platfora Nodes

Your Platfora master node must allow HTTP connections from your user network. All nodes must allow connections from the other Platfora nodes in a multi-node cluster.

On Amazon EC2 instances, you must configure the port firewall rules on the Platfora server instances in addition to the EC2 Security Group Settings.

Platfora Service Default Port

Allow connections from…

Master Web Services Port (HTTP)

8001 External user network Platfora worker servers localhost

Secure Master Web Services Port (HTTPS)

8443 External user network Platfora worker servers localhost

Master Server Management Port

8002 Platfora worker servers localhost

Worker Server Management Port

8002 Platfora master server

other Platfora worker servers localhost

Master Data Port 8003 Platfora worker servers localhost

Worker Data Port 8003 Platfora master server

other Platfora worker servers localhost

Master PostgreSQL Database Port

5432 Platfora worker servers localhost

Ports to Open on Hadoop Nodes

Platfora must be able to access certain services of your Hadoop cluster. This section lists the Hadoop services Platfora needs to access and the default ports for those services.

(16)

Note that this only applies to on-premise Hadoop deployments or to self-managed Hadoop deployments in a virtual private cloud, not to Amazon Elastic MapReduce (EMR).

Default Ports by Hadoop Distro Hadoop Service CDH, HDP, Pivotal Apache Hadoop MapR

Allow connections from…

HDFS NameNode 8020 9000 N/A Platfora master and worker servers HDFS DataNodes 50010 50010 N/A Platfora master and worker servers MapRFS CLDB N/A N/A 7222 Platfora master and worker servers MapRFS DataNodes N/A N/A 5660 Platfora master and worker servers MRv1 JobTracker 8021 9001 9001 Platfora master server

MRv1 JobTracker Web UI

50030 50030 50030 External user network (optional) YARN

ResourceManager

8032 8032 8032 Platfora master server YARN

ResourceManager Web UI

8088 8088 8088 External user network (optional)

YARN Job History Server

10020 10020 10020 Platfora master server YARN Job History

Server Web UI

19888 19888 19888 External user network (optional) HiveServer Thrift

Port

10000 10000 10000 Platfora master server Hive Metastore DB

Port3

9083 9933 (HDP2)

(17)

Supported Hadoop and Hive Versions

This section lists the Hadoop distributions and versions that are compatible with the Platfora installation packages. If using Hive as a data source for Platfora, the version of Hive must be compatible with the version of Hadoop you are using.

Hadoop Distro Version Hive Version M/R Version Platfora Package CDH5.0 0.12 YARN cdh5 CDH5.1 0.12 YARN cdh5 CDH5.2 0.13 YARN cdh52 CDH5.3 0.13.1 YARN cdh52 Cloudera 5 CDH5.4 1.1 YARN cdh54 HDP 2.1.x 0.13.0 YARN hadoop_2_4_0_hive_0_13_0 Hortonworks HDP 2.2.x 0.14.0 YARN hadoop_2_6_0_hive_0_14_0

MapR 4.0.1 0.12.0 YARN mapr4

MapR

Pivotal Labs PivotalHD 3.0 0.14.0 YARN hadoop_2_6_0_hive_0_14_0 Amazon EMR

(AMI 3.7.x)

Hadoop 2.4.0 0.13.1 YARN hadoop_2_4_0_hive_0_13_0

Hadoop Resource Requirements

Platfora must be able to connect to an existing Hadoop installation. Platfora also requires permissions and resources in the Hadoop source system. This section describes the Hadoop resource requirements for Platfora.

Platfora uses the remote Distributed File System (DFS) of the Hadoop cluster for persistent storage and as the primary data source. Optionally, you can also configure Platfora to use a Hive metastore server as a data source.

(18)

Platfora uses the Hadoop MapReduce services to process data and build lenses. For larger lens builds to succeed, Platfora requires minimum resources on the Hadoop cluster for MapReduce tasks.

DFS Disk Space Platfora requires a designated persistent storage directory in the remote distributed file system (DFS) with appropriate free space for Platfora system files and data structures (lenses). The location is configurable.

DFS Permissions The platfora system user needs read permissions to source data directories and files.

The platfora system user needs write permissions to Platfora's persistent storage directory on DFS.

MapReduce Permissions

The platfora system user needs to be added to the submit-jobs and administer-jobs access control list (or added to a group that has these permissions).

DFS Resources Minimum Open File Limit = 5000 MapReduce

Resources

Minimum Memory for Task Processes = 1 GB

Browser Requirements

Users can connect to the Platfora web application using the latest HTML5-compliant web browsers. Platfora supports the latest releases of the following web browsers:

• Chrome (preferred browser) • Firefox

• Safari

• Internet Explorer with the Compatibility View feature disabled (versions prior to IE 10 are not supported)

(19)

3

Configure Hadoop for Platfora Access

Before initializing and starting Platfora for the first time, you must make sure that Platfora can connect to Hadoop and access the directories and services it needs. The tasks in this section are performed in your Hadoop environment, and apply to on-premise Hadoop installations only (not to Amazon EMR).

Topics:

•

Create Platfora User on Hadoop Nodes

•

Create Platfora Directories and Permissions in Hadoop

•

HDFS Tuning for Platfora

•

MapReduce Tuning for Platfora

•

YARN Tuning for Platfora

Create Platfora User on Hadoop Nodes

Platfora requires a platfora system user account on each node in your Hadoop cluster. The Platfora server uses this system user account to submit jobs to the Hadoop cluster and to access the necessary files and directories in the Hadoop distributed file system (HDFS).

Creating a system user requires root or sudo permissions. 1. Create the platfora user:

$ sudo useradd -s /bin/bash -m -d /home/platfora platfora 2. Set a password for the platfora user:

$ sudo passwd platfora

Create Platfora Directories and Permissions in Hadoop

Platfora requires read and write permissions to a designated directory in the Hadoop file system where it can store its metadata and MapReduce output. Platfora connects to HDFS as the platfora user and also runs its MapReduce jobs as this same user.

(20)

Create a data directory for Platfora and set the platfora system user as its owner. In the example below, the Hadoop file system has a user called hdfs, the directory is called /platfora and the Platfora server is running as the platfora system user:

$ sudo -u hdfs hadoop fs -mkdir /platfora

$ sudo -u hdfs hadoop fs -chown platfora /platfora $ sudo -u hdfs hadoop fs -chmod 711 /platfora Note that for MapR, run the command as the mapr user:

$ sudo -u mapr hadoop fs -mkdir /platfora

$ sudo -u mapr hadoop fs -chown platfora /platfora $ sudo -u mapr hadoop fs -chmod 711 /platfora

The platfora system user needs access to the location where MapReduce writes its staging files. Depending on your Hadoop distribution, the location of the staging area is different. In Cloudera, MapR, Pivotal, and Hortonworks, the default location is /user/username. In Apache, the location is /

tmp/xxx_/xxx_/username.

Make sure this location exists and is writable by the platfora system user. For example, on Cloudera:

$ sudo -u hdfs hadoop fs -mkdir /user/platfora

$ sudo -u hdfs hadoop fs -chown platfora /user/platfora For example, on MapR:

$ sudo -u mapr hadoop fs -mkdir /user/platfora

$ sudo -u mapr hadoop fs -chown platfora /user/platfora

During lens build processing, the platfora system user needs to be able to write to the intermediate and log directories on the Hadoop nodes. Check the following Hadoop configuration properties and make sure the specified locations exist in HDFS and are writable by the platfora system user.

Property Hadoop Configuration

File

Description

mapreduce.cluster.local.dir mapred-site.xml Tells the MapReduce servers where to store intermediate files for a job.

mapreduce.jobtracker.system.dir mapred-site.xml The directory where MapReduce stores control files.

mapreduce.cluster.temp.dir mapred-site.xml A shared directory for temporary files.

(21)

Property Hadoop Configuration File

Description

mapr.centrallog.dir (MapR Only) mapred-site.xml The central job log directory for MapR Hadoop.

The platfora system user also needs to be added to the submit-jobs and administer-jobs access control lists (or added to a group that has these permissions).

The platfora system user also needs read permissions to the source data directories and files that you want to analyze in Platfora.

HDFS Tuning for Platfora

Platfora opens files on the Hadoop NameNode and DataNodes as it does its work to build the lens. This section describes how to ensure your Hadoop cluster has file limits that support lens build operations.

Increase Open File Limits

Platfora opens files on the Hadoop NameNode and DataNodes as it builds the lens. For multiple lens builds or for lenses that have a lot of fields selected, a lens build can cause your Hadoop nodes to exceed the maximum open file limit. When this limit is exceeded, Platfora lens builds will fail with a "Too

many open files..." exception.

Linux operating systems limit the number of open files and connections a process can have. This prevents one application from slowing down the entire system by requesting too many file handlers. When an application exceeds the limit, the operating system prevents the application from requesting more file handlers, causing the process to fail with a "Too many open files..." error.

Verify your file limits are adequate on each Hadoop node. Increase the limits on your Hadoop nodes where the limts are too low. There are two places file limits are set in the Linux operating system: • A global limit for the entire system (set in /etc/sysctl.conf)

• A per-user process limit (set in /etc/security/limits.conf)

You can check the global limit by running the command:

$ cat /proc/sys/fs/file-nr

This should return a set of three numbers like this:

704 0 294180

The first number is the number of currently opened file descriptors. The second number is the number of allocated file descriptors. The third number is the maximum number of file descriptors for the whole system. The maximum should be at least 250000.

(22)

To increase the global limit, edit /etc/sysctl.conf (as root) and set the property:

fs.file-max = 294180

Increase Platfora User Limits

You can check the per-user process limit by running the command:

$ ulimit -n

This should return the file limit for the currently logged in user, for example:

1024

This limit should be at least 5000 for the platfora system user (or whatever user runs Platfora lens build jobs).

To increase the limit, edit /etc/security/limits.conf (as root) and add the following lines (the

* increases the limit for all system users):

* hard nofile 65536 * soft nofile 65536 root hard nofile 65536 root soft nofile 65536

You must reboot the server whenever you change OS kernel settings.

Increase DataNode File Limits

A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. In your Hadoop configuration, make sure the DataNodes are tuned to have an upper bound of at least 5000 by setting the following properties in the hdfs-site.xml file (located in the conf directory on your Hadoop NameNode):

Framework hdfs-site.xml Property Minimum Value

MapReduce v1 dfs.datanode.max.xcievers 5000

YARN dfs.datanode.max.transfer.threads 5000

Allow Platfora Local Access

If the platfora system user is not able to make HDFS calls during lens build processing, lens build jobs in Platfora will stall at 0% progress. To prevent this, make sure your hdfs-site.xml files contain the dfs.block.local-path-acess.user parameter and that its value includes the

platfora system user. For example:

<name>dfs.block.local-path-access.user</name>

(23)

MapReduce Tuning for Platfora

It is pretty common in Hadoop to customize configuration file properties to suit a specific MapReduce workload. This section lists the mapred-site.xml properties that Platfora needs for its lens builds. Platfora can pass in certain properties at runtime for its lens build jobs. Other properties must be set on the Hadoop nodes themselves.

Runtime properties can be set in the Platfora local copy of the mapred-site.xml file, and are then passed to Hadoop with the lens build job configuration. Non-runtime properties must be configured in your Hadoop environment directly.

Consult your Hadoop vendor's documentation for recommended memory configuration settings for Hadoop task/container nodes. These settings depend on the node hardware specifications, and can vary for each environment.

Required Properties for MapReduce v1 Hadoop Clusters

These properties must be set in order for lens build jobs to succeed. You can set these in the local

mapred-site.xml file on the Platfora master, and they will be passed to Hadoop at runtime.

Property Recommended Value Default Value Runtime?

mapred.child.java.opts At least -Xmx1024m

Can be set higher based on the amount of memory on your Hadoop nodes and the number of simultaneous task slots available per node.

-Xmx200m YES

(24)

Required Properties for YARN Hadoop Clusters

These properties must be set in order for lens build jobs to succeed. You can set the runtime properties in the local mapred-site.xml file on the Platfora master, and they will be passed to Hadoop at runtime. Non-runtime properties must be configured in your Hadoop environment directly.

Property Recommended Value Default Value Runtime?

mapreduce.map.java.opts YES

mapreduce.reduce.java.opts

At least

-Xmx1024m

Can be set higher based on the amount of memory on your Hadoop nodes and the number of simultaneous task slots available per node.

-Xmx200m

YES

mapreduce.map.shuffle.input.buffer.percent0.30

The percentage of total JVM heap size to allocate to storing map outputs during the shuffle phase.

0.70 YES

mapreduce.reduce.shuffle.input.buffer.percent0.30

The percentage of total JVM heap size to allocate to storing reduce outputs during the shuffle phase.

0.70 YES

mapreduce.map.memory.mb The calculated RAM per container size for your hardware specifications. Platfora requires at least 1024.

512 NO

mapreduce.reduce.memory.mbThe calculated RAM per container size for your hardware specifications. Platfora requires at least 1024.

512 NO

mapreduce.framework.name yarn

Make sure this is set to yarn to prevent jobs from running in local mode.

(25)

Optional Sort Tuning Properties

These properties increase the number of streams to merge at once when sorting files and set a higher memory limit for sort operations. If the sort phase can fit the data in memory, performance will be better than if it spills to disk. You may decide to increase these if you notice that records are spilling when you look at the lens build job details. However, setting this too high can result in job failures. If too much of the JVM is reserved for sorting, then not enough will be left for other task operations.

The following optional mapred-site.xml properties apply to both MapReduce v1 and YARN Hadoop clusters.

Property Recommended Value Default Value Runtime?

io.sort.factor 100 10 YES

io.sort.mb 25-30% of the

*.java.opts values. For example, if the java.opts properties are set to 1024MB, this should be about 256MB.

100 YES

io.sort.record.percent 0.15 0.05 YES

YARN Tuning for Platfora

This configuration is only required for Hadoop MapReduce v2 clusters with YARN. This section lists the yarn-site.xml properties that Platfora needs for its lens builds. Platfora can pass in certain properties at runtime for its lens build jobs. Other properties must be set on the Hadoop nodes themselves.

Runtime properties can be set in the Platfora local copy of the yarn-site.xml file, and are then passed to Hadoop with the lens build job configuration. Non-runtime properties must be configured in your Hadoop environment directly.

Consult your Hadoop vendor's documentation for recommended memory configuration settings for Hadoop task/container nodes. These settings depend on the Hadoop node's hardware specifications, and can vary for each environment.

(26)

Required Properties for YARN Hadoop Clusters

Tuning these properties properly on your Hadoop nodes will optimize Platfora lens build jobs.

Property Recommended Value Default Value Runtime?

yarn.nodemanager.resource.memory-mb

The total memory size for all containers on a node (in MB).

Should be the total amount of RAM on the node, minus 15-20% for reserved system memory space.

8192 NO

yarn.scheduler.minimum-allocation-mb

The minimum memory size per container. Depends on the amount of total memory on a node:

• 512 MB (on nodes with 4-8 GB total RAM) • 1024 MB ( on nodes

with 8-24 GB total RAM)

• 2048 MB (on nodes with more than 24 GB total RAM)

1024 YES

yarn.scheduler.maximum-allocation-mb

The maximum memory size per container. Same as

yarn.nodemanager.resource.memory-mb.

8192 YES

Determine Maximum Reduce Tasks for Platfora

In addition to these YARN settings in Hadoop, you will need to determine the maximum number of MapReduce reduce tasks allowed for a Platfora lens build job. This number is then configured in Platfora after you initialize the Platfora master by setting the Platfora server configuration property:

platfora.reduce.tasks.

The number of reducer tasks can be determined using the following formula:

(yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb) * number_of_hadoop_nodes

(27)

4

Install Platfora Software and Dependencies

This section describes how to provision a Platfora node with the required prerequisites and Platfora software. If you are installing a new Platfora cluster, the master node needs everything (prerequisites and Platfora software). Worker nodes only need the prerequisite software installed prior to initialization.

Most of the tasks in this section require root permissions. The example commands in the documentation use sudo to denote the commands that require root permissions.

Topics:

•

About the Platfora Installer Packages

•

Install Using RPM Packages

•

Install Using the TAR Package

About the Platfora Installer Packages

Platfora provides RPM or TAR installer packages that are specific to the Hadoop distribution you are using. Platfora Customer Support can provide you with the link to download the installer packages for your environment.

Make sure to download the correct Platfora installer packages for your Hadoop distribution and version. See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your chosen Hadoop distribution.

RPM Packages

If you plan to install Platfora on a Linux operating system that supports the RPM packager manager, such as RedHat or CentOS, Platfora recommends using the RPM packages to install Platfora and its required dependencies.

The platfora-base RPM package includes all the prerequisite software that Platfora needs, plus automates the OS configurations needed by Platfora. This package should be installed on all Platfora nodes (master and workers).

(28)

The platfora-server package includes the Platfora software only, which only needs to be installed on the master node. The Platfora software is copied to the worker nodes during initialization or upgrade, so you don't need to install it on the worker nodes ahead of time.

TAR Package

If you plan to install Platfora on a Linux operating system that does not support the RPM package manager, such as Ubuntu, you have to use the TAR package. You may also use the TAR package if you just want to install and manage the dependent software that is installed in your environment yourself. The TAR package contains the Platfora server software only, which only needs to be installed on the master node.

The TAR package does not contain the prerequisite software that Platfora needs. You must manually install the required prerequisite software and do the required OS configurations on all Platfora nodes prior to installing and initializing Platfora.

Install Using RPM Packages

Follow the instructions in this section to install the Platfora dependencies and server software using the RPM packages. Install the platfora-base RPM package on all Platfora nodes, and the

platfora-server RPM package on the master node only.

Install Dependencies RPM Package

The platfora-base RPM package contains all of the dependent software required by Platfora, and also automates several OS configuration tasks. This package is installed on all Platfora nodes.

This task requires root permissions. Commands that begin with sudo denote root commands.

The platfora-base RPM package does the following:

• Creates a /usr/local/platfora/base directory containing Platfora's third-party dependencies. • Creates the platfora system user. The platfora user has no password set.

• Generates an SSH key for the platfora system user and adds the key to the user's

authorized_keys file.

• Adds the platfora system user to the sudoers file. This allows you to execute commands as root while logged in as the platfora user.

• Ensures the OS kernel parameters are appropriate for Platfora and sets them if they are not. • Creates a .bashrc file for the platfora system user.

(29)

The platfora-base package uses the following file naming convention, where version-build is the version and build number of the base package only, and x86_64 is the supported system architecture. The base and Platfora server packages use different versioning schemes.

platfora-base-version_-build_-x86_64_.rpm

The base package is not updated every Platfora release. It is only updated when the Platfora dependencies change, which is not as often. When upgrading Platfora, check the release notes to see if upgrade of the base package is required.

1. Log on to the machine on which you are installing Platfora.

2. Using the download link provided by Platfora Customer Support, download the base package. For example:

$ wget http://downloads.platfora.com/release /platfora-base-version_-build_{-x86_64.rpm}

3. Install the package using the yum package manager (requires root permission). For example:

$ sudo yum --nogpgcheck localinstall platfora-base-version-build -x86_64.rpm

Confirm that the /usr/local/platfora/base directory was created.

$ sudo ls -a /usr/local/platfora/base

Install Optional Security RPM Package

The platfora-security RPM package contains SSL-enabled PostgreSQL and the OpenSSL package it depends on. This package is only needed if you plan to enable SSL communications between the Platfora worker nodes and the Platfora metadata catalog database.

The platfora-security package is installed after the platfora-base package. The

platfora-security RPM package does the following:

• Creates a /usr/local/platfora/security directory containing the SSL-enabled version of PostgreSQL.

• Checks if OpenSSL version 1.0.1 or later is installed, and if not downloads and installs the openssl package dependency from the OpenSSL public repo.

• Edits the .bashrc file for the platfora system user and changes the PATH environment variable so that secure PostgreSQL is listed before the default PostgreSQL installed by the platfora-base package.

The platfora-security package uses the following file naming convention, where

(30)

supported system architecture. The base, security and Platfora server packages use different versioning schemes.

platfora-security-version-build-x86_64.rpm

The security package only needs to be upgraded when the base package is

upgraded, which is not every release. When upgrading Platfora, check the release notes to see if upgrade of the base and security packages is required.

1. Log on to the machine on which you are installing Platfora.

2. Using the download link provided by Platfora Customer Support, download the security package. For example:

$ wget http://downloads.platfora.com/release /platfora-security-version_-build_{-x86_64.rpm}

$ sudo yum --nogpgcheck localinstall platfora-security-version_-build -x86_64.rpm

Confirm that the /usr/local/platfora/security directory was created.

$ sudo ls -a /usr/local/platfora/security

Install Platfora RPM Package (Master Only)

The platfora-server RPM package contains the Platfora server software. This package is installed on the Platfora master node only.

The platfora-server RPM package creates a /user/local/platfora/platfora-server directory containing the Platfora software.

The platfora-server package uses the following file naming convention, where hadoop_distro corresponds to the Hadoop distribution you are using, version-build is the version and build number of the Platfora software, and x86_64 is the supported system architecture.

platfora-server-hadoop_distro_-version_-build_-x86_64_.rpm

Make sure to download the correct Platfora installer packages for your Hadoop distribution and version. See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your chosen Hadoop distribution.

This task requires root permissions. Commands that begin with sudo denote root commands. 1. Log on to the machine on which you are installing the Platfora master.

2. Using the download link provided by Platfora Customer Support, download the Platfora server package. For example:

$ wget http://downloads.platfora.com/release /platfora-server-hadoop_distro-version-build-x86_64.rpm

(31)

platfora-Confirm that the /usr/local/platfora/platfora-server directory was created.

$ sudo ls -a /usr/local/platfora/platfora-server

Install Using the TAR Package

Follow the instructions in this section to install the Platfora dependencies and server software using the TAR packages. The TAR package contains the Platfora server software only. You must install all dependencies yourself.

For the Platfora master node, do all the tasks described in this section.

For a Platfora worker node, do all the tasks described in this section except for: • Install PostgreSQL

• Install Platfora TAR Package • Install PDF Dependencies

Create the Platfora System User

Platfora requires a platfora system user account to own the Platfora installation and run the Platfora server processes. This same system user must be created on all Platfora nodes.

(MapR Only) If you are using MapR as your Hadoop distribution with Platfora, make sure to follow the additional steps for MapR. The platfora system user must exist on all Platfora nodes and all MapR nodes. The UID/GID must also be the same on the MapR nodes as on Platfora nodes.

1. Create the platfora system user:

$ sudo useradd -s /bin/bash -m -d /home/platfora platfora 2. Set a password for the platfora user:

$ sudo passwd platfora

3. (MapR Only) Check the /etc/passwd file on your MapR CLDB node, and find the entry for the

platfora user. Note the user and group id numbers that are used. For example:

platfora:x:1002_:1002_{::/home/platfora:/bin/bash}

4. (MapR Only) Check the /etc/passwd file on your Platfora master node. If the user and group id numbers for the platfora user are different, update them so that they are the same as on the MapR nodes.

For example:

$ sudo usermod -u 1002 platfora $ sudo groupmod -g 1002 platfora

(32)

Configure

sudo

for the

platfora

User

This is an optional task. Configuring sudo access for the platfora system user is a convenient way to run commands as root while logged in as the platfora user.

If you do not configure sudo access for the platfora user, then you must change to the root user to execute the system commands that require root permissions.

This documentation assumes that you have sudo access configured. If you do not, every time you see sudo at the beginning of a command, it means you need to be root to run the command.

1. Edit the /etc/sudoers file using the visudo command.

$ sudo visudo

2. Add a line such as the following in this file:

# User privilege specification platfora ALL=(ALL:ALL) ALL

3. Save your changes and exit the visudo editor. Generate and Authorize an SSH Key

Generating and authorizing an SSH key for the platfora system user on the localhost is required by the Platfora management utilities. This task should be performed on all Platfora nodes.

The Platfora management utilities require a trusted-host environment (the ability to SSH to a remote system in the Platfora cluster without a password prompt). Even in single-node installations, you must exchange SSH keys for the localhost.

1. Make sure that Selinux is disabled using either the sestatus or getenforce command.

$ sestatus

If Selinux is enabled, disable it using the recommended procedure for the node's operating system. 2. Make sure you are logged in to the Platfora server as the platfora system user.

$ su - platfora

3. Go to the ~/.ssh directory (create it if it does not exist):

$ mkdir .ssh $ cd .ssh

4. Generate a public/private key pair that is NOT passphrase-protected. Press the ENTER or RETURN key for each prompt:

$ ssh-keygen -C 'platfora key for node 0' -t rsa

Enter file in which to save the key (/home/platfora/.ssh/ id_rsa): ENTER

Enter passphrase (empty for no passphrase): ENTER Enter same passphrase again: ENTER

(33)

5. Append the public key to the ~/.ssh/authorized_keys file (this allows SSH access from the current host to itself):

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

6. Make sure the home directory, .ssh directory, and the files it contains have the correct permissions:

$ chmod 700 $HOME && chmod 700 ~/.ssh && chmod 600 ~/.ssh/* 7. Test that you can SSH to localhost without a password prompt.

If prompted to add localhost to the list of known hosts, enter yes :

$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be established...

Are you sure you want to continue connecting (yes/no)? yes

Set OS Kernel Parameters

This section has the Linux OS kernel settings required for Platfora. You must have root or sudo permissions to change kernel parameter settings. Changing kernel settings requires a system reboot in order for the changes to take effect.

Kernel ulimit Setting

Linux operating systems set limits on the number of open files and connections a process can have. For some applications, such as Platfora and Hadoop, having a lot of open file handlers during processing is normal. Having the limit set too low can cause Platfora lens builds to fail.

There are two places file limits are set in the Linux operating system: • A global limit for the entire system (set in /etc/sysctl.conf) • A per-user process limit (set in /etc/security/limits.conf) You must have root or sudo permissions to change OS ulimit settings. You can check the global limit by running the command:

$ cat /proc/sys/fs/file-nr

This should return a set of three numbers like this:

704 0 294180

The first number is the number of currently opened file descriptors. The second number is the number of allocated file descriptors. The third number is the maximum number of file descriptors for the whole system. This limit should be at least 250000.

To increase the global limit, edit /etc/sysctl.conf (as root) and set the property:

fs.file-max = 294180

You can check the per-user process limit by running the command:

(34)

This should return the file limit for the currently logged in user, for example:

1024

This limit should be at least 20000 for the platfora user (or whatever user runs the Platfora server). To increase the limit, edit /etc/security/limits.conf (as root) and the following lines (the * increases the limit for all system users):

* hard nofile 65536 * soft nofile 65536 root hard nofile 65536 root soft nofile 65536

Reboot the server for the changes to take effect.

$ sudo reboot

Kernel Memory Overcommit Setting

Linux operating systems allow memory to be overcommitted, meaning the OS will allow an application to reserve more memory than actually exists within the system. Allowing overcommit prevents the OS from killing processes when a process requests more memory than is available.

If you are using a version 1.6 Java Runtime Environment (JRE), you must configure your OS to allow memory overcommit. If you are using a version 1.7 JRE, overcommit is not necessary.

You must have root or sudo permissions to change kernel memory overcommit settings. 1. Check your version of Java.

$ java -version

If you are running a 1.6 version, proceed to the next steps. If you are running a 1.7 version, you do not need to make any further changes.

2. Edit the /etc/systcl.conf file.

$ sudo vi /etc/systcl.conf 3. Set the following value:

vm.overcommit_memory=1 4. Save and close the file.

5. Reboot your system for the change to take effect:

$ sudo reboot

Kernel Shared Memory Settings

Some default OS installations have the system shared memory values set too low for Platfora. You may need to increase the shared memory settings if they are set too low.

You must have root or sudo permissions to set the system shared memory parameters.

1. In /etc/sysctl.conf, make sure the shared memory parameters have the minimum values or higher.

(35)

If your settings are lower than these minimum values, you will need to change them. If they are higher than the minimum, leave them as is.

kernel.shmmax=17179869184 kernel.shmall=4194304

2. If you made changes to /etc/sysctl.conf, reboot the server for the changes to take effect.

$ sudo reboot

Install Dependent Software

If using the TAR installation package to install Platfora, you must install all of the dependencies yourself. This section provides instructions for manually installing the prerequisite software on a Platfora node.

If you are provisioning a Platfora master node, you must install all dependencies.

If you are provisioning a Platfora worker node, you can skip the task for installing PostgreSQL. PostgreSQL is only needed on the Platfora master node.

Confirm Linux OS Utilities

Platfora requires several standard Linux utilities to be installed on your system and in your environment

PATH. Check your system for the required utilites before installing Platfora. Most Linux operating systems already have these utilities installed by default. • rsync • ssh • scp • tail • tar • cp • wget • ntp

• sysctl (/usr/sbin must be in your PATH)

To verify that a utility is installed and can be found in the PATH, you can check its location using the which command. For example:

$ which rsync $ which tar $ which sysctl

If a utility is not installed, you will need to install it before installing Platfora. Check your OS documentation for instructions on installing these utilities.

(36)

Install Java

The Platfora server requires a Java Runtime Environment (JRE) version 1.7 or higher. Platfora recommends installing the full Java Development Kit (JDK) for access to the latest Java features and diagnostic tools.

The instructions in this section are for installing version 1.7 of the Open Java Development Kit (OpenJDK).

You must have root or sudo permissions to install Java. 1. Check if Java 1.7 or higher is already installed.

$ java -version

If java is not found, you will need to install it. 2. Install OpenJDK using your OS package manager.

On Ubuntu Systems:

$ sudo apt-get install openjdk-7-jdk On RedHat/CentOS Systems:

$ su -c "yum install java-1.7.0-openjdk"

3. Set the JAVA_HOME environment variable in the platfora user’s profile file. For example, where

java_directory is the versioned directory where Java is installed:

$ echo "export JAVA_HOME=/usr/lib/jvm/java_directory_{/jre" >> /home/} platfora/.bashrc

$ echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/platfora/.bashrc $ source /home/platfora/.bashrc

4. Make sure JAVA_HOME is set correctly for the platfora user:

$ su - platfora $ echo $JAVA_HOME

Confirm Python Installation

The Platfora management utilities require Python version 2.6.8, 2.7.1, or 2.7.3 through 2.7.6. Python version 3.0 is not supported. Most Linux operating systems already have Python installed by default, but you need to make sure the version is compatible with Platfora.

To check if the correct version of Python is installed:

$ python -V

If Python is not installed (or you have an incompatible version of Python) you will need to install or upgrade/downgrade it before installing Platfora. Check your OS documentation for instructions on installing or upgrading/downgrading Python to version 2.6.8 or higher 2.x version.

(37)

Install PostgreSQL (Master Only)

Platfora stores its metadata catalog in a PostgreSQL relational database. PostgreSQL version 9.2 or 9.3 must be installed (but not running) on the Platfora master server before you start Platfora for the first time. Platfora worker nodes do not require a PostgreSQL installation.

You must have root or sudo permissions to install PostgreSQL. Install PostgreSQL 9.2 on Ubuntu Systems

These instructions are for installing PostgreSQL 9.2 on Linux Ubuntu operating systems. 1. Install the dependent libraries:

$ sudo apt-get install libpq-dev

2. Add the PostgreSQL repository to your system configuration:

$ sudo add-apt-repository ppa:pitti/postgresql $ sudo apt-get update

3. Install PostgreSQL 9.2:

$ sudo apt-get install postgresql-9.2 4. Stop the PostgreSQL service.

$ sudo service postgresql stop

5. Remove the PostgreSQL automatic start-up scripts:

$ sudo rm /etc/rc*/*postgresql

6. Create and change the ownership on the directory where PostgreSQL writes its lock files:

$ sudo mkdir /var/run/postgresql

$ sudo chown platfora /var/run/postgresql

7. Update the platfora user’s PATH environment variable to include the PostgreSQL executable directory and /usr/sbin:

$ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin:$PATH" >> / home/platfora/.bashrc

$ source /home/platfora/.bashrc

Install PostgreSQL 9.2 on RedHat/CentOS Systems

These instructions are for installing PostgreSQL 9.2 on RedHat Enterprise Linux (RHEL) or CentOS operating systems.

1. Download the appropriate PostgreSQL 9.2 YUM repository for your operating system. Go to the PostgreSQL yum repository website, copy the URL link for the appropriate YUM repository configuration, and download it using wget.

For example, to download the YUM repository configuration for PostgreSQL 9.2 on a 64-bit RHEL 6 operating system.

$ wget http://yum.pgrpms.org/9.2/redhat/rhel-6-x86_64/pgdg-redhat92-9.2-7.noarch.rpm

2. Add the PostgreSQL YUM repository to your system configuration:

(38)

3. Install PostgreSQL:

$ sudo yum install postgresql92 postgresql92-server 4. If it is enabled, disable the PostgreSQL automatic start-up.

Each operating system has its own technique for auto starting PostgreSQL. If your system uses chkconfig to manage init scripts, you can remove PostgreSQL from the chkconfig control using the following command:

chkconfig --del postgresql

For some operating systems, the PostgreSQL start.conf file configures the auto-start of a specific PostgreSQL cluster.

5. Create and change the ownership on the directory where PostgreSQL writes its lock files:

$ sudo mkdir /var/run/postgresql

$ sudo chown platfora /var/run/postgresql

6. Update platfora user’s PATH environment variable to include the PostgreSQL executable directory and /usr/sbin:

$ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PATH" >> /home/ platfora/.bashrc

$ source /home/platfora/.bashrc Confirm OpenSSL Installation

Platfora uses OpenSSL for secure communications between the Platfora worker servers and its metadata catalog database. If you decide to enable SSL for the Platfora catalog, which is optional, you will need OpenSSL version 1.0.1 or higher on your Platfora nodes.

As an optional security feature, you can choose to enable SSL communications between the Platfora metadata catalog and the Platfora worker nodes. If you decide to enable this, you will need to have: • SSL-enabled PostgreSQL. If using the RPM installation packages, Platfora provides an optional

platfora-security package that contains SSL-enabled PostgreSQL. If using the TAR

installation packages, the packages provided in the PostgreSQL public repo come with SSL enabled. • OpenSSL. If using the RPM installation packages, Platfora provides an optional

platfora-security RPM package that pulls this dependency from the public repo. If using the TAR installation packages, you will have to install the openssl package yourself.

Many Linux operating systems already have OpenSSL installed by default, but you need to make sure the version is compatible with the version that PostgreSQL uses.

1. Check that OpenSSL version 1.0.1 or higher is installed.

$ openssl version

2. If OpenSSL is not installed (or you have an incompatible version) you will need to install or upgrade it before enabling SSL for the Platfora catalog. Check your OS documentation for instructions on installing or upgrading the openssl package.

(39)

Install Platfora TAR Package (Master Only)

The TAR installation package contains the Platfora server software only. You only need to install this package on the Platfora master node. You can skip this task if you are provisioning a Platfora worker node.

The platfora tar package uses the following file naming convention, where version-build.no is the version and build number of the Platfora software and hadoop_distro corresponds to the Hadoop distribution you are using.

platfora-version_-build.num_-hadoop_distro_.tgz

Make sure to download the correct Platfora installer package for your Hadoop distribution and version. See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your chosen Hadoop distribution.

This task requires root permissions. Commands that begin with sudo denote root commands. 1. Log on to the machine on which you are installing the Platfora master.

2. Create a Platfora installation directory and ensure that it is owned by the platfora system user. For example:

$ sudo mkdir /usr/local/platfora

$ sudo chown platfora /usr/local/platfora -R

3. Log in as the platfora user and go to the installation directory that you just created:

$ su - platfora

$ cd /usr/local/platfora

4. Download the 4.5.0 release package and checksum file using the URLs provided by Platfora Customer Support.

Make sure to download the correct packages for your Hadoop distribution version. For example:

$ wget http://downloads.platfora.com/release_/platfora-version -build.num_-hadoop_distro_.tgz

$ wget http://downloads.platfora.com/release_/platfora-version -build.num_-hadoop_distro_.tgz.sha

5. After downloading the package and checksum file, make sure the package is valid using the shasum command.

For example:

$ shasum -c platfora-version-build.num-hadoop_distro.tgz.sha If the package is valid, you should see a message such as:

platfora-version_-build.num_-hadoop_distro_.tgz:OK

6. Unpack the package within the installation directory. For example:

$ tar -zxvf platfora-version_-build.num_-hadoop_distro_.tgz

(40)

For example:

$ ln -s platfora-version_-build.num_-hadoop_distro_{platfora-server}

8. Set the PLATFORA_HOME environment variable for the platfora system user.

$ echo "export PLATFORA_HOME=/usr/local/platfora/platfora-server" >> $HOME/.bashrc

9. Set the PATH environment variable for the platfora system user.

The PATH should include /usr/sbin, $PLATFORA_HOME/bin, and the PostgreSQL executable directories. If your system has more than one version of PostgreSQL installed, make sure that 9.2 is listed first in the PATH of the platfora user.

For example (Ubuntu):

$ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin: $PLATFORA_HOME/bin:$PATH" >> $HOME/.bashrc

$ source $HOME/.bashrc For example (RedHat/CentOS):

$ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PLATFORA_HOME/bin: $PATH" >> $HOME/.bashrc

$ source $HOME/.bashrc

10.Make sure the JAVA_HOME environment variable is set (if it's not, see Install Java).

$ echo $JAVA_HOME

Install PDF Dependencies (Master Only)

One feature of Platfora is the ability to save a vizboard as a PDF document. In order for the Platfora server to render PDFs, it needs PhantomJS and the OpenSans font to be installed on the Platfora master node. You can skip this task if you are provisioning a Platfora worker node.

The PhantomJS installation relies on several fonts that ship with the Platfora software. For this reason, the PhantomJS installation must be done after installing the Platfora software.

To install PhantomJS, do the following:

1. Log into the Platfora master node as the platfora user. 2. Install the PhantomJS dependencies.

On Ubuntu On Redhat/CentOS

$ sudo apt-get install fontconfig

$ sudo apt-get install libfreetype6

$ sudo apt-get install libfontconfig1

$ sudo apt-get install libstdc++6

$ sudo yum install fontconfig

$ sudo yum install freetype

$ sudo yum install libfreetype.so.6

$ sudo yum install libfontconfig.so.1

(41)

3. Download the compiled PhantomJS executable.

$ sudo wget https://bitbucket.org/ariya/phantomjs/downloads/ phantomjs-1.9.7-linux-x86_64.tar.bz2

4. Extract the files.

$ sudo tar xjf phantomjs-1.9.7-linux-x86_64.tar.bz2 5. Copy the PhantomJS binary to an accessible bin directory.

You should choose a bin directory that is common to most user environments.

$ sudo cp phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/bin 6. Verify the phantomjs command is accessible to the platfora user.

$ which phantomjs

/usr/local/bin/phantomjs

If the command is not found, add the bin directory to the platfora user's environment:

$ echo "export PATH=/usr/local/bin:/usr/sbin:$PATH" >> /home/ platfora/.bashrc

$ source /home/platfora/.bashrc

7. Install the OpenSans font for use by the PDF feature. a) Make a directory to contain the typeface.

$ sudo mkdir -p /usr/share/fonts/truetype b) Copy the font to the truetype directory.

$ sudo cp -r $PLATFORA_HOME/server/webapps/proton/dist/fonts/ OpenSans /usr/share/fonts/truetype

c) Refresh the font cache.

$ sudo fc-cache -f

After installing, you'll want to verify the installation is running correctly. One easy way to do this is using examples that came with the PhantomJS tarball:

$ phantomjs phantomjs-1.9.7-linux-x86_64/examples/hello.js Hello, world!

You can also output a PDF to verify the fonts were installed correctly. to output to PDF choose Share

(42)

side shows the output when the fonts are installed. The right side was rendered without the proper fonts installed:

(43)

5

Configure Environment on Platfora Nodes

This section describes how to configure a Platfora node's operating system and network environment. You should perform these tasks on every node in the Platfora cluster (master and workers) after you have installed the Platfora dependencies and software, but before you initialize Platfora (or initialize a new worker node).

Topics:

•

Install the MapR Client Software (MapR Only)

•

Configure Network Environment

•

Configure Passwordless SSH

•

Synchronize the System Clocks

•

Create Local Storage Directories

•

Verify Environment Variables

Install the MapR Client Software (MapR Only)

If you are using MapR as your Hadoop distribution, you must install the MapR client software on all Platfora nodes (master and workers). If you are not using MapR with Platfora, you can skip this task.

Platfora uses the MapR client to submit MapReduce jobs and file system commands directly to the MapR cluster. For more information about the MapR client, see the MapR documentation.

If you use MapR 4.1, Platfora requires that you install the MapR 4.0.2 client software.

You must have root or sudo permissions to install the MapR client. Installing the MapR Client on Ubuntu

1. Add the following line to the /etc/apt/sources.list file:

deb http://package.mapr.com/releases/version_{/ubuntu/ mapr optional}

(44)

2. Update the repository and install the MapR client:

$ sudo apt-get update

$ sudo apt-get install mapr-client

3. Configure the MapR client where clusterName is the name of your MapR cluster and cldbhost is the hostname and port of the MapR CLDB node:

$ sudo /opt/mapr/server/configure.sh –N clusterName c -C cldbhost:7222

4. Check if the /opt/mapr/hostname file exists on the node.

$ sudo ls /opt/mapr If the file doesn't exist, create it:

$ sudo hostname -f > /opt/mapr/hostname

5. Set the PLATFORA_HADOOP_LIB environment variable. For example (check the path for your version of the MapR client):

$ echo "export PLATFORA_HADOOP_LIB=/opt/mapr/hadoop/lib" >> $HOME/.bashrc

Installing the MapR Client on RedHat/CentOS

1. Create the file /etc/yum.repos.d/maprtech.repo with the following contents:

[maprtech]

name=MapR Technologies

baseurl=http://p