In this directory, copy or recreate the Hadoop configuration files needed for your Hadoop distribution.

In document Platfora Installation Guide (Page 60-69)

Create Local Hadoop Configuration Directory

3. In this directory, copy or recreate the Hadoop configuration files needed for your Hadoop distribution.

core-site.xml (HDFS / MapRFS)

Platfora uses the core-site.xml configuration file to connect to the distributed file system service for your Hadoop deployment. For example: HDFS for Cloudera and Hortonworks, MapRFS for MapR.

Apache/Cloudera/Hortonworks with MapReduce v1

Platfora requires the following minimum property where namenode_hostname is the DNS hostname of your Hadoop NameNode, and hdfs_port is the HDFS server port.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://namenode_hostname:hdfs_port</value> </property>

</configuration>

Apache/Cloudera/Hortonworks/Pivotal with YARN

Platfora requires the following minimum property where namenode_hostname is the DNS hostname of your Hadoop NameNode, and hdfs_port is the HDFS server port.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://namenode_hostname:hdfs_port</value> </property>

MapR with MapReduce v1

Platfora requires the following minimum properties where where cldbhost is the DNS hostname of the MapR CLDB node, and 7222 is the CLDB server port. If you are using file compression, you must also specify the compression libraries you are using.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>fs.default.name</name>

<value>maprfs://cldbhost:7222</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.DeflateCodec, org.apache.hadoop.io.compress.SnappyCodec </value> </property> </configuration>

MapR with YARN

Platfora requires the following minimum properties where where cldbhost is the DNS hostname of the MapR CLDB node, and 7222 is the CLDB server port. If you are using file compression, you must also specify the compression libraries you are using.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>fs.defaultFS</name>

<value>maprfs://cldbhost:7222</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.DeflateCodec, org.apache.hadoop.io.compress.SnappyCodec

</value> </property> </configuration>

hdfs-site.xml

Platfora uses the hdfs-site.xml configuration file to configure how Platfora data is stored in the remote Hadoop distributed file system (HDFS).

HDFS

This file should have at least the following content. If you want Hadoop replication enabled for Platfora lens data, increase the 1 to a higher value.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- required --> <property> <name>dfs.replication</name> <value>1</value> </property>

<!-- required for Cloudera 5.3 and later with HDFS Encryption enabled -->

<property>

<name>dfs.encryption.key.provider.uri</name>

<value>kms://http@hadoop_name_node:16000/kms</value> </property>

</configuration>

mapred-site.xml

Platfora uses the properties in its local mapred-site.xml file to connect to the Hadoop JobTracker service, and pass in client-side configuration options for Platfora-initiated MapReduce jobs.

Any Hadoop MapReduce runtime properties can be passed along by Platfora with a lens build job configuration. See MapReduce Tuning for Platfora for a description of the required and recommended

properties that Platfora needs for lens building. Any properties marked as runtime can be set in the local Platfora mapred-site.xml file instead of on the Hadoop cluster.

Apache/Cloudera/Hortonworks/MapR with MapReduce v1

Platfora requires the following minimum properties in its local mapred-site.xml file for MapReduce v1 distributions.

If you are using the high-availability (HA) JobTracker feature in your Hadoop cluster, you would use the HA JobTracker properties in Platfora's mapred- site.xml file instead of just the mapred.job.tracker property.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<!-- required --> <property>

<name>mapred.job.tracker</name> <value>jobtracker_hostname:jt_port</value> </property>

<!-- should be at least 1024m, but may be more based on memory on your Hadoop nodes -->

<property> <name>mapred.child.java.opts</name> <value>-Xmx1024m</value> </property> <!-- required --> <property> <name>mapred.job.shuffle.input.buffer.percent</name> <value>0.30</value> </property> <!-- optional --> <property> <name>io.sort.record.percent</name> <value>0.15</value> </property> <!-- optional --> <property> <name>io.sort.factor</name> <value>100</value> </property> <!-- optional --> <property>

<name>io.sort.mb</name> <value>256</value>

</property> </configuration>

Apache/Cloudera/Hortonworks/Pivotal with YARN

Platfora requires the following minimum properties in its local mapred-site.xml file for YARN distributions.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>mapreduce.jobhistory.address</name> <value>yarn_rm_hostname:port</value> </property>

<property>

<name>mapreduce.jobhistory.webapp.address</name> <value>yarn_rm_hostname:web_port</value>

</property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>

<!-- should be at least 1024m, but may be more based on memory on your Hadoop nodes -->

<property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx1024k</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024k</value> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>100</value> </property> <property> <name>mapreduce.job.user.classpath.first</name>

</property>

<!-- Needed For Hortonworks 2.2 Only --> <property>

<name>hdp.version</name> <value>2.2.0.0-2041</value> </property>

<!-- Needed For Pivotal 3.0 Only --> <property> <name>stack.version</name> <value>3.0.0.0-249</value> </property> <property> <name>stack.name</name> <value>phd</value> </property> </configuration>

MapR with YARN

Platfora requires the following minimum properties in its local mapred-site.xml file for MapR distributions using YARN.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>mapr.host</name>

<value>yarn_rm_hostname</value> </property>

<property>

<name>mapreduce.jobhistory.address</name> <value>yarn_rm_hostname:port</value> </property>

<property>

<name>mapreduce.jobhistory.webapp.address</name> <value>yarn_rm_hostname:web_port</value>

</property> <property>

<name>mapreduce.framework.name</name> <value>yarn</value>

<property> <name>mapr.centrallog.dir</name> <value>${hadoop.tmp.dir}/logs</value> </property> </configuration> yarn-site.xml

Platfora uses the properties in its local yarn-site.xml file to connect to the Hadoop

ResourceManager service, and pass in client-side configuration options for Platfora-initiated YARN jobs.

Any Hadoop YARN runtime properties can be passed along by Platfora with a lens build job configuration. See YARN Tuning for Platfora for a description of the required and recommended properties that Platfora needs for lens building. Any properties marked as runtime can be set in the local Platfora yarn-site.xml file instead of on the Hadoop cluster.

All Hadoop Distributions with YARN

Platfora requires the following minimum properties in its local yarn-site.xml file for Hadoop distributions using YARN.

<configuration> <property>

<name>yarn.resourcemanager.address</name> <value>yarn_rm_hostname:8032</value> </property>

<property>

<name>yarn.resourcemanager.webapp.address</name> <value>yarn_rm_hostname:8088</value>

</property> <property>

<name>yarn.resourcemanager.admin.address</name> <value>yarn_rm_hostname:8033</value>

</property> <property>

<name>yarn.resourcemanager.resource-tracker.address</name> <value>yarn_rm_hostname:8031</value>

</property> <property>

<value>yarn_rm_hostname:8030</value> </property>

<property>

<name>mapreduce.job.hdfs-servers</name> <value>hdfs://yarn_rm_hostname:8020</value> </property>

# Adjust these properties based on available Hadoop memory resources <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>8192</value> </property> </configuration> hive-site.xml

Platfora uses a local hive-site.xml configuration file to connect to the Hive metastore service. You only need a local hive-site.xml file if you plan to use Hive as a data source for Platfora.

There are two ways to configure how clients connect to the Hive metastore service in your Hadoop environment. You can set up the HiveServer or HiveServer2 Thrift service, which allows various remote clients to connect to the Hive metastore indirectly. This is called a remote metastore client configuration, and is the recommended configuration by the Hadoop vendors. If you add a Hive datasource through the Platfora web application, you can connect to the Hive Thrift service without the need for a Platfora copy of the hive-site.xml file.

Optionally, you can connect directly to the Hive metastore database using a JDBC connection. This requires that you have the login credentials for the Hive metastore database. This is called a local metastore configuration because you are connecting directly to the metastore database rather than through a service. If you want to connect to the Hive metastore database directly using JDBC, then you must specify the connection information in a hive-site.xml.

Platfora can only connect to a single Hive instance via a remote or a local metastore configuration.

Remote Metastore (Thrift) Server Configuration

If you are using the Hive Thrift remote metastore, in addition to the URI, you may want to include the following performance properties:

<?xml version="1.0"?>

<configuration> <property>

<name>hive.metastore.uris</name>

<value>thrift://hostname:hiveserver_thrift_port</value> </property>

<property>

<name>hive.metastore.client.socket.timeout</name> <value>120</value>

<description>

Number of seconds to wait for the client to retieve all of the objects (tables and partitions) from Hive. For tables with thousands of partitions, you may need to increase. </description> </property> <property> <name>hive.metastore.batch.retrieve.max</name> <value>100</value> <description>

Maximum number of objects to get from metastore in one batch. A higher number means less round trips to the Hive metastore server,

but may also require more memory on the client side. </description>

</property> </configuration>

Local JDBC Configuration

To have Platfora connect directly to a local JDBC metastore requires additional configuration on the Platfora servers. Each Platfora server requires a hive-site.xml file with the correct connection information, as well as the appropriate JDBC driver installed. Here is an example hive-site.xml to connect to a MySQL local metastore:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://hive_hostname:metastore_db_port/metastore</ value>

</property> <property>

</property> <property>

<name>javax.jdo.option.ConnectionUserName</name> <value>hive_username</value>

</property> <property>

<name>javax.jdo.option.ConnectionPassword</name> <value>password</value>

</property> <property> <name>hive.metastore.client.socket.timeout</name> <value>120</value> </property> <property> <name>hive.metastore.batch.retrieve.max</name> <value>100</value> </property> </configuration>

The Platfora server would also need the MySQL JDBC driver installed in order to use this

configuration. You can place the JDBC driver .jar files in $PLATFORA_DATA_DIR/extlib to install them (requires a restart of the Platfora server).

In document Platfora Installation Guide (Page 60-69)