PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

(1)

White Paper

Abstract

This white paper explains how Pentaho Data Integration (Kettle) can be configured and used with Greenplum database by using Greenplum Loader (GPLOAD). This boosts connectivity and interoperability of Pentaho Data Integration with Greenplum Database.

February 2012

PENTAHO DATA INTEGRATION WITH

GREENPLUM LOADER

The interoperability between Pentaho Data Integration and

Greenplum Database with Greenplum Loader

(2)

The information in this publication is provided “as is”. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and

specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.

VMware is a registered trademark of VMware, Inc. All other trademarks used herein are the property of their respective owners.

(3)

Executive summary

Greenplum database is a popular analytical database which works with different open source data integration products like Pentaho Data Integration (PDI), a.k.a. Kettle. Pentaho Kettle is part of Pentaho Business Intelligence suite. Greenplum Database is capable of managing, storing and analyzing large amount of data. One of the latest enhancements that Pentaho did for expanded support for OLAP includes a native bulk loader integration with EMC Greenplum to improve the data loading performance. Pentaho is offering a native adaptor support for Greenplum GPLoad capability (bulk loader), which enables joint customers to leverage data integration capabilities to quickly capture, transform and load massive amounts of data into Greenplum Databases.

Currently, Pentaho Data Integration is connected to Greenplum through JDBC (Java Database Connectivity) drivers. Greenplum Database can be used both on the source and target sides in the Pentaho ETL transformations.

Audience

This white paper is intended for EMC field facing employees such as sales, technical consultants, support, as well as customers who will be using Pentaho Data

Integration tool to integrate their ETL work. This is neither an installation guide nor an introductory material on Pentaho. It documents the Pentaho connectivity and

operation capabilities with Greenplum Loader, and shows the readers how Pentaho PDI can be used in conjunction with Greenplum database to retrieve, transform and present data to users. Though the reader is not expected to have extensive Pentaho knowledge, basic understanding of Pentaho data integration concepts and ETL tools would help the reader understand this document better.

(5)

Organization of this paper

This paper covers the following topics:  Executive summary

 Organization of this paper

 Overview of Pentaho Data Integration (PDI)  Overview of Greenplum Database

 Integration of Pentaho PDI and Greenplum Database  Using JDBC drivers for Greenplum database connections

 Greenplum Loader: Greenplum’s Scatter/Gather Streaming Technology  Usage: How to use Greenplum Loader in Pentaho Data Integration  Future expansion and interoperability

(6)

Overview of Pentaho Data Integration

Pentaho Data Integration (PDI) delivers comprehensive Extraction, Transformation and Loading (ETL) capabilities using a meta-data driven approach. It is commonly used in building data warehouses, designing business intelligence applications, migrating data and integrating data models. It consists of different components:  Spoon – Main GUI, graphical Jobs/Transformation Designer

 Carte – HTTP server for remote execution of Jobs/Transformations  Pan – Command line execution of Transformations

 Kitchen – Command line execution of Jobs

 Encr – Command line tool for encrypting strings for storage

 Enterprise Edition (EE) Data Integration Server – Data Integration Engine, Security integration with LDAP/Active Directory, Monitor/Scheduler, Content Management Pentaho is capable of loading big data sets in terms of Terabytes or Petabytes into Greenplum Database taking full advantage of the massively parallel processing environment provided by the Greenplum product family.

Overview of Greenplum Database

Greenplum Database is designed based on a MPP (Massively Parallel Processing) shared-nothing architecture which facilitates Business Intelligence, data integration and big data analytics. Data is distributed and replicated across multiple nodes in the Greenplum Database, the parallel architecture. Greenplum’s MPP architecture allows for increased scalability vs. traditional databases and leverages parallelism to ensure orders of magnitude of improvement in query performance. “Shared-nothing” architecture is optimal for fast queries and loads because processors are placed as close as possible to the data itself for faster operations with the maximum degree of parallelism possible. Highlights of the Greenplum Database:

 Dynamic Query Prioritization

(7)

 Self-Healing Fault Tolerance

- Provides intelligent fault detection and fast online differential recovery.  Polymorphic Data Storage-MultiStorage/SSD Support

- Includes tunable compression and support for both row-and column-oriented storage.

 Analytics Support

- Supports analytical functions for advanced in-database analytics.  Health Monitoring and Alerting

- Provides integrated Greenplum Command Center for advanced support capabilities.

Integration of Pentaho PDI and Greenplum Database

The following diagram shows the basic interoperability between Pentaho Data Integration with the Greenplum Database:

(8)

Using JDBC drivers for Greenplum database connections

Pentaho Kettle ships with many different JDBC drivers that reside in a single java archive (.jar) file that are present in the libext/JDBC directory. By default, Pentaho PDI is shipped with a postgresql jdbc jar file, which is used to connect through Greenplum loader (gpload/gpfdist) when you defined your database connection and choose Native (JDBC) as access.

Java JDK 1.6 is required for the installation.

(9)

Installation of new driver

To add a new driver, simply drop/copy the .jar file containing the driver into the libext/JDBC directory. For example,

• For Data Integration Server: <Pentaho_installed_directory>/server/data-integration-server/tomcat/lib/

• For Data Integration client: <Pentaho_installed_directory>/design-tools/data-integration/libext/JDBC/

For BI Server: <Pentaho_installed_directory>/server/biserver-ee/tomcat/lib/ • For Enterprise Console: <Pentaho_installed_directory>/server/enterprise-console/jdbc/

If you installed a new JDBC driver for Greenplum to the BI Server or DI Server, you have to restart all affected servers to load the newly installed database driver. In addition, if you want to establish a Greenplum data source in the Pentaho Enterprise Console, you must install that JDBC driver in both Enterprise Console and the BI Server to make it effective. In brief, to update the driver, the user would need to update the jar file in

/data-integration/libext/JDBC/.

Assume that there is a Greenplum Database (GPDB) installed and ready to use, you can define the Greenplum database connections in the Database Connection dialog. You can give a connection name, choose Greenplum as the Connection Type, choose “Native (JDBC)” in the Access field, and give the Host Name, Database Name, Port Number, User Name and Password in the Setting section.

Special attention may be required to setup the host files and configuration files in Greenplum database as well as the hosts in which Pentaho is installed. For instance, in Greenplum database, the user may need to configure pg_hba.conf with the IP address of the Pentaho host. In addition, the user may need to add the hostnames and the

corresponding IP address in both systems (i.e. Pentaho PDI server and the Greenplum Database) in order to ensure both machines can communicate.

(10)

Greenplum Loader: Greenplum’s Scatter/Gather Streaming

Technology

Parallel Loading

Greenplum's Scatter/Gather Streaming™ (SGS) technology, typically referred to as gpfdist, eliminates the bottlenecks associated to data loading, enabling ETL applications to stream data into the Greenplum database quickly. This technology is intended for loading huge data sets that are normally used in large-scale analytics and data warehousing. This technology manages the flow of data into all nodes of the database

Figure 1 shows how Greenplum utilizes a parallel everywhere approach to loading. In this approach, data flows from one or more source systems to every node of the database without any sequential bottlenecks.

Figure 1

Greenplum’s SGS technology ensures parallelism by scattering data from source systems across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations.

Figure 2 shows how the final gathering and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally

compressed. This technology is exposed via a flexible and programmable external table (explained below) interface and a traditional command-line loading interface.

(11)

Figure 2

External Tables

External tables enable users to access data in external sources as if it were in a table in the database. In Greenplum database, there are two types of external data sources, external tables and Web tables. They have different access methods, external tables contain static data that can be scanned multiple times. The data does not change during queries. Web tables provide access to dynamic data sources as if those sources were regular database tables. Web tables cannot be scanned multiple times. The data can change during the course of a query.

Greenplum Parallel File Distribution Server(gpfdist)

gpfdist is Greenplum’s parallel file distribution server utility software. It is used with read-only external tables for fast, parallel data loading of text, CSV, XML files into a Greenplum database. The benefit of using gpfdist is that users can take advantages of maximum parallelism while reading from or writing to external tables, thereby offering the best performance as well as easier administration of external tables.

gpfdist can be considered as a networking protocol, much like the http protocol. Running gpfdist is similar to running a HTTP server. It exposes the target file via TCP/IP to a local file directory containing the files. The files are usually delimited files or CSV files, although it can also read tar and gziped files. In the case of tar and gzip files, the PATH contains the location of the tar and gzip utilities.

For data uploading into a Greenplum database, you can generate the flat files from an operational database or transactional database, using export, COPY, dump, or user-written software, depending on the business requirements. This process can be automated to run periodically.

(12)

How does gpfdist work?

gpfdist runs in a client-server model. To start the gpfdist process, you can indicate the directory where they drop/copy their source files. Optionally, you may also designate the TCP port number to be used.

A simple startup of the gpfdist server is the following command syntax: gpfdist –d <file_files_directory> –p <port_number> –l <log_file> &

For example:

In the above example, gpfdist is set up to run on the Greenplum DIA server, anticipating data loading from flat files stored in a file directory /etl-data. Port 8887 is opened and listening for data requests, and a log file is created in /home/gpadmin called etl-log. Using gpload to invoke gpfdist

Pentaho leverages the parallel bulk loading capabilities of GPDB using the Greenplum data loading utility - “gpload”. “gpload” is a data loading utility that acts as an interface to Greenplum Database’s external table parallel loading feature. The Greenplum EXTERNAL TABLE feature allows us to define network data sources as tables that we can query to speed up the data loading process. Using a load specification defined in a YAML formatted control file, “gpload” executes a load by invoking the Greenplum parallel file server

(gpdist) – Greenplum’s parallel file distribution program, creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database.

The gpload program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another. The use of white space is significant. White space should not be used simply for formatting purposes, and tabs should not be used at all.

The basic structure of a load control file: ---

VERSION: 1.0.0.1 DATABASE: db_name USER: db_username HOST: master_hostname

# gpfdist -d /etl-data -p 8887 -l gpfdist_8887.log & [1] 28519

(13)

PORT: master_port GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - hostname_or_ip PORT: http_port

| PORT_RANGE: [start_port_range, end_port_range] FILE: - /path/to/input_file - COLUMNS: - field_name: data_type - FORMAT: text | csv - DELIMITER: 'delimiter_character' - ESCAPE: 'escape_character' | 'OFF' - NULL_AS: 'null_string'

- FORCE_NOT_NULL: true | false - QUOTE: 'csv_quote_character' - HEADER: true | false

- ENCODING: database_encoding - ERROR_LIMIT: integer

- ERROR_TABLE: schema.table_name OUTPUT:

- TABLE: schema.table_name

- MODE: insert | update | merge

- MATCH_COLUMNS: - target_column_name - UPDATE_COLUMNS: - target_column_name - UPDATE_CONDITION: 'boolean_condition' - MAPPING:

target_column_name: source_column_name | 'expression' PRELOAD:

- TRUNCATE: true | false - REUSE_TABLES: true | false

(14)

- AFTER: "sql_command"

Above example shows syntax for GPLOAD using YAML file. This file is divided into sections for easy reference, those horizontal lines are not to be placed in a YAML file. For example, users can run a load job as defined in my_load.yml using gpload:

gpload -f my_load.yml

It is recommended that we confirm that gpload is running successfully, to reduce the chance of future errors. As a first step, you can run gpload at the system (command) prompt to verify. By copying a small representation of a source file and a control (YAML) file, you can run gpload.py using a sample load control file.

If gpload.py script is not successfully executed, please confirm the following settings: - Check if the correct version is installed by checking the gpload readme.

- Check the environment variables for PATH, GPHOME_LOADERS and PYTHONPATH are correctly installed.

- Check if the pathname environmental variables are pointing or including to the correct path Example of the load control file - my_load.yml:

--- VERSION: 1.0.0.1 DATABASE: ops USER: gpadmin HOST: mdw-1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - etl1-1 - etl1-2 - etl1-3 - etl1-4 PORT: 8081 FILE: - /var/load/data/* - COLUMNS: - name: text - amount: float4 - category: text - desc: text - date: date - FORMAT: text

(15)

- DELIMITER: '|' - ERROR_LIMIT: 25 - ERROR_TABLE: payables.err_expenses OUTPUT: - TABLE: payables.expenses - MODE: INSERT SQL:

- BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"

- AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"

Note: YAML file is not a free formatted file, field names and most of the content need to be in a certain format.

By using Pentaho, you do not need to write your own YAML file; there are some pre-built steps inside the “Bulk loading” folder in the Design windows of Spoon. The customized Greenplum step is called “Greenplum Load”, which will help to generate the YAML file when all the necessary details are provided.

The “Greenplum Load” step wraps the Greenplum GPLoad data loading utility we just discussed. The GPLoad data loading utility is used for massively parallel data loading using Greenplum's external table parallel loading feature. As you can see in the above example, four ETL servers are used for feeding data into Greenplum through GPLOAD. GPLoad can be implemented in either single or multiple Pentaho ETL servers. The following diagrams show the typical deployment scenarios for performing parallel loading to

(16)

1) Single ETL Server, Multiple NICs

(17)

Usage: How to use Greenplum Loader in Pentaho Data Integration

Setup

Here are the steps to setup a simple transformation to test out the Greenplum Loader:

1) Create the Text File Input Steps by defining a source file (e.g. csv, delimited file). Choose ‘Text File Input’ component under Design tab and inside Input folder:

(18)

2. Click on the next tab of Contents to define how to parse the CSV file:

3. Go to the next tab Fields and click on Get Fields to define all the fields:

A sample source file lineitem.csv/lineitem.dat should look like this:

1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|lineitem 1 comments

2|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|lineitem 2 comments

…….

100|61336|8855|1|31|40217.23|0.09|0.04|A|F|1993-10-29|1993-12-19|1993-11-08|COLLECT COD|TRUCK|lineitem 100 comments

(19)

4. You should create a target table called “lineitem” which contains: CREATE TABLE lineitem

( l_orderkey integer, l_partkey integer, l_suppkey integer, l_linenumber integer, l_quantity numeric(15,2), l_extendedprice numeric(15,2), l_discount numeric(15,2), l_tax numeric(15,2), l_returnflag character(1), l_linestatus character(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct character(25), l_shipmode character(10), l_comment character varying(44) )

WITH (

OIDS=FALSE )

DISTRIBUTED BY (l_orderkey);

ALTER TABLE lineitem OWNER TO gpadmin;

(20)

The details of the Greenplum Load step need to be defined as the following: First, you have to choose the correct connection and target table.

Then, please click on Get fields button in order to generate all the target table fields:

(21)

Next, go to the GP Configuration tab in order to define the correct GPLOAD, control file, data file location:

Once you complete the definitions, please click OK to save.

A sample job can be created through adding the Hop between the Text Input and Greenplum Load steps.

(22)

When everything is defined and saved, you can execute the transformation/job by click the GREEN arrow on the top left corner.

Once the execution is finished, you can check the Logging and Step Metrics sections to see if the transformation is successfully executed. You can also verify if data is loaded into this target Greenplum database table, lineitem through gpload.

The above transformation is just a sample; therefore, user can add different components in this transformation or incorporate into a well developed job for transforming the data.

Future expansion and interoperability

Both Greenplum and Pentaho are rapidly innovating and extending their capabilities to satisfy the requirements in the BIG DATA industry. In order to meet the challenges of fast data loading, the EMC Data Integration Accelerator (DIA) is purpose-built for batch loading, and micro-batch loading, and leverages a growing number of data integration applications such as Pentaho. Therefore, both companies are working together to expand their interoperability to adopt the constantly growing demands.

(23)

Conclusion

In this white paper, the process of how to use Greenplum Loader Step(GPLOAD) to enhance the loading capability and performance of Pentaho Data Integration is discussed. It covers the preliminary interoperability between both Pentaho PDI and Greenplum database for data integration and business intelligence projects by using Greenplum’s Scatter/Gather Streaming Technology embedded in Greenplum Loader.

(24)

References

1) Pentaho Kettle Solutions – Building Open Source ETL Solutions with Pentaho Data Integration (ISBN-10: 0470635177 / ISBN-13: 978-0470635179)

2) Getting Started with Pentaho Data Integration guide from www.pentaho.com 3) Greenplum Database 4.1 Load tools for UNIX guide

4) Greenplum Database 4.1 Load Tools for Windows guide 5) Pentaho Community - Greenplum Load