Anda di halaman 1dari 24

White Paper

PENTAHO DATA INTEGRATION WITH


GREENPLUM LOADER
The interoperability between Pentaho Data Integration and
Greenplum Database with Greenplum Loader

Abstract
This white paper explains how Pentaho Data Integration (Kettle)
can be configured and used with Greenplum database by using
Greenplum Loader (GPLOAD). This boosts connectivity and
interoperability of Pentaho Data Integration with Greenplum
Database.
February 2012

Copyright 2012 EMC Corporation. All Rights Reserved.


EMC believes the information in this publication is accurate of
its publication date. The information is subject to change
without notice.
The information in this publication is provided as is. EMC
Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and
specifically disclaims implied warranties of merchantability or
fitness for a particular purpose.
Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC
Corporation Trademarks on EMC.com.
VMware is a registered trademark of VMware, Inc. All other
trademarks used herein are the property of their respective
owners.
Part Number h8309

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Table of Contents
Executive summary.................................................................................................. 4
Audience ............................................................................................................................ 4

Organization of this paper ....................................................................................... 5


Overview of Pentaho Data Integration ...................................................................... 6
Overview of Greenplum Database ............................................................................ 6
Integration of Pentaho PDI and Greenplum Database ................................................ 7
Using JDBC drivers for Greenplum database connections .......................................... 8
Installation of new driver .................................................................................................... 9

Greenplum Loader: Greenplums Scatter/Gather Streaming Technology .................. 10


Parallel Loading ........................................................................................................... 10
External Tables ............................................................................................................. 11
Greenplum Parallel File Distribution Server(gpfdist) ..................................................... 11
How does gpfdist work? ............................................................................................... 12
Using gpload to invoke gpfdist ..................................................................................... 12
1) Single ETL Server, Multiple NICs ............................................................................ 16
2) Multiple ETL Servers .............................................................................................. 16

Usage: How to use Greenplum Loader in Pentaho Data Integration.......................... 17


Setup ............................................................................................................................... 17

Future expansion and interoperability .................................................................... 22


Conclusion ............................................................................................................ 23
References ............................................................................................................ 24

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Executive summary
Greenplum database is a popular analytical database which works with different
open source data integration products like Pentaho Data Integration (PDI), a.k.a.
Kettle. Pentaho Kettle is part of Pentaho Business Intelligence suite. Greenplum
Database is capable of managing, storing and analyzing large amount of data.
One of the latest enhancements that Pentaho did for expanded support for OLAP
includes a native bulk loader integration with EMC Greenplum to improve the data
loading performance. Pentaho is offering a native adaptor support for Greenplum
GPLoad capability (bulk loader), which enables joint customers to leverage data
integration capabilities to quickly capture, transform and load massive amounts of
data into Greenplum Databases.
Currently, Pentaho Data Integration is connected to Greenplum through JDBC (Java
Database Connectivity) drivers. Greenplum Database can be used both on the source
and target sides in the Pentaho ETL transformations.

Audience
This white paper is intended for EMC field facing employees such as sales, technical
consultants, support, as well as customers who will be using Pentaho Data
Integration tool to integrate their ETL work. This is neither an installation guide nor an
introductory material on Pentaho. It documents the Pentaho connectivity and
operation capabilities with Greenplum Loader, and shows the readers how Pentaho
PDI can be used in conjunction with Greenplum database to retrieve, transform and
present data to users. Though the reader is not expected to have extensive Pentaho
knowledge, basic understanding of Pentaho data integration concepts and ETL tools
would help the reader understand this document better.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Organization of this paper


This paper covers the following topics:

Executive summary

Organization of this paper

Overview of Pentaho Data Integration (PDI)

Overview of Greenplum Database

Integration of Pentaho PDI and Greenplum Database

Using JDBC drivers for Greenplum database connections

Greenplum Loader: Greenplums Scatter/Gather Streaming Technology

Usage: How to use Greenplum Loader in Pentaho Data Integration

Future expansion and interoperability

Conclusion

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Overview of Pentaho Data Integration


Pentaho Data Integration (PDI) delivers comprehensive Extraction, Transformation
and Loading (ETL) capabilities using a meta-data driven approach. It is commonly
used in building data warehouses, designing business intelligence applications,
migrating data and integrating data models. It consists of different components:

Spoon Main GUI, graphical Jobs/Transformation Designer

Carte HTTP server for remote execution of Jobs/Transformations

Pan Command line execution of Transformations

Kitchen Command line execution of Jobs

Encr Command line tool for encrypting strings for storage

Enterprise Edition (EE) Data Integration Server Data Integration Engine, Security
integration with LDAP/Active Directory, Monitor/Scheduler, Content Management

Pentaho is capable of loading big data sets in terms of Terabytes or Petabytes into
Greenplum Database taking full advantage of the massively parallel processing
environment provided by the Greenplum product family.

Overview of Greenplum Database


Greenplum Database is designed based on a MPP (Massively Parallel Processing) sharednothing architecture which facilitates Business Intelligence, data integration and big data
analytics. Data is distributed and replicated across multiple nodes in the Greenplum
Database, the parallel architecture. Greenplums MPP architecture allows for increased
scalability vs. traditional databases and leverages parallelism to ensure orders of
magnitude of improvement in query performance. Shared-nothing architecture is
optimal for fast queries and loads because processors are placed as close as possible to
the data itself for faster operations with the maximum degree of parallelism possible.
Highlights of the Greenplum Database:

Dynamic Query Prioritization


-

Provides continuous real-time balancing of the resources across queries.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Self-Healing Fault Tolerance


-

Polymorphic Data Storage-MultiStorage/SSD Support


-

Includes tunable compression and support for both row-and column-oriented


storage.

Analytics Support
-

Provides intelligent fault detection and fast online differential recovery.

Supports analytical functions for advanced in-database analytics.

Health Monitoring and Alerting


- Provides integrated Greenplum Command Center for advanced support
capabilities.

Integration of Pentaho PDI and Greenplum Database


The following diagram shows the basic interoperability between Pentaho Data
Integration with the Greenplum Database:

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Using JDBC drivers for Greenplum database connections


Pentaho Kettle ships with many different JDBC drivers that reside in a single java archive
(.jar) file that are present in the libext/JDBC directory. By default, Pentaho PDI is shipped
with a postgresql jdbc jar file, which is used to connect through Greenplum loader
(gpload/gpfdist) when you defined your database connection and choose Native (JDBC)
as access.

Java JDK 1.6 is required for the installation.


There is a startup script, which adds all these .jar files to the environment.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Installation of new driver


To add a new driver, simply drop/copy the .jar file containing the driver into the
libext/JDBC directory. For example,
For Data Integration Server: <Pentaho_installed_directory>/server/dataintegration-server/tomcat/lib/
For Data Integration client: <Pentaho_installed_directory>/design-tools/dataintegration/libext/JDBC/
For BI Server: <Pentaho_installed_directory>/server/biserver-ee/tomcat/lib/
For Enterprise Console: <Pentaho_installed_directory>/server/enterpriseconsole/jdbc/
If you installed a new JDBC driver for Greenplum to the BI Server or DI Server, you have to
restart all affected servers to load the newly installed database driver. In addition, if you
want to establish a Greenplum data source in the Pentaho Enterprise Console, you must
install that JDBC driver in both Enterprise Console and the BI Server to make it effective.
In brief, to update the driver, the user would need to update the jar file in /dataintegration/libext/JDBC/.
Assume that there is a Greenplum Database (GPDB) installed and ready to use, you can
define the Greenplum database connections in the Database Connection dialog. You
can give a connection name, choose Greenplum as the Connection Type, choose Native
(JDBC) in the Access field, and give the Host Name, Database Name, Port Number, User
Name and Password in the Setting section.
Special attention may be required to setup the host files and configuration files in
Greenplum database as well as the hosts in which Pentaho is installed. For instance, in
Greenplum database, the user may need to configure pg_hba.conf with the IP address of
the Pentaho host. In addition, the user may need to add the hostnames and the
corresponding IP address in both systems (i.e. Pentaho PDI server and the Greenplum
Database) in order to ensure both machines can communicate.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Greenplum Loader: Greenplums Scatter/Gather Streaming


Technology
Parallel Loading
Greenplum's Scatter/Gather Streaming (SGS) technology, typically referred to as gpfdist,
eliminates the bottlenecks associated to data loading, enabling ETL applications to stream
data into the Greenplum database quickly. This technology is intended for loading huge
data sets that are normally used in large-scale analytics and data warehousing. This
technology manages the flow of data into all nodes of the database
Figure 1 shows how Greenplum utilizes a parallel everywhere approach to loading. In this
approach, data flows from one or more source systems to every node of the database
without any sequential bottlenecks.

Figure 1
Greenplums SGS technology ensures parallelism by scattering data from source systems
across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the
Greenplum Database. Performance scales with the number of Greenplum Database nodes,
and the technology supports both large batch and continuous near-real-time loading
patterns with negligible impact on concurrent database operations.
Figure 2 shows how the final gathering and storage of data to disk takes place on all nodes
simultaneously, with data automatically partitioned across nodes and optionally
compressed. This technology is exposed via a flexible and programmable external table
(explained below) interface and a traditional command-line loading interface.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

10

Figure 2

External Tables
External tables enable users to access data in external sources as if it were in a table in the
database. In Greenplum database, there are two types of external data sources, external
tables and Web tables. They have different access methods, external tables contain static
data that can be scanned multiple times. The data does not change during queries. Web
tables provide access to dynamic data sources as if those sources were regular database
tables. Web tables cannot be scanned multiple times. The data can change during the
course of a query.

Greenplum Parallel File Distribution Server(gpfdist)


gpfdist is Greenplums parallel file distribution server utility software. It is used with readonly external tables for fast, parallel data loading of text, CSV, XML files into a Greenplum
database. The benefit of using gpfdist is that users can take advantages of maximum
parallelism while reading from or writing to external tables, thereby offering the best
performance as well as easier administration of external tables.
gpfdist can be considered as a networking protocol, much like the http protocol. Running
gpfdist is similar to running a HTTP server. It exposes the target file via TCP/IP to a local file
directory containing the files. The files are usually delimited files or CSV files, although it
can also read tar and gziped files. In the case of tar and gzip files, the PATH contains the
location of the tar and gzip utilities.
For data uploading into a Greenplum database, you can generate the flat files from an
operational database or transactional database, using export, COPY, dump, or user-written
software, depending on the business requirements. This process can be automated to run
periodically.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

11

How does gpfdist work?


gpfdist runs in a client-server model. To start the gpfdist process, you can indicate the
directory where they drop/copy their source files. Optionally, you may also designate the
TCP port number to be used.
A simple startup of the gpfdist server is the following command syntax:
gpfdist d <file_files_directory> p <port_number> l <log_file> &
For example:
# gpfdist -d /etl-data -p 8887 -l gpfdist_8887.log &
[1] 28519
# Serving HTTP on port 8887, directory /home/gpadmin/etl-log

In the above example, gpfdist is set up to run on the Greenplum DIA server, anticipating
data loading from flat files stored in a file directory /etl-data. Port 8887 is opened and
listening for data requests, and a log file is created in /home/gpadmin called etl-log.

Using gpload to invoke gpfdist


Pentaho leverages the parallel bulk loading capabilities of GPDB using the Greenplum data
loading utility - gpload. gpload is a data loading utility that acts as an interface to
Greenplum Databases external table parallel loading feature. The Greenplum EXTERNAL
TABLE feature allows us to define network data sources as tables that we can query to
speed up the data loading process. Using a load specification defined in a YAML formatted
control file, gpload executes a load by invoking the Greenplum parallel file server
(gpdist) Greenplums parallel file distribution program, creating an external table
definition based on the source data defined, and executing an INSERT, UPDATE or MERGE
operation to load the source data into the target table in the database.
The gpload program processes the control file document in order and uses indentation
(spaces) to determine the document hierarchy and the relationships of the sections to one
another. The use of white space is significant. White space should not be used simply for
formatting purposes, and tabs should not be used at all.
The basic structure of a load control file:
---

VERSION: 1.0.0.1
DATABASE: db_name
USER: db_username
HOST: master_hostname

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

12

PORT: master_port

GPLOAD:
INPUT:
- SOURCE:

LOCAL_HOSTNAME:
- hostname_or_ip

PORT: http_port
| PORT_RANGE: [start_port_range, end_port_range]

FILE:
- /path/to/input_file
- COLUMNS:
- field_name: data_type
- FORMAT: text | csv
- DELIMITER: 'delimiter_character'
- ESCAPE: 'escape_character' | 'OFF'
- NULL_AS: 'null_string'
- FORCE_NOT_NULL: true | false
- QUOTE: 'csv_quote_character'
- HEADER: true | false
- ENCODING: database_encoding
- ERROR_LIMIT: integer
- ERROR_TABLE: schema.table_name

OUTPUT:
- TABLE: schema.table_name
- MODE: insert | update | merge
- MATCH_COLUMNS:
- target_column_name
- UPDATE_COLUMNS:
- target_column_name
- UPDATE_CONDITION: 'boolean_condition'
- MAPPING:

target_column_name: source_column_name | 'expression'


PRELOAD:
- TRUNCATE: true | false
- REUSE_TABLES: true | false

SQL:
- BEFORE: "sql_command"

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

13

- AFTER: "sql_command"

Above example shows syntax for GPLOAD using YAML file. This file is divided into sections
for easy reference, those horizontal lines are not to be placed in a YAML file. For example,
users can run a load job as defined in my_load.yml using gpload:
gpload -f my_load.yml
It is recommended that we confirm that gpload is running successfully, to reduce the chance of
future errors. As a first step, you can run gpload at the system (command) prompt to verify. By
copying a small representation of a source file and a control (YAML) file, you can run gpload.py
using a sample load control file.
If gpload.py script is not successfully executed, please confirm the following settings:
Check if the correct version is installed by checking the gpload readme.
Check the environment variables for PATH, GPHOME_LOADERS and PYTHONPATH are
correctly installed.
Check if the pathname environmental variables are pointing or including to the correct path
Example of the load control file - my_load.yml:
--VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- etl1-1
- etl1-2
- etl1-3
- etl1-4
PORT: 8081
FILE:
- /var/load/data/*
- COLUMNS:
- name: text
- amount: float4
- category: text
- desc: text
- date: date
- FORMAT: text

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

14

- DELIMITER: '|'
- ERROR_LIMIT: 25
- ERROR_TABLE: payables.err_expenses
OUTPUT:
- TABLE: payables.expenses
- MODE: INSERT
SQL:
- BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
- AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"

Note: YAML file is not a free formatted file, field names and most of the content need to be in a
certain format.
By using Pentaho, you do not need to write your own YAML file; there are some pre-built
steps inside the Bulk loading folder in the Design windows of Spoon. The customized
Greenplum step is called Greenplum Load, which will help to generate the YAML file
when all the necessary details are provided.
The Greenplum Load step wraps the Greenplum GPLoad data loading utility we just
discussed. The GPLoad data loading utility is used for massively parallel data loading
using Greenplum's external table parallel loading feature. As you can see in the above
example, four ETL servers are used for feeding data into Greenplum through GPLOAD.
GPLoad can be implemented in either single or multiple Pentaho ETL servers. The following
diagrams show the typical deployment scenarios for performing parallel loading to
Greenplum Database:

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

15

1) Single ETL Server, Multiple NICs

2) Multiple ETL Servers

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

16

Usage: How to use Greenplum Loader in Pentaho Data Integration


Setup
Here are the steps to setup a simple transformation to test out the Greenplum Loader:
1) Create the Text File Input Steps by defining a source file (e.g. csv, delimited file). Choose Text
File Input component under Design tab and inside Input folder:

Double Click on the Text File Input and choose the right input delimited file.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

17

2. Click on the next tab of Contents to define how to parse the CSV file:

3. Go to the next tab Fields and click on Get Fields to define all the fields:

A sample source file lineitem.csv/lineitem.dat should look like this:


1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-0322|DELIVER IN PERSON|TRUCK|lineitem 1 comments
2|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-0420|TAKE BACK RETURN|MAIL|lineitem 2 comments
.
100|61336|8855|1|31|40217.23|0.09|0.04|A|F|1993-10-29|1993-12-19|1993-1108|COLLECT COD|TRUCK|lineitem 100 comments

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

18

4. You should create a target table called lineitem which contains:


CREATE TABLE lineitem
(
l_orderkey integer,
l_partkey integer,
l_suppkey integer,
l_linenumber integer,
l_quantity numeric(15,2),
l_extendedprice numeric(15,2),
l_discount numeric(15,2),
l_tax numeric(15,2),
l_returnflag character(1),
l_linestatus character(1),
l_shipdate date,
l_commitdate date,
l_receiptdate date,
l_shipinstruct character(25),
l_shipmode character(10),
l_comment character varying(44)
)
WITH (
OIDS=FALSE
)
DISTRIBUTED BY (l_orderkey);
ALTER TABLE lineitem OWNER TO gpadmin;
Next, you will need to create the Greenplum Load Step:

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

19

The details of the Greenplum Load step need to be defined as the following:
First, you have to choose the correct connection and target table.
Then, please click on Get fields button in order to generate all the target table fields:

After that, click on the Edit Mapping button to define all the mappings from the sources to targets:

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

20

Next, go to the GP Configuration tab in order to define the correct GPLOAD, control file, data file
location:

Once you complete the definitions, please click OK to save.


A sample job can be created through adding the Hop between the Text Input and Greenplum Load
steps.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

21

When everything is defined and saved, you can execute the transformation/job by click the GREEN
arrow on the top left corner.

Once the execution is finished, you can check the Logging and Step Metrics sections to see if the
transformation is successfully executed. You can also verify if data is loaded into this target
Greenplum database table, lineitem through gpload.
The above transformation is just a sample; therefore, user can add different components in this
transformation or incorporate into a well developed job for transforming the data.

Future expansion and interoperability


Both Greenplum and Pentaho are rapidly innovating and extending their capabilities to
satisfy the requirements in the BIG DATA industry. In order to meet the challenges of
fast data loading, the EMC Data Integration Accelerator (DIA) is purpose-built for batch
loading, and micro-batch loading, and leverages a growing number of data integration
applications such as Pentaho. Therefore, both companies are working together to
expand their interoperability to adopt the constantly growing demands.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

22

Conclusion
In this white paper, the process of how to use Greenplum Loader Step(GPLOAD) to
enhance the loading capability and performance of Pentaho Data Integration is
discussed. It covers the preliminary interoperability between both Pentaho PDI and
Greenplum database for data integration and business intelligence projects by using
Greenplums Scatter/Gather Streaming Technology embedded in Greenplum Loader.

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

23

References
1) Pentaho Kettle Solutions Building Open Source ETL Solutions with Pentaho Data
Integration (ISBN-10: 0470635177 / ISBN-13: 978-0470635179)
2) Getting Started with Pentaho Data Integration guide from www.pentaho.com
3) Greenplum Database 4.1 Load tools for UNIX guide
4) Greenplum Database 4.1 Load Tools for Windows guide
5) Pentaho Community - Greenplum Load

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

24

Anda mungkin juga menyukai