Scott Brokaw
slbrokaw@us.ibm.com
Srinivas Mudigonda
msrinivas@in.ibm.com
Please note IBMs statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and
at IBMs sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should
not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to
deliver any material, code or functionality. Information about potential future products may not be incorporated into
any contract.
The development, release, and timing of any future features or functionality described for our products remains at
our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon many
factors, including considerations such as the amount of multiprogramming in the users job stream, the I/O
configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that
an individual user will achieve results similar to those stated here.
Overview
Hadoop (HDFS, Ambari, YARN)
BigIntegrate/BigQuality
Configuration for BigIntegrate/BigQuality
APT_YARN_CONFIG
APT_CONFIG_FILE
APT_YARN_FC_DEFAULT_KEYTAB_PATH
Binary Localization
Kerberos Overview
Configuration assistance
IBM JDK recommendation
Troubleshooting for BigIntegrate/BigQuality
Container Resource Requirements
Logs
Hadoop Connectivity
File Connector
Hive Connector
BigSQL
Kafka Connector
Apache Hadoop project includes:
HDFS
YARN
Other Apache projects related to Hadoop
Ambari
Hive
Spark
etc
HDFS
Helpful commands:
If YARN Log aggregation is enabled, the following command can be used to collect the container logs
belonging to an Application Master:
Jobs are submitted from an IIS Client (1) Submit Job (1)
Conductor asks IIS YARN Client for an
Application Master(AM) to run the job (2) IIS Engine Tier Hadoop Edge Node
IIS YARN Client manages IIS AM pool,
starts new ones when necessary (3) (2) IIS YARN
Hadoop Conductor
Cluster Client
Conductor passes IIS AM resource
/opt/IBM/InformationServer (3)
requirements and commands to start (4)
Section Leaders (4)
IIS AM gets containers from YARN Section Leader Section Leader
(6) (6)
ResourceManager(not pictured) IIS
Application
YARN NodeManagers(NM) on Player 1 Player 2 Player N Player 1 Player 2 Player N Master
/tmp on local disk for Engine and all data nodes needs at least 5 gb free to localize binaries
Set APT_YARN_BINARIES_PATH if you wish to change default /tmp location
If using HDFS or passwordless SSH for APT_YARN_BINARY_COPY_MODE, run on each
data node Matching the full path that was used for the Engine install:
mkdir -p IBM/InformationServer
mkdir -p IBM/InformationServer/Server/Projects/<InfoSphere_DataStage_project_name>
Users must have permissions to access (rwx) and create directories on all data nodes
under the above directory structure(s)
Users that will run jobs need to have valid permissions/access in Hadoop and have a
directory in HDFS:
i.e. /user/dsadm
Database clients need to be installed each node in the cluster for all databases that you are
using as a source or target
If you do not want to run the jobs on all data nodes, you can use a node map
constraint instead of installing database clients on each node.
APT_YARN_CONFIG
APT_YARN_CONFIG
Variable that provides a path for InfoSphere DataStage to read the yarnconfig.cfg file, which
specifies all the environment variables that you need to run Information Server on Hadoop.
Ensure that APT_YARN_CONFIG points to a yarnconfig.cfg file where APT_YARN_MODE
is set to the default value of true or 1
Otherwise, the jobs will not run on Hadoop, but will run in the standard manner, without
using the YARN resource manager
Best Practice:
Store yarnconfig.cfg in $DSHOME
Set APT_YARN_CONFIG to
$DSHOME/yarnconfig.cfg
APT_CONFIG_FILE
Configuration files changes will not trigger binary localization, such as:
.odbc.ini, ishdfs.config, isjdbc.config, etc
Force Binary localization
cd $DSHOME/../..
echo '<!-- Forced PX Yarn Client Binary Localization on' "`date` -->" >> Version.xml
cd $APT_ORCHHOME/etc/yarn_conf
./stop-pxyarn.sh
./start-pxyarn.sh
Best practices to force binary localization to all data nodes
Run a generic job with a static configuration file that references all data nodes
Force binary localization and prevent startup delays during other production runs
Kerberos
Important commands:
kinit prinipal/hostname@realm.com -k -t /pathTo/keytab
klist -e
kdestroy
Troubleshooting/debugging
JVM Options (for IBM JDK)
-Dcom.ibm.security.jgss.debug=all -Dcom.ibm.security.krb5.Krb5De
Depending on encryption sizes, unrestricted JCE policy files may be needed
Kerberos cache or keytab files need to be available to Hadoop data nodes for processes
Set Environment Variables to automate the localization of these files to the YARN application cache:
File Connector default keytab option
APT_YARN_FC_DEFAULT_KEYTAB_PATH
JDBC Connector, same as JDBCDriverLogin.conf for the useKeytab option
APT_YARN_CONNECTOR_USER_KEYTAB_PATH
ODBC/JDBC Connector
APT_YARN_CONNECTOR_USER_CACHE_CRED_PATH
BDFS Stage, cache must be from IBM JDK
APT_YARN_USER_CACHED_CRED_PATH
APT_YARN_FC_DEFAULT_KEYTAB_PATH
When using the File Connector with the keytab option, keytab mentioned in the stage
properties must exists on the respective data nodes where the processes are running
Set environment variable to point to a default keytab that is available on your engine/edge
node
The default keytab is automatically sent to all data/compute nodes from the engine tier
Localization of the keytab is done via YARN localization, i.e. the keytab will be localized to
appcache for the YARN application e.g: from the NodeManager log:
File Connector will take care of using the dynamic appcache path
Kerberos
There are many different implementations of Kerberos client libraries, some examples:
MIT Kerberos i.e. /usr/bin/klist
IBM JDK i.e. /IBM/InformationServer/jdk/jre/bin/klist
Another Vendors JDK typically shipped with the Hadoop distribution
It important to understand which client libraries different components of Information Server on Hadoop
are built against. This table presents the components that have hard requirements for the Kerberos client
libraries.
Component Kerberos Client
PX Engine (DataSets in HDFS, etc) IBM JDK
File Connector IBM JDK
MIT or Hadoop
PX YARN Client
JDK
Kerberos Recommendation:
Use IBM JDKs kinit to generate a ticket cache and then set environment variable KRB5CCNAME to the
value of that ticket cache i.e. ~/krb5cc_$user
This allows for MITs kinit, and other Kerberos client libraries to work with the IBM JDK ticket cache.
Other Vendor or Kerberos clients can be used to generate the ticket cache, but the IBM JDK must
be upgraded to at least 1.7 SR3 FP30 in order for the IBM JDK to work with the generated ticket
cache using the KRB5CCNAME environment variable
Upgrade to latest IBM JDK
Refer to security bulletins, Fix Central will display latest IBM JDK that is certified with Information
server.
Connectivity to Kerberized Hadoop Clusters
If cluster is kerberized, you must use a Kerberos client (i.e. DataStage must have a valid
ticket) in order to connect
Knox
Avoid requiring a ticket
https://knox.apache.org/
Note: This will NOT work for JDBC connections to Hive using the DataDirect JDBC
driver as it does not currently support using a http port
This means Hive Connector and File Connector, Hive table create option can not be
used on a kerberized cluster unless the DataStage server can obtain a Kerberos ticket.
Troubleshooting BigIntegrate/BigQuality
gather_px_yarn_logs.sh
Automated script runs upon PX job failure if the following variables are set:
APT_YARN_GATHER_LOGS_NM_PATH
APT_YARN_GATHER_LOGS_RM_PATH
YARN log aggregation collection is automated if YARN log aggregation is enabled (recommended)
Additional files are attempted to be collected
Version.xml
yarnconfig.cfg
NodeManager logs (if passwordless SSH is enabled between Edge node and Hadoop data nodes for
runtime user)
Log Files:
Shows errors/logging for YARN client startup
/tmp/yarn_client.[USERID].out
Shows binary localization details [From Engine to HDFS]
/tmp/copy-orchdist.log
PX Yarn client logging (after it is started)
/IBM/InformationServer/Server/PXEngine/logs/yarn_logs
Troubleshooting BigIntegrate/BigQuality
File Connector
Allows reading / writing the files directly into HDFS
Can read or write in parallel
Can be used to write large volumes of data into a Hive table. This can be accomplished
by writing the data into HDFS files and then create a Hive table association using the
Create Hive table option in the File Connector
Supports multiple data formats
Supports connecting through WebHDFS, HttpFS or Native HDFS
Note :
(1) Native HDFS API support has been added via a patch
(2) Native HDFS API would be useful when running with BigIntegrate
File Connector (Configuration)
File Connector (Configuration)
Hive Table Create Option
Uses DataDirect JDBC driver for Hive
Creates an external table to the HDFS file that was loaded as part of the File Connector
JDBC driver requires additional configuration, isjdbc.config (in $DSHOME)
CLASSPATH=/opt/IBM/InformationServer/ASBNode/lib/java/IShive.jar
CLASS_NAMES=com.ibm.isf.jdbc.hive.HiveDriver;
JDBCDriverLogin.conf should be in
in ASBNode/lib/java when Kerberos is
being used.
File Connector (Configuration)
Sample JAAS Configuration file
JDBCDriverLogin.conf
JDBC_DRIVER_01{
com.ibm.security.auth.module.Krb5LoginModule required
credsType=initiator
principal="slbrokaw/kvm313-rh6.swg.usma.ibm.com@IBM.COM"
useCcache="FILE:/home/slbrokaw/krb5cc_slbrokaw";
};
The entries are related to the IBM JDK and it supports only IBM JDK
Both cache and keytab can be used
To use keytab replace useCcache with:
useKeytab = "FILE:/home/slbrokaw/slbrokaw.keytab
Multiple entries can be specified in JDBCDriverLogin.conf
Specific stanza can be set in the JDBC URL using the configuration property
loginConfigName=JDBC_DRIVER_01
Hive Connector
Writes to Hive are possible. (with the ODBC / JDBC Connector, there were certain
limitations with the write functionality)
Can be used for transactional writes into Hive, while File Connector serves the purpose of
initial loading of data into Hive
Partitioned Reads and Partitioned Writes are possible
Support for creating the Tables with different file formats. Here in this case, the statements
are generated, whereas with the ODBC / JDBC Connector, the user has to provide the
statement
Initial implementation doesnt support cache localization i.e. it wont localize the credential
cache with BigIntegrate/BigQuality
Every data node must have the credential cache or keytab
Easy to configure and the URL and configuration are similar to JDBC Connector
BigSQL
DB2 Connector
Uses DB2 client to connect to BigSQL
Limitations around special BigSQL syntax such as CREATE HADDOP TABLE
Reads
Normal SQL
Fast
Writes
Use a staging table approach, load into temporary DB2 table
After SQL statement to Insert/Select
File Connector/Hive table create option
Use external Hive table through File Connector as staging table
Run After-Job subroutine
Call BigSQL to Insert/Select into internal Hive table (BigSQL)
Kafka Connector
Kafka is a distributed publish-subscribe messaging system that is designed to be fast,
scalable, and durable. It maintains feeds of messages in categories called topics.
Producers publish(write) messages to topics and Consumers subscribe (read) from topics.
Kafka being a distributed system, topics can be partitioned and replicated across multiple
nodes.
The connector waits for the timeout specified, to end the job which can be used incase
data is coming from streaming sources.
Kafka Connector (Read mode)
Kafka Connector can read messages from topics into the ETL job flow
For the first time read of a topic by a consumer group, it retrieves messages based on the
Reset policy: earliest or latest
For subsequent reads, it stores the offset from the earlier read and retrieves messages
after that.
Kafka Connector (Read mode)
Kerberos Security in Kafka Connector
Kafka connector implements Kerberos authentication mechanism.
Implements the pluggable authorizer from Kafka which controls users from accessing the
data they have permissions on
Troubleshooting Connectors
CC_MSG_LEVEL
Additional debug logs can be generated by setting the value of this environment
variable to 1
Supported by all the connectors to capture additional debug information which would
be helpful to diagnose the issues
CC_JVM_OPTIONS
Used to set additional JVM options. This can be used to set the debug options while
using SSL or Kerberos authentication
When SSL is used, setting the following value provides debug information
CC_JVM_OPTIONS=-Djavax.net.debug=all
When Kerberos is used, setting the following value provides debug information
CC_JVM_OPTIONS=-Dcom.ibm.security.jgss.debug=all -Dcom.ibm.security.krb5.Krb5Debug=all
Notices and Copyright 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or
transmitted in any form without written permission from IBM.
disclaimers U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been
reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall
have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,
EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS
INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS
OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under
which they are provided.
IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have
been previously installed. Regardless, our warranty terms apply.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without
notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are
presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,
programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily
reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor
shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal
counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customers
business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent
or warrant that its services or products will ensure that the customer is in compliance with any law.
Notices and Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other
publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of
disclaimers
performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be
addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-
party products to interoperate with IBMs products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,
continued INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents,
copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera, Bluemix, Blueworks Live, CICS, Clearcase, Cognos, DOORS, Emptoris, Enterprise Document
Management System, FASP, FileNet, Global Business Services , Global Technology Services , IBM ExperienceOne, IBM
SmartCloud, IBM Social Business, Information on Demand, ILOG, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON,
OpenPower, PureAnalytics, PureApplication, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery,
pureScale, PureSystems, QRadar, Rational, Rhapsody, Smarter Commerce, SoDA, SPSS, Sterling Commerce, StoredIQ,
Tealeaf, Tivoli, Trusteer, Unica, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
Thank You