Anda di halaman 1dari 39

DIG-1318: IBM Information Server on Hadoop

Technical Dive into Setup,


Configuring, Tuning and Troubleshooting

Scott Brokaw
slbrokaw@us.ibm.com
Srinivas Mudigonda
msrinivas@in.ibm.com
Please note IBMs statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and
at IBMs sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should
not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to
deliver any material, code or functionality. Information about potential future products may not be incorporated into
any contract.

The development, release, and timing of any future features or functionality described for our products remains at
our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon many
factors, including considerations such as the amount of multiprogramming in the users job stream, the I/O
configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that
an individual user will achieve results similar to those stated here.

2 World of Watson 2016 10/20/16


Agenda

Overview
Hadoop (HDFS, Ambari, YARN)
BigIntegrate/BigQuality
Configuration for BigIntegrate/BigQuality
APT_YARN_CONFIG
APT_CONFIG_FILE
APT_YARN_FC_DEFAULT_KEYTAB_PATH
Binary Localization
Kerberos Overview
Configuration assistance
IBM JDK recommendation
Troubleshooting for BigIntegrate/BigQuality
Container Resource Requirements
Logs
Hadoop Connectivity
File Connector
Hive Connector
BigSQL
Kafka Connector
Apache Hadoop project includes:
HDFS
YARN
Other Apache projects related to Hadoop
Ambari
Hive
Spark
etc
HDFS
Helpful commands:

List out files in /tmp on HDFS:


hdfs dfs -ls /tmp
List out files in /user/dsadm on HDFS:
hdfs dfs -ls /user/dsadm
Be cautious that you understand which file system files are being written too i.e.:
/tmp exists on Local File System and /tmp exists in HDFS file system
Change owner on HDFS:
hdfs dfs -chown -R user:group /pathTo/HDFS/directory
Delete files/directores on HDFS:
hdfs dfs -rm -r -f /pathTo/HDFS/directory
YARN
Terminology:
Client
A user developed tool that submits an application to run on a YARN cluster.
ResourceManager (RM)
The master process of a YARN cluster. Handles scheduling and allocating
resources to applications on the cluster
NodeManager (NM)
The worker processes on each node. Handles launching and monitoring
container processes
ApplicationMaster (AM)
User developed application launched in the cluster to manage the life cycle of
the application in the cluster. Can request additional containers within the cluster
to run the user job.
Container
A logical set of resources a process tree is given to run within the YARN cluster.
YARN
Important YARN configuration parameters:
yarn.nodemanager.pmem-check-enabled
Enforce physical memory limit on containers
yarn.nodemanager.vmem-check-enabled
Enforce virtual memory limit on containers
yarn.scheduler.minimum-allocation-mb
Minimum size for containers
yarn.scheduler.maximum-allocation-mb
Maximum size for containers

Other parameters found on Apaches site

Important YARN commands:


yarn application list
yarn node list

If YARN Log aggregation is enabled, the following command can be used to collect the container logs
belonging to an Application Master:

yarn logs applicationId [ApplicationID]


Where [ApplicationID] is the Application Master for the job run.
Overview

IS for Hadoop Engine is Information Server on Hadoop or BigIntegrate/BigQuality


BigIntegrate/BigQuality
IIS Service Tier IIS Metadata
IIS Client Tier
Repository Tier

Jobs are submitted from an IIS Client (1) Submit Job (1)
Conductor asks IIS YARN Client for an
Application Master(AM) to run the job (2) IIS Engine Tier Hadoop Edge Node
IIS YARN Client manages IIS AM pool,
starts new ones when necessary (3) (2) IIS YARN
Hadoop Conductor
Cluster Client
Conductor passes IIS AM resource
/opt/IBM/InformationServer (3)
requirements and commands to start (4)
Section Leaders (4)
IIS AM gets containers from YARN Section Leader Section Leader
(6) (6)
ResourceManager(not pictured) IIS
Application
YARN NodeManagers(NM) on Player 1 Player 2 Player N Player 1 Player 2 Player N Master

DataNodes start YARN containers with


DataNode DataNode DataNode
Section Leaders (5) /opt/IBM/InformationServer /opt/IBM/InformationServer /opt/IBM/InformationServer

Section Leaders connect back to


Conductor and start players (6)
(5) YARN Containers
Configuration: edge node vs. data node

Hadoop edge node


Engine tier or all tiers installed on a Hadoop edge node within the cluster
Node within the Hadoop cluster that does contain any HDFS data, but has Hadoop
client software configured and installed
Provides the best performance and is the most common and preferred topology

Hadoop data node


Engine tier or all tiers installed on a Hadoop data node within the cluster.
This option is typically used for smaller clusters or single machine deployments.
Set APT_YARN_EDGE_NODE_INSTALL=false
Configuration

/tmp on local disk for Engine and all data nodes needs at least 5 gb free to localize binaries
Set APT_YARN_BINARIES_PATH if you wish to change default /tmp location
If using HDFS or passwordless SSH for APT_YARN_BINARY_COPY_MODE, run on each
data node Matching the full path that was used for the Engine install:
mkdir -p IBM/InformationServer
mkdir -p IBM/InformationServer/Server/Projects/<InfoSphere_DataStage_project_name>
Users must have permissions to access (rwx) and create directories on all data nodes
under the above directory structure(s)
Users that will run jobs need to have valid permissions/access in Hadoop and have a
directory in HDFS:
i.e. /user/dsadm
Database clients need to be installed each node in the cluster for all databases that you are
using as a source or target
If you do not want to run the jobs on all data nodes, you can use a node map
constraint instead of installing database clients on each node.
APT_YARN_CONFIG

APT_YARN_CONFIG
Variable that provides a path for InfoSphere DataStage to read the yarnconfig.cfg file, which
specifies all the environment variables that you need to run Information Server on Hadoop.
Ensure that APT_YARN_CONFIG points to a yarnconfig.cfg file where APT_YARN_MODE
is set to the default value of true or 1
Otherwise, the jobs will not run on Hadoop, but will run in the standard manner, without
using the YARN resource manager
Best Practice:
Store yarnconfig.cfg in $DSHOME
Set APT_YARN_CONFIG to
$DSHOME/yarnconfig.cfg
APT_CONFIG_FILE

New format options available


Static and Dynamic can be combined

Static, 30 Node Configuration File: Dynamic, 30 Node Configuration File:


{
node node0
{
fastname engine-tier.domain.com
{ pools conductor
node node1
resource disk /mydisk/tmp {pools }
{
resources scratchdisk /myscratch/tmp {}
fastname mymachine.domain.com
}
pools
node node1
resource disk /mydisk/tmp {pools } {
resources scratchdisk /myscratch/tmp {pools }
fastname $host
instances 30
pools
}
resource disk /mydisk/tmp {pools }
}
resources scratchdisk /myscratch/tmp {pools }
instances 30
}
}
Binary Localization

When the PX Yarn Client is started:


Check is performed
Version.xml checksum matches copy in HDFS, no action is taken
Binaries are loaded into hdfs for localization onto data nodes
Check /tmp/copy-orchdist.log on Engine tier for progress
At job runtime:
Data nodes selected for job run will be examined to see if binaries need to be pulled down
from HDFS and localized
Localization will take some time to complete during startup
Check the status in the stdout of the container example:
/hadoop/yarn/log/applicationID/container_ApplicationID_01_containerID
Binary Localization

Configuration files changes will not trigger binary localization, such as:
.odbc.ini, ishdfs.config, isjdbc.config, etc
Force Binary localization
cd $DSHOME/../..
echo '<!-- Forced PX Yarn Client Binary Localization on' "`date` -->" >> Version.xml
cd $APT_ORCHHOME/etc/yarn_conf
./stop-pxyarn.sh
./start-pxyarn.sh
Best practices to force binary localization to all data nodes
Run a generic job with a static configuration file that references all data nodes
Force binary localization and prevent startup delays during other production runs
Kerberos

Network authentication protocol


Provide strong authentication for client/server applications using secret-key cryptography
Strict time requirements for all servers
Tickets have a time window before expiration (Default 24 hours)
Operating System clocks must be kept in sync
Terms
Principal
A Kerberos principal is a service or user. Principal names consist of three parts: a
service or user name, an instance name, and a realm name in the following form:
principal-name/instance-name@realm-name
Realm
A Kerberos realm is a set of managed nodes that share the same Kerberos database
Keytab
Stores long-term keys for one or more principals.
Kerberos

Important commands:
kinit prinipal/hostname@realm.com -k -t /pathTo/keytab
klist -e
kdestroy
Troubleshooting/debugging
JVM Options (for IBM JDK)
-Dcom.ibm.security.jgss.debug=all -Dcom.ibm.security.krb5.Krb5De
Depending on encryption sizes, unrestricted JCE policy files may be needed
Kerberos cache or keytab files need to be available to Hadoop data nodes for processes
Set Environment Variables to automate the localization of these files to the YARN application cache:
File Connector default keytab option
APT_YARN_FC_DEFAULT_KEYTAB_PATH
JDBC Connector, same as JDBCDriverLogin.conf for the useKeytab option
APT_YARN_CONNECTOR_USER_KEYTAB_PATH
ODBC/JDBC Connector
APT_YARN_CONNECTOR_USER_CACHE_CRED_PATH
BDFS Stage, cache must be from IBM JDK
APT_YARN_USER_CACHED_CRED_PATH
APT_YARN_FC_DEFAULT_KEYTAB_PATH

When using the File Connector with the keytab option, keytab mentioned in the stage
properties must exists on the respective data nodes where the processes are running
Set environment variable to point to a default keytab that is available on your engine/edge
node
The default keytab is automatically sent to all data/compute nodes from the engine tier
Localization of the keytab is done via YARN localization, i.e. the keytab will be localized to
appcache for the YARN application e.g: from the NodeManager log:

INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource


hdfs://ipsvm00495.swg.usma.ibm.com:8020/user/dsadm/PXApplication/application_146072553
3569_0006/defaultkeytabfile(-
>/hadoop/yarn/local/usercache/dsadm/appcache/application_1460725533569_0006/filecache/
10/defaultkeytabfile) transitioned from DOWNLOADING to LOCALIZED

File Connector will take care of using the dynamic appcache path
Kerberos
There are many different implementations of Kerberos client libraries, some examples:
MIT Kerberos i.e. /usr/bin/klist
IBM JDK i.e. /IBM/InformationServer/jdk/jre/bin/klist
Another Vendors JDK typically shipped with the Hadoop distribution
It important to understand which client libraries different components of Information Server on Hadoop
are built against. This table presents the components that have hard requirements for the Kerberos client
libraries.
Component Kerberos Client
PX Engine (DataSets in HDFS, etc) IBM JDK
File Connector IBM JDK
MIT or Hadoop
PX YARN Client
JDK
Kerberos Recommendation:
Use IBM JDKs kinit to generate a ticket cache and then set environment variable KRB5CCNAME to the
value of that ticket cache i.e. ~/krb5cc_$user
This allows for MITs kinit, and other Kerberos client libraries to work with the IBM JDK ticket cache.
Other Vendor or Kerberos clients can be used to generate the ticket cache, but the IBM JDK must
be upgraded to at least 1.7 SR3 FP30 in order for the IBM JDK to work with the generated ticket
cache using the KRB5CCNAME environment variable
Upgrade to latest IBM JDK
Refer to security bulletins, Fix Central will display latest IBM JDK that is certified with Information
server.
Connectivity to Kerberized Hadoop Clusters

If cluster is kerberized, you must use a Kerberos client (i.e. DataStage must have a valid
ticket) in order to connect
Knox
Avoid requiring a ticket
https://knox.apache.org/
Note: This will NOT work for JDBC connections to Hive using the DataDirect JDBC
driver as it does not currently support using a http port
This means Hive Connector and File Connector, Hive table create option can not be
used on a kerberized cluster unless the DataStage server can obtain a Kerberos ticket.
Troubleshooting BigIntegrate/BigQuality

gather_px_yarn_logs.sh
Automated script runs upon PX job failure if the following variables are set:
APT_YARN_GATHER_LOGS_NM_PATH
APT_YARN_GATHER_LOGS_RM_PATH
YARN log aggregation collection is automated if YARN log aggregation is enabled (recommended)
Additional files are attempted to be collected
Version.xml
yarnconfig.cfg
NodeManager logs (if passwordless SSH is enabled between Edge node and Hadoop data nodes for
runtime user)
Log Files:
Shows errors/logging for YARN client startup
/tmp/yarn_client.[USERID].out
Shows binary localization details [From Engine to HDFS]
/tmp/copy-orchdist.log
PX Yarn client logging (after it is started)
/IBM/InformationServer/Server/PXEngine/logs/yarn_logs
Troubleshooting BigIntegrate/BigQuality

Key yarn-site.xml file (YARN client) parameters:


yarn.nodemanager.log-dirs - Contains the location of where the container logs are stored
yarn.nodemanager.local-dirs Contains the working directory for containers
yarn.nodemanager.delete.debug-delay-sec Number of seconds before the NodeManager cleans up
logs and files related to a containers execution. Recommend setting to 600 to allow for time to debug
container startup issues.
yarn.log-aggregation-enable Determines whether YARN Log aggregation is enabled
yarn.nodemanager.remote-app-log-dir Location where logs are aggregated, typically in HDFS
YARN_LOG_DIR Environment variable typically in yarn-env.sh, defines where YARN NodeManager
logs are stored. In most cases this is /var/log/hadoop-yarn
These logs will contain helpful messages concerning container resource sizes, etc
Hadoop Connectivity
Hadoop connectivity options
IBM InfoSphere Information Server provides multiple connectivity options to handle data
stored in the HDFS
Users can either read the data directly from the files stored in the HDFS or use the SQL
like interface to run queries against the Hive tables or use the Kafka Connector to publish
/ subscribe to topics in the Kafka cluster

File Connector
Allows reading / writing the files directly into HDFS
Can read or write in parallel
Can be used to write large volumes of data into a Hive table. This can be accomplished
by writing the data into HDFS files and then create a Hive table association using the
Create Hive table option in the File Connector
Supports multiple data formats
Supports connecting through WebHDFS, HttpFS or Native HDFS
Note :
(1) Native HDFS API support has been added via a patch
(2) Native HDFS API would be useful when running with BigIntegrate
File Connector (Configuration)
File Connector (Configuration)
Hive Table Create Option
Uses DataDirect JDBC driver for Hive
Creates an external table to the HDFS file that was loaded as part of the File Connector
JDBC driver requires additional configuration, isjdbc.config (in $DSHOME)
CLASSPATH=/opt/IBM/InformationServer/ASBNode/lib/java/IShive.jar
CLASS_NAMES=com.ibm.isf.jdbc.hive.HiveDriver;

JDBCDriverLogin.conf should be in
in ASBNode/lib/java when Kerberos is
being used.
File Connector (Configuration)
Sample JAAS Configuration file
JDBCDriverLogin.conf
JDBC_DRIVER_01{
com.ibm.security.auth.module.Krb5LoginModule required
credsType=initiator
principal="slbrokaw/kvm313-rh6.swg.usma.ibm.com@IBM.COM"
useCcache="FILE:/home/slbrokaw/krb5cc_slbrokaw";
};
The entries are related to the IBM JDK and it supports only IBM JDK
Both cache and keytab can be used
To use keytab replace useCcache with:
useKeytab = "FILE:/home/slbrokaw/slbrokaw.keytab
Multiple entries can be specified in JDBCDriverLogin.conf
Specific stanza can be set in the JDBC URL using the configuration property
loginConfigName=JDBC_DRIVER_01
Hive Connector
Writes to Hive are possible. (with the ODBC / JDBC Connector, there were certain
limitations with the write functionality)
Can be used for transactional writes into Hive, while File Connector serves the purpose of
initial loading of data into Hive
Partitioned Reads and Partitioned Writes are possible
Support for creating the Tables with different file formats. Here in this case, the statements
are generated, whereas with the ODBC / JDBC Connector, the user has to provide the
statement
Initial implementation doesnt support cache localization i.e. it wont localize the credential
cache with BigIntegrate/BigQuality
Every data node must have the credential cache or keytab
Easy to configure and the URL and configuration are similar to JDBC Connector
BigSQL
DB2 Connector
Uses DB2 client to connect to BigSQL
Limitations around special BigSQL syntax such as CREATE HADDOP TABLE
Reads
Normal SQL
Fast
Writes
Use a staging table approach, load into temporary DB2 table
After SQL statement to Insert/Select
File Connector/Hive table create option
Use external Hive table through File Connector as staging table
Run After-Job subroutine
Call BigSQL to Insert/Select into internal Hive table (BigSQL)
Kafka Connector
Kafka is a distributed publish-subscribe messaging system that is designed to be fast,
scalable, and durable. It maintains feeds of messages in categories called topics.

Producers publish(write) messages to topics and Consumers subscribe (read) from topics.

Kafka being a distributed system, topics can be partitioned and replicated across multiple
nodes.

Kafka can be run as a cluster comprised of one or more servers.


Kafka Connector
Kafka Connector implements Kafka Producer and Kafka Consumer functionalities

Kafka Producer writes messages into Kafka topics

Kafka Consumer read messages from Kafka topics

Kafka Connector also supports Kerberos authentication


Note : Localization of the Keytab is not supported in the initial release of the connector

Note : Connector is released as patch on IS 11.5.0.1 running on Linux x64 platform


Kafka Connector (Write mode)
Kafka Connector can write messages to multiple topics from upstream source stages in the
ETL flow.

The connector waits for the timeout specified, to end the job which can be used incase
data is coming from streaming sources.
Kafka Connector (Read mode)
Kafka Connector can read messages from topics into the ETL job flow

Reads messages from the topic based on consumer group specified

For the first time read of a topic by a consumer group, it retrieves messages based on the
Reset policy: earliest or latest

For subsequent reads, it stores the offset from the earlier read and retrieves messages
after that.
Kafka Connector (Read mode)
Kerberos Security in Kafka Connector
Kafka connector implements Kerberos authentication mechanism.

Implements the pluggable authorizer from Kafka which controls users from accessing the
data they have permissions on
Troubleshooting Connectors
CC_MSG_LEVEL
Additional debug logs can be generated by setting the value of this environment
variable to 1
Supported by all the connectors to capture additional debug information which would
be helpful to diagnose the issues
CC_JVM_OPTIONS
Used to set additional JVM options. This can be used to set the debug options while
using SSL or Kerberos authentication
When SSL is used, setting the following value provides debug information
CC_JVM_OPTIONS=-Djavax.net.debug=all
When Kerberos is used, setting the following value provides debug information
CC_JVM_OPTIONS=-Dcom.ibm.security.jgss.debug=all -Dcom.ibm.security.krb5.Krb5Debug=all
Notices and Copyright 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or
transmitted in any form without written permission from IBM.
disclaimers U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been
reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall
have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,
EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS
INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS
OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under
which they are provided.

IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have
been previously installed. Regardless, our warranty terms apply.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without
notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are
presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
performance, cost, savings or other results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,
programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily
reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor
shall constitute legal or other guidance or advice to any individual participant or their specific situation.

It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal
counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customers
business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent
or warrant that its services or products will ensure that the customer is in compliance with any law.
Notices and Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other
publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of

disclaimers
performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be
addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-
party products to interoperate with IBMs products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,
continued INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents,
copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com, Aspera, Bluemix, Blueworks Live, CICS, Clearcase, Cognos, DOORS, Emptoris, Enterprise Document
Management System, FASP, FileNet, Global Business Services , Global Technology Services , IBM ExperienceOne, IBM
SmartCloud, IBM Social Business, Information on Demand, ILOG, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON,
OpenPower, PureAnalytics, PureApplication, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery,
pureScale, PureSystems, QRadar, Rational, Rhapsody, Smarter Commerce, SoDA, SPSS, Sterling Commerce, StoredIQ,
Tealeaf, Tivoli, Trusteer, Unica, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
Thank You

Anda mungkin juga menyukai