Enterprise Edition
Day 1
Review of EE Concepts
Sequential Access
Best Practices
DBMS as Source
Combining Data
Configuration Files
Extending EE
Meta Data in EE
Day 2
EE Architecture
Transforming Data
DBMS as Target
Sorting Data
Day 3
Day 4
Job Sequencing
Testing and Debugging
Online Help
Intro
Part 1
Introduction to DataStage EE
What is DataStage?
DataStage Administrator
Client Logon
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage
Compile Designer
DataStage Projects
Intro
Part 2
Configuring Projects
Module Objectives
Project Properties
Licensing Tab
Environment Variables
Permissions Tab
Tracing Tab
Tunables Tab
Parallel Tab
Intro
Part 3
Module Objectives
What Is Metadata?
Data
Source
Transform
Meta
Data
Target
Meta
Data
Meta Data
Repository
DataStage Manager
Manager Contents
Export Procedure
Import Procedure
Import Options
Exercise
Metadata Import
Intro
Part 4
Module Objectives
What Is a Job?
Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers
Job
properties
Compile
Tools Palette
Transformer Stage
Result
Job Properties
Short and long descriptions
Shows in Manager
Annotation stage
Is a stage on the tool palette
Shows on the job GUI (work area)
Compiling a Job
Intro
Part 5
Running Jobs
Module Objectives
DataStage Director
Module 1
DSEE DataStage EE
Review
Ascentials Enterprise
Data Integration Platform
Command & Control
ANY SOURCE
CRM
ERP
SCM
RDBMS
Legacy
Real-time
Client-server
Web services
Data Warehouse
Other apps.
DISCOVER
PREPARE
TRANSFORM
Gather
relevant
informatio
n for target
enterprise
application
s
Cleanse,
correct and
match input
data
Standardize
and enrich
data and load
to targets
Data Profiling
Data Quality
Extract, Transform,
Load
Parallel Execution
Meta Data Management
ANY TARGET
CRM
ERP
SCM
BI/Analytics
RDBMS
Real-time
Client-server
Web services
Data Warehouse
Other apps.
Course Objectives
Course Agenda
Day 1
Review of EE Concepts
Sequential Access
Standards
DBMS Access
Combining Data
Configuration Files
Day 2
EE Architecture
Transforming Data
Sorting Data
Day 3
Day 4
Extending EE
Meta Data Usage
Job Control
Testing
Module Objectives
Tasks
Review Topics
DataStage architecture
Client-Server Architecture
Command & Control
ANY TARGET
ANY SOURCE
Designer
Discover
Extract
Administrator
Repository
Manager
Prepare Transform
Transform
Cleanse
Extend
Integrate
Director
Server
Repository
CRM
ERP
SCM
BI/Analytics
RDBMS
Real-Time
Client-server
Web services
Data Warehouse
Other apps.
Process Flow
Administrator Project
Creation/Removal
Functions
specific to a
project.
Variables for
parallel
processing
Administrator Environment
Variables
Variables are
category
specific
OSH is what is
run by the EE
Framework
DataStage Manager
Push meta
data to
MetaStage
Designer Workspace
Can execute
the job from
Designer
The EE
Framework
runs OSH
Messages
from previous
run in different
color
Stages
Can now customize the Designers palette
Select desired stages
and drag to favorites
Row
generator
Peek
Row Generator
Edit row in
column tab
Repeatabl
e property
Peek
Why EE is so Effective
Emphasis on memory
note
cpu
cpu
cpu
cpu
Operational Data
Transform
Clean
Load
Archived Data
Data
Warehouse
Disk
Disk
Disk
Source
Traditional approach to batch processing:
Write to disk and read from disk before each processing operation
Sub-optimal utilization of resources
a 10 GB stream leads to 70 GB of I/O
processing resources can sit idle during I/O
Very complex to manage (lots and lots of small jobs)
Becomes impractical with big data volumes
disk I/O consumes the processing
terabytes of disk required for temporary staging
Target
Pipeline Multiprocessing
Data Pipelining
Transform, clean and load processes are executing simultaneously on the same processor
rows are moving forward through the flow
Operational Data
Archived Data
Source
Transform
Clean
Load
Data
Warehouse
Target
Start a downstream process while an upstream process is still
running.
This eliminates intermediate storing to disk, which is critical for big data.
This also keeps the processors busy.
Still has limits on scalability
Partition Parallelism
Data Partitioning
Break up big data into partitions
Run one partition on each processor
4X times faster on 4 processors With data big enough:
100X faster on 100 processors
This is exactly how the parallel
databases work!
Data Partitioning requires the
same transform to all partitions:
Aaron Abbott and Zygmund Zorn
undergo the same transform
Node 1
Transform
A-F
G- M
Source
Data
Node 2
Transform
N-T
U-Z
Node 3
Transform
Node 4
Transform
Source
Data
Pa
rtit
ion
ing
Pipelining
Source
Transform
Clean
Load
Data
Warehouse
Target
Repartitioning
Transform
Source
Clean
rtit
ion
ing
Re
pa
Source
Data
art
itio
nin
g
Pa
rtit
io
U-Z
N-T
G- M
A-F
Pipelining
Re
p
nin
g
Data
Warehouse
Load
Targe
Without Landing To Disk!
EE Program Elements
DataStage EE Architecture
DataStage:
Provides data integration platform
Orchestrate Framework:
Provides application scalability
Orchestrate Program
(sequential data flow)
Flat Files
Relational Data
Clean 1
Import
Analyze
Merge
Clean 2
Configuration File
Performance
Visualization
Analyze
Inter-node communications
Parallelization of operations
Introduction to DataStage EE
DSEE:
in parallel (partitioners)
sequentially (collectors)
Exercise
Module 2
Module Objectives
Sequential
File Set
Data Set
Importing/Exporting Data
Data import:
Data export
EE internal format
EE internal format
How
Recordization
Divides input stream into records
Set on the format tab
Columnization
Divides the record into columns
Default set on the format tab but can be overridden on
the columns tab
Can be incomplete if using a schema or not even
specified in the stage if using RCP
R e c o rd d e lim ite r
F ie ld 1
F ie ld 1
F ie ld 1
L a s t fie ld
nl
F in a l D e lim ite r = e n d
F ie ld D e lim ite r
F ie ld 1
F ie ld 1
F ie ld 1
L a s t fie ld
, nl
F in a l D e lim ite r = c o m m a
Will
Stage categories
Multiple output
links
Show records
Format Tab
Record into
columns
Read Methods
Reject Link
Source
Target
Descriptor file
Key column
specified
Key column
dropped in
descriptor file
Data Set
Suffixed by .ds
Persistent Datasets
Two parts:
Descriptor file:
contains
Data file(s)
contain
input.ds
record (
partno: int32;
description:
string;
)
the data
multiple Unix files (one per node), accessible in parallel
node1:/local/disk1/
node2:/local/disk2/
Quiz!
True or False?
Everything that has been data-partitioned must be
collected in same job
Occurs on import
From sequential files or file sets
From RDBMS
Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS
Managing DataSets
Alternative methods
Orchadmin
Unix
Dsrecords
Lists
Display data
Schema
Dsrecords
Gives
record count
Orchadmin
Exercise
Module 3
Objectives
Will cover:
Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters
Job Presentation
Document using
the annotation
stage
Description shows in DS
Manager and MetaStage
Naming conventions
Container
Copy stage
Transformer Stage
Techniques
Often
Partitioners
Sort / Remove Duplicates
Rename, Drop column
Developing Jobs
1.
Keep it simple
2.
3.
4.
Final Result
$APT_DUMP_SCORE
Report
$APT_CONFIG_FILE
Establishes
Click to add
environment
variables
Partitoner
And
Collector
Mapping
Node--> partition
Exercise
Module 4
DBMS Access
Objectives
Client
Enterprise Edition
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS
RDBMS Access
Supported Databases
DB2
Informix
Oracle
Teradata
RDBMS Access
RDBMS Stages
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
RDBMS Usage
As a source
As a target
Inserts
Upserts (Inserts and updates)
Loader
Stream link
Columns in SQL
statement must match the
meta data in columns tab
Exercise
User-defined SQL
Exercise 4-1
Reject
link
Null Handling
Link
name
DBMS as a Target
DBMS As Target
Write Methods
Delete
Load
Upsert
Write (DB2)
Target Properties
Generated code
can be copied
Upsert mode
determines options
Exercise
Module 5
Platform Architecture
Objectives
Concepts
Piece of Application
Logic Running Against
Individual Records
Parallel or Sequential
Output
Interface
Business
Logic
Partitioner
Input
Interface
EE Stage
Producer
Pipeline
Consume
r
Partition
OSH
What is OSH?
Orchestrate shell
Has a UNIX command-line interface
OSH Script
Where:
op is an Orchestrate operator
in.ds is the input data set
out.ds is the output data set
OSH Operators
Will be enabled
for all projects
Operator
Schema
OSH Practice
Datasets
Consist of Partitioned Data and Schema
Can be Persistent (*.ds)
or Virtual (*.v, Link)
Overcome 2 GB File Limit
What you program:
GUI
data files
of x.ds
Node 1
Node 2
Operator
A
Operator
A
Node 3
Operator
A
Node 4
Operator
A
. . .
Shared Disk
Disk
Disk
CPU
Memory
Uniprocessor
Shared Memory
SMP System
(Symmetric Multiprocessor)
PC
Workstation
Single processor server
Shared Nothing
Disk
Disk
Disk
Disk
CPU
CPU
CPU
CPU
Memory
Memory
Memory
Memory
Job Execution:
Orchestrate
Conductor Node
Processing Node
SL
SL
Communication:
Step Composer
Creates Section Leader processes (one/node)
Consolidates massages, outputs them
Manages orderly shutdown.
Section Leader
Forks Players processes (one/Stage)
Manages up/down communication.
Processing Node
Players
file
- for sequential execution, lighter reportshandy for testing
'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism
'BigN-nodes' file
- aims at full data-partitioned parallelism
$APT_CONFIG_FILE
Scheduling
Nodes, Processes, and CPUs
Ops
User
Orchestrate
O/S
Processes
CPUs
Nodes * Ops
"
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Re-Partitioning
Parallel to parallel flow may incur reshuffling:
Records may jump between nodes
node
1
node
2
partitioner
Partitioning Methods
Auto
Hash
Entire
Range
Range Map
Collectors
Collectors combine partitions of a dataset into a
single input stream to a sequential Stage
...
data partitions
collector
sequential Stage
Partitioner
Collector
Exercise
Module 6
Transforming Data
Module Objectives
Transformed Data
and time
Mathematical
Logical
Null handling
More
Stages Review
Transformer
Parallel
Basic
Create derivations
Flow Control
Rejecting Data
Constraint
Other/log option
Property Reject
Mode = Output
If Not Found
property
Multi-purpose
Counters
Hold values for previous rows to make comparison
Hold derivations to be used in multiple field dervations
Can be used to control execution of constraints
Stage Variables
Show/Hide button
Transforming Data
Derivations
Using expressions
Using functions
Date/time
Constraint Rejects
Transformer
Basic Transformer
Is the non-Universe
transformer
No DS routines available
Logical
Null Handling
Number
String
Type Conversion
Exercise
Module 7
Sorting Data
Objectives
Sorting Data
Important because
Some stages require sorted input
Some stages may run faster I.e Aggregator
Can be performed
Option within stages (use input > partitioning tab and
set partitioning to anything other than auto)
As a separate stage (more complex sorts)
Sorting Alternatives
Sort Stage
Sort Utility
UNIX
Sort Stage
Removing Duplicates
OR
Exercise
Module 8
Combining Data
Objectives
Combining Data
Horizontally:
Several input links; one output link (+ optional rejects)
made of columns from different input links. E.g.,
Joins
Lookup
Merge
Vertically:
One input link, one output link with column combining
values from all input rows. E.g.,
Aggregator
Naming convention:
Joins
Lookup
Merge
Left
Right
Source
LU Table(s)
Master
Update(s)
Tip:
Check "Input Ordering" tab to make sure
intended Primary is listed first
Link Order
immaterial for Inner
and Full Outer Joins
(but VERY important
for Left/Right Outer
and Lookup and
Merge)
0
0
Output
One or more
tables (LUTs)
Lookup
Reject
no pre-sort necessary
allows multiple keys LUTs
flexible exception handling for
source input rows with no match
RDBMS LOOKUP
NORMAL
SPARSE
Combines
Master row and one or more updates row are merged if they have the same
value in user-specified key column(s).
A non-key column occurs in several inputs? The lowest input port number
prevails (e.g., master over update; update values are ignored)
Unmatched ("Bad") master rows can be either
kept
dropped
One or more
updates
0
0
Merge
Output
Rejects
Synopsis:
Joins, Lookup, & Merge
Joins
Lookup
Merge
Model
Memory usage
RDBMS-style relational
light
Master -Update(s)
light
1 Source, N LU Tables
1 Master, N Update(s)
no
OK
Warning!
[fail] | continue | drop | reject
NONE
reusable
all inputs
Warning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed
1
Nothing (N/A)
1 out, (1 reject)
unmatched primary entries
1 out, (N rejects)
unmatched secondary entries
# Outputs
Captured in reject set(s)
In this table:
, <comma>
Columns to be aggregated
Aggregation functions:
count (nulls/non-nulls) sum
max/min/range
Grouping Methods
Aggregator Functions
Sum
Min, max
Mean
Aggregator Properties
Aggregation Types
Aggregation types
Containers
Two varieties
Local
Shared
Local
Shared
Creating a Container
Create a job
Using a Container
Exercise
Module 9
Configuration Files
Objectives
Optimizing Parallelism
Configuration File
Components
Node
Fast name
Pools
Resource
Node Options
Fastname
Pools
Resource
Disk
Scratchdisk
Disk Pools
pool "bigdata"
Sorting Requirements
Resource pools can also be specified for sorting:
3
1
}}
node
node "n1"
"n1" {{
fastname
fastname s1"
s1"
pool
pool ""
"" "n1"
"n1" "s1"
"s1" "sort"
"sort"
resource
disk
"/data/n1/d1"
resource disk "/data/n1/d1" {}
{}
resource
disk
"/data/n1/d2"
{}
resource disk "/data/n1/d2" {}
resource
resource scratchdisk
scratchdisk "/scratch"
"/scratch" {"sort"}
{"sort"}
}}
node
node "n2"
"n2" {{
fastname
fastname "s2"
"s2"
pool
""
"n2"
pool "" "n2" "s2"
"s2" "app1"
"app1"
resource
disk
"/data/n2/d1"
resource disk "/data/n2/d1" {}
{}
resource
scratchdisk
"/scratch"
resource scratchdisk "/scratch" {}
{}
}}
node
node "n3"
"n3" {{
fastname
fastname "s3"
"s3"
pool
""
"n3"
pool "" "n3" "s3"
"s3" "app1"
"app1"
resource
disk
"/data/n3/d1"
resource disk "/data/n3/d1" {}
{}
resource
scratchdisk
"/scratch"
resource scratchdisk "/scratch" {}
{}
}}
node
node "n4"
"n4" {{
fastname
fastname "s4"
"s4"
pool
""
"n4"
pool "" "n4" "s4"
"s4" "app1"
"app1"
resource
resource disk
disk "/data/n4/d1"
"/data/n4/d1" {}
{}
resource
scratchdisk
"/scratch"
resource scratchdisk "/scratch" {}
{}
}}
...
...
Resource Types
Disk
Scratchdisk
DB2
Oracle
Saswork
Sortwork
of CPUs
CPU speed
Available memory
Available page/swap space
Connectivity (network/back-panel speed)
Exercise
Module 10
Extending DataStage EE
Objectives
EE Extensibility Overview
Sometimes it will be to your advantage to
leverage EEs extensibility. This extensibility
includes:
Wrappers
Buildops
Custom Stages
pipe-safe
can read rows sequentially
no random access to data
Wrappers (Contd)
LS Example
Creating a Wrapper
Name of stage
Conscientiously maintaining the Creator page for all your wrapped stages
will eventually earn you the thanks of others.
Interfaces input and output columns these should first be entered into the
table definitions meta data (DS
Manager); lets do that now.
Interface schemas
Layout interfaces describe what columns the
stage:
Needs for its inputs (if any)
Creates for its outputs (if any)
Should be created as tables with columns in
Manager
input schema
export
stdin or
named pipe
UNIX executable
stdout or
named pipe
import
output schema
Resulting Job
Wrapped stage
Job Run
Hardware Environment:
Software:
DB2/EEE, COBOL, EE
Buildops
Buildop provides a simple means of extending beyond the
functionality provided by EE, but does not use an existing
executable (like the wrapper)
Reasons to use Buildop include:
Speed / Performance
Complex business logic that cannot be easily represented
using existing stages
BuildOps
"Build" stages
from within Enterprise Edition
"Wrapping existing Unix
executables
General Page
Identical
to Wrappers,
except:
Temporary
variables
declared [and
initialized] here
Logic here is
executed once
BEFORE
processing the
FIRST row
Logic here is
executed once
AFTER
processing the
LAST row
First line:
output 0
Optional
renaming of
output port
from default
"out0"
Write row
Input page: 'Auto Read'
Read next row
In-Repository
Table
Definition
'False' setting,
not to interfere
with Transfer
page
First line:
Transfer of index 0
Example - sumNoTransfer
a:int32; b:int32
sumNoTransfer
sum:int32
No Transfer
From Peek:
NO TRANSFER
- RCP set to "False" in stage definition
and
- Transfer page left blank, or Auto Transfer = "False"
Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
Transfer
TRANSFER
- RCP set to "True" in stage definition
or
- Auto Transfer set to "True"
Effects:
- new column "sum" is transferred, as well as
- input columns "a" and "b" and
- input column "ignored" (present in input, but
not mentioned in stage)
Columns vs.
Temporary C++ Variables
Temp C++ variables
Columns
DS-EE type
Defined in Table
Definitions
C/C++ type
Value persistent
throughout "loop" over
rows, unless modified in
code
Exercise
Exercise
Custom Stage
Use EE API
Custom Stage
DataStage Manager > select Stage Types branch
> right click
Custom Stage
Name of Orchestrate
operator to be used
The Result
Module 11
Objectives
Data definitions
Recordization and columnization
Fields have properties that can be set at individual
field level
Data
Schemas
Can be imported into Manager
Can be pointed to by some job stages (i.e. Sequential)
Format tab
Column Overrides
Field
and
string
settings
Editing Columns
Properties
depend on the
data type
Schema
Creating a Schema
Importing a Schema
Data Types
Date
Vector
Decimal
Subrecord
Floating point
Raw
Integer
Tagged
String
Time
Timestamp
Exercise
Module 12
Objectives
Job Sequencer
Build a controlling job much the same way you build
other jobs
Comprised of stages and links
No basic coding
Job Sequencer
Example
Job Activity
stage
contains
conditional
triggers
Job to be executed
select from dropdown
Job parameters
to be passed
Options
Different links
having different
triggers
Sequencer Stage
Can be set to
all or any
Notification Stage
Notification
Notification Activity
E-Mail Message
Exercise
Module 13
Objectives
Environment Variables
Environment Variables
Stage Specific
Environment Variables
Environment Variables
Compiler
The Director
Typical Job Log Messages:
Environment variables
Tracing/Debug output
Must compile job
Adds overhead
in trace mode
Troubleshooting
If you get an error during compile, check the following:
Compilation problems
If Transformer used, check C++ compiler, LD_LIRBARY_PATH
If Buildop errors try buildop from command line
Some stages may not support RCP can cause column mismatch .
Use the Show Error and More buttons
Examine Generated OSH
Check environment variables settings
Very little integrity checking during compile, should run validate from Director.