DataStage Essentials v8
Core Modules
4.0.3
02/012007
1
DataStage Essentials v8
DynamicServer, WorkgroupEdition
Enterprise Storage Server
FFST/2
Foundation.2000
Illustra
Informix
Informix4GL
InformixExtendedParallelServer
InformixInternet Foundation.2000
Informix RedBrick Decision Server
J/Foundation
MaxConnect
MVS
MVS/ESA
Net.Data
NUMA-Q
ON-Bar
OnLineDynamicServer
OS/2
OS/2 WARP
OS/390
OS/400
PTX
QBIC
QMF
RAMAC
RedBrickDesign
RedBrickDataMine
Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
All other product or brand names may be trademarks of their respective companies.
All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any of
these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific
situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has
been certified as being Year 2000 compliant.
This document may not be reproduced in whole or in part without the priori written permission of IBM.
Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
02/012007
2
DataStage Essentials v8
Course Contents
02/012007
3
DataStage Essentials v8
02/012007
4
DataStage Essentials v8
4.0.3
02/012007
5
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
List and describe the uses of DataStage
List and describe the DataStage clients
Describe the DataStage workflow
List and compare the different types of DataStage jobs
Describe the two types of parallelism exhibited by DataStage
parallel jobs
Notes:
02/012007
6
DataStage Essentials v8
Design the jobs that extract, integrate, aggregate, load, and transform
the data for your data warehouse or data mart.
02/012007
7
DataStage Essentials v8
Security
Repository
Logging and reporting
Metadata management
02/012007
8
DataStage Essentials v8
Information
Services
Director
Business
Glossary
Information
Analyzer
DataStage
QualityStage
Metadata
Metadata
Access Services
Analysis Services
MetaBrokers
Federation
Server
Metadata Server
02/012007
9
DataStage Essentials v8
10
02/012007
10
DataStage Essentials v8
DataStage Architecture
Clients
Parallel engine
Server engine
Engines
Shared Repository
11
The top half displays the Clients. Below there are two engines: The Server
engine that runs DataStage server jobs and the parallel engine that runs
parallel jobs. Our focus in this course is on Parallel jobs.
The DataStage client components are:
Administrator
Administers DataStage projects and conducts housekeeping on the server
Designer
Creates DataStage jobs that are compiled into executable programs
Director
Used to run and monitor the DataStage jobs
The Repository is used to store DataStage objects. The Repository is
shared with other applications in the Suite.
02/012007
11
DataStage Essentials v8
DataStage Administrator
12
Use the Administrator to specify general server defaults, add and delete
projects, and to set project properties. The Administrator also provides a
command interface to the DataStage repository.
Use the Administrator Project Properties window to:
tab.
Set job monitoring limits and other Director defaults on the General
02/012007
12
DataStage Essentials v8
DataStage Designer
DataStage
parallel job
13
02/012007
13
DataStage Essentials v8
DataStage Director
Log
message
events
14
Use the Director to validate, run, schedule, and monitor your DataStage
jobs. You can also gather statistics as the job runs.
02/012007
14
DataStage Essentials v8
Developing in DataStage
15
Development Workflow:
Import metadata that defines the format of data stores your jobs will
read from or write to
02/012007
15
DataStage Essentials v8
Standard
jobs folder
Standard
Table
Definitions
folder
16
All your work is stored in a DataStage project. Before you can do anything,
other than some general administration, you must open (attach to) a project.
Projects are created during and after the installation process. You can add
projects after installation on the Projects tab of Administrator.
A project is associated with a directory. The project directory is used by
DataStage to store your jobs and other DataStage objects and metadata on
your server.
You must open (attach to) a project before you can do any work in it.
Projects are self-contained. Although multiple projects can be open at the
same time, they are separate environments. You can, however, import and
export objects between them.
Multiple users can be working in the same project at the same time.
However, DataStage will prevent multiple users from editing the same
DataStage object (job, Table Definition, etc.) at the same time.
02/012007
16
DataStage Essentials v8
Server jobs
17
This course focuses on parallel jobs and job sequences that control parallel jobs.
However, job sequences can include both server and parallel jobs in the same
sequence.
02/012007
17
DataStage Essentials v8
Read data
Write data
E.g., Sequential File, DB2, Oracle, Peek stages
Transform data
Filter data
Aggregate data
Generate data
Split / Merge data
E.g., Transformer, Aggregator, Join, Sort stages
Links
Pipes through which the data moves from stage to stage
18
02/012007
18
DataStage Essentials v8
Pipeline Parallelism
Advantages:
Reduces disk usage for staging areas
Keeps processors busy
19
02/012007
19
DataStage Essentials v8
Partition Parallelism
Divide the incoming stream of data into subsets to be
separately processed by an operator
Subsets are called partitions
20
02/012007
20
DataStage Essentials v8
Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Data
Operation
subset3
Node 3
Operation
21
02/012007
21
DataStage Essentials v8
22
02/012007
22
DataStage Essentials v8
Checkpoint
1. True or false: DataStage Designer is used to build and
compile your ETL jobs
2. True or false: User Director to monitor your job during
execution
3. True or false: Administrator is used to set global and project
properties
23
Notes:
Write down your answers here:
1.
2.
3.
02/012007
23
DataStage Essentials v8
Checkpoint solutions
1. True.
2. True.
3. True.
24
02/012007
24
DataStage Essentials v8
Unit summary
Having completed this unit, you should be able to:
List and describe the uses of DataStage
List and describe the DataStage clients
Describe the DataStage workflow
List and compare the different types of DataStage jobs
Describe the two types of parallelism exhibited by DataStage
parallel jobs
25
Notes:
02/012007
25
DataStage Essentials v8
26
02/012007
26
DataStage Essentials v8
4.0.3
02/012007
27
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
Identify the components of Information Server that need to be
installed
Describe what a deployment domain consists of
Describe different domain deployment options
Describe the installation process
Start the Information Server
28
Notes:
02/012007
28
DataStage Essentials v8
Administrator
Designer
Director
Information Analyzer
Business Glossary
Rational Data Architect
Information Services Director
Federation Server
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
29
Database servers other than DB2 can be configured and used. DB2, however, is
available on the installation image.
Which additional applications are available for installation depends on which have
been licensed.
02/012007
29
DataStage Essentials v8
DataStage server
Clients
30
Not shown here and in later slides is the possibility of additional DataStage playernode machines connected to the DataStage server machine using a high-speed
network. For simplicity in the diagram only the conductor node machine is
displayed.
02/012007
30
DataStage Essentials v8
DataStage server
MetaData server and
DB2 Repository
DataStage server
Clients
31
Additional DataStage servers can be part of this domain, but they would have to be
on separate machines.
Here the Database is depicted as DB2, but additional databases are supported.
02/012007
31
DataStage Essentials v8
DataStage server
MetaData server
DB2 Repository
Clients
DataStage server
32
Additional DataStage Servers can be part of this domain, but they would have to be
separate from one another.
02/012007
32
DataStage Essentials v8
33
02/012007
33
DataStage Essentials v8
Engine
Domain
Repository
Documentation
34
Selected configuration layers determine what gets installed on the machine local to
the installation. If your domain is spread across multiple machines, you will select
just the configuration types that are to be installed on the machine you are currently
installing on.
02/012007
34
DataStage Essentials v8
From Windows Start menu, click Start the Server after the profile
to be used (e.g., default)
From the command line, open the profile bin directory
Enter startup server1
>
From Windows Start menu, click Start the agent after selecting the
Information Server folder
Only required if DataStage and the MetaData Server are on
different machines
35
By default upon installation in Windows, the application server and ASB agent are
set to start upon system startup.
02/012007
35
DataStage Essentials v8
Profile
Profile
Startup command
36
02/012007
36
DataStage Essentials v8
37
02/012007
37
DataStage Essentials v8
Checkpoint
1. What application components make up a domain?
2. Can a domain contain multiple DataStage servers?
3. Does the DB2 instance and the Repository database need
to be on the same machine as the Application Server?
4. Suppose DataStage is on a separate machine from the
Application Server. What two components need to be
running before you log onto DataStage?
38
Notes:
Write down your answers here:
1.
2.
02/012007
38
DataStage Essentials v8
Checkpoint solutions
1. Metadata server hosted by the Application Server. One or
more DataStage servers. One DB2/UDB instance
containing the Suite Repository database
2. Yes. The DataStage servers must be on separate
machines. They can be on different platforms, e.g., one
server running on Windows and another running on Linux
3. No. The DB2 instance with the Repository can reside on a
separate machine/platform than the Application Server
4. The Application Server and the ASB agent
39
02/012007
39
DataStage Essentials v8
Unit summary
Having completed this unit, you should be able to:
Identify the components of Information Server that need to be
installed
Describe what a deployment domain consists of
Describe different domain deployment options
Describe the installation process
Start the Information Server
40
Notes:
02/012007
40
DataStage Essentials v8
4.0.3
02/012007
41
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
Open the Administrative console
Create new users and groups
Assign Suite roles and Product roles to users and groups
Give users DataStage credentials
Log onto DataStage Administrator
Add a DataStage user on the Permissions tab and specify the
users role
Specify DataStage global and project defaults
List and describe important environment variables
42
Notes:
02/012007
42
DataStage Essentials v8
Domain management
Session management
Management of users and groups
Logging management
Scheduling management
43
Our focus in this course is on the management of users and groups and domain
management.
02/012007
43
DataStage Essentials v8
Information Server
console address
Information
Server
administrator ID
44
To open the Administrative Web Console, select Web Console for IBM Information
Server Web Console.
The console address is of the form: http://machine:nnnn.
Here machine is the host name of the machine running the application server that
hosts MetaData Server.
nnnn is the port address of the console. By default, it is 9080.
The Information Server administrator ID and password is specified during
installation. After installation, new administrator IDs can be specified.
02/012007
44
DataStage Essentials v8
45
02/012007
45
DataStage Essentials v8
Users that are members of a group acquire the authorizations of the group
Suite roles
Administrator
Perform user and group management tasks
Includes all the privileges of the Suite User role
User
Create views of scheduled tasks and logged messages
Create and run reports
DataStage
DataStage user
DataStage administrator
46
A user ID that is assigned Suite roles can immediately log onto the Information
Server console.
What about a user ID that is assigned a DataStage Suite Component role? If the
user ID is assigned the DataStage Administrator role, then the user will immediately
acquire the DataStage Administrator permission for all projects.
If the user ID is assigned the DataStage User role, one more step is required for any
DataStage project that user will log onto. A DataStage Administrator must assign a
Developer role to that user ID on the Permissions tab. This is done within
DataStage, as we will see later.
02/012007
46
DataStage Essentials v8
Administration
console
Create
new user
Users
47
The process of creating a new group is similar to creating a new user. Users
assigned to a group inherit the authorizations assigned to the group.
02/012007
47
DataStage Essentials v8
User ID
Assign Suite
Administrator role
Assign Suite
User role
Users
Assign DataStage
User role
Assign DataStage
Administrator role
48
02/012007
48
DataStage Essentials v8
49
This assumes that when the DataStage Server was installed, the style of user
registry selected for the installation was Internal User Registry.
02/012007
49
DataStage Essentials v8
Assign DataStage
User role
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
50
All Suite users without their own DataStage credentials will be mapped to this user
ID and password. Here the user name and password are demohawk / demohawk.
demohawk is assumed to be a valid user on the DataStage Server machine and has
file permissions on the DataStage engine and projects directories.
Suite users can also be mapped individually to specific users.
Note that demohawk need not be a Suite administrator or user.
02/012007
50
DataStage Essentials v8
DataStage Administrator
51
02/012007
51
DataStage Essentials v8
DataStage
administrator ID
and password
Name or IP
address of
DataStage server
machine
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
52
This port is also used when logging into the Information Server Administration
console.
Recall that multiple DataStage servers can exist in a domain, although they must be
on different machines. Here you select the server that has the DataStage projects
you want to administer.
The user ID requires a DataStage Administrator or DataStage User product role.
There are some limits as to what a DataStage User role provides. For example, a
DataStage User cannot delete projects and cannot set permissions.
02/012007
52
DataStage Essentials v8
Server projects
Click to specify
project properties
53
02/012007
53
DataStage Essentials v8
Enable runtime
column propagation
(RCP) by default
Environment
variable settings
54
02/012007
54
DataStage Essentials v8
Environment Variables
Parallel job variables
User-defined variables
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
55
02/012007
55
DataStage Essentials v8
Display Score
Display record
counts
Display OSH
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
56
02/012007
56
DataStage Essentials v8
Assigned product
role
Add DataStage
users
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
57
This lists Suite users and groups that have a DataStage User or Administrator role.
When Suite users or groups that have a DataStage Administrator role are added,
they automatically they are automatically entered here and assigned the role of
DataStage Administrator.
Suite users or groups that have a DataStage User role need to be manually added.
To accomplish this, click the Add User or Group button (which is hidden in this
screen shot). Then you need to select the DataStage user role (Operator, Super
Operator, Developer, Production Manager) that this user ID is to have.
02/012007
57
DataStage Essentials v8
Available users /
groups. Must
have a DataStage
product User role
Add DataStage
users
58
Click the Add User or Group button to open this window. On the left are
Information Server users and groups that have been given a DataStage User role.
02/012007
58
DataStage Essentials v8
Added
DataStage user
Select DataStage
role
59
Use this page to set user group permissions for accessing and using DataStage. All
DataStage users must belong to a recognized user role before they can log on to
DataStage. This helps to prevent unauthorized access to DataStage projects.
There are four roles that can be assigned to a DataStage user:
DataStage Developer, who has full access to all areas of a DataStage project.
DataStage Operator, who can run and manage released DataStage jobs.
DataStage Super Operator, who can open Designer and view the Repository in a
read-only mode.
DataStage Production Manager, who has create and manipulate protected
projects.
02/012007
59
DataStage Essentials v8
60
02/012007
60
DataStage Essentials v8
61
Use this tab to specify Parallel job defaults. In addition to displaying the
OSH generated by DataStage from Parallel jobs, you specify default formats
for dates and times.
In general, you should choose to display the OSH. This provides useful
information about how your job works.
02/012007
61
DataStage Essentials v8
62
02/012007
62
DataStage Essentials v8
Checkpoint
1. Authorizations can be assigned to what two items?
2. What two types of authorization roles can be assigned to a
user or group?
3. In addition to Suite authorization to log onto DataStage what
else does a DataStage developer require to work in
DataStage
63
Notes:
Write down your answers here:
1.
2.
02/012007
63
DataStage Essentials v8
Checkpoint solutions
1. Users and groups. Members of a group acquire the
authorizations of the group.
2. Suite roles and Product roles.
3. Must be mapped to a user with DataStage credentials.
64
02/012007
64
DataStage Essentials v8
Unit objectives
Having completing this unit, you should be able to:
Open the Administrative console
Create new users and groups
Assign Suite roles and Product roles to users and groups
Give users DataStage credentials
Log onto DataStage Administrator
Add a DataStage user on the Permissions tab and specify the
users role
Specify DataStage global and project defaults
List and describe important environment variables
65
Notes:
02/012007
65
DataStage Essentials v8
66
02/012007
66
DataStage Essentials v8
4.0.3
02/012007
67
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
Log onto DataStage
Navigate around DataStage Designer
Import and export DataStage objects to a file
Import a Table Definition for a sequential file
68
Notes:
02/012007
68
DataStage Essentials v8
DataStage server
machine / project
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
69
02/012007
69
DataStage Essentials v8
Menus
Toolbar
Parallel
canvas
Palette
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
70
The appearance of the designer work space is configurable; the graphic shown here
is only one example of how you might arrange components.
In the right center is the Designer canvas, where you create stages and links. On
the top left is the Repository window. Items in the Repository, such as jobs and
Table Definitions can be dragged to the canvas area. On the bottom left is the
Tools Palette, which contains stages you can add to the canvas.
Click View>Repository to display the Repository window. Click View>Palette to
display the Palette window.
02/012007
70
DataStage Essentials v8
71
02/012007
71
DataStage Essentials v8
Repository Window
Search for
objects in the
project
Project
Default jobs
folder
Default Table
Definitions
folder
72
02/012007
72
DataStage Essentials v8
73
Any set of DataStage objects, including whole projects, which are stored in
the Repository, can be exported to a file. This export file can then be
imported back into DataStage.
Import and export can be used for many purposes, including:
Sharing jobs and projects between developers. The export files, when
zipped, are small and can be easily emailed from one developer to another.
02/012007
73
DataStage Essentials v8
Export Procedure
Click Export>DataStage Components
Select DataStage objects for export
Specify type of export:
DSX: Default format
XML: Enables processing of export file by XML applications,
e.g., for generating reports
74
02/012007
74
DataStage Essentials v8
Export Window
Click to
select
objects
from the
repository
Selected
objects
Export type
Default jobs
folder
Begin export
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
75
02/012007
75
DataStage Essentials v8
Import Procedure
Click Import>DataStage Components
Or Import>DataStage Components (XML) if you are importing
an XML-format export file
76
02/012007
76
DataStage Essentials v8
Import Options
Import all objects
in the file
Display list to
select from
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
77
For large imports, you may want to disable Perform impact analysis. This
adds overhead to the import process.
02/012007
77
DataStage Essentials v8
78
02/012007
78
DataStage Essentials v8
Sequential files
Relational tables
COBOL files
Many other things
79
Table Definitions define the formats of a variety of data files and tables.
These definitions can then be used and reused in your jobs to specify the
formats of data stores.
For example, you can import the format and column definitions of the
Customers.txt file. You can then load this into the sequential source stage
of a job that extracts data from the Customers.txt file.
You can load this same metadata into other stages that access data with the
same format. In this sense the metadata is reusable. It can be used with
any file or data store with the same format.
If the column definitions are similar to what you need you can modify the
definitions and save the Table Definition under a new name.
You can import and define several different kinds of Table Definitions
including: Sequential files and ODBC data sources.
02/012007
79
DataStage Essentials v8
80
02/012007
80
DataStage Essentials v8
Import Table
Definition for
a sequential
file
81
02/012007
81
DataStage Essentials v8
Start import
Select file
Select repository
folder
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
82
02/012007
82
DataStage Essentials v8
Specify Format
Edit columns
Delimiter
83
02/012007
83
DataStage Essentials v8
84
You can also add Extended Properties. Double-click on the number to the left of
the column name to open up a window in which you specify these extended
properties. Extended properties are discussed later in this course.
02/012007
84
DataStage Essentials v8
Parallel properties
Property
categories
Available
properties
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
85
02/012007
85
DataStage Essentials v8
Type
Source
Stored Table
Definition
86
In the Repository window, select the folder that contains the Table Definition.
Double-click the Table Definition to open the Table Definition window.
Click the Columns tab to view and modify any column definitions. Select
the Format tab to edit the file format specification. Select the Parallel tab to
specify parallel format properties.
02/012007
86
DataStage Essentials v8
Checkpoint
True or False? The directory to which you export is on the
DataStage client machine, not on the DataStage server
machine.
Can you import Table Definitions for sequential files with
fixed-length record formats?
87
Notes:
Write down your answers here:
1.
2.
02/012007
87
DataStage Essentials v8
Checkpoint solutions
1. True.
2. Yes. Record lengths are determined by the lengths of the
individual columns.
88
02/012007
88
DataStage Essentials v8
Unit summary
Having completed this unit, you should be able to:
Log onto DataStage
Navigate around DataStage Designer
Import and export DataStage objects to a file
Import a Table Definition for a sequential file
89
Notes:
02/012007
89
DataStage Essentials v8
90
02/012007
90
DataStage Essentials v8
4.0.3
02/012007
91
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
Design a simple Parallel job in Designer
Define a job parameter
Use the Row Generator, Peek, and Annotation stages in a job
Compile your job
Run your job in Director
View the job log
92
Notes:
02/012007
92
DataStage Essentials v8
93
02/012007
93
DataStage Essentials v8
94
In this module, you will go through the whole process with a simple job,
except for the first bullet. In this module you will manually define the
metadata.
02/012007
94
DataStage Essentials v8
Tools Palette
Stage
categories
Stages
95
The tool palette contains icons that represent the components you can add
to your job design.
02/012007
95
DataStage Essentials v8
96
02/012007
96
DataStage Essentials v8
97
02/012007
97
DataStage Essentials v8
Open
New
window
Parallel job
98
Click the New button in the toolbar to open the New window. Click on the
Parallel Job icon to create a new parallel job (the focus of this course).
02/012007
98
DataStage Essentials v8
Peek
Job
properties
Row
Generator
99
The tools palette may be shown as a floating dock or placed along a border.
Alternatively, it may be hidden and the developer may choose to pull needed
stages from the repository onto the design work area.
02/012007
99
DataStage Essentials v8
Documentation
Clarity
Fewer development errors
100
02/012007
100
DataStage Essentials v8
RowGenerator Stage
101
02/012007
101
DataStage Essentials v8
Set property
value
Property
102
02/012007
102
DataStage Essentials v8
Columns Tab
Double-click to specify
extended properties
View data
Load a
Table
Definition
Select Table
Definition
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
103
On the Columns tab, define the column definitions. Either manually specify the
columns or load the column definitions from a Table Definition. A Table Definition
can either be loaded, as shown here, or dragged from the Repository and dropped
on the link.
02/012007
103
DataStage Essentials v8
Extended Properties
Specified
properties and
their values
Additional
properties to add
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
104
Double-click on the column number to define the extended properties for the column
02/012007
104
DataStage Essentials v8
Peek Stage
Displays field values
105
The peek stage will display column values in a job's output messages log.
02/012007
105
DataStage Essentials v8
Output to
job log
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
106
02/012007
106
DataStage Essentials v8
Job Parameters
Defined in Job Properties window
Makes the job more flexible
Parameters can be:
Used in directory and file names
Used to specify property values
Used in constraints and derivations
107
02/012007
107
DataStage Essentials v8
Parameter
Add environment
variable
108
02/012007
108
DataStage Essentials v8
Job parameter
Click to insert
Job parameter
109
Select the property. Then enter the value in the text box. Click the button at the
right of the text box to display a menu for selecting a job parameter.
02/012007
109
DataStage Essentials v8
Annotation stage
Added from the Tools Palette
Displays formatted text descriptions on diagram
110
02/012007
110
DataStage Essentials v8
Documentation
111
02/012007
111
DataStage Essentials v8
112
You can type in whatever you want; the default text comes from the short
description of the jobs properties you entered, if any.
Add one or more Annotation stages to the canvas to document your job.
An Annotation stage works like a text box with various formatting options.
You can optionally show or hide the Annotation stages by pressing a button
on the toolbar.
There are two Annotation stages. The Description Annotation stage
correlates its text with the Descriptions specified as part of the job properties.
02/012007
112
DataStage Essentials v8
Compiling a Job
Run
Compile
113
Before you can run your job, you must compile it. To compile it, click
File>Compile or click the Compile button on the toolbar. The Compile Job
window displays the status of the compile.
A compile will generate OSH.
02/012007
113
DataStage Essentials v8
Highlight stage
with error
114
If an error occurs:
Click Show Error to identify the stage where the error occurred. This will
highlight the stage in error.
Click More to retrieve more information about the error. This can be lengthy
for parallel jobs.
Many errors also show up on the diagram if Show Stage Validation Errors
is turned on.
02/012007
114
DataStage Essentials v8
115
02/012007
115
DataStage Essentials v8
DataStage Director
Use to run and schedule jobs
View runtime messages
Can invoke directly from Designer
Tools > Run Director
116
As you know, you run your jobs in Director. You can open Director from within
Designer by clicking Tools>Run Director.
In a similar way, you can move between Director and Designer.
There are two methods for running a job:
Run it immediately.
Select the job in the Job Status view. The job must have been compiled.
Click Job>Run Now or click the Run Now button in the toolbar. The Job Run
Options window is displayed.
02/012007
116
DataStage Essentials v8
Run Options
Assign values to
parameter
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
117
The Job Run Options window is displayed when you click Job>Run Now.
This window allows you to stop the job after:
Click Run to run the job after it is validated. The Status column displays the status
of the job run.
02/012007
117
DataStage Essentials v8
Run Options
118
The Job Run Options window is displayed when you click Job>Run Now.
This window allows you to stop the job after:
Click Run to run the job after it is validated. The Status column displays the status
of the job run.
02/012007
118
DataStage Essentials v8
Schedule view
Log view
Select job to
view
messages in
the log view
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
119
02/012007
119
DataStage Essentials v8
Peek messages
120
Click the Log button in the toolbar to view the job log. The job log records
events that occur during the execution of a job.
These events include control events, such as the starting, finishing, and
aborting of a job; informational messages; warning messages; error
messages; and program-generated messages.
02/012007
120
DataStage Essentials v8
Message Details
121
02/012007
121
DataStage Essentials v8
122
02/012007
122
DataStage Essentials v8
Runs a job
Use run to run the job
Use param to specify parameters
In this example, dx444 is the name of the project
In this example, GenDataJob is the name of the job
123
02/012007
123
DataStage Essentials v8
Checkpoint
1. Which stage can be used to display output data in the job
log?
2. Which stage is used for documenting your job on the job
canvas?
3. What command is used to run jobs from the operating
system command line?
124
Notes:
Write down your answers here:
1.
2.
02/012007
124
DataStage Essentials v8
Checkpoint solutions
1. Peek stage
2. Annotation stage
3. dsjob -run
125
02/012007
125
DataStage Essentials v8
Unit summary
Having completed this unit, you should be able to:
Design a simple Parallel job in Designer
Define a job parameter
Use the Row Generator, Peek, and Annotation stages in a job
Compile your job
Run your job in Director
View the job log
126
Notes:
02/012007
126
DataStage Essentials v8
4.0.3
02/012007
127
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
Understand the stages for accessing different kinds of file
data
Sequential File stage
Data Set stage
Create jobs that read from and write to sequential files
Create Reject links
Work with NULLs in sequential files
Read from multiple files using file patterns
Use multiple readers
128
Notes:
02/012007
128
DataStage Essentials v8
Data Set
Complex flat file
129
Several stages handle sequential data. Each stage as both advantages and
differences from the other stages that handle sequential data.
Sequential data can come in a variety of types -- including both fixed length and
variable length.
02/012007
129
DataStage Essentials v8
130
02/012007
130
DataStage Essentials v8
Importing/Exporting Data
Internal format
Data import:
Data export
Internal format
131
When data is imported the imported operator translates that data into the framework
internal format. The export operator performs the reverse action.
02/012007
131
DataStage Essentials v8
132
02/012007
132
DataStage Essentials v8
Field 12
Field 13
, Last field
nl
Field 1
Field 12
Field 13
, Last field
, nl
133
This shows the format of one type of sequential file. Delimiters separate columns.
Fields all columns are defined my delimiters. Similarly, records are defined by
terminating characters.
02/012007
133
DataStage Essentials v8
134
02/012007
134
DataStage Essentials v8
Stream link
Reject link
(broken line)
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
135
The source Sequential stage has a stream output link and a reject output link. The
target Sequential stage has an input link and a reject, output link.
02/012007
135
DataStage Essentials v8
View data
136
The Columns tab has these features: Column definitions, View Data, Load
button. The Save button is used to save the column definitions into the
Repository as a new Table Definition.
02/012007
136
DataStage Essentials v8
Column names
in first row
137
If specified individually, you can make a list of files that are unrelated in name.
If you select read method and choose file pattern, you effectively select an
undetermined number of files.
Set the Report Progress property to Yes to show percent of processing during the
job run.
02/012007
137
DataStage Essentials v8
Format Tab
Record format
Column format
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
138
02/012007
138
DataStage Essentials v8
Use wild
cards
Select File
Pattern
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
139
02/012007
139
DataStage Essentials v8
140
This option allows you to read a single sequential file in parallel. However, the row
order is not maintained. Therefore if input rows need to be identified, this option can
only be used if the data itself provides a unique identifier. This works for both fixedlength and variable-length records.
02/012007
140
DataStage Essentials v8
Append /
Overwrite
141
02/012007
141
DataStage Essentials v8
Reject Link
Reject mode =
In a source stage
In a target stage
142
The sequential stage can have a single reject link. This is typically used when you
are writing to a file and provides a location where records that have failed to be
written to a file for some reason can be sent. When you are reading files, you can
use a reject link as a destination for rows that do not match the expected column
definitions.
02/012007
142
DataStage Essentials v8
Column
mappings
143
02/012007
143
DataStage Essentials v8
144
02/012007
144
DataStage Essentials v8
145
02/012007
145
DataStage Essentials v8
Nullable
column
Added
property
146
02/012007
146
DataStage Essentials v8
DataSet Stage
147
02/012007
147
DataStage Essentials v8
Data Set
Binary data file
Preserves partitioning
Component dataset files are written to each partition
Suffixed by .ds
Referred to by a header file
Managed by Data Set Management utility from GUI (Manager,
Designer, Director)
Represents persistent data
Key to good performance in set of linked jobs
No import / export conversions are needed
No repartitioning needed
Data file(s)
148
02/012007
148
DataStage Essentials v8
DataSet stage
properties
149
02/012007
149
DataStage Essentials v8
Display record
counts for each
data file
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
150
The window is available (data sets management) from Designer and Director. In
Designer, click Tools>Data Set Management to open this window.
02/012007
150
DataStage Essentials v8
151
02/012007
151
DataStage Essentials v8
Partitioned
Implemented with header file and data files
152
Number of raw data files depends on: the configuration file more on configuration
files later.
02/012007
152
DataStage Essentials v8
Checkpoint
1. List three types of file data
2. What makes datasets perform better than other types of
files in parallel jobs
3. What is the difference between a data set and a file set?
153
Notes:
Write down your answers here:
1.
2.
02/012007
153
DataStage Essentials v8
Checkpoint solutions
1. Sequential, dataset, complex flat files
2. They are partitioned and they store data in the native
parallel format
3. Both are partitioned. Data sets store data in a binary
format not readable by user applications. File sets are
readable.
154
02/012007
154
DataStage Essentials v8
Unit summary
Having completed this unit, you should be able to:
Understand the stages for accessing different kinds of file
data
Sequential File stage
Data Set stage
Create jobs that read from and write to sequential files
Create Reject links
Work with NULLs in sequential files
Read from multiple files using file patterns
Use multiple readers
155
Notes:
02/012007
155
DataStage Essentials v8
156
02/012007
156
DataStage Essentials v8
4.0.3
02/012007
157
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
Describe parallel processing architecture
Describe pipeline parallelism
Describe partition parallelism
List and describe partitioning and collecting algorithms
Describe configuration files
Describe the parallel job compilation process
Explain OSH
Explain the Score
158
Notes:
02/012007
158
DataStage Essentials v8
Scalable processing:
Add more resources (CPUs and disks) to increase system
performance
159
Parallel processing is the key to building jobs that are highly scalable.
The parallel engine uses the processing node concept. A processing node
is a CPU or an SMP, or a board on an MPP.
02/012007
159
DataStage Essentials v8
Single CPU
Dedicated memory
& disk
SMP
Multi-CPU (2-64+)
Shared memory &
disk
GRID / Clusters
MPP
2 1000s of CPUs
160
02/012007
160
DataStage Essentials v8
Pipeline Parallelism
Advantages:
Reduces disk usage for staging areas
Keeps processors busy
161
02/012007
161
DataStage Essentials v8
Partition Parallelism
Divide the incoming stream of data into subsets to be
separately processed by an operation
Subsets are called partitions (nodes)
162
02/012007
162
DataStage Essentials v8
Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Data
Operation
subset3
Node 3
Operation
163
02/012007
163
DataStage Essentials v8
The method to use to distribute the data into the available nodes
The method to use to collect the data from the nodes into a single node
Configuration file
164
02/012007
164
DataStage Essentials v8
165
Much of the parallel processing paradigm is hidden from the programmer. The
programmer simply designates the process flow, as shown in the upper portion of
this diagram. The parallel engine, using the definitions in the configuration file, will
actually execute processes that are partitioned and parallelized, as illustrated in the
bottom portion.
02/012007
165
DataStage Essentials v8
Configuration File
Configuration file separates configuration (hardware / software) from
job design
Defines nodes with their resources (need not match physical CPUs)
Optimizes overall throughput and matches job characteristics to overall hardware resources
Allows runtime constraints on resource usage on a per job basis
166
DataStage jobs can point to different configuration files by using job parameters.
Thus, a job can utilize different hardware architectures without being recompiled.
It can pay to have a 4-node configuration file running on a 2 processor box, for
example, if the job is resource bound. We can spread disk I/O among more
controllers.
02/012007
166
DataStage Essentials v8
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
Key points:
1.
2.
3.
Advanced resource
optimizations and configuration
(named pools, database, SAS)
}
167
This example shows a typical configuration file. Pools can be applied to nodes or
other resources. Note the curly braces following some disk resources.
Following the keyword node is the name of the node (logical processing unit).
The order of resources is significant. The first disk is used before the second, and
so on.
Keywords, such as sort and bigdata, when used, restrict the signified processes
to the use of the resources that are identified. For example, sort restricts sorting
to node pools and scratchdisk resources labeled sort.
Database resources (not shown here) can also be created that restrict database
access to certain nodes.
Question: Can objects be constrained to CPUs? No, a request is made to the
operating system and the operating system chooses the CPU.
02/012007
167
DataStage Essentials v8
168
02/012007
168
DataStage Essentials v8
169
02/012007
169
DataStage Essentials v8
Round robin
Random
Hash: Determine partition based on key value
Modulus
Entire: Send all rows down all partitions
Same: Preserve the same partitioning
Auto: Let DataStage choose the algorithm
Read in by key
Presumes data is sorted by the key in each partition
Builds a single sorted stream based on the key
Sort Merge
Ordered
170
02/012007
170
DataStage Essentials v8
Round Robin
Random
Entire
Same
Example: Key is State. All CA rows go into the same partition; all
MA rows go in the same partition. Two rows of the same state never
go into different partitions
171
02/012007
171
DataStage Essentials v8
8 7 6 5 4 3 2 1 0
Round Robin
6
3
0
7
4
1
8
5
2
172
02/012007
172
DataStage Essentials v8
ENTIRE Partitioning
Keyless
8 7 6 5 4 3 2 1 0
ENTIRE
.
3
2
1
0
.
3
2
1
0
.
3
2
1
0
173
02/012007
173
DataStage Essentials v8
HASH Partitioning
Keyed partitioning
method
Rows are distributed
according to the values
in key columns
HASH
Keyed
0
3
0
3
1
1
1
2
2
2
distributed
174
02/012007
174
DataStage Essentials v8
Modulus Partitioning
Keyed partitioning method
Rows are distributed according
to the values in one integer key
column
Keyed
Values of key column
0 3 2 1 0 2 3 2 1 1
Uses modulus
MODULUS
0
3
0
3
1
1
1
2
2
2
175
02/012007
175
DataStage Essentials v8
Auto Partitioning
DataStage inserts partition components as necessary to
ensure correct results
Before any stage with Auto partitioning
Generally chooses ROUND-ROBIN or SAME
Inserts HASH on stages that require matched key values
(e.g. Join, Merge, Remove Duplicates)
Inserts ENTIRE on Normal (not Sparse) Lookup reference
links
176
Auto generally chooses Round Robin when going from sequential to parallel. It
generally chooses Same when going from parallel to parallel.
Since DataStage has limited awareness of your data and business rules, explicitly
specify HASH partitioning when needed, that is, when processing requires groups of
related records.
02/012007
176
DataStage Essentials v8
Partition imbalances
If all the records are going down only one of the nodes,
then the job is in effect running sequentially
177
02/012007
177
DataStage Essentials v8
Address
Ford
Henry
66 Edison Avenue
Ford
Clara
66 Edison Avenue
Ford
Edsel
7900 Jefferson
Ford
Eleanor
7900 Jefferson
Dodge
Horace
17840 Jefferson
Dodge
John
75 Boston Boulevard
Ford
Henry
4901 Evergreen
Ford
Clara
4901 Evergreen
Ford
Edsel
1100 Lakeshore
10
Ford
Eleanor
1100 Lakeshore
ID
LName
FName
Address
Dodge
Horace
17840 Jefferson
Dodge
John
75 Boston Boulevard
Partition 1
LName
Part 0
Source Data
ID
ID
LName
FName
Address
Ford
Henry
66 Edison Avenue
Ford
Clara
66 Edison Avenue
Ford
Edsel
7900 Jefferson
Ford
Eleanor
7900 Jefferson
Ford
Henry
4901 Evergreen
Ford
Clara
4901 Evergreen
Ford
Edsel
1100 Lakeshore
10
Ford
Eleanor
1100 Lakeshore
178
02/012007
178
DataStage Essentials v8
Partitioning icon
Collecting icon
179
02/012007
179
DataStage Essentials v8
SAME partitioner
Re-partition
watch for this!
AUTO partitioner
180
02/012007
180
DataStage Essentials v8
Partitioning Tab
Input tab
Key
specification
Algorithms
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
181
02/012007
181
DataStage Essentials v8
Collecting Specification
Key
specification
Algorithms
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
182
02/012007
182
DataStage Essentials v8
Runtime Architecture
183
02/012007
183
DataStage Essentials v8
Compile
DataStage server
Executable
Job
Gene
rated
OSH
C++ fo
r
Trans each
forme
r
Transformer
Components
184
Compilation generates OSH (Orchestrate Shell Script) and also C++ code for any
Transformer stages used.
For each Transformer, DataStage builds a C++ Framework operator that is then
compiled. So when a job contains a Transformer, it takes longer to compile (but not
to run).
02/012007
184
DataStage Essentials v8
Generated OSH
Enable viewing of generated
OSH in Administrator:
Comments
Operator
Schema
- Job
properties
- Job run log
- View Data
- Table Defs
185
You can view generated OSH through DataStage Designer. This provides an
overview of the OSH that will be generated. It is important to note, however, that
this OSH will go through some changes for optimization and execution.
02/012007
185
DataStage Essentials v8
Source: import
Target: export
DataSet: copy
Sort: tsort
Aggregator: group
Row Generator, Column Generator, Surrogate Key
Generator: generator
Oracle
Source: oraread
Sparse Lookup: oralookup
Target Load: orawrite
Target Upsert: oraupsert
186
The stages on the diagram do not necessarily map one-to-one to operators. For
example, the Sequential stage when used as a source is mapped to the import
operator. When used as a target it is mapped to the export operator.
The converse is also true. Different stages can be mapped to a single operator.
For example, Row Generator and Column Generator are both mapped to the
generator operator.
02/012007
186
DataStage Essentials v8
Syntax
####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000
## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;
Virtual dataset is
used to connect
output of one
operator to input of
another
####################################################
#### STAGE: SortSt
## Operator
tsort
## Operator options
-key 'a'
-asc
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;
187
02/012007
187
DataStage Essentials v8
Job Score
Generated from OSH and configuration file
Think of Score as in musical score, not game score
188
02/012007
188
DataStage Essentials v8
Set $APT_DUMP_SCORE
to output the Score to the
job log
To identify the Score dump,
look for main program:
This step
Virtual
datasets
Operators with
node
assignments
189
02/012007
189
DataStage Essentials v8
Processing Node
SL
Players
The actual processes associated with stages
Combined players: one process only
Sends stderr, stdout to Section Leader
SL
Processing Node
190
The conductor node has the start-up process. Creates the score. Starts up section
leaders.
Section leaders communicate with the conductor only. The Conductor
communicates with the players.
02/012007
190
DataStage Essentials v8
Section Leader,0
Section Leader,1
generator,0
copy,0
Section Leader,2
generator,1
copy,1
generator,2
copy,2
191
Every player has to be able to communicate with every other player. There are
separate communication channels (pathways) for control, messages, errors, and
data. Note that the data channel does not go through the section leader/conductor,
as this would limit scalability. Data flows directly from upstream operator to
downstream operator using APT_Communicator class.
02/012007
191
DataStage Essentials v8
Checkpoint
1. What file defines the degree of parallelism a job runs
under?
2. Name two partitioning algorithms that partition based on
key values?
3. What partitioning algorithm produces the most even
distribution of data in the various partitions?
192
Notes:
Write down your answers here:
1.
2.
02/012007
192
DataStage Essentials v8
Checkpoint solutions
1. Configuration file
2. Hash, Modulus, DB2
3. Round robin. Or Entire
193
02/012007
193
DataStage Essentials v8
Unit summary
Having completed this unit, you should be able to:
Describe parallel processing architecture
Describe pipeline parallelism
Describe partition parallelism
List and describe partitioning and collecting algorithms
Describe configuration files
Describe the parallel job compilation process
Explain OSH
Explain the Score
194
Notes:
02/012007
194
DataStage Essentials v8
4.0.3
02/012007
195
DataStage Essentials v8
Unit objectives
After completing this unit, you should be able to:
Combine data using the Lookup stage
Define range lookups
Combine data using Merge stage
Combine data using the Join stage
Combine data using the Funnel stage
196
Notes:
02/012007
196
DataStage Essentials v8
Combining Data
Ways to combine data:
Horizontally:
Vertically:
One input link, one output link combining groups of related records
into a single record
Aggregator
Remove Duplicates
197
This module discusses the horizontal way of merging data. The next module
discusses the vertical way.
02/012007
197
DataStage Essentials v8
198
02/012007
198
DataStage Essentials v8
Lookup Stage
199
02/012007
199
DataStage Essentials v8
Lookup Features
Reject link
Indexed by key
Should be small enough to fit into physical memory
200
02/012007
200
DataStage Essentials v8
Lookup Types
Equality match
Match exactly values in the lookup key column of the reference link to
selected values in the source row
Return row or rows (if multiple matches are to be returned) that match
Caseless match
Like an equality match except that its caseless
E.g., abc matches AbC
201
02/012007
201
DataStage Essentials v8
Lookup Example
Reference link
Source (stream)
link
202
02/012007
202
DataStage Essentials v8
Lookup
constraints
Mappings to
output columns
Reference key
column and source
mapping
Column
definitions
203
For an equality or caseless match lookup, one or more columns in the reference link
are selected as keys (see lower left panel). Columns from the source link are
matched to the key columns using drag and drop. To specify an equality match,
select the equal sign (=) from the Key Type box of the reference link panel. To
specify a caseless match, select the Caseless from the Key Type box of the
reference link panel.
Output columns are specified in the top, right panel. Columns from the source and
reference link are dragged to the front of these columns to specify the values to be
mapped to the output columns.
02/012007
203
DataStage Essentials v8
Drop
Input row is dropped
Continue
Input row is transferred to the output. Reference link columns are
filled with null or default values
Reject
Input row sent to a reject link
This requires that a reject link has been created for the stage
204
02/012007
204
DataStage Essentials v8
Select to return
multiple rows
Select lookup
Select action failure action
205
Click the Constraints icon (top, second from left) to open this window.
02/012007
205
DataStage Essentials v8
Source link
Revolution
1789
1776
Citizen
Lefty
M_B_Dextrous
Reference link
Citizen
M_B_Dextrous
Righty
Exchange
Nasdaq
NYSE
Lookup key
column
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
206
02/012007
206
DataStage Essentials v8
Lookup Stage
Output of Lookup with Continue option
Revolution
1789
1776
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
Empty string
or NULL
Citizen
M_B_Dextrous
Exchange
Nasdaq
207
For the first source row (1789), the lookup fails to find a match. Since Continue is
the lookup failure option, the row is output. The Exchange column is populated with
NULL (if the column is nullable) or a default value, empty string (if the column is not
nullable).
For the second source row (1776), the lookup finds a match, so the Exchange
column gets a value from the lookup file.
When Drop is the lookup failure action, the first, unmatched row is dropped.
02/012007
207
DataStage Essentials v8
208
02/012007
208
DataStage Essentials v8
Reference link
Lookup stage
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
209
02/012007
209
DataStage Essentials v8
Reference
range values
Retrieve
description
Source values
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
210
02/012007
210
DataStage Essentials v8
Reference link
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
211
Select the column in the source link that contains the value to match to the range on
the reference link.
02/012007
211
DataStage Essentials v8
Select
operators
Select range
columns
212
02/012007
212
DataStage Essentials v8
Retrieve other
column values
Source range
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
213
02/012007
213
DataStage Essentials v8
Reference link
key column
Range type
214
02/012007
214
DataStage Essentials v8
Select range
columns
215
02/012007
215
DataStage Essentials v8
Join Stage
216
02/012007
216
DataStage Essentials v8
Join Stage
Inner
Left outer
Right outer
Full outer
Input links must be sorted
Light-weight
217
02/012007
217
DataStage Essentials v8
Join stage
218
02/012007
218
DataStage Essentials v8
Column to
match
Select left
input link
Select if multiple
columns make up
the join key
219
02/012007
219
DataStage Essentials v8
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
NYSE
Join key
column
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
220
02/012007
220
DataStage Essentials v8
Inner Join
Transfers rows from both data sets whose
key columns have matching values
Output of inner join on key Citizen
Revolution
1776
Citizen
M_B_Dextrous
Exchange
Nasdaq
221
02/012007
221
DataStage Essentials v8
Citizen
Lefty
M_B_Dextrous
Exchange
Nasdaq
Null or default
value
222
02/012007
222
DataStage Essentials v8
Transfers all values from the right link and transfers values
from the left link only where key columns match.
Revolution
1776
Citizen
M_B_Dextrous
Righty
Exchange
Nasdaq
NYSE
Null or default
value
223
02/012007
223
DataStage Essentials v8
Merge Stage
224
02/012007
224
DataStage Essentials v8
Merge Stage
Light-weight
225
02/012007
225
DataStage Essentials v8
Merge stage
Capture update
link non-matches
Copyright IBM Corporation 2007
The course materials are provided for internal use only by the individual receiving the materials.
The materials may not be modified, copied, distributed or transferred without the express prior written consent of IBM.
226
02/012007
226
DataStage Essentials v8
Match key
Keep or drop
unmatched
masters
227
Unmatched updates are captured by adding additional reject links (one for each
update link).
02/012007
227
DataStage Essentials v8
Model
Memory usage
# and names of Inputs
Mandatory Input Sort
Duplicates in primary input
Duplicates in secondary input(s)
Options on unmatched primary
Options on unmatched secondary
On match, secondary entries are
# Outputs
Captured in reject set(s)
Joins
Lookup
Merge
RDBMS-style relational
light
Master -Update(s)
light
1 Source, N LU Tables
2 or more: left, right
all inputs
no
OK
OK
OK
W arning!
Keep (left outer), Drop (Inner) [fail] | continue | drop | reject
Keep (right outer), Drop (Inner)
NONE
captured
captured
1
Nothing (N/A)
1 out, (1 reject)
unmatched primary entries
1 Master, N Update(s)
all inputs
W arning!
OK only when N = 1
[keep] | drop
capture in reject set(s)
consumed
1 out, (N rejects)
unmatched secondary entries
228
02/012007
228
DataStage Essentials v8
Funnel Stage
229
02/012007
229
DataStage Essentials v8
Continuous
Records are combined in no particular order
Sort Funnel
Sequence:
Outputs all records from the first input link, then all
from the second input link, and so on
230
02/012007
230
DataStage Essentials v8
231
02/012007
231
DataStage Essentials v8
Checkpoint
1. Name three stages that horizontally join data?
2. Which stage uses the least amount of memory? Join or
Lookup?
3. Which stage requires that the input data is sorted? Join or
Lookup?
232
Notes:
Write down your answers here:
1.
2.
02/012007
232
DataStage Essentials v8
Checkpoint solutions
1. Lookup, Merge, Join
2. Lookup
3. Join
233
02/012007
233
DataStage Essentials v8
Unit summary
Having completed this unit, you should be able to:
Combine data using the Lookup stage
Define range lookups
Combine data using Merge stage
Combine data using the Join stage
Combine data using the Funnel stage
234
Notes:
02/012007
234