Anda di halaman 1dari 105

A Practical Introduction to

Ab Initio Software:
Part 1

24 August 2007

Course Structure

Day 1

Part 1: Basic Concepts and DML


Part 2: Building Applications
& Parallelism

Day 2

Part 3: Parallel Topics


Database Connectivity (Optional)

Finger Exercises

Intermediate
Exercises

What Does Ab Initio Mean?


Ab Initio is Latin for From the Beginning.
From the beginning our software was designed to support a complete
range of business applications, from simple to the most complex.
Crucial capabilities like parallelism and checkpointing cant be added
after the fact.
The Graphical Development Environment and a powerful set of
components allow our customers to get valuable results from the
beginning.

Ab Initios focus

Moving Data
move small and large volumes of data in an efficient manner
deal with the complexity associated with business data
High Performance
scalable solutions
Better productivity

Ab Initio Software
Ab Initio software is a general-purpose data processing platform
for mission-critical applications such as:
Data warehousing
Batch processing
Click-stream analysis
Real Time Applications
Data movement
Data transformation

Parallel Computer Architecture


Computers come in many shapes and sizes:
Single-CPU, Multi-CPU
Network of single-CPU nodes
Network of multi-CPU nodes

Multi-CPU machines are often called SMPs (for Symmetric Multi


Processors).

Specially-built networks of machines are often called MPPs (for


Massively Parallel Processors).

A Multi-CPU Computer (SMP)

A Network of Multi-CPU Nodes

A Network of Networks

Ab Initio Provides For:


Distribution - a platform for applications to execute across a
collection of processors within the confines of a single machine
or across multiple machines.
Reduced Run Time Complexity - the ability for applications to run
in parallel on any combination of computers where the Ab Initio
Co>Operating System is installed from a single point of control.

Applications of Ab Initio Software


Processing just about any form and volume of data.

Parallel sort/merge processing.

Data transformation.

Rehosting of corporate data.

Parallel execution of existing applications.

Applications of Ab Initio Software


Front end of Data Warehouse:
Transformation of disparate sources
Aggregation and other preprocessing
Referential integrity checking
Database loading

Back end of Data Warehouse:


Extraction for external processing
Aggregation and loading of Data Marts

Ab Initio Product Architecture


User
UserApplications
Applications
Development
DevelopmentEnvironments
Environments
GDE
Shell
GDE
Shell
Component
Component
Library
Library

User-defined
User-defined
Components
Components

3rd
3rdParty
Party
Components
Components

Ab
AbInitio
Initio
EME
EME

The
TheAb
AbInitio
InitioCo>Operating
Co>OperatingSystem
System

Native
NativeOperating
OperatingSystem
System(Unix,
(Unix,Windows,
Windows,OS/390)
OS/390)

Co>Operating System Services


Parallel and distributed application execution
Control
Data Transport

Transactional semantics at the application level.


Checkpointing.
Monitoring and debugging.
Parallel file management.
Metadata-driven components.

The Graph Model

The Graph Model: Naming the Pieces

Components
Datasets

Dataset

Flows

The Graph Model: Some Details

Ports

Record format
metadata

Expression
metadata

Components
Components may run on any computer running the Co>Operating
System.

Different components do different jobs.

The particular work a component accomplishes depends upon its


parameter settings.

Some parameters are data transformations, that is business rules to be


applied to an input(s) to produce a required output.

Datasets
A dataset is a source or destination of data. It can be a simple file, a
database table, a SAS dataset, ...
Datasets may reside on any machine running the Co>Operating
System.
Datasets may reside on other machines if connected by FTP or
database middleware.
Data is always described by record format metadata (termed dml).

Dataset: Records and Fields

A dataset is made up of
records; a record
consists of fields.
Analogous database
terms are rows and
columns

Records

0345John
0345John
0212Sam
0212Sam
0322Elvis
0322Elvis
0492Sue
0492Sue
0121Mary
0121Mary
0221Bill
0221Bill
Fields

Smith
Smith
Spade
Spade
Jones
Jones
West
West
Forth
Forth
Black
Black

Sources of Record Format Metadata


Record formats can be generated from:
Database catalogs
COBOL copybooks
Other third-party products
SAS datasets

One can always resort to manual entry!

A Sandbox Environment
Setting up a standard working environment helps a development
team work together.

The Sandbox capability allows an application to be designed to


be trivially portable

The Sandbox contents are a project administrative function

Sandbox Parameters

Start the Ab Initio GDE


Open mp/figure-01.mp
Go to Project-Edit Sandbox...

Environment Quick Overview


$AI_RUNrun directory
$AI_DMLrecord format files
$AI_XFRtransform files
$AI_MPgraphs
$AI_DBdatabase config files

$AI_SERIAL - serial source data, other serial data


$AI_MFS - Ab Initio multifile directory in training will also
contain partition directories (more about this later!)
$AI_LOG - A location to place logging files, etc.

Environment Overview
We will make use of environment variables (shortcuts, parms)
during class.

The goal is to have a development environment which enables


the migration of a graph or set of graphs to any other
environment with absolutely no changes

Viewing Component Properties

Double click on a
component to bring
up its Properties Page

Viewing Port Properties

Click on the Ports Tab


to view the Port(s)
Properties

Record Format Metadata in Graphical Form

0345John
0345John
0212Sam
0212Sam
0322Elvis
0322Elvis
0492Sue
0492Sue
0121Mary
0121Mary
0221Bill
0221Bill

Smith
Smith
Spade
Spade
Jones
Jones
West
West
Forth
Forth
Black
Black

Editing Types in GDE


Dont do a Save when exiting

Field name

Field type

Field length

The Record Format Metadata in text form

record
decimal(4) id;
string(6) first_name;
string(6) last_name;
string(5) newfield;
end

Field Names
Names consist of letters, digits, and underscores:
a z, A Z, 0 9, _
Note: No spaces, hyphens, $s, #s, %s

Case does matters! ABC and abc are different!

Some words are reserved (record, end, date, )

Field Type and Field Length


There are several built-in types available via the drop-down menu. This
course uses three types: string, decimal (for all numbers), and date.
A date type requires a format specifier that is an exact representation
of the date (e.g., MM-DD-YYYY).
A field length is either a number for fixed-length fields, or the delimiter
that terminates the field for variable-length fields.

What Data Can Be Described?


There are both fixed-size and variable-length types.
ASCII, EBCDIC, UNICODE character sets are supported.
Supported types can represent strings, numbers, binary
numbers, packed decimals, dates
Complex data formats can consist of nested records, vectors, ...

Access to Field Characteristics


Some aspects of field descriptions (e.g., date formats) must be
accessed via the attribute pane.
To see additional attributes, use the Attributes item on the
Record Format Editors View Menu or use the Attributes button.

More Record Format Editing

View Attributes.

Length can be delimiter string

Field Type drop-down

Date format goes here

Text Record Format for Date Field

record
decimal(4) id;
string(6) first_name;
string(6) last_name;
date("YYYY-DD-MM") newfield;
end;

Expressions in DML
Computations are expressed in the algebraic syntax of C, Pascal, etc.
Field names act as variables.
Arithmetic operators: +, -, *, ...
Comparison operators: >, <, ==, !=, ...
Many built-in functions: string_concat, string_trim, today,
date_day_of_week,
(See the Data Manipulation Language Reference for more information on
expressions and built-in functions.)

Viewing Data (mp/figure-01.mp)

1. Right click on dataset.

2. Select View Data...

The View Data Panel

Evaluating Expressions from View Data

Type in an expression...
or use the expression editor

Expression Editor
Fields

Functions

Expression text

Operators

Exercise 1: Writing DML


Open mp/ex1.mp
The data file ex1.dat contains these lines:
Smith,John,1992.02.23,2400
Jones,Jane,1993.10.29,320
Warren,Jake,1994.11.02,9045

Use the Record Format Editor (New) to create a description of this data:
lastname, firstname, pur_date, and amt. Then use View Data to verify
the description is correct.
Hint: Newline delimiters are written: \n

Simple Components

In these components the record


format metadata does not
change from input to output

The Filter by Expression Component


For each record on the input port the select_expr parameter is
evaluated. If select_expr evaluates true (non-zero), the input record is
written to the out port exactly as the input was read.

If the select_expr evaluates false (zero), the record is written to the


deselect port.

The out port must be connected downstream, those records meeting


the select_expr criteria

The deselect output may be optionally used

Filter Data (Selection)

(figure-02)

1. Push Run button.

2. View monitoring information.

3. View output data.

Expression Parameter

Exercise 2: Data Filtering (Selection)


Using example graph figure-02.mp, change the select expression
parameter of the Filter by Expression component to select
records with id greater than 215.

Run the application and examine the resulting data.

Keys
A key identifies a single field or set of fields (a composite key) used to
organize a dataset in some way.
Single field:

{id}

Multiple field:

{last_name; first_name}

Modifiers:

{id descending}

Used for sorting, grouping, partitioning.


(See the Data Manipulation Language Reference for more information on
keys. Note: keys are also called collators.)

The Sort Component


Reads records from input port, sorts them by key, and writes the
result on the output port.

Sorting (mp/figure-03.mp)

Sorting - The Key Specifier Editor

Exercise 3: Sorting
Using example graph figure-03.mp, change the key parameter of
the Sort component to sort the data by first_name.

Run the application and examine the resulting data.

More Complex Components


In these components the record
format metadata typically changes
(goes through a transformation)
from input to output

Data Transformation

0345,090263John,Smith;
0345,090263John,Smith;
Drop Reformat
Reformat

Reorder

Input record format:


record
decimal(,) id;
date(MMDDYY) bday;
string(,)first_name;
string(;) last_name;
end

id+1000000
Output record format:
record
decimal(7) id;
string(8) last_name;
date(YYYY.MM.DD) bday;
end

1000345Smith
1000345Smith

1963.09.02
1963.09.02

The Reformat Component (mp/figure-04.mp)


Reads records from input port, reformats each according to a
transform function (optional in the case of the Reformat
Component), and writes the result records to the output (out0) port.
Additional output ports (out1, ...) can be created by adjusting the
count parameter.

Transformation Functions
A transform function specifies the business rules used to create
the output record.

Each field of the output record must successfully be assigned a


value. Partial output records are not allowed!

The Transform Editor is used to create a transform function in a


graphical manner.

The Transform Function Editor

Text DML: Transform Function Syntax

Transform Functions look like:


output-variables :: name ( input-variables ) =
begin
assignments;
end;

Assignments look like:


output-variable.field :: expression;

(See the Data Manipulation Language Reference for more information on


transform functions.)

The Transform Function in Text Format

out :: reformat (in) =


begin
out.id :: in.id + 1000000;
out.last_name :: string_concat(Mac, in.last_name);
end;

A Look Inside the Reformat Component

a b

x y z

A Record arrives at the input port

9 45 QF

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;

The Record is read into the component

9 45 QF
out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;

The Transformation Function is evaluated

9 45 QF
out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;

Since every rule within the Transform function


is successful, a result record is issued

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;
44 9 RG

The result record is written to the output port of the component

out :: trans(in) =
begin
out.x :: in.b - 1;
out.y :: in.a;
out.z :: fn(in.c);
end;

44 9 RG

Exercise 4: Reformat Data


Using graph figure-04.mp, change the record format metadata of the
Simple-Out dataset to add a new field called name of type string(20).

Add a business rule to the existing transform function to populate


name by concatenating first_name and last_name using string_concat.

Run the graph and examine the results.

Then modify the transform to trim the spaces from the first name before
concatenating with last name to get John Smith rather than John
Smith

Data Aggregation

0345Smith
0345Smith
0212Spade
0212Spade
0322Jones
0322Jones
0492West
0492West
0121Forth
0121Forth
0221Black
0221Black

Bristol
Bristol
London
London
Compton
Compton
London
London
Bristol
Bristol
New
New York
York

56
56
88
12
12
23
23
77
42
42

Bristol
Bristol
Compton
Compton
London
London
New
New York
York

63
63
12
12
31
31
42
42

Data Aggregation of Sorted/Grouped Input

0345Smith
0345Smith
0121Forth
0121Forth
0322Jones
0322Jones
0212Spade
0212Spade
0492West
0492West
0221Black
0221Black

Bristol
Bristol
Bristol
Bristol
Compton
Compton
London
London
London
London
New
New York
York

56
56
77
12
12
88
23
23
42
42

Bristol
Bristol 63
63
Compton
Compton 12
12
London
London
New
New York
York

31
31
42
42

The Rollup Component (mp/figure-05.mp)

By default, Rollup reads grouped (sorted) records from the input


port, aggregates them as indicated by key and transform
parameters, and writes the resulting aggregate record on the out
port.

Built-in Functions for Rollup


The following aggregation functions are predefined and are only
available in the rollup component:

avg

max

count

min

first

product

last

sum

Rollup Wizard

Note the use of an aggregation function in the expression

Exercise 6: Rollup Data


Using example graph figure-05.mp, modify the transform function
to count the number of records for the same city.

Run the application and examine the results.

Joining Data
0345Smith
0345Smith
0212Spade
0212Spade
0322Jones
0322Jones
0492West
0492West
0121Forth
0121Forth
0221Black
0221Black

Bristol
Bristol
London
London
Compton
Compton
London
London
Bristol
Bristol
New
New York
York

56
56
88
12
12
23
23
77
42
42

0322970402
0322970402
0345970924
0345970924
0121961211
0121961211
0492971123
0492971123
0666950616
0666950616

0345Bristol
0345Bristol
0212London
0212London
0322Compton
0322Compton
0492London
0492London
0121Bristol
0121Bristol
0221New
0221New York
York

1242.50
1242.50
923.75
923.75
12392.00
12392.00
234.12
234.12
2312.10
2312.10

561997/09/24
561997/09/24
81900/01/01
81900/01/01
121997/04/02
121997/04/02
231997/11/23
231997/11/23
71996/12/11
71996/12/11
421900/01/01
421900/01/01

Joining Sorted Data on the id field

0121Forth
0121Forth
0212Spade
0212Spade
0221Black
0221Black
0322Jones
0322Jones
0345Smith
0345Smith
0492West
0492West

Bristol
Bristol
London
London
New
New York
York
Compton
Compton
Bristol
Bristol
London
London

77
88
42
42
12
12
56
56
23
23

0121Bristol
0121Bristol
0212London
0212London
...
...

0121961211
0121961211 12392.00
12392.00
0322970402
0322970402 1242.50
1242.50
0345970924
923.75
0345970924
923.75
0492971123
234.12
0492971123
234.12
0666950616
0666950616 2312.10
2312.10
71996/12/11
71996/12/11
81900/01/01
81900/01/01

Building the Output Record

in0:

in1:

record
decimal(4) id;
string(6) name;
string(8) city;
decimal(3) amount;
end

record
decimal(4) id;
date(YYMMDD) dt;
decimal(9.2) cost;
end

out:
record
decimal(4) id;
string(8) city;
decimal(3) amount;
date(YYYY/MM/DD)dt;
end

What if the in1 record is missing?

in0:

in1:

record
decimal(4) id;
string(6) name;
string(8) city;
decimal(3) amount;
end

record
decimal(4) id;
date(YYMMDD) dt; ???
decimal(9.2) cost;
end

out:
record
decimal(4) id;
string(8) city;
decimal(3) amount;
date(YYYY/MM/DD)dt;
end

Prioritized Assignment

Destination

out.dt
out.dt

Priority

Source

:1: in1.dt;
:2: 1900/01/01;

In DML, a missing value (say, if there is no in1 record) causes an


assignment to fail.
If an assignment for a left hand side fails, the next priority
assignment is tried. There must be one successful assignment for
each output field.

Assigning Priorities to Business Rules

Resulting display when out.dt is selected

The Join Component


Join performs a join of inputs. By default, the inputs to join
must be sorted and an inner join is computed.
Note: The following slides and the on-line example assume the
join-type parameter is set to Outer, and thus compute an outer
join.

Driving Key, max-core, Record - Required

Joining (mp/figure-06.mp)

A Look Inside the Join Component*

a b

a q

Align inputs by key


a b

a q

out :: fname(in0, in1) =


begin
...
...
...
...
...
end;

a x

*join-type = Full
Outer join

Records arrive at the inputs of the Join


G 234 42

G NY

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

The input records are read into the Join component

G 234 42

G NY

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

The input Key fields are compared

G 234 42

G NY

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

The aligned records are passed to the transformation function

Align inputs by a
G 234 42

G NY

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

The transformation engine evaluates based on the inputs

Align inputs by a
G 234 42

G NY

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

A result record is emitted and written out


as long as all output fields have been successfully
computed

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

G 24 NY

New records arrive at the inputs of the Join


H 79 23

K IL

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

Again, they are read into the Join component

H 79 23

K IL

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

The input key fields are compared

H 79 23

K IL

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

The aligned records are passed to the transformation function

K IL

Align inputs by a
H 79 23
out :: join(in0, in1) =
begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

The transformation engine evaluates based on the inputs

K IL

Align inputs by a
H 79 23
out :: join(in0, in1) =
begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

A result record is generated and written out


as all output fields are successfully computed
K IL

Align inputs by a

out :: join(in0, in1) =


begin
out.a : : in0.a;
out.x :1: in1.r + 20;
out.x :2: in0.b + 10;
out.q :1: in1.q;
out.q :2: XX;
end;

H 89 XX

Exercise 7: Join Data


Using example graph figure-06.mp, modify the transform function
to join visits.dat and last-visits.dat so that no records are
rejected.

Run the application, and examine the results. The Unmatched


Last Visits dataset should be empty.

Exercise 8 (if time): Join Retaining All Fields


Building upon the graph you created in Exercise 7, create a new
output record format and transform function to join visits.dat
and last-visits.dat according to the following rules:
Retain all fields from each dataset.
Supply defaults where necessary.

Change the necessary parameters, run the application, and


examine the results.

Lookup Files
DML provides a facility for looking up records in a dataset based
on a key:
lookup(file-name, key-expression)

The data is read from a file into memory.

The GDE provides a Lookup File component as a special dataset


with no ports.

Using lookup instead of Join

Using Last-Visits
as a lookup file

Configuring a Lookup File


1. Label used as name in
lookup expression

2. Browse for pathname

4. Set the lookup key

3. Set record format

Using a lookup file in a Transform Function

Input 0 record format:

Output record format:

record
decimal(4) id;
string(6) name;
string(8) city;
decimal(3) amount;
end

record
decimal(4) id;
string(8) city;
decimal(3) amount;
date(YYYY/MM/DD) dt;
end

Transform function:
out :: lookup_info(in) =
begin
out.id
: : in.id;
out.city
: : in.city;
out.amount : : in.amount;
out.dt
:1 : lookup(Last-Visits, in.id).dt;
out.dt
:2 : 1900/01/01;
end;

Exercise 9 (if time): Lookup


Building upon the graph you created in Exercise 8, convert into
lookup format
Change the necessary parameters, run the application, and
examine the results.

The GDE Debugger


The GDE has a built in debugger capability
To enable the Debugger, Debugger:Enable Debugger
The Debugger Toolbar

Enable Debugger

Add Watcher File

Remove All Watchers

Isolate Components

The GDE Debugger


To add a Watcher File, select a flow and click Add Watcher
To remove a Watcher File, click Remove All Watchers
To Isolate a set of components, select the components to be Isolated,
Watcher Files will automatically be placed into the graph by the
Debugger.
Note that if the Watcher files do not exist, the GDE will build them during the first run only,
using the Watchers on successive runs

Q&A
Any Questions ?

Capgemini
WORLDWIDE HEADQUARTERS 6400 SHAFER COURT ROSEMONT, ILLINOIS USA 60018
Tel. 847.384.6100 Fax 847.384.0500 WWW.Capgemini.COM

24 August 2007

Anda mungkin juga menyukai