B-Fundamentals of DataStage Parallelism

Overview of Datastage Parallelism
Server & Parallel Jobs
• ‘Parallel Jobs’ separate from the ‘Server Jobs’

• separate palette, different stages
• A single project can contain parallel as well as server jobs, can be called using the same
sequencer
• Use Parallel jobs where there is a large volume of data to be processed
• Overhead for management of parallel processes
• Hardware support for parallelization
• MPP – Single box, multiple share-nothing processors: Parallelization will give performance
improvements
• Clustered – Multiple boxes: Parallelization will give performance improvements
• SMP processors – Shared memory, single OS
• CPU-limited jobs – parallelization to increase number of processors used will help
improve performance
• Memory-limited jobs In SMP systems (with shared memory) may need hardware
upgrade
• Disk I/O limited jobs - Some SMP systems allow scalability of disk I/O, so that
throughput improves as the number of processors increases.
• Server jobs also support limited degree of parallelization
May 9, 2011 2
Types or Parallelism
• Internal Performance Enhancements

• Parallel execution of independent flows
2
1a 1a 1a Operator 4
Operator 1 Operator 2 Operator 3
1b
1b 1b
Operator 5 Operator 6 Operator 7
Wait Point for dependent

process
• Pipelining: Each row is processed passed on to the next process in a pileline
Process 1 Process 2 Process 3 Process 4
Row 4 Row 3 Row 2 Row 1
• Partitioning: rows split across different processes, each performing the same logic
Logical Unit
P1
Partitioned Data
Data P1
Note that these options also depend on
P1 • support from the h/w & OS
• server settings & configurations
May 9, 2011 3
DataStage Enterprise Edition
• Pipelining & Partitioning Combined
Pipeline processed for read, process & write
Oracle
Oracle
Oracle Oracle
N-way partitioned data read, process & write
• Optimize the parallelism

• Better utilization of available CPU & hardware resources
• Overhead of copying, and preparing data, managing the processes
May 9, 2011 4
Pipelining
• By default,
• DataStage combines operators where possible

• E.g. Two transform stages may combined into a single operator
• Saves data copying & preparation
• This can be over-ridden at a project or stage level, if we wish to create separate processes
for each operator
May 9, 2011 5
Data Partitioning
Default
• The configuration file decides how many process instances of each operator is created,
e.g. if 4 nodes are defined, there is a 4-way partition of data
• By default, Auto-Partitioning is set
• DS chooses the optimum partitioning & repartitioning mechanism
• “Round-robin” is applied at the first level followed by “Same”
• If there is a need for key-based partition upstream or down-stream, then alternative
modes are chosen
• e.g. in the case of a join, the data in the input link is sorted & partitioned by the join
key
Degree of parallelism
• This is decided by the Configuration file.
• The configuration file used can be varied at the job-level to suit different job
requirements
• Individual stages may also be executed on a selected nodes by specifying the node map
constraints
• Where the overhead of partitioning is not worth the performance improvement, the entire
job or a specific stage may be executed sequentially.
Avoid Repartitioning & Redundant Sorting

• The Designer palette indicates links at which these occur (when not auto partitioning)
• Designers may try to optimize this through changing the order in which the stages occur
or otherwise modifying the jobs
• Sequential files are non-partitioned so are less optimal intermediate storage formats.
DataSets may be used to preserve partitioning across jobs.
May 9, 2011 6
Configuration File
The Configuration File
• System size & configuration details maintained external to the job design
• Can be modified to suit development & production environment, handle hardware
upgrades, etc. without redesigning/recompiling jobs
• The configuration file describes available processing power in terms of processing
nodes
• determines how many instances of a process will be produced when you
compile a parallel job.
• Minimum #Nodes < ½ times #CPUs Minimum Recommended
• Usual starting point for #Nodes = # CPUs
• # Nodes < # CPUs if some CPUs left free for OS, DB and other
applications
• # Nodes > # CPUs for I/O intensive streams with poor CPU-usage
• Associates the scratchdisk with each node
May 9, 2011 7
Configuration File
Sample
{
node "node1"
{
fastname "ibmsceai"
pools ""
resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}
resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}
node "node2"
{
fastname "ibmsceai"
pools ""
resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}
resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}
}
May 9, 2011 8
Partitioning & Collecting
Partitioning Techniques : to be specified at each stage (or left as default)

• General load-balancing techniques
• ‘Round Robin’ – most efficient
• ‘Random’
• ‘Same’ – use partition strategy of input feed, no repartitioning
• ‘Entire’ – each node receives every row, e.g. to create a lookup file
• Hash By Key – e.g to remove duplicates, aggregate over fewer groups
• Key Modulo
• Range – ensure related records come together
• DB2 – same algorithm used by DB2
• Auto – leave it to DataStage – Usually Initially ‘Round-robin’, then ‘Same’
Collecting
• Round Robin
• Ordered – Read all records from first partition, then from second and so on
• Sorted Merge – Read records based on one or more columns (collecting key)
May 9, 2011 9
Discussion on Partitioning Data Within A
Job
Partitioning
• Assume that a <#Node> configuration file has been set
• In the simplest case,

• when no specific options have been set data is processed as <#Node> partitions
• Unless downstream stages require otherwise
• partition mechanism is Round Robin to begin with
• subsequent stages have “Same”, i.e no repartitioning happens
• This normally gives maximum performance benefit & resource usage
• Set/Unset Designer menu options Diagram->Link Markings
May 9, 2011 11
Partitioning
Sequential Files - Read
• Normally read in sequence, i.e 1 Process to read the data & pass it to the next stage
• Output data is partitioned in round-robin into <#Node> partitions
Processing Stages
• Receives partitioned data & propagates it using the “Same” method
Sequential Files – Write

• Partitioned Data collected & written sequentially
Read Sequentially &

Round-Robin Partition
“Collected” & written sequentially
Can sort on collection
No Repartitioning here
Auto Partition sets it to “Same”
May 9, 2011 12
Partitioning
Sequential Files
• Executes in parallel when reading multiple

files
• For fixed-width files, can parallelize using
multiple readers per node OR Multiple
Nodes
Multiple files => 1 Partition per file
On SMP systems: set “Number of Readers

Per Node”
On cluster systems, set “Read From

Multiple Nodes”
May 9, 2011 13
Partitioning
• Back to EE_TRG_Demo_1
• Note that we did not set any specific options for parallelism or partitioning
Read Sequentially &

Apply Partition (round-robin
by default)
Icon for Auto Partition
May 9, 2011 14
Partitioning
• DataStage inserts Operators for sort, partition, buffering, etc.
Aggregator Stage needs Key-

partitioned data
Implicit Insert: Hash Partition
Join Stage expects all input links to be

Key-Partitioned & Sorted. Partitioning
mode must be the same
Implicit Insert:
Hash Partition (if required**) & Sort on
Join Key(s)
** Note that in this case data output from aggregator is not partitioned again since it is already in the
required partitioning format. It is only sorted
May 9, 2011 15
Partitioning
• USUALLY – Auto modes works sufficiently well
• When would we worry about the partitioning mechanism?

• Some cases of debugging
• Performance Tuning
• Look at Link Icons to identify where partitioning, explicit repartitioning & collection has occurred
• Advanced Users: look at DataStage log for

• What partition has been implicitly applied
• What and how the stages have been interpreted within the OSH
• How data is distributed across the partitions
• Where & why repartitioning occurs
• Note that the level of reporting will depend on the environment variable settings
• Tune parallelism
• through the configuration file
• running specific stages sequentially or on selected node pool(s)
• Changing the partition mode
• Enabling or disabling Operator Combinability, etc.
May 9, 2011 16
Partitioning
• Most Stages have an “Advanced” tab with parallelization options
Sequential/Parallel
Combine operators (where

possible) or Do not combine
Combine operators (where

possible) or Do not combine
Constraints or limitations on which

nodes are to be used
May 9, 2011 17
Partitioning
• Each Input Link into a stage has a Partitioning tab

Select Partitioning Type
Options on sorting incoming link

data.
Select Partitioning Key (if

applicable)
• If stage is executed sequentially & preceding stage is parallel, then the “Collection” options are available
May 9, 2011 18
Partitioning
• How can inappropriate partitioning cause wrong results to be produced?

• If the partitioning mode, on say the Aggregation Stage, is for some reason explicitly set to a non-
key mode (such as round-robin), the stage will return WRONG RESULTS
Grp Key Amt Val Grp Key Amt Val

A 10 A 40
Grp Key Amt Val
A 30 B 120
A 10
B 50
A 20
B 70
A 30 Grp Key Amt Val
A 40 A 40
B 50 B 120
Grp Key Amt Val Grp Key Amt Val
B 60 A 60
A 20 A 60
B 70 B 140
A 40 B 140
B 80 B 60
B 80
Round-Robin Aggregate within the partition Wrong Output!

Partition
May 9, 2011 19
Case Study 2

B-Fundamentals of DataStage Parallelism

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

B-Fundamentals of DataStage Parallelism

Diunggah oleh

Hak Cipta:

Format Tersedia

Overview of Datastage Parallelism

Server & Parallel Jobs

• ‘Parallel Jobs’ separate from the ‘Server Jobs’

• Internal Performance Enhancements

Wait Point for dependent

Pipeline processed for read, process & write

N-way partitioned data read, process & write

• Optimize the parallelism

• DataStage combines operators where possible

Avoid Repartitioning & Redundant Sorting

The Configuration File

• Associates the scratchdisk with each node

Partitioning Techniques : to be specified at each stage (or left as default)

• Assume that a <#Node> configuration file has been set

• In the simplest case,

• Set/Unset Designer menu options Diagram->Link Markings

Sequential Files – Write

Read Sequentially &

• Executes in parallel when reading multiple

Multiple files => 1 Partition per file

On SMP systems: set “Number of Readers

On cluster systems, set “Read From

Read Sequentially &

Icon for Auto Partition

• DataStage inserts Operators for sort, partition, buffering, etc.

Aggregator Stage needs Key-

Join Stage expects all input links to be

• USUALLY – Auto modes works sufficiently well

• When would we worry about the partitioning mechanism?

• Advanced Users: look at DataStage log for

• Most Stages have an “Advanced” tab with parallelization options

Combine operators (where

Combine operators (where

Constraints or limitations on which

• Each Input Link into a stage has a Partitioning tab

Options on sorting incoming link

Select Partitioning Key (if

• How can inappropriate partitioning cause wrong results to be produced?

Grp Key Amt Val Grp Key Amt Val

Round-Robin Aggregate within the partition Wrong Output!

Anda mungkin juga menyukai