May 9, 2011 2
Types or Parallelism
1b
1b 1b
Operator 5 Operator 6 Operator 7
• Partitioning: rows split across different processes, each performing the same logic
Logical Unit
P1
Partitioned Data
Data P1
Note that these options also depend on
P1 • support from the h/w & OS
• server settings & configurations
May 9, 2011 3
DataStage Enterprise Edition
• Pipelining & Partitioning Combined
Oracle
Oracle
Oracle Oracle
May 9, 2011 4
Pipelining
• By default,
• This can be over-ridden at a project or stage level, if we wish to create separate processes
for each operator
May 9, 2011 5
Data Partitioning
Default
• The configuration file decides how many process instances of each operator is created,
e.g. if 4 nodes are defined, there is a 4-way partition of data
• By default, Auto-Partitioning is set
• DS chooses the optimum partitioning & repartitioning mechanism
• “Round-robin” is applied at the first level followed by “Same”
• If there is a need for key-based partition upstream or down-stream, then alternative
modes are chosen
• e.g. in the case of a join, the data in the input link is sorted & partitioned by the join
key
Degree of parallelism
• This is decided by the Configuration file.
• The configuration file used can be varied at the job-level to suit different job
requirements
• Individual stages may also be executed on a selected nodes by specifying the node map
constraints
• Where the overhead of partitioning is not worth the performance improvement, the entire
job or a specific stage may be executed sequentially.
• System size & configuration details maintained external to the job design
• Can be modified to suit development & production environment, handle hardware
upgrades, etc. without redesigning/recompiling jobs
• The configuration file describes available processing power in terms of processing
nodes
• determines how many instances of a process will be produced when you
compile a parallel job.
• Minimum #Nodes < ½ times #CPUs Minimum Recommended
• Usual starting point for #Nodes = # CPUs
• # Nodes < # CPUs if some CPUs left free for OS, DB and other
applications
• # Nodes > # CPUs for I/O intensive streams with poor CPU-usage
May 9, 2011 7
Configuration File
Sample
{
node "node1"
{
fastname "ibmsceai"
pools ""
resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}
resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}
node "node2"
{
fastname "ibmsceai"
pools ""
resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}
resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}
}
May 9, 2011 8
Partitioning & Collecting
Collecting
• Round Robin
• Ordered – Read all records from first partition, then from second and so on
• Sorted Merge – Read records based on one or more columns (collecting key)
May 9, 2011 9
Discussion on Partitioning Data Within A
Job
Partitioning
May 9, 2011 11
Partitioning
Sequential Files - Read
• Normally read in sequence, i.e 1 Process to read the data & pass it to the next stage
• Output data is partitioned in round-robin into <#Node> partitions
Processing Stages
• Receives partitioned data & propagates it using the “Same” method
May 9, 2011 12
Partitioning
Sequential Files
May 9, 2011 13
Partitioning
• Back to EE_TRG_Demo_1
• Note that we did not set any specific options for parallelism or partitioning
May 9, 2011 14
Partitioning
** Note that in this case data output from aggregator is not partitioned again since it is already in the
required partitioning format. It is only sorted
May 9, 2011 15
Partitioning
• Look at Link Icons to identify where partitioning, explicit repartitioning & collection has occurred
• Tune parallelism
• through the configuration file
• running specific stages sequentially or on selected node pool(s)
• Changing the partition mode
• Enabling or disabling Operator Combinability, etc.
May 9, 2011 16
Partitioning
Sequential/Parallel
May 9, 2011 17
Partitioning
• If stage is executed sequentially & preceding stage is parallel, then the “Collection” options are available
May 9, 2011 18
Partitioning
A 40 A 40
B 50 B 120
Grp Key Amt Val Grp Key Amt Val
B 60 A 60
A 20 A 60
B 70 B 140
A 40 B 140
B 80 B 60
B 80
May 9, 2011 19
Case Study 2