Anda di halaman 1dari 9

14 Good design tips in Datastage 8.0.

1
------------------------------------------------------------------------------------------------------------------------------

1) When you need to run the same sequence of jobs again and again, better create a
sequencer with all the jobs that you need to run. Running this sequencer will run all the
jobs. You can provide the sequence as per your requirement.

2) If you are using a copy or a filter stage either immediately after or immediately before
a transformer stage, you are reducing the efficiency by using more stages because a
transformer does the job of both copy stage as well as a filter stage

3) Use Sort stages instead of Remove duplicate stages. Sort stage has got more
grouping options and sort indicator options.

4) Turn off Runtime Column propagation wherever it’s not required.

5) Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of
Transformer stage only if the anticipated volumes are high and performance becomes a
problem. Otherwise use Transformer. It is very easy to code a transformer than a modify
stage.

6)Avoid propagation of unnecessary metadata between the stages. Use Modify stage
and drop the metadata. Modify stage will drop the metadata only when explicitly
specified using DROP clause.

7)Add reject files wherever you need reprocessing of rejected records or you think
considerable data loss may happen. Try to keep reject file at least at Sequential file
stages and writing to Database stages.

8)Make use of Order By clause when a DB stage is being used in join. The intention is to
make use of Database power for sorting instead of Data Stage resources. Keep the join
partitioning as Auto. Indicate don’t sort option between DB stage and join stage using
sort stage when using order by clause.

9)While doing Outer joins, you can make use of Dummy variables for just Null checking
instead of fetching an explicit column from table.

10)Data Partitioning is very important part of Parallel job design. It’s always advisable to
have the data partitioning as ‘Auto’ unless you are comfortable with partitioning, since all
Data Stage stages are designed to perform in the required way with Auto partitioning.

11) Do remember that Modify drops the Metadata only when it is explicitly asked to do
so using KEEP/DROP clauses.

12) Range Look-up: Range Look-up is equivalent to the operator between. Lookup
against a range of values was difficult to implement in previous Data Stage versions. By
having this functionality in the lookup stage, comparing a source column to a range of
two lookup columns or a lookup column to a range of two source columns can be easily
implemented.

13) Use a Copy stage to dump out data to intermediate peek stages or sequential debug
files. Copy stages get removed during compile time so they do not increase overhead

14)Where you are using a Copy stage with a single input and a single output, you should
ensure that you set the Force property in the stage editor TRUE. This prevents
DataStage from deciding that the Copy operation is superfluous and optimizing it out of
the job.

Some useful info there. I'd like to comment on:

6)Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop
the metadata. Modify stage will drop the metadata only when explicitly specified using DROP
clause.

This is not strictly true as there are many ways to achieve this, although Modify stage is the
recommended medium.

For example, you can use a dataset stage and only select columns you want to use. Such actions
should be annotated clearly.

You should identify columns not required and drop them at the earliest opportunity. You can drop
columns as part of the output in most stages. Ideally, local coding standards should make the
method consistent. As long as it is clearly indicated on the canvas or in the job description of
course ;)

10)Data Partitioning is very important part of Parallel job design. It’s always advisable to have
the data partitioning as ‘Auto’ unless you are comfortable with partitioning, since all Data Stage
stages are designed to perform in the required way with Auto partitioning.

I disagree strongly with your use of 'auto' as default. Unless you properly partition, you will get
unexpected and possibly incorrect results, especially if you are running multi-node. You have to
understand partitioning and utilise it correctly, especially when performing joining or lookup
operations.

Some very interesting pointers in there, thanks for sharing :)

DR
Palgun | May 28
Daren,

Thanks for posting your comments.

Your comment on Point No 10 seems authentic.

However you may not have understood my perspective in Point No 6. Please be noted that
though you drop columns in the Output tab from other stages(other than Modify), these
columns will be carried over thru-out the job at background. This can be proved if u use RCP.
In case you want to completely drop the columns, u need to drop them using Modify stage.
Sometimes you may need to read several columns and after a bit of processing theese
columns may be useless to be carried over further. These cases give the best use of a Modify
stage. Also note that NO OTHER STAGE drops the metadata at rutime.

alandoland alandoland | May 29


And the real happiness is not in the new house but in the home it becomes. A new house is a
big step in our lives... but it´s an empty accomplishment mcpd as long as it doesn´t become a
home, part from out legacy. Going home makes us happy, entering a new house no that
much. The happiness is in the meaning of the things, not in the things themselves.

Vincent McBurney | Jun 1


Nice post, keep up the tips and tricks for DataStage postings. Point 8 can seem a bit confusing
although I know what you are trying to say. The sort stage has several functions - 1) sort data
2) remove duplicates 3) indicate that no sort is required because data is already sorted - this
avoids the insertion of an unnecessary TSORT operator at run time.

saurabh gupta | Jul 3


@Palgun : Nice post as usual :)

But I have one question here for point 6


If we are using a modify stage just to drop column , then won't it affect performance ? As we
are using an extra stage just to remove columns.

Palgun | Jul 3
@Saurocks

You need to understnd that there a few stages in Datastage that have nil or minimal memory
usage and shuld be used in top priority whn it comes to 'performance'. Modify and Copy
stages are examples of this.

Furthermore, there is no other stage in Datastge which drops the columns at runtime, except
ofcourse the Modify stage. You may try to not send a few columns into the output tab of a
stage Ex Transformer stage, but those columns get propagated thruout the job, at runtime.
This can be checked using the "Runtime column propagation" option. However when you use
the Modify's Drop option, the columns get dropped all the way!

In this way, Modify is quite unique. Also the type conversions work in a faster way this stage
Arash | Aug 31
Hi all,

thank to Palgun for this blog and to all for your posts,

I have a question, maybe it can be related to point 3)

I have 20.000.000 of records in a job which use more than 20 range look-up and 2 of
them are not using dataset as reference(they use odbc source as reference)

and at the end I have to remove some duplicate by


sorting data(in my copy stage instead of sort ) and then send same partition to
RMDuplicate to remove those duplicate.
is there any other way to do more performance for this job

my job take 55 min and now I m adding sort and RMV(which take 20 min separately [I
tested it in separate job])

is there any idea to do perform this one??

Thank you
Best regards
Arash

Palgun | Aug 31
Dear Arash,

1) If the reference link in a lookup has very huge no. of records, better replace that Lookup
with a Join.
2) Are you using that copy stage just to sort data? If yes. better use a Sort stage itself and sort
the data and then pass them on to the Rdup stage. That is the best possible performance
tuning you can do.

Regards,
Palgun

Arash | Sep 21
Hi Palgun,

Thanks for your replay, but even if I use JOIN instead of lookup, it will be difficult to do
the
range!!
I mean range become complicate when you use a join :(
Regards
Arash
Reading Multiple files through Parallel jobs

Hi,

I want to read multiple files from a single directory and i have to do this through Parallel jobs.I'm
using DataStage version 7.5.2. Can any one help me in this regard.Your quick reply will be
highly appreciated.

Shiva Ram
Replied Oct 8, 2007

Hi,
Have the mode, file pattern in Sequential File Stage and grab a key word
which is common in all files, for eg., if you have files,
UNITED_STATES.txt, STATES_UNITED.txt, UNITED_STATES_OF_AMERICA.txt, if you
set the mode to file pattern and set the pattern as 'STATES' , all of these
three files will be read by one same sequential file stage.

Thanks
0

vasu koti
Replied Oct 9, 2007

HI

In the sequential stage format tab properties you need to set multiple file
s option

file1=path\abc.txt
file2=path\def.csv

Hope you will get it

with cheers
Vasudev> From: email@removed To: email@removed Dat
e: Mon, 8 Oct 2007 15:49:53 +0000> Subject: [ascential-l] Reading Multiple
files through Parallel jobs> > > > Hi,> > I want to read multiple files from a single
directory and i have to do this through Parallel jobs.I'm using DataStage v
ersion 7.5.2. Can any one help me in this regard.Your quick reply will be h
ighly appreciated.> > Thanks> Haris> > > > > >

multiplefile extract in sequential file


Asked by kiran kumar | posted Aug 9 | Replies (6)

how you will extract multiple files using sequential file with out using file pattern

White Papers and Webcasts


Popular

• Increase Application Performance With Solid State Disk

Related

• Solving the Dependency Problem in Software Builds

• The PST File Rodeo

• IBM Data Deduplication Strategy and Operations

More White Papers

6 Replies
0

Probal Banerjee
Replied Aug 9

using loop.
probal1u
0

vasudev koti
Replied Aug 9

Hi Kiran Kumar

You can read multiple files with sequential file Stage. In the sequential file stage you can assign
properties you need to specify the
eg:
File=pathname of the file1
File=pathname of the .file2.
File=path name of the .file3
?n
being read from (repeat this for reading multiple files).
It executes in parallel mode.

Thanking you & Regards


Vasudev
0

karthik reddy
Replied Aug 9

Hi
i have multiple files like file1 and file2...etc

in file1 and file2 contains different types of delimiters like file1 is


comma delimiter and file2 is tab delimiter so

how to extract the both file using sequential file

Thanks to advance
0

Deepak
Replied Aug 9

@ Kiran
Other option.
You can simply use "cat file1 file2 file3"
if you know the file name.
And this can be used in Filter option.
Or you can use Multiple instance job too. Pass the file name as Instance id.
White Papers and Webcasts
Popular

• How Business Intelligence Should Work

Related

• IBM Data Deduplication Strategy and Operations


• Oracle Real Application Testing 11g Release 2

• Closing the data privacy gap: Protecting sensitive data in non-production environments

More White Papers

Deepak
Replied Aug 9

@ Karthik
You cannot handle multiple delimiters in the same stage.
You have to pre process the files, to unify the delimiters.
Simple tr command should work out for that reason.
0

Shin Takakura
Replied Aug 9

In that case, you need to specify 2 sequential file stage to handle


different type of file.
I'm not sure what you want to do , but if the difference is delimiter only
and want to handle the contents in the same way,
Read the files from different sequential file and use funnel stage to
merge all the records into one, and pass to the rest of your logic.

If you don't know which file is using which delimiter, then you might need
some kind of pre-processor to convert all tabs to comma or vice versa in
the file to make the logic simple.

Best Regards,
Shin Takakura
How to handle null by modify stage?

Hi, I try to handle null warning by using modify stage but it's still warning log.
My data type >> ID (integer), name(varchar(20)
I put this in modify stage
>> ID:int32 = handle_null(ID, 98);name:string = handle_null(name, "NA");
Program run finish with warning like this :
Sequential_File_6: When checking operator: When validating export schema: At field "name":
Exporting nullable field without null handling properties
Sequential_File_6: When checking operator: When validating export schema: At field "ID":
Exporting nullable field without null handling properties
Please help clarify this , How can i reject warning?
Thank you in advance,

2 Replies
0

tigerpeng
Replied Oct 14, 2009

Is Sequential_File_6 a up-stream stage to Modify stage?


0

Renjie Luo
Replied Oct 19, 2009

The method to the problem:


Open the sequential file stage, change to the column tab ,then edit the column(for the columns
which with the warning logs), edit the proprities>Nullable, then add a properties of Null field
value, save. OK

Anda mungkin juga menyukai