Anda di halaman 1dari 18

ETL Design & Code Guidelines

Guidelines to be followed when preparing for Design and Code Reviews.


Presented in checklist format.

Category Guideline
Design jobs for restartability.
Design

Do not follow Sequential File stage with "Same" partitioning.


Code

Set $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject rows when data value exceeds column size.
Code

Do not hard-code parameters.


Code

Do not hard-code directory paths.


Code

Do not use fork-joins to generate lookup data sets.


Code

Use "Hash" aggregation for limited distinct key values. Outputs after all rows are read.
Code

Use "Sort" aggregation for large number of distinct key values. Data must be pre-sorted. Outputs after each aggregation
Code group.
Use multiple aggregators to reduce collection time when aggregating all rows. Define a constant key column using row
Code generator. First aggregator sums in parallel. Second aggregator sums sequentially.
Make sure sequences are not too long. Break up into logical units of work.
Code

Source
What is the source (Database or Files and category of the same)?

Source
Is this the best source to extract this data?

Source
Is this source maintained manually or systematically?
Source What additional checks are used when data is entered manually to ensure data quality issues are handled due to manual
mistakes?

Source
Is source is cleaned and loaded every time?

Source
How are we making sure we are not getting incomplete data, if source is cleaned and loaded?

Source
Do we get changed data or complete data (if complete, why can't we get delta data)?

Source
What is the frequency data gets updated in the source?

Source
What is the volume of source (Delta or Complete)?

Source
Can we increase the frequency of load, if volume of delta records is huge?

Source
What is % of growth after 2 and 5 years?

Source
What is the record length of source?

Source
Is the source partitioned?

Source
Is the source in Detroite or other location?

Target
Is this target table populated by multiple source system data?

Target
How are we differentiating each source system, if yes to above question?

Target
Is this target table populated by multiple table data?

Target
How are we differentiating each table detail, if yes to above question?

Target
Are there any integrity/referential constraints that should be satisfied?

Target
What would happen if the integrity constraint is not satisfied when it is loading?

Target How are we making sure that it gets into the system with single format for date and other fields in this table, if data comes from
multiple locations?
Target
What are the business rules around this table load?

Target
How are we making sure that data is not shown to the bossiness when we are in the process of loading the same?

Scheduling
How many jobs are we going to have as part of this review?

Scheduling
Is number of jobs more or less for standard?

Scheduling
Can we combine or split so we can reduce number of jobs or complexity respectively?

Scheduling
What is the time window this job should be executed

Scheduling
Can we run this schedule in the non-busy time, if we have resource bottleneck on specified time window?

Scheduling
How the dependency scheduled for the master child relationship?

Scheduling
How do we get initial data to the system?

Scheduling
Will there be a separate review, if we have separate process for the initial load?

Scheduling
What would be the result, if two schedules are deployed in the system at the same time?

Scheduling
How to avoid the above said scenario?

Scheduling
Will this flow satisfy all the requirements of requirement?

Lookups & Joins


Volume information and growth information for the Lookup/Join tables?

Lookups & Joins


What is the amount of space it is going to take in work node, if we use lookup?

Lookups & Joins


What is the amount of space required in work node, if we use join?

Lookups & Joins


Do we expect any mismatch and handling the same, when we do this operation?

Lookups & Joins


What is the time and frequency this is going to be scheduled?
Lookups & Joins
What is overall impact due to this in Database and Datastage environment?

Lookups & Joins


What is the over all resource utilization after this implementation (Complete, not just job or project)?

Lookups & Joins


Do we carry any unnecessary columns to the next level?
Data
Quality/Business
Rules Is all possible nullable columns are handled for null?
Data
Quality/Business
Rules Are we defaulting to the proper value, if it is null?
Data
Quality/Business
Rules Do we need all the columns in the target (Consider future requirement also)?
Data
Quality/Business
Rules Is there a code values (like address type) column in the table?
Data
Quality/Business
Rules Is the code value is what business would like to look or description?
Data
Quality/Business
Rules Do we need validation for the code values, if any?
Data
Quality/Business
Rules Do we need range validation in any of the fields?
Data
Quality/Business
Rules Run thru the flow with exceptional scenario (negative) data in it?
Data
Quality/Business
Rules Are we loading the data, if it violates one or more validation and business rules?
Data
Quality/Business
Rules Is there any possibility to have late coming transaction?
Data
Quality/Business
Rules Have we handled the situation of late coming transaction appropriately?
Data
Quality/Business
Rules How Update/Insert is handled?

Exception Handling
Do we expect any rejections in the process?

Exception Handling
Do we have exception handling at the stage level or group of stage (for the records don’t satisfy condition)?

Exception Handling
How is it handled, if abort in the middle flow?
Exception Handling
Are we going to re-start or re-run, if any abort?

Exception Handling
How restart and re-run handled, so that it doesn't impact the data integrity?

Exception Handling
What if the environment is not up and it is failing due to the same?

Exception Handling
What would be the way data will get populated wrongly in the target table?

Exception Handling
Is there any recovery plan, if data gets populated wrongly?

Exception Handling
Would it be OK if process load part of data?

Restartability
Is this flow is restartable at any point (Job) of failure?

Restartability
Should this to be re-run if failure happens?

Restartability
What would need to be done, if source data is cleaned up by source system(for re-processing or dunirng failure of processing)?

Restartability
Is there a possibility of same data getting processed more than one time?

Knowledge Sharing
How busy is the system during the scheduled time window?

Knowledge Sharing
Update resource available and resource utilization/sharing

Knowledge Sharing
Partitioning/Sorting

Environmental
Is all metadata are all available?

Environmental
Is ETL related objects are available and is developer has access?

Environmental
Is Database related objects are available?

Environmental
Is all connection information are available?

Environmental
Is source database access are available?
Environmental
Is target database access is available?
ign & Code Guidelines
wed when preparing for Design and Code Reviews.
resented in checklist format.

Guideline

Same" partitioning.

FIELD_OVERRUNS to reject rows when data value exceeds column size.

ata sets.

t key values. Outputs after all rows are read.

of distinct key values. Data must be pre-sorted. Outputs after each aggregation

tion time when aggregating all rows. Define a constant key column using row
el. Second aggregator sums sequentially.
reak up into logical units of work.

d category of the same)?

ematically?
ta is entered manually to ensure data quality issues are handled due to manual

e?

ng incomplete data, if source is cleaned and loaded?

(if complete, why can't we get delta data)?

n the source?

mplete)?

volume of delta records is huge?

ource system data?

stem, if yes to above question?

able data?

ail, if yes to above question?

nts that should be satisfied?

aint is not satisfied when it is loading?


he system with single format for date and other fields in this table, if data comes from
able load?

shown to the bossiness when we are in the process of loading the same?

art of this review?

d?

e number of jobs or complexity respectively?

e executed

time, if we have resource bottleneck on specified time window?

aster child relationship?

e separate process for the initial load?

s are deployed in the system at the same time?

f requirement?

n for the Lookup/Join tables?

take in work node, if we use lookup?

work node, if we use join?

the same, when we do this operation?

ng to be scheduled?
base and Datastage environment?

er this implementation (Complete, not just job or project)?

the next level?

ed for null?

is null?

Consider future requirement also)?

column in the table?

ike to look or description?

, if any?

fields?

o (negative) data in it?

or more validation and business rules?

g transaction?

ing transaction appropriately?

s?

ge level or group of stage (for the records don’t satisfy condition)?

w?
bort?

doesn't impact the data integrity?

failing due to the same?

lated wrongly in the target table?

pulated wrongly?

a?

) of failure?

ta is cleaned up by source system(for re-processing or dunirng failure of processing)?

processed more than one time?

uled time window?

tilization/sharing

developer has access?

?
ETL Code Review Checklis
Guidelines to be followed when preparing for Design an
Reviews, presented in checklist format.

Guideline
1 Design jobs for restartability/ if not designed then what is the reason ?

2 Do not follow Sequential File stage with "Same" partitioning.

3 check if the APT_CONFIG_FILE parameter is added. This is required to change the number of
runtime.

4 Do not hard-code parameters.

5 Do not hard-code directory paths.

6 Do not use fork-joins to generate lookup data sets.

7 Use "Hash" aggregation for limited distinct key values. Outputs after all rows are read.

8 Use "Sort" aggregation for large number of distinct key values. Data must be pre-sorted. Outp
aggregation group.

9 Use multiple aggregators to reduce collection time when aggregating all rows. Define a consta
using row generator. First aggregator sums in parallel. Second aggregator sums sequentially.

10 Make sure sequences are not too long. Break up into logical units of work.

11 Is the error handling done properly? It is prefered to propogate errors from lower jobs to the hig
sequence)

12 What is the volume of extract data( is there a where clause in the SQL)

13 Are the correct scripts to clean up datasets after job complete revoked ?

14 Is there a reject process in place ?

15 Can we combine or split so we can reduce number of jobs or complexity respectively?

16 It is not recommended to have an increase in the number of nodes if there are too many stages
increases the number of processes spun off)
17 Volume information and growth information for the Lookup/Join tables?
18 Check if there is a select * in any of the queries. It is not advised to have select * , instead the r
columns have to be added in the statement

19 Check the paritioning and sorting at each stage

20 When a sequence is used make sure none of the parameters passed are left blank

21 Check if there are separate jobs for atleast extract, transform and load

22 Check if there is annotation for each stage and the job, the job properties should have the auth
out
23 Check for naming convention of the jobs, stages and links

24 Try avoiding peeks in production jobs, peeks are generally used for debug in the development

25 Make sure the developer has not suppressed many warnings that are valid

27 Verify that the jobs conform to the Flat File and Dataset naming specification. This is especially
cleaning up files and logging errors appropriately.

28
Verify that all fields are written to the Reject flat files. This is necessary for debugging and reco
ew Checklist
eparing for Design and Code
hecklist format.

eason ?

equired to change the number of nodes during

ts after all rows are read.

. Data must be pre-sorted. Outputs after each

egating all rows. Define a constant key column


nd aggregator sums sequentially.

units of work.

e errors from lower jobs to the highest level( ex a

the SQL)

revoked ?

complexity respectively?

odes if there are too many stages in the job( this

n tables?
ed to have select * , instead the required

passed are left blank

and load

b properties should have the author,date etc filled

ed for debug in the development

that are valid

g specification. This is especially important for

ecessary for debugging and reconciliation.

Anda mungkin juga menyukai