Anda di halaman 1dari 70

Abiniito

 A suite of products which together provide a platform for data processing applications
 Main Ab-Initio products are:
 Co>Operating System
 Component Library
 GDE
 EME
 Data Profiler
 Conduct>It

In introduction include :--


 Dataset Components
 Input File, Output File
 Input Table, Output Table
 Sort Components
 Sort
 Transform Components
 Reformat
 Multi Reformat
 Filter by Expression
 Dedup Sorted
 Join
 Rollup
 Scan
 Normalize
 Partition Components
 Partition by Key
 Partition by Round-robin
 Partition by Expression
 Departition Components
 Gather
 Concatenate
 Merge
Ab-Initio Concepts & Performance Tuning
 Phases & Checkpoints

 Difference between a phase and checkpoint as far as i know it something related with how temporary files containing
the data landed to disk is handled.
that is phases are used to break up a graph so that it does not use up all the memory , it reduce the no of component
running in parallel hence improves the performances (used for performance fine tuning, by managing the resource in
perfect manner)
Check points are used for the purpose of recovery.
Phase is a stage in a graph that runs before staratup of next stage.
Check point is intermediate stoping point of the graph to save guard against failure.
We can arrange phase without check points.
We dont assign checkpoint without phases.

In another words :-

You can have checkpoints only at phase breaks

The major difference between these to is that phasing deletes the intermediate files made at the end of each phase,
as soon as it enters the next phase. On the other hand, what checkpointing does is...it stores these intermediate files
till the end of the graph. Thus we can easily use the intermediate file to restart the process from where it failed. But
this cannot be done in case of phasing

Phases are used in case to use the resources such as memory, disk space, and CPU cycles for the most demanding
part of the job.Say, we have memory consuming components in the straight flow and the data in flow is in millions,we
can separate the process out in one phase so as the cpu allocation is more for the process to consume less time for
the whole process to get over.

In contrary,Checkpoints are like save points while we play a PC game.These are required if we need to run the graph
from the saved last phase recovery file(phase break checkpoint) if it fails unexpectedly.

Use of phase breaks which includes the checkpoints would degrade the performance but ensures save point
run.Toggling Checkpoints could be helpful for removing checkpoints from phase break
 Parallelism
 Dynamic Script Generation
 Plans & Psets
Plan
 A plan is an Ab-Initio Conduct>It feature
 It is a representation of all the interrelated elements of a system
 Using a plan, you can control the sequence, relationships, and communication between tasks by how you
connect the tasks and by how you specify methods and parameters. You also control how tasks use system
resources and how to group tasks for safe recovery
 A subplan is a complete Conduct>It plan embedded in a larger plan
Pset
 A pset is a file containing a set of input parameter values, that reference a graph/plan
 Every .pset file contains information linking it back to the original graph or plan it was created from

DML Overview:--

 record
string(10) name;
decimal(10) roll no;
string(“\n”) newline;
end;
Useful DML Utilities: m_eval, m_dump
 m_eval
 Evaluates DML expressions and displays their derived types
 Used to test and evaluate simple, multiple, cast, and other expressions that you want to use in a graph
E.g.: $ m_eval ‘(date("YYYYMMDD")) (today() - 10)’  "20041130“
 m_dump
 Prints information about data records, their record formats, and the evaluations of expressions
E.g.: $ m_dump -string "record int a; string(12) b; \
double c; end" \
-describe
• Record formats are set in the following 2 ways :
Use a file
Embed

• Embed – The record format is written for each port in the below format:
record
string(“\x01”,maximum_length=7) clm_nbr;
decimal(“\x01”) agr_id;
date(“YYYY-MM-DD”) (“\x01”) eff_strt_dt;
end;
• Use file – A DML file is created which contains only the record format and it is stored in the DML folder of the
sandbox.
• In the component we specify the path for this DML to import the record format.

Q. What is the relation between eme, gde and co-operating system?


Eme is said as enterprise metadataenv, gde as graphical development env and co-operating system can be said as abinitio server
relation b/w this co-op, eme and gde is as fallowsco operating system is the abinitio server. This co-op is installed on particular o.s
platform that is called native o.s .coming to the eme, its just as repository in Informatica, its hold the metadata, transformations,
dbconfig files source and targets information’s. Coming to gde its is end user environment where we can develop the graphs
(mapping just like in Informatica) designer uses the gde and designs the graphs and save to the eme or sand box it is at user side.
Where eme is at server side.
informatica vs ab initio
Feature AB Initio Informatica

About Tool Code based ETL Engine based ETL

Parallelism Supports One Types of parallelism Supports three types of parallelism

Scheduler No scheduler Schedule through script available


Error Handling Can attach error and reject files One file for all

Robust Robustness by function comparison Basic in terms of robustness

Feedback Provides performance metrics for each component Debug mode, but slow
executed implementation

Delimiters while Supports multiple delimeters Only dedicated delimeter


reading

Q. What are kinds of layouts does ab initio supports?


Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one
depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way
parallel if the layout is defined such as it’s same as the degree of parallelism.
What is the diff b/w look-up file and look-up, with a relevant example?
Generally, Lookup file represents one or more serial files (Flat files). The amount of data is small enough to be held in the memory.
This allows transform functions to retrieve records much more quickly than it could retrieve from Disk.
How many components in your most complicated graph?
It depends the type of components you us. Usually avoid using much complicated transform function in a graph

Have you used rollup component? Describe how?


If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform
function and it contains the following mandatory functions.
1.Initialize
2.Rollup
3. Finalize
Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup function calls for each of the records in the
group and finally calls the finalize function once at the end of last rollup call.

Q. How do you improve the performance of a graph?


There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimize the number of sort components
4) Minimize sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don’t use broadcast as partitioner
9) Minimize the use of regular expression functions like re_index in the transfer functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be
partitioned.
Q. Have you ever encountered an error called “depth not equal”?
When two components are linked together if their layout does not match then this problem can occur during the compilation of the
graph. A solution to this problem would be to use a partitioning component in between if there was change in layout.
2) Explain what is the architecture of Abinitio?
Architecture of Abinitio includes

 GDE (Graphical Development Environment)


 Co-operating System
 Enterprise meta-environment (EME)
 Conduct-IT
3) Mention what is the role of Co-operating system in Abinitio?

The Abinitio co-operating system provide features like

 Manage and run Abinitio graph and control the ETL processes
 Provide Ab initio extensions to the operating system
 ETL processes monitoring and debugging
 Meta-data management and interaction with the EME
7) List out the file extensions used in Abinitio?

The file extensions used in Abinitio are

 .mp: It stores Ab initio graph or graph component


 .mpc: Custom component or program
 .mdc: Dataset or custom data-set component
 .dml: Data manipulation language file or record type definition
 .xfr: Transform function file
 .dat: Data file (multifile or serial file)

 9) Explain how you can run a graph infinitely in Ab initio?

 To execute graph infinitely, the graph end script should call the .ksh file of the graph. Therefore, if the graph name is
abc.mp then in the end script of the graph it should call to abc.ksh. This will run the graph for infinitely.
) Mention what the difference between “Look-up” file and “Look is up” in Abinitio?

Lookup file defines one or more serial file (Flat Files); it is a physical file where the data for the Look-up is stored. While Look-
up is the component of abinitio graph, where we can save data and retrieve it by using a key parameter.

Mention what are the different types of parallelism used in Abinitio?

Different types of parallelism used in Abinitio includes

 Component parallelism: A graph with multiple processes executing simultaneously on separate data uses parallelism
 Data parallelism: A graph that works with data divided into segments and operates on each segments respectively,
uses data parallelism.
 Pipeline parallelism: A graph that deals with multiple components executing simultaneously on the same data uses
pipeline parallelism. Each component in the pipeline read continuously from the upstream components, processes data
and writes to downstream components. Both components can operate in parallel.
2) Explain what is Sort Component in Abinitio?

The Sort Component in Abinitio re-orders the data. It comprises of two parameters “Key” and “Max-core”.

 Key: It is one of the parameters for sort component which determines the collation order
 Max-core: This parameter controls how often the sort component dumps data from memory to disk
13) Mention what dedup-component and replicate component does?
 Dedup component: It is used to remove duplicate records
 Replicate component: It combines the data records from the inputs into one flow and writes a copy of that flow to each
of its output ports
Mention what is a partition and what are the different types of partition components in Abinitio?

In Abinitio, partition is the process of dividing data sets into multiple sets for further processing. Different types of partition
component includes
 Partition by Round-Robin: Distributing data evenly, in block size chunks, across the output partitions
 Partition by Range: You can divide data evenly among nodes, based on a set of partitioning ranges and key
 Partition by Percentage: Distribution data, so the output is proportional to fractions of 100
 Partition by Load balance: Dynamic load balancing
 Partition by Expression: Data dividing according to a DML expression
 Partition by Key: Data grouping by a key

Explain what is de-partition in Abinitio?


De-partition is done in order to read data from multiple flow or operations and are used to re-join data records from
different flows. There are several de-partition components available which includes Gather, Merge, Interleave, and
Concatenation.

List out some of the air commands used in Abintio?

Air command used in Abinitio includes

 air object Is<EME path for the object-/Projects/edf/..> : It is used to see the listings of objects in a directory inside the
project
 air object rm<EME path for the object-/Projects/edf/..> : It is used to remove an object from the repository
 air object versions-verbose<EME path for the object-/Projects/edf/..> : It gives the version history of the object.
 air versions –verbose <path to the object in EME>
 air sandbox diff -version 437959 -version 397048

Other air command for Abinitio include air object cat, air object modify, air lock show user, etc.

18) Mention what is Rollup Component?


Roll-up component enables the users to group the records on certain field values. It is a multiple stage function and consists
initialize 2 and Rollup 3.

19) Mention what is the syntax for m_dump in Abinitio?

The syntax for m_dump in Abinitio is used to view the data in multifile from unix prompt. The command for m_dump
includes

 m_dump a.dml a.dat: This command will print the data as it manifested from GDE when we view data in formatted text
 m_dump a.dml a.dat>b.dat: The output is re-directed in b.dat and will act as a serial file.b.dat that can be referred when
it is required.
 What is PDL Abinitio?
 PDL is a new concept introduced in later versions of Abinitio. Using this feature you can run the graph without deploying it
through GDE. The mp file can directly be executed using air sandbox run command, which contains the commands to set up
the host environment. To summarize its kind of parameterized environment

What is max core value ? what is the use of it?


Q:-Can any body give me the clear explanation about how to separate header,trailer and body records in Ab Initio?
Answer:- You will be having an indicator field(sth like
record_indicator which will identify whether the record is
a header or trailer or detailed.) in your DML. So, use a
partion by expression component in your graph and based on
the indicator values, separate the records.

i.e. give the expression in the PBE component sth like:

if(record_indicator == "H") 0 else if(record_indicator


== "T") 1 else 2;
Or >>use next_in_sequence>=1 in reformat to remove header and
use entire record as a parameter in dedup and eliminate the
trailer record
Q:-- What is regex (lookup)? When you should use it? How to use in abinitio graph?
Questions :- Why creation of temporary files depends on the value of MAX CORE ? How to use in abinitio graph?
10. What is the diff between abinitiorc and .abinitiorc files ? How to use in abinitio graph?
11. What is the use of allocate()? How to use in abinitio graph? 12. What is use of branch in EME ?
13. How you can break a lock in EME ? How can you lock a file so that only no one other than EME admin can break it ?
How to use in abinitio graph?
14. When you should be using ablocal() ? How you can use ablocal_expr? How to use in abinitio graph?
15. Why you should not keep the layout as 'default' for input table component ? How to use in abinitio graph?
16. What is dynamic lookup ? How to use in abinitio graph?
17. What is dependent parameter ? How to use in abinitio graph?
18. What is BRE ? (Business Rule Environment - This is a recent addition in abinitio package) How to use in abinitio graph?
19.What is output index ? How to use in abinitio graph?
20. How you can track the records those are not getting selected from ‘select’ in reformat component ? How to use in abinitio
graph?
21. Can we have more than one launcher process for a particular graph ? How about agent ? How to use in abinitio graph?
22. There are lot of new fuctions added in 2.15 , you can ask about them ? How to use in abinitio graph?
23. How can you run multiple instances of a graph in parallel? How to use
24) Difference between Force_error & Force_abort?
force_error can be written by the developer to exit by throwing an user specified error.
Eg : if (age<18) force_error("Age not suitable for Voting")

in above case the graph fails with Exit 3 along with error message in error log file saying - Age not suitable for Voting

Whereas , force_abort just fails the graph without any message even this with return code exit 3 for failure

 which version you are using.


 what are the components you are working
 what is the scheduling tool you have used.
 Tell me about join .
 What are the parameters of join explain all.
 what is semi-join.
 what are cartesian joins.
 what is lookup.
 what the partition components you worked
 What is the difference between merge and concatenate.
 what is the diff b/w scan and role-up.
 How will you test a dbc file from command prompt.
 What is generic graph.
 what is wrapper script.
 What are m_commands.
 What is the purpose of m_rollup.
 How to write conditional dml if the file has huge amount of data.
 how do you convert4-way mfs to 16-way mfs.
 how would you do performance tuning for already builted graph.
 Tell the sed and awk commands with some suitable examples.
 write two sample scripts previously you wrote
:---

Air Project Commands to perform CHECK IN :-


 Copying files from your sandbox to EME datastore. This process is called check-in.
 The command line syntax:
air project import <project-path> -basedir <basedir > { [-files <relative path >] \ [-force ] }
To check-in a file from sandbox to EME datastore.
air project import /Projects/bi/sbis/cmi/cmi_loss
-basedir /ab_sand_bi_01/sbis/n516dd/n516dd/sandbox/cmi/cmi_loss
-files mp/LHST1101CLAIMClmPty_blddata_JulVer.mp

Perform Check-out
 Copying one or more files, projects from EME datastore to sandbox. This process is called Check-out.
 The command line syntax:
air project export <project-path> -basedir <basedir > { [-files <relative path >] \ [-force ] }

To check-out a file from EME datastore to sandbox


air project export /Projects/bi/sbis/cmi/cmi_loss
-basedir /ab_sand_bi_01/sbis/n4c980/n4c980/sandbox_new/cmi/cmi_loss
-files mp/LHST1101CLAIMClmPty_blddata_JulVer.mp
Ex: To check-out a file ,which was already present in sandbox.
air project export /Projects/bi/sbis/cmi/cmi_loss
-basedir /ab_sand_bi_01/sbis/n4c980/n4c980/sandbox_new/cmi/cmi_loss
-force -files mp/LHST1031CLAIMCatEvntLs_Blddata_JulVer.mp

What is Locking?

How to check if an object is Locked?


 Using air lock show command .
 Air lock show command :Displays list of locks for a given user, object, project or all locks in the EME datastore.
 The command line syntax:
air lock show { [ -user <username> ] | [ -object <object name> ] | [ -project <project name> ] | [ -all ] }
 Display lists the path, username, time, status of the lock.
 air lock show -user n4c980

When you need a Lock?


 Before you can modify a graph or a file and check it in, you must first lock it in your sandbox. The locking mechanism gives you
exclusive permission to edit an object while preventing other users from making changes on the same object.
 Using air lock set command we can set or modify the lock on a project/object/file.
 Command line syntax:
air lock set [-force] | [-breakable |-unbreakable] | [-auto-release | -manual –release] | [-modify attribute] { -object object | -
project project }
 To modify a lock, one should be the owner or TR administrator.

To set whether other users can break this lock on an object or not . By default locks are breakable.
air lock set –unbreakable /Projects/bi/sbis/cmi/cmi_loss /mp/LHST1031CLAIMCatEvntLs_Blddata_JulVer.mp

Air lock release Command


 Unlocks a lock on an object, on all objects in a project, or on all objects in all projects within EME datastore.
 Only a TR administrator or lock owner can release a lock.
 The command line syntax:
air lock release { [-object object] | [-project project] ..| [-all]}

Air Tag command for Listing


 Lists information about tags in technical repository.
 The command line syntax:
air tag list [-flags] [tag | object]
Ex: air tag list -o /Projects/bi/sbis/cmi/cmi_loss /pset/LHST_MIS417138_CLCV_STS_STS_CD_1S.pset
Here –o in command lists all the tags on an object.
Ex: air tag list –c CMI_LOSS_20120607065616.FULL
Here –c in command lists the comments on tag.
Air Tag command for Listing Version difference
 Lists the version differences between the objects in two tags.
 The command line syntax:
air tag diff [-show-all] tag1 tag2
Here, tag1 tag2 are name of two tags to be compared. –show-all lists objects which have no difference.

air sandbox revert to revert the version

Dynamic Output DML generation based on values in input field and then dynamic xfr generation

Hello everyone, I need help to achieve the below requirement:

I have a input table, which has the columns as :

Policy_number | Coverage_id

I want in Output :
1) only one record which will be having:

i) policy_number_count column --> contains Count of distinct policy_number

ii) coverage id1 coverage id2 coverageid3 coverageidN column/columns --> contains count of corresponding
Coverage_ids

Note : Number of coverage id columns in output dml would be created based on distinct coverage_id values present in input
coverage_id column

Look into the below scenarios to get clear picture of what I exactly want:

Scenario 1:

Lets say i/p table is having following data:

Policy_number Coverage_id

1 1

2 1

3 1

Expected o/p:

Policy_number_distinct_count Coverage_id_1
3 3

As coverage_id in input has only one distinct value(i.e., '1') - so there should be one coverage_id column in output with
columnname as coverage_id_1

Scenario 2:

i/p:

Policy_number Coverage_id

1 1

2 1

3 2

Expected o/p:

Policy_number_distinct_count Coverage_id_1 Coverage_id_2

3 2 1
As coverage_id in input has two distinct values('1' and '2 ') - so there should be two coverage_id columns in output with
columnnames as coverage_id_1 and coverage_id_2 respectively

Scenario 3:

i/p:

Policy_number Coverage_id

1 1

2 1

3 2

4 3

5 3

Expected o/p:

Policy_number_distinct_count Coverage_id_1 Coverage_id_2 Coverage_id_3

5 2 1 2

As coverage_id in input has three distinct values(1,2 and 3 ) - so there should be three coverage_id columns in output with
columnnames as coverage_id_1,coverage_id_2 and coverage_id_3 respectively

Scenario 4:

i/p:

Policy_number Coverage_id

1 1

1 1

1 2

2 3

Expected o/p:

Policy_number_distinct_count Coverage_id_1 Coverage_id_2 Coverage_id_3

2 2 1 1

As there are two distinct policy_number in input, so policy_number_count in output should have '2' as value.
I have implemented the solution for the above requirement for fixed set of coverage id values in input using the rollup. but in above
case I want solution where coverage id in input can have any set of values i.e., not fixed and based on that output columns/dml
should be created.

Thanks in advance.

Answer:---
1) First sort by Policy_number then dedup it then use rollup and take count
2) Use another sort by Coverage_id use the rollup with below xfr[ NOTE:- USE SORT IF REQUIRED]

temp::rollup(temp,in)=
begin
temp.a :: temp.a+1;
end;
type temporary_type=record
decimal("|") a;
end; /*Temporary variable*/
temp :: initialize(in) =
begin
temp.a :: 0;
end;

out :: finalize(temp, in) =


begin
out.a :: string_concat(temp.a,"+",in.b);
end;
DML:
record
string('\n') a;
end

3) Use concatenate component to combine the file (in.0 -> count and in.1 -> 2nd rollup)

4) Then use the run program component with script

#!/bin/ksh
cut -d "+" -f1 /data/sandboxes/jprathap/jaga_ts/dm1.dat > /data/sandboxes/jprathap/jaga_ts/dm4.dat
export b=1;
export b1=`wc -l /data/sandboxes/jprathap/jaga_ts/dm1.dat | cut -d " " -f1`
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' /data/sandboxes/jprathap/jaga_ts/dm4.dat > /data/sandboxes/jprathap/jaga_ts/dm5.dat
while read a
do
if [ $b == 1 ]
then
echo "record
decimal ('|') Policy_number;" > /data/sandboxes/jprathap/jaga_ts/dml/dm3.dml
elif [ $b != $b1 ]
then
b2=`echo $a | cut -d "+" -f2`;
echo "decimal ('|') Coverage_id_$b2;" >> /data/sandboxes/jprathap/jaga_ts/dml/dm3.dml
elif [ $b == $b1 ]
then
b2=`echo $a | cut -d "+" -f2`;
echo "decimal ('\n') Coverage_id_$b2;
end;" >> /data/sandboxes/jprathap/jaga_ts/dml/dm3.dml
else
echo "0";
fi
((b++))
done < /data/sandboxes/jprathap/jaga_ts/dm1.dat
sed -i 's/ /|/g' /data/sandboxes/jprathap/jaga_ts/dm5.dat

NOTE:- CHANGE THE PATH NAME ACCORDINGLY.

Now
/data/sandboxes/jprathap/jaga_ts/dm5.dat --it will be the output flat file and

/data/sandboxes/jprathap/jaga_ts/dml/dm3.dml --it will the dynamic dml

i have tried with all of your input its working fine.


ex: INPUT DATA
Policy_number Coverage_id
1 1
2 1
3 2
4 3
5 3

OUTPUT DATA:

Policy_number Coverage_id_1 Coverage_id_2 Coverage_id_3


5 2 1 2

How to return multiple records from lookup in Ab Initio?

LOOKUP SYNTAX:- lookup(“RSRV_CD Lookup file”, string_lrtrim(RSRV_CD)).RSRV_ID

"Lookup returns a single record (I mean a single field from that record) even when multiple records match with key field. Suppose a
lookup file contains multiple records for the matched key, and I want all those records (I mean again fields from all matching
records) to be retrieved. How can I emulate this behavior? I am permitted to use any components except Join. Please help. Thank
you.
Answer:-
You will need to first use the lookup_count to get the number of matching records in the lookup file. And then you need to loop
around till the number of matching records with lookup_next function.
lookup_next function will move the pointer forward till it has a matching records.

Let me know, if you face any difficulties in achieving this.


Answer2:- Yes its quite possible. If you need the output in individual records, then how about normalize without vectors?

In the normalize component define the length function as lookup_count(""lookup"", in.key);

There you go, now normalize function will understand how many times it has to run for each record in the input dataset. i.e
Normalize function will be called as many as times the value of length function. Now use lookup_next function to get the
subsequent matching records. For each time normalize function is called, it will produce an output record. i.e in this scenario
typically it produces number of o/p records that is directly proportional to the number of matching records in the lookup file.

But could you let me know, what is your business case if there are no matches in the lookup file for an input record?
I was just trying to understand your business requirement. i.e If there are no matches with the lookup.

If you need to reject the input record then prioritize the length function output of the normalize component as below,

ANSWER 3:-
out:1: lookup_count(""lookup_file"", key);
out::1;

The reason is if there is no match i.e count will be 0, then the normalize function will not be called , so prioritize the rules as above.

Now in the normalize function you can do the reject mechanism may be force_error to reject it.
out :: length(in) =
begin

let decimal("","") idx_cnt = lookup_count(""lookup_file"", in.key);

out :1: if(idx_cnt != 0) idx_cnt;

out :: 1;

end;

type lkp_type =

record

string("","") key;

string("","") prod_type;

string(""\n"") prod_man_name;

end;

out :: normalize(in,index) =

begin

let lkp_type lkp_rec;

if( l_count(""lookup_file"", in.key)!= 0)


begin

if(index == 0)

lkp_rec = lookup(""lookup_file"", in.key);

else

lkp_rec = lookup_next(""lookup_file"");

end

out.prod_type :1: lkp_rec.prod_type;

out.prod_type :: """";

out.prod_man_name :1: lkp_rec.prod_man_name;

out.prod_man_name :: """";

out.key :: in.key;

end;

No need for a vector structure in the output dml unless you wanted to create any based on your requirement.
Ab Inito Differences for In-Memory Sort With Multifiles

Questions:-
I am running a simple graph with 2 multifiles (one with huge volume (8 billion) and other file small volume (1 million).
I have followed 2 approaches.
1. I have used the PBKS component before join for the 2 files and I have done inner join.
2. I have used the in-memory sort, keeping huge volume for the driving port and applied inner join.
So, I see the difference in the output. Want to know, is there any difference between the 2 approaches when working with
multifiles?
Answer:-
The logic of MFS is
Partition_1 file of in0 port will always look into Partition_1 file of in1 port
Partition_2 file of in0 port will always look into Partition_2 file of in1 port and so on...
Partition_1 never look into Partition_2 data
-> Since you have used PBKS in the First approach, all the same key will in same partition so it will give the expected result, but in
second approach same key may be split up into different partition so result may get differ.
-> Never use in memory for huge file; it will slow down the performance.

How to Handle Tab Separated Records Which Have Some Tabs and Newline Characters in the Data Ab Initio?

Can someone please help on how to handle tab separated records which have some tabs and newline characters in the data
abinitio? I'm new to Ab Initio. One of my friends suggested to use RSV component, but I'm not able to figure out
Answer:-
(1) OP had a file with field separators, but some process had folded the rows over multiple lines: that is, the data had
extra newlines in most of the records.

Logical solution: we knew there should be 15 fields in each row. So we appended lines from the input file until there
were 15 (or more) fields, and output that line. So the field separators told us where the real line breaks should be.

(2) OP had a CSV file where just one field could have extra commas in the data (but it had not been quoted as
demanded by the CSV specification).

Logical solution: There were always 13 data fields, and field 4 was the only one that had extra commas (it was an
address). So we identified fields 1-3 by counting from the front, and fields 5-13 by counting from the end of each line,
and field 4 was anything in the middle, and we quoted it to make valid CSV. In that case, the line breaks told us
where the trailing fields should be.

If your file format has that much internal consistency, we can fix it. If it really has random characters, nobody can fix
it, because information has been destroyed when it was created.

Use of Redefine Format Over Reformat

Hi, What is the use of Redefine format over Reformat? What Redefine format can do where Reformat cannot do? In
help file I saw about FAQ: REFORMAT versus REDEFINE FORMAT, but I can't get exact difference between these.
According to help file: 1. REFORMAT can actually change the bytes in the data. I can able to convert string(3) to
string("","") with Reformat but with Redefine I can't able to convert So this point is understandable. 2. REDEFINE
FORMAT simply changes the record format on the data as it flows through, leaving the data unchanged. I can able to
convert string to decimal with Reformat, as well as with Redefine format also, for example string(3) to decimal(3)
Reformat also changes record format leaving the data unchanged, then what is the exact use of Redefine format
over Reformat. Please give your answers. Regards, Syam"

Answer1:-
To remove processing over head.
Well I am sure you know what reformat does. It's the primary Transform Component of AB-INIO.. Any kind of
standard Record by record transformation is performed by it . This involves applying field to field transformation logic
, Increasing or reducing the number of fields.

Now for redefine Format we cannot do any kind of TRANSFORMATION as such .

Redefine Format is just used to RE-INTERPRET the same data with a Different DML. For Ex : Initially you might read
the below file with

10|abc|200|2015-01-01

string(""\n"") input_record.

You are reading the whole input record but after few components you need to interpret the same data with the actual
DML.

say

decimal(""|"") employee_id;
string(""|"") employee_name;
decimal(""|"") salary;
date(""YYYY-MM-DD"")(""\n"") joining_date;

Now you just need to provide this DML in the output port of Redefine format and the component will automatically re-
interpret the same data with the new DML.
One major use of Redfeine Format is when a File has header , trailer and data Records. Since the file has three
types of records with three different DMLs we can read the file first using

string(""\n"") input_line;

But after we remove the header and trailer from the file and left with the actual data records we can re-interpret the
flow using a redefine format and start reading the data records with the actual DML

Lookups and Abinitio errors


LOOKUP FILE
A lookup file is an indexed dataset and it actually consists of two files : one files holds data and the other holds an hash index into
the data file.We commonly use a lookup file to hold in physical memory the data that a transform component frequently needs to
access.

LOOKUP FILE
How to use a LOOKUP FILE COMPONENT:
• To perform a memory-resident lookup using a Lookup File component:

• Place a LOOKUP_FILE component in the graph and open its Properties dialog.

• On the Description tab, set the Label to a name we will use in the lookup functions that reference this file.

• Click Browse to locate the file to use as the lookup file.

• Set the RecordFormat parameter to the record format of the lookup file.
• Set the key parameter to specify the fields to search.

LOOKUP FILE
• Set the Special attribute of the key to the type of lookup we want to do.

• Add a lookup function to the transform of the component that will use the lookup file.

• The first argument to a lookup function is the name of the lookup file. The remaining arguments are values to be matched against
the fields named by the key parameter of the lookup file.
lookup("MyLookupFile", in.key)
• If the lookup file key's Special attribute (in the Key Specifier Editor) is exact, the lookup functions return a record that matches the
key values and has the format specified by the RecordFormat parameter.
Partitioned lookup files:
Lookup files can be either serial or partitioned (multifiles). The lookup
functions we use to access lookup data come in both local and non-local
varieties, depending on whether the lookup data files are partitioned.
When a component accesses a serial lookup file, the Co>Operating System
loads the entire file into the component’s memory. If the component is
running in parallel (and you use a _local lookup function), the
Co>Operating System splits the lookup file into partitions.
The benefits of partitioning lookup files are:
1. The per-process footprint is lower. This means the lookup file as a whole can exceed the 2 GB limit.
2. If the component is partitioned across machines, the total memory needed on any one machine is reduced.

DYNAMIC LOOKUP
A disadvantage of statically loading a lookup file is that the dataset occupies a fixed amount of
memory even when the graph isn’t using the data.
By dynamically loading lookup data, we control how many lookup datasets are loaded, which
lookup datasets are loaded, and when lookup datasets are loaded. This control is useful in
conserving memory; applications can unload datasets that are not immediately needed and load
only the ones needed to process the current input record.
The idea behind dynamically loading lookup data is to:
1. Load the dataset into memory when it is needed.
2. Retrieve data with your graph.
3. Free up memory by unloading the dataset after use.

DYNAMIC LOOKUP
How to look up data dynamically:
To look up data dynamically:
1. Prepare a LOOKUP TEMPLATE component:
a. Add a Lookup Template component to the graph and open its Properties dialog.
b. On the Description tab of the Properties dialog, enter a label in the Label text box.
c. On the Parameters tab, set the RecordFormat parameter.
Here, we specify the DML record format of the lookup data file.
• Set the key parameter to the key we will use for the lookup.
• Load the lookup file using the lookup_load function inside a transform function.
DYNAMIC LOOKUP
For example, enter:
let lookup_identifier_type LID =
lookup_load(MyData, MyIndex, "MyTemplate", -1)
where:
LID is a variable to hold the lookup ID returned by the lookup_load function. This ID references the lookup file in memory.
The lookup ID is valid only within the scope of the transform.
MyData is the pathname of the lookup data file.
MyIndex is the index of the pathname of the lookup index file.
If no index file exists, we must enter the DML keyword NULL. The graph creates an
index on the fly.

LOOKUP TEMPLATE component:


A LOOKUP TEMPLATE component substitutes for a LOOKUP FILE
component when we want to load lookup data dynamically.
Defining a lookup template:
When you place a LOOKUP TEMPLATE component in our graph, we define
it by specifying two parameters:
RecordFormat — A DML description of the data
key — The field or fields by which the data is to be searched
Note: In a lookup template, we do not provide a static URL for the
dataset’s location as we do with a lookup file. Instead, we specify the
dataset’s location in a call to the lookup_load function when the data is
actually loaded.

Appendable lookup files (ALFs)


Data has a tendency to change and grow. In situations where new data is arriving all the time,
static datasets loaded into memory at graph execution time are not up to the task. Even
dynamically loaded lookup datasets may require complex logic to check whether the data has
changed before using it.
Appendable lookup files (ALFs) are a special kind of dynamically loaded lookup file in which a
newly arriving record is made available to our graph as soon as the complete record appears on
disk. ALFs can enable the applications to process new data quickly— often less than a second
after it is landed to disk.
DYNAMIC LOOKUP
How to create an appendable lookup file (ALF):
To create an ALF:
Call the lookup_load function, specifying:
The integer -2 as the load-behavior argument to lookup_load
The DML constant NULL for the index
For example, this command creates an ALF using the data in the existing disk file mydata, and following the record format specified
in the lookup template My_Template:
let lookup_identifier_type LID =
lookup_load($DATA/mydata, NULL, "My_Template", -2)

Compressed lookup data:


The data stored in a lookup file can be either uncompressed or block-compressed. Block-compressed data
forms the basis of indexed compressed flat files (ICFFs). This kind of data is both compressed and divided
into blocks of roughly equal size.
Obviously, we can store data more efficiently when it is compressed. On the other hand, raw data can be
read faster, since it does not need to be uncompressed first.
Typically, we would use compressed lookup data when the total size of the data is large but only a
relatively small amount of it is needed at any given time.
Compressed LOOKUP
Block-compressed lookup data:
With block-compressed data, only the index resides in memory. The lookup function uses the index file to
locate the proper block, reads the indicated block from disk, decompresses the block in memory, and
searches it for matching records.
• Exact and range lookup operations only.

• The only lookup operations we can perform on block-compressed lookup data are exact and range.

• Interval and regex lookup operations are not supported.

• In addition, we must use only fixed-length keys for block-compressed lookup operations.
Compressed LOOKUP
Handling compressed versus uncompressed data:
The Co>Operating System manages memory differently when handling block-compressed
And uncompressed lookup data.
Uncompressed lookup data
Any file can serve as an uncompressed lookup file as long as the data is not compressed
and has a field you can define as a key.
We can also create an uncompressed lookup file using the WRITE LOOKUP (or
WRITE MULTIPLE LOOKUPS) component. The component writes two files:
a file containing the lookup data
and an index file that references the data file.
With an uncompressed lookup file, both the data and its index reside in memory. The
lookup function uses the index to find the probable location of the lookup key value in the
data file. Then it goes to that location and retrieves the matching record.

ICFF
An indexed compressed flat file (ICFF) is a specific kind of lookup file that can store large
volumes of data while also providing quick access to individual records.
Why use indexed compressed flat files?
A disadvantage of using a lookup file like is that there is a limit to how much data we can
keep in it. What happens when the dataset grows large? Is there a way to maintain the
benefits of a lookup file without swamping physical memory? Yes, there is a way: it
involves using indexed compressed flat files.
ICFFs present advantages in a number of categories:
• Disk requirements — Because ICFFs store compressed data in flat files without the overhead associated with a DBMS, they
require much less disk storage capacity than databases — on the order of 10 times less.
• Memory requirements — Because ICFFs organize data in discrete blocks, only a small portion of the data needs to be loaded in
memory at any one time.
ICFF
• Speed — ICFFs allow us to create successive generations of updated information without any pause in processing. This means
the time between a transaction taking place and the results of that transaction being accessible can be a matter of seconds.
• Performance — Making large numbers of queries against database tables that are continually being updated can slow down a
DBMS. In such applications, ICFFs outperform databases.
• Volume of data — ICFFs can easily accommodate very large amounts of data — so large, in fact, that it can be feasible to take
hundreds of terabytes of data from archive tapes, convert it into ICFFs, and make it available for online access and processing.

ICFF
• ICFFs are usually dynamically loaded. To define an ICFF dataset, place a BLOCK-COMPRESSED LOOKUP TEMPLATE
component in your graph.
About the BLOCK-COMPRESSED LOOKUP TEMPLATE component :
• A BLOCK-COMPRESSED LOOKUP TEMPLATE component is identical to a LOOKUP TEMPLATE, except that in the former the
block_compressed and keep_on_disk parameters are set to True by default, while in the latter they are False.
Defining a BLOCK-COMPRESSED LOOKUP TEMPLATE component:
• When we place a BLOCK-COMPRESSED LOOKUP TEMPLATE component in the graph, we define it by specifying two
parameters:
RecordFormat — A DML description of the data
key — The field or fields by which the data is to be searched

Note: In a BLOCK-COMPRESSED LOOKUP TEMPLATE component, we do not provide a static URL for the dataset’s
location as we do with a lookup file. Instead, we specify the dataset’s location in a call to the lookup_load function when the data is
actually loaded.

Lookup Functions :

Lookup -- Returns the first record from a lookup file that matches a specified expression.
lookup_local -- Behaves like lookup, except that this function searches only one partition of a lookup file

lookup_match -- Searches for records matching a specified expression in a lookup file.

lookup_match_local -- Behaves like lookup_match, except that this function searches only one partition of a lookup file.

lookup_first -- Returns the first record from a lookup file that matches a specified expression. In Co>Operating System Version
2.15.2 and later, this is another name for the lookup function.

lookup_first_local -- Returns the first record from a partition of a lookup file that matches a specified expression. In Co>Operating
System Version 2.15.2 and later, this is another name for the lookup_local function.

lookup_last -- Returns the last record from a lookup file that matches a specified expression.

lookup_last_local -- Behaves the same as lookup_last, except that this function searches only one partition of a lookup file.

lookup_count -- Returns the number of records in a lookup file that matches a specified expression.

lookup_next -- Returns the next successive matching record or the next successive record in a range,
if any, that appears in the lookup file.

lookup_nth -- Returns a specific record from a lookup file.

lookup_previous -- Returns the record from the lookup file that precedes the record returned by the last successful call to a lookup
function.

Lookup_add -- Adds a record or vector to a specified lookup table.

lookup_create -- Creates a lookup table in memory.


Lookup_load -- Returns a lookup identifier that you can pass to other lookup functions.

Lookup_not_loaded -- Initializes a global lookup identifier for a lookup operation.

Lookup_range -- Returns the first record whose key matches a value in a specified range. For use only with block-compressed
lookup files.

Lookup_range_count -- Returns the number of records whose keys match a value in a specified range. For use only with block-
compressed lookup files.

Lookup_range_last -- Returns the last record whose key matches a value in a specified range.

Lookup_unload -- Unloads a lookup file previously loaded by lookup_load.

----------------------------------------------------------------------------------------------------------------------

Ab-Initio errors and resolution details:


What does the error message "Mismatched straight flow" mean?
Answer: This error message appears when you have two components that are connected by a straight flow and running at
different levels of parallelism. A straight flow requires the depths — the number of partitions in the layouts — to be the same for
both the source and destination. Very often, this error occurs after a graph is moved to a new environment. A common cause for
this error is that the depths set in development.
What does the error message "File table overflow" mean?
Answer: This error message indicates that the system-wide limit on open files has been exceeded. Either there are too many
processes running on the system, or the kernel configuration needs to be changed. This error message might occur if the maximum
number of open files allowed on the machine is set too low, or if max-core is set too low in the components that are processing
large amounts of data. In the latter case, much of the data processed in a component (such as a SORT or JOIN component) spills
to disk, causing many files to be opened. Increasing the value of max-core is an appropriate first step in the case of a sort, because
it reduces the number of separate merge files that must be opened at the conclusion of the sort. NOTE: Because increasing max-
core also results in the memory requirements of your graph increasing be careful not to increase it too much (and you might need
to consider changing the graph's phasing to reduce memory requirements). It is seldom necessary to increase max-core beyond
100MB. If the error still occurs, see your system administrator. Note that the kernel setting for the maximum number of system-wide
open files is operating system-dependent (for example, this is the nfile parameter on Unix systems), and, on many platforms,
requires a reboot in order to take effect. See the Co>Operating System Installation Notes for the recommended settings.

What does the error message "broken pipe" mean?


Answer: This error message means that a downstream component has gone away unexpectedly, so the flow is broken. For
example, the database might have run out of memory making database components in the graph unavailable. In general, broken
pipe errors indicate the failure of a downstream component, often a custom component or a database component. When the
downstream component failed, the named pipe the component was writing to broke. In the majority of cases, the problem is that the
database ran out of memory, or some other problem occurred during database load. There could be a networking problem, seen in
graphs running across multiple machines where a TCP/IP problem causes the sender to see a "Connection reset by peer"
message from the remote machine. If a component has failed, you typically see either of two scenarios.
What does the error message "Trouble writing to socket: No space left on device" mean?
Answer: This error message means your work directory (AB_WORK_DIR) is full. NOTE: Any jobs running when AB_WORK_DIR
fills up are unrecoverable. An error message like the following means you have run out of space in your work directory,
AB_WORK_DIR: ABINITIO: host.foo.bar: Trouble writing to socket: No space left on device Trouble creating layout "layout1": [B9]
/~ab_work_dir/host/a0c5540-3dd4143c-412c/history.000 [/var/abinitio/host/a0c5540- 3dd4143c-412c/history.000]: No space left on
device [Hide Details] Url: /~ab_work_dir/host/a0c5540-3dd4143c-412c/history.000 [/var/abinitio/host/a0c5540- 3dd4143c-
412c/history.000] Check the disk where this directory resides to see if it is full. If it is, you can try to clean it up. Note t,hat although
utilities are provided to clean up AB_WORK_DIR, they succeed only for those files for which you have permissions (nonprivileged
users can clean up only the temporary files from their own jobs; root should be able to clean up any jobs It is critically important that
you not clean up files that are associated with a job that is still running, or that you want to be able to recover later. Be aware that
some types of Unix filesystems allocate a fixed number of inodes (information nodes) when the filesystem is created, and you
cannot make more files than that. Use df -i to see the status of inodes. If you make many little files, inodes can run out well ahead
of data space on the disk. The way to deal with that would be to make sure any extraneous files on your system are backed up and
removed.

What does the error message "Failed to allocate bytes" mean?


Answer: This error message is generated when an Ab Initio process has exceeded its limit for some type of memory allocation.
Three things can prevent a process from being able to allocate memory: • The user data limit (ulimit -Sd and ulimit -Hd). These
settings do not apply to Windows systems. • Address space limit. • The entire computer is out of swap space.
What is ABLOCAL and how can I use it to resolve failures when unloading in parallel (Failed parsing SQL)?

Answer: Some complex SQL statements contain grammar that is not recognized by the Ab Initio parser when unloading in
parallel. In this case you can use the ABLOCAL construct to prevent the input component from parsing the SQL (it will get passed
through to the database). It also specifies which table to use for the parallel clause.

We know rollup component in Abinitio is used to summarize group of data record then why do we use
aggregation?
Aggregation and Rollup, both are used to summarize the data.

- Rollup is much better and convenient to use.


- Rollup can perform some additional functionality, like input filtering and output filtering of records.

- Aggregate does not display the intermediate results in main memory, where as Rollup can.

- Analyzing a particular summarization is much simpler compared to Aggregations.


What kind of layouts does Abinitio support?

- Abinitio supports serial and parallel layouts.

- A graph layout supports both serial and parallel layouts at a time.

- The parallel layout depends on the degree of the data parallelism

- A multi-file system is a 4-way parallel system

- A component in a graph system can run 4-way parallel system.


How do you add default rules in transformer?

The following is the process to add default rules in transformer

- Double click on the transform parameter in the parameter tab page in component properties

- Click on Edit menu in Transform editor

- Select Add Default Rules from the dropdown list box.

- It shows Match Names and Wildcard options. Select either of them.


What is a look-up?

- A lookup file represents a set of serial files / flat files

- A lookup is a specific data set that is keyed.

- The key is used for mapping values based on the data available in a particular file

- The data set can be static or dynamic.

- Hash-joins can be replaced by reformatting and any of the input in lookup to join should contain less number of
records with a slim length of records

- Abinitio has certain functions for retrieval of values using the key for the lookup
What is a ramp limit?

- A limit is an integer parameter which represents a number of reject events

- Ramp parameter contain a real number representing a rate of reject events of certain processed records

- The formula is - No. of bad records allowed = limit + no. of records x ramp

- A ramp is a percentage value from 0 to 1.

- These two provides the threshold value of bad records.


What is a Rollup component? Explain about it.

- Rollup component allows the users to group the records on certain field values.
- It is a multi stage function and contains

- Initialize 2. Rollup 3. Finalize functions which are mandatory

- To counts of a particular group Rollup needs a temporary variable

- The initialize function is invoked first for each group


Rollup is called for each of the records in the group.

- The finally function calls only once at the end of last rollup call.
What is the difference between partitioning with key / hash and round robin?

Partitioning by Key / Hash Partition :

- The partitioning technique that is used when the keys are diverse

- Large data skew can exist when the key is present in large volume

- It is apt for parallel data processing

Round Robin Partition :

- This partition technique uniformly distributes the data on every destination data partitions

- When number of records is divisible by number of partitions, then the skew is zero.

- For example – a pack of 52 cards is distributed among 4 players in a round-robin fashion.


Explain the methods to improve performance of a graph?

The following are the ways to improve the performance of a graph :

- Make sure that a limited number of components are used in a particular phase

- Implement the usage of optimum value of max core values for the purpose of sorting and joining components.

- Utilize the minimum number of sort components

- Utilize the minimum number of sorted join components and replace them by in-memory join / hash join, if needed
and possible

- Restrict only the needed fields in sort, reformat, join components

- Utilize phasing or flow buffers when merged or sorted joins

- Use sorted join, when two inputs are huge, otherwise use hash join
What is the function that transfers a string into a decimal?

- Use decimal cast with the size in the transform() function, when the size of the string and decimal is same.

- Ex: If the source field is defined as string(8).

- The destination is defined as decimal(8)

- Let us assume the field name is salary.

- The function is out.field :: (decimal(8)) in salary


- If the size of the destination field is lesser that the input then string_substring() function can be used

- Ex : Say the destination field is decimal(5) then use…

- out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5))

- The ‘ lrtrim ‘ function is used to remove leading and trailing spaces in the string
Describe the Evaluation of Parameters order.

Following is the order of evaluation:

- Host setup script will be executed first

- All Common parameters, that is, included , are evaluated

- All Sandbox parameters are evaluated

- The project script – project-start.ksh is executed

- All form parameters are evaluated

- Graph parameters are evaluated

- The Start Script of graph is executed


Explain PDL with an example?
- To make a graph behave dynamically, PDL is used

- Suppose there is a need to have a dynamic field that is to be added to a predefined DML while executing the graph

- Then a graph level parameter can be defined

- Utilize this parameter while embedding the DML in output port.

- For Example : define a parameter named myfield with a value “string(“ | “”) name;”

- Use ${mystring} at the time of embedding the dml in out port.

- Use $substitution as an interpretation option


State the working process of decimal_strip function.

- A decimal strip takes the decimal values out of the data.

- It trims any leading zeros

- The result is a valid decimal number

Ex:
decimal_strip("-0184o") := "-184"
decimal_strip("oxyas97abc") := "97"
decimal_strip("+$78ab=-*&^*&%cdw") := "78"
decimal_strip("Honda") "0"
State the first_defined function with an example.

- This function is similar to the function NVL() in Oracle database


- It performs the first values which are not null among other values available in the function and assigns to the
variable

Example: A set of variables, say v1,v2,v3,v4,v5,v6 are assigned with NULL.


Another variable num is assigned with value 340 (num=340)
num = first_defined(NULL, v1,v2,v3,v4,v5,v6,NUM)
The result of num is 340
What is MAX CORE of a component?

- MAX CORE is the space consumed by a component that is used for calculations

- Each component has different MAX COREs

- Component performances will be influenced by the MAX CORE’s contribution

- The process may slow down / fasten if a wrong MAX CORE is set
What are the operations that support avoiding duplicate record?

Duplicate records can be avoided by using the following:

- Using Dedup sort

- Performing aggregation

- Utilizing the Rollup component


What parallelisms does Abinitio support?

AbInitio supports 3 parallelisms. They are


- Data Parallelism : Same data is parallelly worked in a single application

- Component Parallelism : Different data is worked parallelly in a single application

- Pipeline Parallelism : Data is passed from one component to another component. Data is worked on both of the
components.
State the relation between EME, GDE and Co-operating system.

EME:

- EME stands for Enterprise Metadata Environment

- It is a repository to AbInitio. It holds transformations, database configuration files, metadata and target information

GDE:

- GDE – Graphical Development Environment

- It is an end user environment. Graphs are developed in this environment

- It provides GUI for editing and executing AbInitio programs

Co-operative System:

- Co-operative system is the server of AbInitio.

- It is installed on a specific OS platform known as Native OS.


- All generated graphs in GDE are later deployed and executed in co-operative system
What is a deadlock and how it occurs?

- A graphical / program hand is known as deadlock.

- The progression of a program would be stopped when a dead lock occurs.

- Data flow pattern likely causes a deadlock

- If a graph flows diverge and converge in a single phase, it is potential for a deadlock

- A component might wait for the records to arrive on one flow during the flow converge, even though the unread data
accumulates on others.

- In GDE version 1.8, the occurrence of a dead lock is very rare

What is the difference between check point and phase?

Check point:

- When a graph fails in the middle of the process, a recovery point is created, known as Check point

- The rest of the process will be continued after the check point

- Data from the check point is fetched and continue to execute after correction.

Phase:
- If a graph is created with phases, each phase is assigned to some part of memory one after another.

- All the phases will run one by one

- The intermediate file will be deleted

HOW MUCH COMPLEX GRAPH YOU HAD DEVELOPED.


How many components in your most complicated graph? It depends the type of components you us.

usually avoid using much complicated transform function in a graph.

What is the lookup function used to retrieve the particular duplictae datarecords in the lookup file

Use lookup_count for finding the duplicates and lookup_next for retrieving it.

If lookup_count (string file_label, [ expression [ , expression ... ] ] )>0


lookup_next ( lookup_identifier_type lookup_id, string lookup_template )

Data from 1 Column to be separated in Multiple Columns

Input file
col1
1
2
3
4
5
6
7
8
output file
col1 col2 col3 col4
1234
5678
How to achieve this?

created for 3 columns.


temp :: rollup(temp, in) =
begin
temp.rec :: if(temp.ind != 0)string_concat(temp.rec,"|",in.data) else temp.rec;
temp.ind :: temp.ind + 1;
end;
type temporary_type=record
decimal("") ind;
string("") rec;
end; /*Temporary variable*/
temp :: initialize(in) =
begin
temp.ind :: 0;
temp.rec :: in.data;
end;
out :: finalize(temp, in) =
begin
out.col1 :: (string_split(temp.rec,"|"))[0];
out.col2 ::(string_split(temp.rec,"|"))[1];
out.col3 :: (string_split(temp.rec,"|"))[2];
end;
out :: key_change(in1, in2) =
begin
out :: (next_in_sequence() % 3) ==0;
end;

Difference between Departitioning Components.

GATHER, MERGE, CONCATENATE, INTERLEAVE

GATHER MERGE CONCATENATE INTERLEAVE


Gather combines the data KEY BASED (ALL IN BOUND inbound flows or partitions and
records from multiple flow MUST BE SORTED ON THE serializes them by pancaking
partitions arbitarly SAME KEY. them one on top of another in
the order of the flow partition id
Not key-based OUTPUT SERIALIZED FILE so concatenation is handy if
Result ordering is .PRESERVE SORT ORDER your first flow contains a
unpredictable header, the second flow
Most useful method for contains a body of records and
efficient collection of data from the last flow contains a footer -
multiple flows they will be serialized in order
of header-body-footer
It is more powerful to sort a Somehow similar to unix CAT
flow in this manner because command
you can partition the sort
across multiple CPUs

Merge is only useful if the


flows are partitioned, or you
are merging multiple disparate
flows into one
the merge is only useful for a
key-based partition. for sorted
records

• Layout
1)Layout determines the location of the resources.
2)Layout is either Serial or Parallel.
3)Serial layout specifies the one node or one directory.
4)Parallel layout specifies the multiple nodes or multiple directories.

Phase:

Phase are basically to break up the graph into blocks for performance tuning.Phase limits the number of simultaneous processes
by breaking up the graphs into different phase.The main use of phase is to avoid the deadlock.The temporary files generated by
phase break will be deleted at the end of phase regardless of wether the job got successful or not.
Checkpoint:

The temporary file generated through checkpoint will not get deleted hence it will start the job from the last good
process.Checkpoint are used as the purpose of recovery.

Sandbox consists of five folder:

db– database configuration files


dml– record formats, user defined data types
mp– graphs
plan– plans
run– Korn shell scripts, other scripts
xfr– business logics
PARALLELISM IN ABINITIO
 Parallelism means doing more than 1 thing at the same time.
 In AbInitio Parallelism is achieved via its “Co>Operating system” which provides the facilities for parallel execution.

Multifiles

Multifiles are parallel files composed of individual files, which may be located on separate disks or systems. These
individual files are the partitions of the multifile.

 An AbInitio multifile organizes all partitions of a multifile into a single virtual file that you can reference as one entity.

 You organize multifiles by using a multifile system, which has a directory tree structure that allows you to work with
multifiles.

 A multifile has a control file that contains URLs pointing to one or more data files.

AbInitio has 3 kinds of parallelism:

1)Pipeline: This kind of parallelism is available for all the graphs


and most of the components. This can be easily viewed when u run the
graph u see different number of records processed in different parts
.For example you can keep on reading the data from input file(say 10
records) but till now processed only 6 records. This is called
pipeline parallelism when one component doesnot wait for alll the data
to come and starts processing parallely in a pipe.
Pipeline
Occurs when several connected program components on the same branch of a graph execute simultaneously
If a source component must read all its records before writing any records pipeline parallelism does not occur.
Components which break pipeline parallelism are Sort, Rollup (in-memory), Join (in-memory), Scan, Sort within
Groups, Fuse, Interleave

2) Component: This kind of parallelism is specific to your graph when


2 different components are not interrelated and they process the data
parallely. For example you have 2 input files and you sort the data of
both of them in 2 different flows. Then this 2 components are under
component parallelism.

3) Data: This is the most common parallelism when you partition your
data to be processed fast.This is achieved thru partitioning. For
example you have 1000 records and you divide them to 8 computers to
process fast.
Packages:- Deployment :-
• Two types of packages are present:

1. Full

2. Incremental

• Full package will contain the entire project.

• Incremental package contains only the objects which have been modified.

• 1.Log file : Operational metadata

• 2. Config File :- Contains information about TAG name, EME project path and Sandbox path

• 3. Save File:- Contains details about the objects and their associated fields

ABINITO COMPONENTS :--

Reformat:-

• Example:

Input fields : LOSS_DT and CAT_CD

Output field : ACC_YR_CAT_CD

Variable : TEMP_ACC_YR

(decimal(4))date_year((date”YYYY-MM-DD”))in.LOSS_DT)
Business Rule for ACC_YR_CAT_CD

if(is_null(TEMP_ACC_YR) or is_null(CAT_CD)) ‘^’

else

string_concat(TEMP_ACC_YR,CAT_CD)

2) Join:--

3) Sort, Dedup Sort & Sort within groups:-

• Dedup Sort :

 Dedup sorted separates one particular record from each group of records.

 The input for Dedup sorted component must always be Grouped as it operates on groups.

 The key component of the Dedup sort should have the same key on which the input is grouped.

Example :-

• Sort within groups :

 Sort within groups sorts the records within a group which is created by already sorting the records.

 For this component the Major key parameter contains the value of the field on which the component is already
sorted.

 The Minor key parameter contains the value of the field on which the component will sort the data.
 Example

I have file containing 5 unique rows and I am passing them through SORT component using null key and and passing output
of SORT to Dedup sort. What will happen, what will be the output

Answer:- If there is no key used in the sort component, while using the dedup sort the output depends on the keep
parameter.
If its set to firt then the output would have only the first record,
if its set to last the output would have the last record,
if its set to unique_only,then there would be no records in the output file.

dedup :
{} key - 1 record in sequence will go to out port (In case of keep first)
Null in key data - 1 st null will go to out port (In case of keep first)

the best Answer:-

Case:1 :If we can take null key in dedup sort also then output depend on keep parameter.
keep: first: 1st record
last: last record
unique: 0 records

Case 2: If we can take any key in dedup then output will be 5 records(if only ur i/p file contain unique rows only)

Keep: first : 5 records


last : 5 records
unique: 5 records
1) keep : first - then it will allow the first record from the input port
2) keep : last then it will allow the last record from the input port
30 keep : unique , here it will compare each record with all the column subjected in the each record and will find
out that the values of each and every column differ from the each and every column of the records in the input
port.Thus to conclude all records then will be found as unique and hence all records will be found at the out
port.

SORT and SORT GROUP examples:-

Question :-
I have some queries regarding Sort and Sort Within Groups components...
i) Which one is more useful?
ii) Are they both work on same logic?
iii) My file is already sorted on account number but now i want to sort on 2
more keys.
iv) In such case my major key will be acct num and minor keys will be other
2 keys on which i want to sort my file.
iv) I have referred the component help but still it not completely clarified
all my points.
Answer:-
if your file is sorted on acct_num and you want
sort on 2 other keys you can use sort within groups provided acct_num is
your first preferred key.
For example:
if you require the file to sort on acct_num, key 2, key 3.. in this case you
can use sort within groups.
But if you require to sort the file on keys as key1, acct_num, key2 then you
will have to use sort component.
It is preferred to use sort within groups wherever applicable as it reduces
the keys on which the sort needs to be done adding its part in the
performance.

• Rollup :

The Rollup component groups input records on a key parameter and performs aggregation functions like
count(),sum(),avg(), max() etc within the group.

Scan :

Scan creates a series of cumulative aggregate or summarized records for grouped data.

Scan can create intermediate summarized records unlike Rollup.

Example: question :- 123|A|X| |12.0| 123|A|X|2012-02-17|18.5| 123|D|Y|2012-02-18|20.5| 123|C|X|2012-02-19|09.5|


123|A|X|2012-02-21|10.0| 123|C|X|2012-02-22|32.5| 123|D|X|2012-02-23|45.5| Dml of input file: record string('|')
tran_id; string('|') tran_cd; string('|') src_cd; string('|') tran_dt; decimal('|') tran_amt; string('\n') new_ln; end I am using the
below scan: type temporary_type=record string(1) temp_rej; string(1) curr_cd; string(1) temp_src; string(1) prev_cd; end;
/*Temporary variable*/ temp :: scan(temp, in) = begin temp.prev_cd :: temp.curr_cd; temp.curr_cd :: in.tran_cd;
temp.temp_rej :: temp.temp_rej; temp.temp_src :: in.src_cd; end; out :: finalize(temp, in) = begin out.tran_cd:: temp.curr_cd;
out.rec_rej:: switch(temp.curr_cd) case ""A"" : if ((temp.prev_cd == 'A') || (temp.prev_cd == 'C') || (temp.prev_cd == 'D')) 'M';
case ""C"" : if ((temp.prev_cd == 'A') || (temp.prev_cd == 'C') || (temp.prev_cd == 'D')) 'N'; case ""D"" : if ((temp.prev_cd == 'A')
|| (temp.prev_cd == 'C') || (temp.prev_cd == 'D')) 'O'; end; out.new_ln:: '\n'; out.tran_id:: in.tran_id; out.tran_dt:: in.tran_dt;
out.tran_amt:: in.tran_amt; out.src_cd :: temp.temp_src; end; temp :: initialize(in) = begin temp.temp_rej:: ''; temp.curr_cd:: '';
temp.temp_src :: ''; temp.prev_cd :: ''; end; The logic that I am trying to implement is: If the first record has the code as ""A""
and 2nd record as ""A""/""C""/""D"", then I need to get the reject_cd as ""M"" else If the first record has the code as ""C"" and
2nd record as ""A""/""C""/""D"", then I need to get the reject_cd as ""N"" else If the first record has the code as ""D"" and 2nd
record as ""A""/""C""/""D"", then I need to get the reject_cd as ""O"". Using the above scan, the output result that I got is as
under: 123|A|X| |12.0|| 123|A|X|2012-02-17|18.5|M| 123|D|Y|2012-02-18|20.5|O| 123|C|X|2012-02-19|09.5|N|
123|A|X|2012-02-21|10.0|M| 123|C|X|2012-02-22|32.5|N| 123|D|X|2012-02-23|45.5|O| Where exactly I am going
wrong as the first record has NULL populated instead of ""M"".

Answer:-

I was able to solve the above issue after I used the below code in my finalize function in the scan:

out.rec_rej:1: if (!is_null(temp.prev_cd) && !is_blank(temp.prev_cd))


switch(temp.prev_cd)
case ""A"" : if ((temp.curr_cd == 'A') || (temp.curr_cd == 'C') || (temp.curr_cd == 'D')) 'M';
case ""C"" : if ((temp.curr_cd == 'A') || (temp.curr_cd == 'C') || (temp.curr_cd == 'D')) 'N';
case ""D"" : if ((temp.curr_cd == 'A') || (temp.curr_cd == 'C') || (temp.curr_cd == 'D')) 'O';
end;
out.rec_rej:2: if (is_blank(temp.prev_cd) && (temp.curr_count == 1)) ""Z"";

Its setting up ""Z"" for the first record. But my requirement is that if I have only 1 record, then I need to set the value as ""N""
instead.

Anda mungkin juga menyukai