Ans: The Layout is something which determines whether a component runs in a serial or
parallel mode. If you specify the path as serial directory the component runs as single stream
and if you specify the path as a multi file directory the component runs in parallel mode. Also
the path which you specify there is serves as the working directory of the graph where all
intermediate files are stored
Layout can be specified as
1) propagate from neighbors
2) URL
3) custom
4) host
Before you can run an Ab Initio graph, you must specify layouts to describe the following to the
Co>Operating System:
2) What is skew?
Ans: Skew tells about the unbalanced behavior of data partitioning. You can do performance
tuning by controlling Skew, Max-Core etc and there are so many ways.
Ans: Put a partition and a gather component... Partition component should be 4 way MFS and
the gather should be 8way.
5) In Ab initio what is the upstream and downstream?
Ans: upstream and downstream are used in conjunction with EME for dependency and impact
analysis for the graphs we have developed and saved in to the repository. Basically it helps
tracking the changes between the versions and changes in the individual components and
variables in the components.
6) To extract files from both Oracle DB and from Mainframes (DB2). Is it possible to
extract the data directly from the DB's or do i need to convert them into Flat files &
load?
Ans: You can extract directly from the database on each of these. You just have to make sure
that you have a config file set up for DB2 and Oracle. You also want to make sure you have
your entire login variables set in your run settings You can load the data directly from
DB2,Oracle or Informix using Unload DB Table component to my knowledge
7) Does anybody know how many number of columns in the lookup file? What is the
maximum data we can have in the lookup file? I am doing code review for my
application, i see their 8 to 10 columns in each lookup file with large amount of
data
Ans: 1) There is no set column number that a lookup file can contain. There is, however, a
limit on the size of the data file. If you believe that 8-10 columns are too large, you might
be correct. If the size of the lookup contains anything over 750,000-1M records, I would
highly recommend using a join on this. The lookup file will die, if the size gets too large,
and you will have to code for a join.
2) Lookups get cached into memory during graph execution; it is always a good idea to
keep the data in lookup to bare minimum based on requirement.
Don't keep any columns you don't need or you don't access from the lookup in Lookup file.
If the graph is partitioned then try to use lookup_local wherever possible. For this your
partition key and lookup key must match or lookup key should be leading subset of
partition key.
Rule of thumb: Trim any fields from the data, which you don't use in the downstream
processing.
3) The limit for a lookup file is 2GB. Whether or not it is sensible to use a lookup of that
sort of size depends on what it's being used for.
9) How can I stop an executing graph in the middle for some conditions then how to
restart it?
Ans: 1) Doing a kill -9 PID1 PID2 will only kill the Ab Initio processes running on the host node.
We may still have Ab Initio processes running on different agent nodes. During runtime Ab Inito
creates one recovery file by the name <graph_name.rec> in the host directory specified in Run
-> Setting parameter in GDE. If the host directory is not specified, then the file gets created in
the default $HOME of the user specified in the Run ->Setting of GDE. This recovery file contains
pointers to different temporary files created dynamically during runtime. In order to kill an Ab
Inito Job and all its associated processes running across all the nodes, you have to execute the
following two commands in order as they appear -
1. m_kill -9 <recovery_file>
2. m_rollback -d <recovery_file>
If the graph execution has to be stopped, depending on certain conditions, then use
force_error() function.
a. today() :: Returns the internal representation of the current date on each call
b. today1() :: Returns the internal representation of the current date on the first
call.
Note [DML represents dates internally as integer values specifying days relative to
January 1, 1900]
out.fieldname (date(YYYY-MM-DD))in.fieldname;
Note: [However if any of the i/p field has NULL data, it fails, so use a is_valid() ,is_defined()
functions to check the validity of the i/p data ]
12) What is the relation between EME, GDE and Co-operating system?
Ans. EME is said as enterprise metadata env, GDE as graphical development env and Co-
operating system can be said as ab initio server relation b/w this CO-OP, EME AND GDE is as
fallows Co operating system is the Abinitio Server. This co-op is installed on particular
O.S platform that is called NATIVE O.S .coming to the EME, its just as repository in
informatica, its hold the metadata, transformations, db config files source and targets
informations. coming to GDE its is end user environment where we can develop the
graphs(mapping just like in informatica) designer uses the GDE and designs the graphs and save
to the EME or Sand box it is at user side where EME is as server side.
13) What is the use of aggregation when we have rollup as we know rollup component
in abinitio is used to summarize group of data record, then where we will use
aggregation?
Ans: Aggregation and Rollup both can summarize the data but rollup is much more convenient
to use. In order to understand how a particular summarization being rollup is much more
explanatory compared to aggregate. Rollup can do some other functionality like input and
output filtering of records.
Aggregate and rollup perform same action, rollup display intermediate result in main memory;
Aggregate does not support intermediate result
Lookup File consists of data records which can be held in main memory. This makes the
transform function to retrieve the records much faster than retrieving from disk. It allows the
transform component to process the data records of multiple files fastly.
17) What is the difference between look-up file and look-up, with a relevant example?
Ans: Generally Lookup file represents one or more serial files (Flat files). The amount of data is
small enough to be held in the memory. This allows transform functions to retrieve records
much more quickly than it could retrieve from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it by using a
key parameter.
A lookup file is the physical file where the data for the lookup is stored.
Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per
the data present in a particular file (serial/multi file). The dataset can be static as well
dynamic (in case the lookup file is being generated in previous phase and used as lookup file in
current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of
the inputs to the join contains less number of records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
If the user wants to group the records on particular field values then rollup is best way to do
that. Rollup is a multi-stage transform function and it contains the following mandatory
functions.
1. initialise
2. Rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup function
calls for each of the records in the group and finally calls the finalize function once at the end
of last rollup call.
Add Default Rules Opens the Add Default Rules dialog. Select one of the following: Match
Names Match names: generates a set of rules that copies input fields to output fields with
the same name. Use Wildcard (.*) Rule Generates one rule that copies input fields to output
fields with the same name.
In case of reformat if the destination field names are same or subset of the source fields then
no need to write anything in the reformat xfr unless you don t want to use any real transform
other than reducing the set of fields or split the flow into a number of flows to achieve the
functionality.
24) What is the difference between partitioning with key and round robin?
Partition by Key or hash partition -> this is a partitioning technique which is used to partition
data when the keys are diverse. If the key is present in large volume then there can large data
skew? But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on each
of the destination data partitions. The skew is zero in this case when no of records is divisible
by number of partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin manner.
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving
port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned and
if possible output file should also be partitioned.
27) Have you ever encountered an error called "depth not equal"?
When two components are linked together if their layout doesn t match then this problem can
occur during the compilation of the graph. A solution to this problem would be to use a
partitioning component in between if there was change in layout.
28) What is the function you would use to transfer a string into a decimal?
In this case no specific function is required if the size of the string and decimal is same. Just
use decimal cast with the size in the transform function and will suffice. For example, if the
source field is defined as string (8) and the destination as decimal (8) then (say the field name
is field1).
If the destination field size is lesser than the input then use of string_substring function can be
used like the following.
say destination field is decimal(5).
An outer join is used when one wants to select all the records from a port - whether it has
satisfied the join criteria or not.
32) What is the relation between EME, GDE and Co-operating system?
ans. EME is said as enterprise metadata env, GDE as graphical development env and Co-
operating system can be said as abinitio server relation b/w this CO-OP, EME AND GDE is as
fallows
Co operating system is the Abinitio Server. this co-op is installed on particular O.S platform
that is called NATIVE O.S .coming to the EME, its i just as repository in informatica , its hold
the metadata, transformations, db config files source and targets informations. coming to GDE
its is end user environment where we can develop the graphs(mapping just like in informatica)
designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side
where EME is as server side.
33) Explain the difference between the truncate and "delete" commands.
The difference between the TRUNCATE and DELETE statement is Truncate belongs to DDL
command whereas DELETE belongs to DML command. Rollback cannot be performed in case of
truncate statement whereas Rollback can be performed in Delete statement. "WHERE" clause
cannot be used in Truncate where as "WHERE" clause can be used in DELETE statement.
34) How we can create job sequencer in abinitio i.e running number of graphs at a
time?
As such there is no job sequencer supported by Ab initio Until the versions:GDE:1.13.3 and
Co>Op:2.12.1 But we can sequence a the jobs by creating Wrapper Scripts in UNIX i.e. a korn
shell script which calls the graphs in sequence.
In Abinito it is not possible to create the job sequence. But scheduling of the jobs can be done
with the help of scheduling tool called "CONTROL M".In this tool graph corresponding scripts
and wrapper scripts are placed as per the sequence of exec and we can monitor the execution
of the graphs. There is no sequencer concept in abinitio. suppose you have graphs A,B,C
Then you will write a wrapper script that will call this jobs, script will be like this
a.ksh
b.ksh
c.ksh
There is a Read Excel component that reads excel either from host or from local drive. The dml
will be a default one.
make it csv formatted , deliminated file and read it thru input table comp.
36) What is the function you would use to transfer a string into a decimal?
One of the main purpose of the parameterized graphs is that if we need to run the same graph
for n number of times for different files, we set up the graph parameters like $INPUT_FILE,
$OUTPUT_FILE etc and we supply the values for these in the Edit>parameters. These
parameters are substituted during the run time. We can set different types of parameters like
positional, keyword, local etc.
The idea here is, instead of maintaining different versions of the same graph, we can maintain
one version for different files.
/*string_trim.xfr*/
out::trim(input_string)=
begin
out::trimmed_string;
end
Now, the above xfr can be included in the transform where you call the above function as
include ''~/xfr/string_trim.xfr'';
If you want to see all the records of one input file independent of whether there is a matching
record in the other file or not. Then its an outer join.
Depending on the requirement, sometimes it more advisable to create a lookup instead. But
that depends on the requirement and design
Example:
when u need to run 3 graphs but the condition is after the first graph ran successfully u need to
take the feed generated by it and use it in next graph and so on... graph after it finished u
have to check the graph ran successfully then run the second KSh so on.....
Then u have to right a Unix script in which run the ksh of the first The DML that is used as a
condition is known as conditional DML..
Suppose we have data that includes the Header, Main data and Trailer as given below:
20 emp_id,emp_name, salary
30 count
Record
if (id==10)
begin
end
else if (id==20)
begin
end
else if (id==30)
begin
end
end;
This is
Could anybody provide me the major UNIX commands for abinitio multi file system?
A vector is a sequence of the same type of elements. The element type may be any type
including a vector or record type.
It is a field which tell us how many times a particular field is repeated .for example
Take this input
Cust_id purchase_amount purchase date
101 1000 29.08.06
101 500 30.08.06
102 1050 31.08.06
103 1140 1.0906
103 1000 02.0906
103 500 30.09.06
Here no_purchase is the vector field which rep the no of times a cust hase done purchases
Dependency analysis will answer the questions regarding data linage that is where does the
data come from, what applications produce and depend on this data etc.
For data parallelism, we can use partition components. For component parallelism, we can
use replicate component. Like this which component(s) can we use for pipeline parallelism?
When connected sequence of components of the same branch of graph executes concurrently is
called pipeline parallelism.
Components like reformat where we distribute input flow to multiple o/p flow using output
index depending on some selection criteria and process those o/p flows simultaneously creates
pipeline parallelism.
But components like sort where entire i/p must be read before a single record is written to o/p
cannot achieve pipeline parallelism
flow:
clearly speaking when ever u run any graph we observe the number of records processed on
flows ,this is best example for pipeline parallism.
.abinitiorc is the config file for ab initio. It is found in user's home directory. Generally it is
used to contain abinitio home path, different log in information like id encrypted password
login method for hosts where the graph connects in time of execution.
.abinitiorc file contains all configuration variables such as AB_WORK_DIR, AB_DATA_DIR etcthis
file can be find in "$AB_HOME/Config".
.profile is a file which gets executed automatically when that particular user logging in.
You can change your .profile file to include any commands that you want to execute whenever
u logging in. you can even put commands in your .profile file that overrides settings made in
/etc/profile(this file is set up by the system administrator).
- Environment settings
- aliases
- path variables
if u want the semi join u put 'record_requiredn' as true for the required component and false
for other components.
Data mapping deals with the transformation of the extracted data at FIELD level i.e.
the transformation of the source field to target field is specified by the mapping defined on the
target field. The data mapping is specified during the cleansing of the data to be loaded.
For Example:
source;
target;
Driving port in join supplies the data that drives join . That means, for every record from the
driving port, it will be compared against the data from non driving port.
We have to set the driving port to the larger dataset so that non driving data which is smaller
can be kept in main memory for speeding up the operation.
Data Cleansing: the act of detecting and removing and/or correcting a database s dirty data
(i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly)
Co> Operating system is nothing but distributed operating system, which can run as a backend
server
Reformat
Rollup
Join
Sort
Replicate
Partition by expression and key
Redefine
Multi update
Lookup
Intermediate
Reformat can change the record formats by dropping fields or adding or combining
Ports:
Input
Output
Reject
Log
Error
Specific Parameters:
Select
Output Index
5. What is the difference between output Index and Select parameters in reformat?
Select and output index both are used to filter the data, but using select parameter we can t
get the deselected record. But using output index parameter U can filter the data as well as u
can connect the deselected record to another output port.
Reads the data from two or more inputs and combines the records with
matching keys and send to output ports
Specific parameters:
Driving port: Driving port is the largest input and remain inputs will directly
reads into memory.(Available only when Inmemory: Input need not to sort
parameter set to true)
Join type:
1. Inner join
2. Full outer join
3. Explicit join
Record required parameter: This will be available when join type is set to Explicit. If
you want left outer join set true to input 0 and false to input 1.If you want right outer
join set false to Input 0 and set true to Input 1.
Overridden key : Set the alternative names to the particular key fields
Max memory: Maximum usage of bytes before joining to write the temporary files to
the disk(Available only when(sorted Input In memory: Input need to be sort is set to
true), default is 8MB
Max-core: Maximum usage of bytes before joining to write the temporary files to the
disk (Available only when (sorted Input In memory: Input need not to be sort is set to
true) .The default is 64MB
Sorted Input:
When set to in memory Input need to be sort, it accept only sorted input and if it is In
memory Input need not sort, accepts unsorted data
Specific ports:
For three inputs, if you want right outer join set the record required parameter false to
input 0 input 1 and set true toinput2
Both components used to join the data based on keys, with join we can combine to input
flows, but using merge we can combine the partitioned data.
Parameters:
Key
Maxcore (Default is 100MB)
Portion by Key: Distributes the records to output flow portions according to its key value
Partition by key
Partition by Expression
Partition by round robin
Partition by range
Broadcast
Broadcast: combines the records it receives into single flow and writes a copy of that flow to
each output flow partitions. Broadcast supports data parallelism.
Replicate: combines the records it receives into single flow and write a copy of that flow to
each output flows. Replicate supports component parallelism.
Merge
Interleave (Combines in round robin fashion)
Concatenate
Gather (Combines the data arbitrarily)
In both components we can filter the data based on select expression, but in reformat we can t
get the de selected records in a separate port. In filter by expression we have a separate
deselect port.
Both components used for summarization, but in aggregator don t have the built-in functions.
In rollup we have the built-in functions like SUM (), AVG (), COUNT (), MIN (), MAX (), FIRST (),
LAST (), PRODUCT ().
Rollup component can produce the total control on summarization. Scan component produce
only Intermediate summary or cumulative summary records.
Sort
Sort with groups
Checkpointed sort
23. What is a multifile and how we can create through command line?
AbInitio multifiles are nothing but a partition of a large serial file into tree structure and runs
parallel way.
We can create the multifile in command line using the command M_MFKS fallowed by URL. Of
that particular file.
Phases are used to break up a graph into blocks for performance tuning.
Check point is used for recovery
Component parallelism
Pipeline parallelism
Data parallelism
Component parallelism:
Pipeline parallelism:
Pipeline parallelism occurs when a connected sequence of program components on the same
branch of a graph execute simultaneously.
Data parallelism:
Data parallelism occurs when you separate data into multiple divisions, allowing multiple
copies of program components to operate on the data in all the divisions simultaneously.
26. Explain about flow partitions in AbIntio?
Straight flow
Fan-in flow( )
Fan-out flow( )
Straight flow: This flow connects the two components with the same depth of parallelism
Fan-in flow: A fan-in flow connects a component with a greater depth of parallelism to one
with a lesser depth in other words; it follows a many-to-one pattern.
Fan-out flow:
A fan-out flow connects a component with a lesser number of partitions to one with a greater
number of partitions in other words, it follows a one-to-many pattern.
You can make any component or sub graph conditional by specifying a conditional expression
that the GDE evaluates at runtime to determine whether or not the component runs.
If the conditional expression evaluates to true, the GDE runs the subgraph or component. If the
conditional expression evaluates to false, the GDE either disables the component and any flows
connected to its ports, or replaces it with a flow, depending on your choice on the Properties
dialog: Condition tab.
A subgraph is a graph fragment. Just like graphs, subgraphs contain components and flows. A
subgraph groups together components that perform a subtask in a graph. The subgraph creates
a reusable component that performs the subtask.
Lookup functions
Date functions
Is_error (Tests whether the error will occur while the time of evaluating the
expression)
Decimal_lpad:
Decimal_lrpad
String_compare
String_substring
String_concat
String_Index
String_length
String_lpad
String_lrpad
Lookup File represents one or more serial files or a multifile. The amount of data is small
enough to be held in main memory. This allows a transform function to retrieve records much
more quickly than it could retrieve them if they were stored on disk.
Lookup File associates key values with corresponding data values to index records and retrieve
them.
Key
Record format
Unlike other dataset components, Lookup File is not connected to other components in graphs.
In other words, it has no ports. However, its contents are accessible from other components in
the same or later phases.
You use the Lookup File in other components by calling one of the following DML functions in
any transform function or expression parameter: lookup, lookup_count, or lookup_next.
The first argument to these lookup functions is the name of the Lookup File. The remaining
arguments are values to be matched against the fields named by the key parameter. The
lookup functions return a record that matches the key values and has the format given by the
RecordFormat parameter. For details, see the Data Manipulation Language Reference.
A file you want to use as a Lookup File must fit into memory. If a file is too large to fit into
memory, use Input File followed by Match Sorted or Join instead.
Information about Lookup Files is stored in a catalog, which allows you to share them with
other graphs.
Lookup
Lookup_count
Lookup_Local
Lookup_next
(Note please go through help document for the description)
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimize the number of sort components
4) Minimize sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving
port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily
Db config file has the information required for the AbInitio to connect the database
Creation: In Input or output table components select dbconfig file/new/then u should give the
Db name, Db node,database version and user_id and password and click create.
Check-In
Check-Out
(Note :Please go through the help document for more Information)
Whenever u checkout the graph u need to give Tag information in the Tag tab(It represents the
version)
If u want to view total versions, you need to give the fallowing command in the command line:
AIR_OBJECT_VERSION_VERBOSE.
Ans: Using file watchers we can debug the graph, watcher will add an Intermediate file on the
flow. So you can view the data that passes through the flow when you run a graph.
Non-phased
Phased
o On the GDE Debugging toolbar, click the Add Watcher to Flow button .
o Right-click the flow and choose Add Watcher from the shortcut menu.
46.what is a sandbox?
A sandbox is a collection of graphs and related files that are stored in a single directory tree,
and treated as a group for purposes of version control, navigation, and migration.
When you create the sandbox, Automatically the tree structure (DML.XFR,DB,MP,RUN
Folders) ,parameters and environment variables will create. Along with these the
ABPROJECTSETUP.KSH file will create in the sandbox.
When the depth of parallelism (partitions of a layout) mismatched between up stream and
down stream components.
When the max core value is too low while executing a component this error will occur, so we
need to set the appropriate max core value for that component.
If you set the checkpoint phase .rec file will create automatically. Once failure occur
for graph, while the time of rerunning of that graph, It will automatically recover the
data till last check point
If you want to run the from the beginning, you need perform the manual rollback from
the command line
A local variable is a variable declared within a transform function. You can use local variables
to simplify the structure of rules or to hold values used by multiple rules.
Declaration:
NOTE: The declaration of a local variable must occur before the statements and rules in a
transform function.
[not NULL] Optional. Keywords indicating that the variable cannot take on the value of
null. These must appear after the variable name and before the initial value.
NOTE: If you create a local variable without the not NULL keywords, and do
not assign an initial value, the local variable initially takes on the value of
null.
For example, the following local variable definitions define two variables, x and y. The value
for x depends on the value of the amount field of the variable in, and the value of y depends
on the value of x:
With in a package you can create and use the global variable to all the transformation
functions, which are present in the package, but u should declare the global variable outside
the transformation function.
Declaration:
let type_specifier variable_name [not NULL] [ = expression ] ;
Let Keyword for declaring a variable.
[not NULL] Optional. Keywords indicating that the variable cannot take on the value of
null. These must appear after the variable name and before the initial value.
NOTE: If you create a global variable without the not NULL keywords, and do
not assign an initial value, the global variable initially takes on the value of
null.
m_rollback rolls back a partially completed graph to its beginning state. m_cleanup cleans up
files left over from unsuccessfully executed graphs and manually recovered graphs.
To find temporary files and directories before cleaning them up, you use the m_cleanup
command. You can run this utility with or without arguments:
m_cleanup prints usage for the command.
m_cleanup -help prints usage for the command.
m_cleanup -j job_log_file [job_log_file... ] lists the temporary files and directories
listed in the log file specified by job_log_file. To specify multiple files, separate each
filename with a space.
Log files have either a .hlg or .nlg suffix. A log file ending in .hlg is on the control, or host,
machine of a graph. A log file ending in .nlg is on a processing machine of a graph.
The job_log_file can be an absolute or relative pathname. Paths have the following syntax:
o On the control machine AB_WORK_DIR/host/job_id/job_id.hlg
o On a processing machine AB_WORK_DIR/vnode/job_id-XXX/job_id.nlg,
where the XXX on a processing machine path is an internal ID assigned to each
machine by the Co>Operating System.
58.How can I generate DML for a database table from command line?
Using the m_db command line utility we can generate the dml.
Syntax is
Yes, we can do check-in and check-out using the air commands like AIR_OBJECT_IMPORT and
AIR_OBJECT_EXPORT.
Ans) EME is short for Enterprise Meta>Environment. The EME is a high performance object-
oriented storage system that manages Ab-Initio applications (including data formats and
business rules) and related information. It provides an integrated and consolidated view of your
business. It is used for the purpose of VERSION CONTROLLING, NAVIGATION & MIGRATION.
An EME datastore is a specific instance of the EME: the term denotes the specific EME storage
that you are currently connected to through the GDE, there can be many such datastore
instances resident in an environment in which the EME has been installed. But you can only be
connected to one datastore at a time: this is determined by your GDE's current EME datastore
settings.
2) What is Sandbox?
Ans) A sandbox is a collection of graphs and related files that are stored in a single directory
tree, and treated as a group for purposes of version control, navigation, and migration. A
sandbox can be a file system copy of a datastore project.
The Co>Operating System is layered on top of the native operating systems of a collection of
computers. It provides a distributed model for process execution, file management, process
monitoring, checkpointing, and debugging.
The Graphical Development Environment (GDE) provides a graphical user interface into the
services of the Co>Operating System.
4) What are the differences between the various GDE connection methods?
Ans) There are a number of communication methods used to communicate between the GDE
and the Co>Operating System, including:
Ab Initio Server/REXEC:
Ab Initio Server/TELNET:
DCOM:
REXEC:
RSH:
TELNET:
SSH(/Ab Initio)
When using the GDE to connect to the Co>Operating system, the normal process for a
connection differs depending upon which communication method is selected. In broad
terms, two things tend to happen: files are transferred from the GDE to the target host (or
from the host to the GDE), and processes are started/executed on the host.
When using telnet, rexec and rsh, the basic steps are as follows.
A. The GDE transfers the execution script to the server via FTP.
B. The GDE connects to the server by means of the selected method.
C. The GDE executes that script on the server by means of the connection set up in
step B.
The process is differerent for connection methods that use the Ab Initio Server, however.
These methods include Ab Initio Server/Telnet and Ab Initio Server/Rexec, as well as SSH and
DCOM. The use of the Ab Initio Control Server replaces the need for FTP and adds enhanced
server-side services. When the Ab Initio Control Server is involved, the basic steps are as
follows:
All file transfer occurs across the same Control Server connection.
Ans) Meta data is Data about the Data, It will give the description about the data. Metadata
associated with graphs. This includes the information needed to build a graph, such as record
formats, key specifiers, and transform functions.
Ans) The Co>Operating System accepts either of two names for the per-user Ab Initio
configuration file. In addition to .abinitiorc, the Co>Operating System now also accepts
abinitio.abrc in order to conform to Windows file name conventions. Other supported
platforms also recognize the new name. Only one configuration file is permitted, however.
Using both .abinitiorc and abinitio.abrc results in an error.
Ans)
.cfg Database table configuration files for use with 2.1 Database Components
Ans) The GDE provides default settings and behaviors for several features. Flow
Buffering and Deadlock Layout and
Record Format Propagation
9) What kind of flat file formats supports by Ab Initio Graphical Design Interface (GDE)?
Ans) The Ab Initio Graphical Design Interface (GDE) supports these flat file formats: All file
types use the .dat extension.
Serial Files
Multifiles
Ad-hoc Multifile
Serial Files
A serial file is a flat, non-parallel file also known as one-way parallel. You create serial files
using a Universal Resource Locator (URL) on the component's Description tab. The URL starts
with file
Multifiles:
A multifile is a parallel file consisting of individual files called partitions and often stored on
different disks or computers. A multifile has a control file that contains URLs pointing to one or
more data files. You can divide data across partition files using these methods: random or
roundrobin partitioning, partitioning based on ranges or functions, and replication or
broadcast, in which each partition is an identical copy of the serial data. You create multifiles
using a URL on the components Description tab.
Ad-hoc Multifile :An ad-hoc multifile is a also a parallel file. Unlike a multifile, however, the
content of an ad-hoc multifile is not stored in multiple directories. In a custom layout, the
partitions are serial files. You create an ad-hoc multifile using partitions on the component's
Description tab.
Ans) File with a .dbc extension which provides the GDE with the information it needs to
connect to a database. A configuration file contains the following information:
The name and version number of the database to which you want to connect
The name of the computer on which the database instance or server to which you want
to connect runs, or on which the database remote access software is installed
The name of the database instance, server, or provider to which you want to connect
Ans) The default sandbox parameters in a GDE-created sandbox are these six:
These six parameters are automatically created (and assigned their correct value) whenever
you create a sandbox.
12) What is the difference b/w sandbox parameters & graph parameters?
Ans) The difference between sandbox parameters and graph parameters is:
Graph parameters are visible only to the particular graph to which they belong
Sandbox parameters are visible to all the graphs stored in a particular sandbox
Ans) A sandbox that is not associated with a project is simply a special directory.
Ans) The big difference between the contents of a sandbox and its corresponding project in the
EME is that the project contains, for each file, each and every version that has ever been
checked in by anybody. The sandbox, on the other hand, contains only the latest version of
each file checked out into that sandbox.
A sandbox can be associated with only one project. However, there is no limit (other than the
physical one of disk space) to the number of sandboxes that a user can have. Although a given
sandbox can be associated with only one project, a given project can have any number of
sandboxes.
Ans) A formal graph parameter is a parameter you substitute for a path and/or filename when
you create a graph. This allows you to specify the value of that parameter at runtime.
Ans) When you run a graph, parameters are evaluated in the following order:
Ans) A transform function (or transform) is the logic that drives data transformation most
commonly, transform functions express record reformatting logic. In general, however, you can
use transform functions in data cleansing, record merging, and record aggregation.
To be more specific, a transform function is a collection of business rules, local variables, and
statements. The transform expresses the connections between the rules, variables, and
statements, as well as the connections between these elements and the input and output
fields.
Transform functions are always associated with transform components; these are components
that have a transform parameter: Aggregate, Denormalize Sorted, Fuse, Join, Match Sorted,
MultiReformat, Normalize, Reformat, Rollup, and Scan components.
Ans) The order of evaluation of rules in a transform function by assigning priority numbers to
the rules. The rules are attempted in order of priority, starting with the assignment of lowest-
numbered priority and proceeding to assignments of higher-numbered priorities, then finally to
an assignment for which no priority has been given.
Ans) A local variable is a named storage location in an expression or transform function. You
declare a local variable within the transform function in which you want to use it. The local
variable is reinitialized each time the transform function is called, and it persists for one single
evaluation of the transform function.
Ans) A package is a named collection of related DML objects. A package can hold types,
transform functions, and variables, as well as other packages. Packages provide a means of
locating in one place DML objects that are needed more than once in a given graph, or needed
by multiple developers. Packages allow developers to avoid redundant code; this makes
maintenance of DML objects more efficient.
The record formats of multiple ports use common record formats and/or type specifiers
Multiple components use common transforms
Ans) The multi-stage transform components require packages because, unlike other transform
components, they are driven by more than single transform functions. These components each
take a package as a parameter and, in order to process data, look for particular variables,
functions, and types in that package. For example, a multi-stage component might look for a
type named temporary_type, a transform function named finalize, or a variable named
count_items.
Ans) A phase is a stage of a graph that runs to completion before the start of the next stage. By
dividing a graph into phases, you can save resources, avoid deadlock, and safeguard against
failures. To protect a graph, all phases are checkpoints by default.
Ans) A checkpoint is a phase that acts as an intermediate stopping point in a graph and saves
status information to allow you to recover from failures. By assigning phases with checkpoints
to a graph, you can recover completed stages of the graph if failure occurs.
24) How will use the subgraph of graph A in the Graph B?
Ans) When you build a subgraph, it becomes a part of the graph in which you build it. If you
want to use it in other graphs, or in other places in the original graph, save it in the
Component Organizer of the GDE.
25) Is there a way to make my graph conditional, so that certain components may not run?
Ans) You can enter a Condition statement on the Condition tab of graph components. This is an
expression that evaluates to the string value for true or false (see details below). The GDE then
evaluates the expression at runtime. If the expression evaluates to true, the component or
subgraph is executed. If it is false, then the component or subgraph is not executed, and is
either removed completely or replaced with a flow between two user-designated ports.
Ans) If the GDE is performing slowly, you can improve performance with one or more of these
methods:
Turn off Undo by choosing File > Autosave/Undo on the GDE menu bar and clearing the
selection of Undo/Redo Enabled.
Turn off Propagation by choosing Edit > Propagation on the GDE menu bar and clearing
the selection of Record Format and Layout.
Increase the Tracking Interval by choosing Run > Default Settings on the GDE menu bar,
clicking the Code Generation tab, and increasing the Tracking Interval to 60 seconds.
Ans) Lookup File represents one or more serial files or a multifile. The amount of data is small
enough to be held in main memory. This allows a transform function to retrieve records much
more quickly than it could retrieve them if they were stored on disk. Lookup File associates key
values with corresponding data values to index records and retrieve them.
Ans) When an all-to-all flow connects components with layouts containing a large numbers of
partitions, the Co>Operating System uses many networking resources. If the number of
partitions in the source and destination components is N , an all-to-all flow uses resources
proportional to N*N(N square) .
To save network resources, you can mark an all-to-all flow as using two-stage routing. With
two-stage routing, the all-to-all flow uses only resources 2*N*N (2*N*root N).
For example, an all-to-all flow with 25 partitions uses 25*25 = 625 resources, but with two-
stage routing uses only 2*25*5 = 250 resources.
Component Parallelism
Pipeline Parallelism
Data Parallelism
Component parallelism scales to the number of branches of a graph the more branches a
graph has, the greater the component parallelism. If a graph has only one branch, component
parallelism cannot occur.
Ans) Pipeline parallelism occurs when a connected sequence of program components on the
same branch of a graph execute simultaneously.
Ans) Data parallelism occurs when you separate data into multiple divisions, allowing multiple
copies of program components to operate on the data in all the divisions simultaneously.
33) What are Multifiles and Multifile Systems & Multi directories?
Ans) Ab Initio multifiles are parallel files composed of individual files, typically located on
different disks and usually, but not necessarily, on different systems. These individual files are
the partitions of the multifile.
Ab Initio multifiles reside in parallel directories called multidirectories, which are organized
into multifile systems. An Ab Initio multifile system consists of multiple replications of a
directory tree structure containing multidirectories and multifiles. Each replication constitutes
a partition of the multifile system.
Each partition holds a subset of the data contained in the multifile system, and the system has
one additional partition that contains control information. The partitions containing data are
the data partitions of the system, and the additional partition is the control partition. The
control partition contains no user data, only the information the Co>Operating System needs to
manage the multifile system.
Ans) To create a multifile system, issue the m_mkfs command, using as arguments the URLs of
the partitions of the multifile system you want to create. The first URL creates the control
partition, and each subsequent URL creates the next partition of the multifile system.
Similarly use m_mkdir for multi directories.
Ans) Using the EME, you can conduct project analyses of the dependencies within and between
graphs. The EME examines the project and develops an analytical survey of it in its entirety,
tracing how data is transformed and transferred, field by field, from component to component.
Ans):
None Turns off all translation and dependency analysis during checkin.
Translates graphs from GDE format to datastore format, but does not
do error checking and does not store results in the datastore.
Translation Only
Tip We recommend that at minimum you do translation only, since it
is required for analysis, which you can run anytime.
Translation with Translates graphs from GDE to datastore format and checks for errors
Checking that will interfere with dependency analysis. See Checked-for Errors.
Performs full dependency analysis on the graph and saves the results
Full Dependency in the datastore.
Analysis (Default) Tip We recommend that you do not do analysis now, as it can greatly
prolong checkin.
What to Analyze
The What to Analyze group of checkboxes allow you to specify which files will be subjected to
the level of analysis you specified in Analysis Level. The following table explains the four
choices:
Only the files checked in by you. This group can include files you
Only My Checked
checked in earlier which are still on the analysis queue and have not yet
In Files
been analyzed.
Analysis Scope
The Analysis Scope group of checkboxes allow you to specify how far the specified level of
analysis will be extended to files dependent on those being analyzed, both in the current
project and in other projects. The following table describes the three choices.
Dependent Files from All Files in other projects common to (included in) the one you are
Projects (Default) checking, if they are dependent on the files being analyzed.
Dependent Files from Only the dependent files that are in the same project as the
Specified Project (Default) file(s) being analyzed.
Ans) A switch parameter has a fixed set of string values which you define when you create the
parameter. The purpose of a switch parameter is to allow you to change your sandbox's
context: its value determines the values of various other parameters that you make dependent
on that switch. For each switch value, each of the dependent parameters has a dependent
value. Changing the switch's value thus changes the values of all its dependent parameters.
Standard Parameters
Switch Parameters
Dependent Parameters
If you set the max-core value too low, the component runs more slowly than expected. If you
set the max-core value too high, the component might use too many machine resources, slow
the process drastically, and cause hard-to-diagnose failures.
Ans) The Ordered attribute is a port attribute. It determines whether the order in which you
attach flows to a port, from top to bottom, is significant to the definition and purpose of the
component. If a port is ordered, the order in which flows are attached determines the result of
the processing the component does: if you change the order in which you attach the flows, you
create a different result.
Note: GDE indicates the difference between a port that is ordered and one that is not by
drawing them differently. If you inspect the ordered port of Concatenate in the graph, you see
a line dividing the port between the two flows; that line is not present in the port of Gather,
which is not ordered.
Ans) Components maintain the ordering of the input data records unless their explicit purpose
is to reorder records. For most components, if record x appears before record y in an input
flow partition, and if record x and record y are both in the same output flow partition, then
record x appears before record y in that output flow partition.
For example, if you supply sorted input to a Partition component, it produces sorted output
partitions.
Exceptions are:
The components that explicitly reorder records, such as Sort, Sort within Groups, and
Partition by Key and Sort.
The components that have fan-in flows, such as the Departition components. They each
define their own record order.
Ans) The transform components and some other components have a logging parameter. This
parameter specifies whether or not you want the component to generate log records for
certain events. The value of the logging parameter is True or False. The default is False.
If you set the logging parameter to True, you must also connect the component's log port to a
component that collects the log records.
Ans) There are a number of components that compress and uncompress data.
Components:
Ans) Broadcast arbitrarily combines all the data records it receives into a single flow and writes
a copy of that flow to each of its output flow partitions.
Replicate arbitrarily combines all the data records it receives into a single flow and writes a
copy of that flow to each of its output flows.
Use Replicate to support component parallelism for example, when you want to perform
more than one operation on a flow of data records coming from an active component.
Use Broadcast to increase data parallelism when you have connected a single fan-out flow to
the out port or to increase component parallelism when you have connected multiple straight
flows to the out port.
Ans) Fuse applies a transform function to corresponding records of each input flow. The first
time the transform function executes, it uses the first record of each flow. The second time
the transform function executes, it uses the second record of each flow, and so on. Fuse sends
the result of the transform function to the out port.
The component works as follows. The component tries to read from each of its input flows.
Ans) Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output port.
Additional ports allow you to collect rejected and unused records. There can be as many as 20
input ports.
Types of join:
Inner join sets the record-required parameters for all ports to True. Inner join is
the default. The GDE does not display the record-required parameters because they all
have the same value.
Outer join sets the record-required parameters for all ports to False. The GDE does
not display the record-required parameters because they all have the same value.
Explicit allows you to set the record-required parameter for each port individually.
49) What is the use of override key parameter & where it is used?
Ans) Override key parameter is used in the Join component. To specify the alternative name(s)
for the key field(s) for a particular in port.
50) What are the different options available in reject thresh hold?
Never abort
Abort on first reject
Use limit/ramp
Ans) Limit is a number representing the acceptable total of reject events. Default is 0.
Ramp is a decimal representing the Rate of reject events in the number of records
processed.
The component stops the execution of the graph when the number of reject events
exceeds the result of the following formula:
Ans) Join with DB joins records from the flow or flows connected to its input port with records
read directly from a database(SQL statement), and outputs new records containing data based
on, or calculated from, the joined records.
execute_on_miss parameter:
A statement executed when no rows are returned by the select_sql statement. The statement
should be an INSERT (or possibly an UPDATE); after it is executed, the select_sql is executed a
second time. If no results are generated on the second attempt, the input record is rejected. A
database commit is by default performed after each execution of execute_on_miss, but this
can be altered by setting the commit number parameter.
53) What is the difference b/w JOIN & JOIN with DB?
Ans) The main difference is, in join with DB component we are joining the incoming feed file
with the TABLE by writing the SQL statement, where as in normal join we don t have SQL
statements.
Instead of using a statement in SQL, you can now extract the to-be-joined data from the
database by calling a stored procedure, specified in the sql_select parameter. The syntax for
calling a stored procedure using Oracle or DB2 is as follows:
Ans) The Meta Pivot component allows you to split records by data fields (columns). The
component converts each input record into a series of separate output records. There is one
separate output record for each field of data in the original input record. Each output record
contains the name and value of a single data field from the original input record.
Ans) Reformat changes the record format of data records by dropping fields, or by using DML
expressions to add fields, combine fields, or transform the data in the records.
output-index:
Either the name of a file containing a transform function, or a transform string. The
Reformat component calls the specified transform function for each input record. The
transform function uses the value of the input record to direct that input record to a
particular output port. The expected output of the transform function is the index of
an output port (zero-based). The Reformat component directs the input record to the
identified output port and executes the transform function, if any, associated with that
port.
When you specify a value for output-index, each input record goes to exactly one
transform/output port pair. For example, suppose there are 100 input records and two
output ports. Each output port receives between 0 and 100 records. According to the
transform function you specify for output-index, the split can be 50/50, 60/40, 0/100,
99/1, or any other combination that adds up to 100.
If you do not specify a value for output-index, Reformat sends every input record to
every transform/output port pair. For example, if Reformat has two output ports and
there are no rejects, 100 input records results in 100 output records on each port for a
total of 200 output records.
56) What is the difference between the Reformat and Redefine Format components?
Ans) The difference between Reformat and Redefine Format is that Reformat can actually
change the bytes in the data while Redefine Format simply changes the record format on the
data as it flows through, leaving the data unchanged.
The Reformat component can change the record format of data records by dropping fields, or
by using DML expressions to add fields, combine fields, or transform the data in the records.
The Redefine Format component copies data records from its input to its output without
changing the values in the data records. You use Redefine Format to change or rename fields in
a record format without changing the values in the records. In this way, it is similar to the DML
built-in function, reinterpret_as. Typically this component has different DML on its input and
output ports, and allows the unmodified data to be interpreted in a different form.
57) Explain Multi Reformat?
Ans) Multi Reformat changes the record format of data records flowing between from one to 20
pairs of in and out ports by dropping fields, or by using DML expressions to add fields, combine
fields, or transform the data in the records. A typical use for Multi Reformat is to put it
immediately before a custom component that takes multiple inputs.
The component operates separately on the data flowing between each pair of its inn-outn
ports. The count parameter specifies the total number of port pairs. Each inn-outn port pair
has its own associated transformn to reformat the data flowing between those ports.
58) What is ABLOCAL() and how can I use it to resolve failures when unloading in parallel?
Ans) Some complex SQL statements contain grammar that is not recognized by the Ab Initio
parser when unloading in parallel. You can use the ABLOCAL() construct in this case to prevent
the Input Table component from parsing the SQL (it will get passed through to the database). It
also specifies which table to use for the parallel clause.
59) What is the difference b/w Update table & Multi update table?
Ans) The main difference is commit number & commit table are mandatory parameters in multi
update table, where as in update table they are optional.
Update table modifys only single table in the database, where as multi update table can
modify more than one table, so we require commit table & commit number in multi update
table.
The statements are applied to the incoming records as follows. For each record:
Ans) Normalize generates multiple output data records from each of its input records. You can
directly specify the number of output records for each input record, or the number of output
records can depend on some calculation.
Denormalize consolidates groups of related data records into a single output record
with a vector field for each group, and optionally computes summary fields in the
output record for each group.
Both these components are Multi stage Transform components, The multi-stage transform
components require packages because, unlike other transform components, they are driven by
more than single transform functions. These components each take a package as a parameter
and, in order to process data, look for particular variables, functions, and types in that
package. For example, a multi-stage component might look for a type named temporary_type,
a transform function named finalize, or a variable named count_items.
61) How can I generate DML for a database table from the command line?
Ans) The Ab Initio command-line utility, m_db, with the gendml argument, generates
appropriate metadata for a database table or expression. The syntax for the utility is
Ans) Departition components combine multiple flow partitions of data records into a single flow
as follows:
Concatenate appends multiple flow partitions of data records one after another.
Gather combines data records from multiple flow partitions arbitrarily.
Interleave combines blocks of data records from multiple flow partitions in round-robin
fashion.
Merge combines data records from multiple flow partitions that have been sorted
according to the same key specifier and maintains the sort order.
Ans) Partition components distribute data records to multiple flow partitions to support data
parallelism, as follows:
Broadcast arbitrarily combines all the data records it receives into a single flow and
writes a copy of that flow to each of its output flow partitions.
Partition by Expression distributes data records to its output flow partitions according
to a specified DML expression.
Partition by Key distributes data records to its output flow partitions according to key
values.
Partition by Range distributes data records to its output flow partitions according to
the ranges of key values specified for each partition.
Partition with Load Balance distributes data records to its output flow partitions,
writing more records to the flow partitions that consume records faster.
65) What are FTP components?
Ans) FTP (file transfer protocol) components transfer data records as follows:
FTP From transfers files of data records from a computer that is not running the
Co>Operating System to a computer that is running the Co>Operating System.
FTP To transfers files of data records to a computer that is not running the
Co>Operating System from a computer that is running the Co>Operating System
Ans) You can use a Reformat component with a force_error() function to test for a condition
and terminate the graph if that condition is met.
Ans) Rollup evaluates a group of input records that have the same key and then generates data
records that either summarize each group or select certain information from each group. There
are two ways to use a Rollup component:
Template Mode :
This mode uses transform that uses aggregation functions like SUM,MAX,MIN,COUNT,AVG.
Expanded Mode :
Ans) For every input record, Scan generates an output record that includes a running,
cumulative summary for the data records group that that input record belongs to. For example,
the output records might include successive year-to-date totals for groups of data records.
Ans) For every input record, Scan with Rollup sends an output record to its out port that
includes a running, cumulative summary for the input group that input record belongs to. In
addition, after reading all input records for a particular input group, Scan with Rollup sends a
summary record to its rollup port for that input group. For example, suppose transaction
records are keyed on the stores in which they occur. Each record sent to the out port might
include the year-to-date transaction total for the store in which the transaction occurred.
Each record sent to the rollup port would include the year's total for transactions at one
store, and there would be one record for each store.
When Sort within Groups reaches the end of a group, it does the following:
NOTE: When connecting a fan-in or all-to-all flow to the in port of a Sort, you do not need to
use a Gather because Sort can gather internally on its in port.
Ans) Validate Records uses the is_valid function to check each field of the input data records
to determine if the value in the field is:
Consistent with the data type specified for the field in the input record format
Meaningful in the context of the kind of information it represents
72) What is the difference between m_rollback and m_cleanup? When would you use them?
Ans) m_rollback rolls back a partially completed graph to its beginning state. m_cleanup
cleans up files left over from unsuccessfully executed graphs and manually recovered graphs.
The Co>Operating System automatically creates a recovery (.rec) file and other temporary files
and directories in the course of executing a graph. When a graph terminates abnormally, it
leaves the temporary files and directories on disk. At this point there are several alternatives
possible:
The Co>Operating System rolls back the graph automatically, if possible. You can roll back the
graph manually by explicitly using the m_rollback command without the -d option. After a
rollback, some temporary files and directories remain on disk. To remove them, follow one of
the other three alternatives.
If the graph is not already rolled back, rerunning the graph first rolls back the graph to the last
checkpoint. The graph then starts re-executing. If the re-execution is successful, it removes all
temporary files and directories.
So, given this new feature, for old job files, you can use the m_cleanup utility to list the
temporary files and directories, and m_cleanup_rm to delete them. You can also use
m_cleanup_du to display the amount of space these files use. Because recovery files and
temporary files are automatically created in the course of a run, remember not to delete these
files for jobs that are still running.
73) What does the error message "straight flows may only connect ports having equal
depths" mean?
Ans) The "straight flows may only connect ports having equal depths" error message appears
when you connect two components running at different levels of parallelism (or depth) with a
straight flow (one that does not have an arrow symbol on it). For example, you get this error
message if you connect a Join running 10 ways parallel to a serial output file, or if you connect
a serial Join to a 4-way multifile.
74) What is AB_WORK_DIR and what do you need to know about it?
Ans) AB_WORK_DIR is a configuration variable whose value is a working space for graph
execution.You can view the value of this by using m_env describe.
75) What does the error message "too many open files" mean, and how do you fix it?
Ans) The "too many open files" error messsage occurs most commonly because the value of the
max-core parameter of the Sort component is set too low. In these cases, increasing the value
of the max-core parameter solves the problem.
76) What does the error message "Failed to allocate <n> bytes" mean and how do you fix it?
Ans) The "failed to allocate" type of error message is generated when an Ab Initio process has
exceeded its limit for some type of memory allocation.
Reduce the value of max-core in order to reduce the amount of memory allocated to a
component before temporary files are used. When the amount of memory specified by
max-core is used up by a component, the component starts writing temporary files to
hold the data being processed.
Be aware that while reducing the value of max-core may solve the problem of running
out of swap space, it may have an adverse effect on the graph's performance and will
increase the number of temporary files.
Increase available swap space, for example, by waiting until other memory intensive
jobs have completed.
77) What do you need to do to configure to run my graph across two or more machines?
Ans) In order to execute a graph across multiple machines, you need to carry out the following
steps:
Make sure that all the machines involved have compatible Co>Operating Systems
installed.
Set up the configuration files (.abinitiorc files) so that the different Co>Operating
Systems can communicate with each other.
Set up the environment variables and make sure that they are propagated properly
from one machine to another, when appropriate.
Set up the graph so that it can run across the machines as desired.
78) What communication ports does the GDE use when communicating with the
Co>Operating System?
Ans) The communication ports used depend upon the communication protocol selected. In
short, the GDE uses:
SSH(/AI): 22
AI/REXEC: 512
AI/TELNET: 23 & **
The ** above refer to the dynamically determined port that the control server sets up for
the file transfer.
79) If you use the layout Database: default in your database component, which working
directory does the Co>Operating System use?
Ans) The $AB_WORK_DIR directory is the working directory for database layouts.
$AB_DATA_DIR provides disk storage for the temporary files.
Ans) Vectors are arrays of elements. An element can be a single field or an entire record. They
are often used to provide a logical grouping of information. Many programming languages use
the concept of an array. In broad terms, an array is a collection of elements that are logically
grouped for ease of access.
Ans) You can use the m_eval utility to quickly test the expressions that you intend to use in
your graphs.
Ans) The debugger places watcher files in the layout of the component downstream of the
watcher.
Ans) To delete all watcher datasets in the host directory (for all graphs), you can either use the
GDE menu option, Debugger > Clean-out Watcher Datasets or invoke the following command:
m_rm -f -rmdata GDE-WATCHER-xxx
84) How can I determine which version of the GDE and Co>Operating System I am using?
Ans) To determine your GDE version, on the GDE menu bar choose Help > About Ab Initio.
m_env -version
m_env -v
85) Should you use a Reformat component with a lookup file or a Join component in graph?
Ans) First of all, there are situations in which you cannot use a Reformat with Lookup instead
of a Join. For example, you cannot do a Full Outer Join using a Reformat and Lookup. The
answer below assumes that in your particular case either Reformat with Lookup or Join can be
used in principle, and that the question is about performance benefits of one over the other.
When the lookup file (in case of lookup) or the nondriving input (in case of a Join) fits into the
available memory, the Join and the lookup offer very similar performance.
86) How can you increase the time-out value for starting an Ab Initio process?
Ans) You can increase time-out values with the Ab Initio environment variables
AB_STARTUP_TIMEOUT and AB_RTEL_TIMEOUT_SECONDS.
To copy: m_cp
To move: m_mv
Count: m_wc
88) What are data-sized vectors? How do you work with them?
Ans) Data-sized vectors are vectors that have no set length of elements but, rather, are
variably sized based upon the number of elements in each data record. For example, if an
input dataset has three records, each with a vector, the first record's vector might have 5
elements, the second 1 element, and the third record, 7.
89) What is the difference b/w today (now) and today1 (now1)?
Ans) The today (now) function calls the operating system for the current date on each call.
In contrast, the function today1 (now1) calls the operating system for the current date
only on the first call in a job, returning the same value on subsequent calls. The
difference between the two functions is particularly noticeable on jobs that start
before and end after midnight.
Ans) The Read Raw component reads a flow of data whose structure requires it to be parsed
programmatically rather than with declarative DML type declarations. Typically, the data
written to the output port can be readily described with DML types.
Ans) first_defined returns the first defined (not NULL) argument of two arguments. Note that
the Oracle NVL function is very similar to this function.
3)What is the function you would use to transfer a string into a decimal?
For converting a string to a decimal we need to typecast it using the following syntax,
The above statement converts the string to decimal and populates it to the decimal field in
output.
There are lot many ways to handle the DMLs which changes dynamically with in a single file.
Some of the suitable methods are to use a conditional DML or to call the
vector functionality while calling the DMLs.
.air-sandbox-overrides
This file exists only if you are using version 1.11 or a
later version of the GDE. It contains the user's private
values for any parameters in .air-project-parameters that
have the Private Value flag set. It has the same format as
the .air-project-parameters file.
When you edit a value (in GDE) for a parameter that has the
Private Value flag checked, the value is stored in the .air-
sandbox-overrides file rather than the .air-project-
parameters file.
8) air lock show -user <UNIX User ID> -- shows all the
files locked by a user in various projects.
9) air sandbox status <file name with the relative path> ---
shows the status of file in the sandbox with respect to
the EME (Current, Stale, Modified are few statuses)
what is deeup in unique only?
Answer
# 1 keep Parameter of Dedup Sorted Component
(choice, required)
Default is first.
14) What is Ad hoc multifile? How is it used?
ANSWER:
Here is a description of Ad hoc multifile:
Ad hoc Multifiles treat several serial files having the same record format as a single graph
component.
Frequently, the input of a graph consists of a set of serial files, all of which have to be
processed as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial
files, without needing to define a multifile system to contain it. This enables you to represent
the needed set of serial files with a single input file component in the graph. Moreover, the set
of files used by the component can be determined at runtime. This lets the user customize
which set of files the graph uses as input without having to change the graph itself, even after
it goes into production.
Ad hoc Multifiles can be used as output, intermediate, and lookup files as well as input files.
The simplest way to define an Ad hoc multifile is to list the files explicitly as follows:
If you have added 'n' files, then the input file now acts something like a file in a n-way multifile
system, whose data partitions are the n files you listed. It is possible for components to run in
the layout of the input file component. However, there is no way to run commands such as
m_ls or m_dump on the files, because they do not comprise a real multifile system.
There are other ways than listing the input files explicitly in an Ad hoc multifile.
1. Listing files using wildcards - If the input file names have a common pattern then you can
use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are
found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile.
2. Listing files in a variable. You can create a runtime parameter for the graph and inside the
parameter you can list all the files separated by spaces.
3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces
the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in
choosing the input files, since you can use complex commands also that involves owner of file
or date time stamp.
15) How can I tune a graph so it does not excessively consume my CPU?
How to Tune a Graph against Excessive CPU consumption?
ANSWER:
Options:
1. Reduce the DOP ( degree of paralleism ) for components.
Example:
1. Change from a 4-way parallel to a 2-way parallel.
2. Examine each transformation for inefficiencies.
Example:
1. If transformation uses many local variables, make these variables global.
2. If same function call is performed more than once; call it once and store its value in a global
variable.
3. When reading data, reduce the amount of data that needs to be carried forward to the next
component.
16) I'm having trouble finding information about the AB_JOB variable. Where and how can I
set this variable?
ANSWER:
You can change the value of the AB_JOB variable in the start script of a given graph. This will
enable you to run the same graph multiple times at the same time (thus parallel). However,
make sure you append some unique identifier such as timestamp or sequential number to the
end of each AB_JOB variable name you assign. You will also need to vary the file names of any
outputs to keep the graphs from stepping on each other s outputs. I have used this technique to
create a "utility" graph as a container for a start script that runs another graph multiple times
depending on the local variable input to the "utility" graph. Be careful you don't max out the
capacity of the server you are running on.
17) I have a job that will do the following: ftps files from remote server; reformat data in
those files and updates the database; deletes the temporary files.
How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start
a graph again, what are the points to be considered? Does *.rec file have anything to do
with it?
ANSWER:
AbInitio has very good restartability and recovery features built into it. In Your situation you
can do the tasks you mentioned in one graph with phase breaks.
FTP in phase 1 and your transformation in next phase and then DB update in another phase
(This is just an example this may not best of doing it as best design depends on various other
factors)
If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your
graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would
see a message saying recovery file exists, do you want to start your graph from last successful
check point or restart from beginning. Same thing if it fails in Phase 2.
Phases are expensive from Disk I/O perspective, so have to be careful in doing too much
phasing.
Coming back to error trapping each component has reject, error, log ports, reject captures
rejected records, error captures corresponding error and log captures the execution statistics
of the component. You can control reject status of each component by setting reject threshold
to either "Never Abort", "Abort on first reject" or setting "ramp/limit"
Recovery files keep tack of crucial information for recovering the graph from failed status,
which node the component is executing on etc. It is a bad idea to just remove the *.rec files,
you always want to rollback the recovery fils cleanly so that temporary files created during
graph execution won't hang around and occupy disk space and create issues.
ANSWER:
1) Component parallelism:
A graph with multiple processes running simultaneously on separate data uses component
parallelism.
2) Data parallelism:
A graph that deals with data divided into segments and operates on each segment
simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data
parallelism. To support this form of parallelism, Ab Initio software provides Partition
Components to segment data, and Departition Components to merge segmented data back
together.
3) Pipeline parallelism:
A graph with multiple components running simultaneously on the same data uses pipeline
parallelism.
Each component in the pipeline continuously reads from upstream components, processes data,
and writes to downstream components. Since a downstream component can process records
previously written by an upstream component, both components can operate in parallel.
NOTE: To limit the number of components running simultaneously, set phases in the graph.
ANSWER:
Sandbox is a directory structure of which each directory level is assigned a variable name, is
used to manage check-in and checkout of repository based objects such as graphs.
Within EME for the same project an identical structure will exist.
The above-mentioned structure will exist under the os (eg unix), for instance for the project
called fin, and is usually name of the top-level directory.
When you checkout or check-in a whole project or an object belonging to a project, the
information is exchanged between these two structures.
For instance, if you checkout a dml called fin.dml for the project called fin, you need a
sandbox with the same structure as the EME project called fin. Once you've created that, as
shown above, fin.dml or a copy of it will come out from EME and be placed in the dml directory
of your sandbox.
21) How can I read data which contains variable length records with different record
structures and no delimiters?
ANSWER:
a)Try using the Read Raw component, it should do exactly what you are looking for.
ANSWER:
First, highlight all of the components you would like to have in the sub graph, click on edit,
then click on sub graph, and finally click on create.
ANSWER:
Checkin and checkout from EME
Checkin (sandbox) ---------------> EME
Checkout (sandbox) <------------- EME
1. AbInitio gives command line interfaces via air command to perform
Checkin and checkout
2. Checkin and checkout must be performed via sandbox
3. GDE gives option to perform checkin and checkout via Project ---->
Checkin option ----> checkout "
4. You create a sandbox from GDE via Project -----> Create sandbox
option
5. When creating a sandbox you specify a directory name ( try it
out; don't be afraid)
6. EME contains one or many Projects ( a project is a collection of
graphs and related files and a parameter called Project parameter file )
7. The project parameter file, when resides within EME, is called
Project Parameter
8. The project parameter file, resides within one's sandbox, is called sandbox parameter.
Therefore, sandbox parameter is a copy of project parameter and
is local to sandbox owner.
9. When project parameters change, it'll be reflected in your
sandbox parameters , if you have checked out a graph and therefore, a copy of latest
project parameter, after that change had taken place.
10. You edit sandbox parameter via Project ----->edit sandbox option
11. You edit project parameters via Project -----> Administrative -------> edit project
12. When checking out an object, use Project--------> checkout option.
Navigate down to the Project of your choice Navigate down to required directory
( eg.mp, dml or xfr etc ) Select the object required Then specify a sandbox name ( ie.
the top level directory of the directory structure called sandbox) You will be prompted to
confirm the checkout
13. Sometimes, when you checkout an object, you get a number of other objects checked
out for you automatically and this happens due to dependency.
Example
Checkout a graph ( .mp file )
In addition, you might get a .dml .xfr file
You will also certainly ger a .ksh file for the graph
27)I was trying to use a User Defined Function (int_to_date) inside a Rollup, to type cast
date and time values originally stored as integers back to date forms and then concatenate
the same.
The code I wrote is as below.
record
datetime("YYYY-MM-DD HH24:MI:SS")("\001") output_date_format;
end out::int_to_date(record
big endian integer(4) input_date_part;
end in0, record
big endian integer(4) input_time_part;
end in1) begin
let datetime("YYYY-MM-DD HH24:MI:SS")("\001") v_output_format =(datetime("YYYY-MM-DD
HH24:MI:SS"))string_concat((string("|"))(date("YYYY-MM-DD"))in0.input_date_part,(string("|"))
(datetime("HH24:MI:SS"))decimal_lpad(((string("|"))(decimal("|")) in1.input_time_part),6));
out.output_date_format :: v_output_format;
end;
out::rollup(in) begin
let datetime("YYYY-MM-DD HH24:MI:SS")("\001") rfmt_dt;
rfmt_dt=int_dat(in.reg_date, in.reg_time);
out.datetime_output :: rfmt_dt;
out.* :: in.*;
end;
out::reformat(in) =
begin
let string(integer(2)) temp_code2 = in.code2;
let string(integer(2)) temp_code22 = " ";
let integer(2) i=0;
while (string_index(temp_code2, ",") !=0 || temp_code2 "")
begin
temp_code22 = string_concat(in.code,",", string_substring(temp_code2,
1,string_index(temp_code2,",")));
temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","),
string_length(temp_code2));
i=i+1;
end
out.code :: in.code;
out.code2 :: string_lrtrim(temp_code22);
out.count:: i;
end;
my expected output is
string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,middlen
ame,5
string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4
string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4
ANSWER:
record
begin
string(",") code, code2;
integer(2) count ;
end("\n")
In my abinitio it is not validated ..................
29) In my graph I am creating a file with account data. For a given account there can be
multiple rows of data. I have to split this file into 4 (specifically) files which are nearly
equal in size. The trick is to keep the accounts confined to one file. In other words
account data should not span across these files. How do I do it?
Also if the records are less than 4 (different accounts) i should be able to create empty
files. But I need atleast 4 files.
FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load
balancing the subsequent processes.
ANSWER:
a)
I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well
as keep the same account numbers in same file. Can you explain what will you do in case of
5account numbers having 20 records each?..........As far as splitting is concerned a very very
crude soln would be as follows
In the end script do the following:
1.Find the size of the file and store it in variable (say v_size)
2.v_qtr_size=`expr $v_size / 4`
3.split -b $v_qtr_size <filename>
4.Rename the splitted files as per ur requirement. Note the splitted
files have a specific pattern in their name
b)
Your requirement is such that it essentially depends on the skewness of your data across
accounts. If you want to keep same accounts in same partition, then partition the data by key
(account) with the out port connected to 4 way parallel layout. But this does not guarantee
equal load in all partitions unless the data has little skewness.
But I can suggest you an alternative approach, though cumbersome, still might give you a
result, close to your requirement.
You replicate your original dataset into two, and take one of them and rollup on account no to
find the record count per account_no. Now sort this result on record count so that you have the
account_no with min count at top and the one with max count at bottom. Now apply a
partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3).
Now take the first partition and join with your main dataset ( that you have replicated earlier)
on account_no and write the matching records (out port) into the first file. Take the unused
records of your main flow of the previous join and now join it with the second partition
(partition1) on account_no and write the matching records (out port) to the second file.
Similarly again take the unused records of the previous join and join it with the third partition
(partition 2) on acount_no. Write the matching record (out port) to the third file and the
unused records of the main flow in the fourth file.
This way you can get four files, nearly equal in size, and same account not spread across files.
30) I have a graph parameter state_cd having values based on a If statement. This variable I
would like to use in SQL statement in AI_SQL directory. I have 20 SQL statements for 20
table codes. I will be using corresponding SQL statement based on table code passed as
parameter to a graph.
Problem is ${STATE_CD} is not getting interpreted when I echo the 'Select Statement',
hence giving problem.
What is the difference between m_rollback and m_cleanup and when would I use them?
Answer
m_rollback has the same effect as an automatic rollback using the jobname.rec file, it rolls
back a job to the last completed checkpoint, or to the beginning if the job has not completed
any checkpoints. The m_cleanup commands are used when the jobname.rec file doesn't exist
and you want to remove temporary files and directories left by failed jobs.
Details
In the course of running a job, the Co>Operating System creates a jobname.rec file in the
working directory on the run host.
NOTE: The script takes jobname from the value of the AB_JOB environment variable. If
you have not specified a value for AB_JOB, the GDE supplies the filename of the graph as the
default value for AB_JOB when it generates the script.
The jobname.rec file contains a set of pointers to the internal job-specific files written by the
launcher, some of which the Co>Operating System uses to recover a job after a failure. The
Co>Operating System also creates temporary files and directories in various locations. When a
job fails, it typically leaves the jobname.rec file, the temporary files and directories, and
many of the internal job-specific files on disk. (When a jobs succeeds, these files are
automatically removed, so you don't have to worry about them.)
If your job fails, determine the cause and fix the problem. Then:
If the job succeeds, the jobname.rec file and all the temporary files and directories are
cleaned up. Alternatively, run m_rollback -d to clean up the files left behind by the failed job.
Short answer
The max-core parameter is found in the SORT, JOIN, and ROLLUP components, among others.
There is no single, optimal value for the max-core parameter, because a "good" value depends
on your particular graph and the environment where it runs.
Details
The SORT component works in memory, and the ROLLUP and JOIN components have the option
to do so. These components have a parameter called max-core, which determines the
maximum amount of memory they will consume per partition before they spill to disk. When
the value of max-core is exceeded in any of the in-memory components, all inputs are dropped
to disk. This can have a dramatic impact on performance; but this does not mean that it is
always better to increase the value of max-core.
The higher you set the value of max-core, the more memory the component can use. Using
more memory generally improves performance up to a point. Beyond this point,
performance will not improve and might even decrease. If the value of max-core is set too
high, operating system swapping can occur and the graph might fail if memory on the machine
is exhausted.
When setting the value for max-core, you can use the suffixes k, m, and g (upper-case is also
supported) to indicate powers of 1024. For max-core, the suffix k (kilobytes) means precisely
1024 bytes, not 1000. Similarly, the suffix m (megabytes) means precisely 1048576 (10242), and
g (gigabytes) means precisely 1024 3. Note that the maximum value for max-core is 2g-1.
SORT component
For the SORT component, 100 MB is the default value for max-core. This default is used to
cover a wide variety of situations and might not be ideal for your particular circumstances.
Increasing the value of max-core will not increase performance unless the full dataset can be
held in memory, or the data volume is so large that a reduction in the number of temporary
files improves performance. You can estimate the number of temporary files by multiplying the
data volume being sorted by three and dividing by the value of max-core (because data is
written to disk in blocks that are one third the size of the max-core setting). This number
should be less than 1000. For example, suppose you are sorting 1 GB of data with the default
max-core setting of 100 MB and the process is running in serial. The number of temporary files
that will be created is:
You should decrease the value of a SORT component's max-core if an in-memory ROLLUP or
JOIN component in the same phase would benefit from additional memory. The net
performance gain will be greater.
If you get a "Too many open files" error message, your SORT component's max-core might be
set too low. If this is the case, SORT can also fill AB_WORK_DIR (usually set to /var/abinitio at
installation), which will cause all graphs to fail with a message about semaphores. This
directory is where recovery information and inode information for named pipes are stored and
is typically mounted on a small filesystem.
It is difficult to be precise about the amount of memory an in-memory ROLLUP or JOIN will
consume. An in-memory JOIN tries to hold all its nondriving inputs in memory, so make the
largest input by volume the driving one. A ROLLUP component must hold the size of the keys,
plus the size of the temporaries, plus the size of any input fields required in finalize to
produce the output. In practice, in most ROLLUP components, this is just the size of the
output. In addition, some space is needed for the hash table.
You should always set the max-core parameter in in-memory ROLLUP and JOIN components
with a parameter, like AI_GRAPH_MAX_CORE. The default can be set to the appropriate value
and changed at runtime if required. You can create additional parameters such as
AI_GRAPH_MAX_CORE_HALF and AI_GRAPH_MAX_CORE_QUARTER to divide up the available
max-core among different in-memory components in a phase. If two in-memory components
each need most or all of AI_GRAPH_MAX_CORE, you should put them in separate phases,
provided you have the disk space to hold the data at the phase break.
A second use of phasing is to control the allocation of memory among in-memory components.
Because there is a limited amount of memory available, you can use phasing to make sure each
in-memory component gets a sufficient amount. Typically, only one to four in-memory
components should occupy the same phase, depending on memory availability and demands.
To compute a value for AI_GRAPH_MAX_CORE, take the total memory on the machine and
subtract memory used by lookups and competing processes, including other graphs, running at
the same time on the machine. This is the available memory. Divide this by twice the number
of partitions to get AI_GRAPH_MAX_CORE max-core is measured per partition, and the
factor of two gives a contingency safety factor. So: