Anda di halaman 1dari 115

DWH Material

Version 1.0

REVISION HISTORY

Page 1 of 115

DWH Training -9739096158


The following table reflects all changes to this document.

Date Author / Versio Reason for Change


Contributor n

01-Nov-
2004 1.0 Initial Document

14-Sep-
2010 1.1 Updated Document

Table of Contents

1 Introduction 4
1.1 Purpose 4
2 ORACLE 4
2.1 DEFINATIONS 4
NORMALIZATION: 5
First Normal Form: 5
Second Normal Form: 5
Third Normal Form: 5
Boyce-Codd Normal Form: 6
Fourth Normal Form: 6
ORACLE SET OF STATEMENTS: 6
Data Definition Language :(DDL) 6
Data Manipulation Language (DML) 6
Data Querying Language (DQL) 7
Data Control Language (DCL) 7
Transactional Control Language (TCL) 7
Syntaxes: 7
ORACLE JOINS: 9
Equi Join/Inner Join: 10
Non-Equi Join 10
Self Join 10
Natural Join 11
Cross Join 11
Outer Join 11
Left Outer Join 11
Right Outer Join 11
Full Outer Join 12
What’s the difference between View and Materialized View? 12

Page 2 of 115

DWH Training -9739096158


View: 12
Materialized View: 13
Inline view: 14
Indexes: 19
Why hints Require? 20
Explain Plan: 22
Store Procedure: 23
Packages: 24
Triggers: 25
Data files Overview: 27
2.2 IMPORTANT QUERIES 27
3 DWH CONCEPTS 30
What is BI? 30
4 ETL-INFORMATICA 54
4.1 Informatica Overview 54
4.2 Informatica Scenarios: 94
4.3 Development Guidelines 101
4.4 Performance Tips 105
4.5 Unit Test Cases (UTP): 107
5 UNIX 110

Page 3 of 115

DWH Training -9739096158


Detailed Design DocumentAutomation
of Candidate Extract and Load Process

1 Introduction

1.1 Purpose
The purpose of this document is to provide the detailed
information about DWH Concepts and Informatica based
on real-time training.

2 ORACLE

2.1 DEFINATIONS
Organizations can store data on various media and in different
formats, such as a hard-copy document

in a filing cabinet or data stored in electronic spreadsheets or in


databases.

A database is an organized collection of information.

To manage databases, you need database management


systems (DBMS). A DBMS is a program that

stores, retrieves, and modifies data in the database on request.


There are four main types of databases:

hierarchical, network, relational, and more recently object


relational(ORDBMS).

Page 4 of 115

DWH Training -9739096158


NORMALIZATION:
Some Oracle databases were modeled according to the rules of
normalization that were intended to eliminate redundancy.

Obviously, the rules of normalization are required to understand


your relationships and functional dependencies

First Normal Form:

A row is in first normal form (1NF) if all underlying domains


contain atomic values only.

• Eliminate duplicative columns from the same table.


• Create separate tables for each group of related data and
identify each row with a unique column or set of columns (the
primary key).

Second Normal Form:

An entity is in Second Normal Form (2NF) when it meets the


requirement of being in First Normal Form (1NF) and
additionally:

• Does not have a composite primary key. Meaning that the


primary key can not be subdivided into separate logical entities.
• All the non-key columns are functionally dependent on the
entire primary key.
• A row is in second normal form if, and only if, it is in first
normal form and every non-key attribute is fully dependent on
the key.
• 2NF eliminates functional dependencies on a partial key by
putting the fields in a separate table from those that are
dependent on the whole key. An example is resolving many:
many relationships using an intersecting entity.

Third Normal Form:

An entity is in Third Normal Form (3NF) when it meets the


Page 5 of 115

DWH Training -9739096158


requirement of being in Second Normal Form (2NF) and
additionally:

• Functional dependencies on non-key fields are eliminated


by putting them in a separate table. At this level, all non-key
fields are dependent on the primary key.
• A row is in third normal form if and only if it is in second
normal form and if attributes that do not contribute to a
description of the primary key are move into a separate table.
An example is creating look-up tables.

Boyce-Codd Normal Form:

Boyce Codd Normal Form (BCNF) is a further refinement of 3NF.


In his later writings Codd refers to BCNF as 3NF. A row is in
Boyce Codd normal form if, and only if, every determinant is a
candidate key. Most entities in 3NF are already in BCNF.

Fourth Normal Form:

An entity is in Fourth Normal Form (4NF) when it meets the


requirement of being in Third Normal Form (3NF) and
additionally:

Has no multiple sets of multi-valued dependencies. In other


words, 4NF states that no entity can have more than a single
one-to-many relationship.

ORACLE SET OF STATEMENTS:

Data Definition Language :(DDL)

Create

Alter

Drop

Truncate

Data Manipulation Language (DML)

Insert

Update

Page 6 of 115

DWH Training -9739096158


Delete

Data Querying Language (DQL)

Select

Data Control Language (DCL)

Grant

Revoke

Transactional Control Language (TCL)

Commit

Rollback

Save point

Syntaxes:

CREATE OR REPLACE SYNONYM HZ_PARTIES FOR


SCOTT.HZ_PARTIES

CREATE DATABASE LINK CAASEDW CONNECT TO ITO_ASA


IDENTIFIED BY exact123 USING ' CAASEDW’

Materialized View syntax:

CREATE MATERIALIZED VIEW


EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW

REFRESH COMPLETE

START WITH sysdate

NEXT TRUNC(SYSDATE+1)+ 4/24

WITH PRIMARY KEY

AS

Page 7 of 115

DWH Training -9739096158


select * from HWMD_MTH_ALL_METRICS_CURR_VW;

Another Method to refresh:

DBMS_MVIEW.REFRESH('MV_COMPLEX', 'C');

Case Statement:

Select NAME,
(CASE
WHEN (CLASS_CODE = 'Subscription')
THEN ATTRIBUTE_CATEGORY
ELSE TASK_TYPE
END) TASK_TYPE,
CURRENCY_CODE
From EMP

Decode()

Select empname,Decode(address,’HYD’,’Hyderabad’,
‘Bang’, Bangalore’, address) as address
from emp;

Procedure:

CREATE OR REPLACE PROCEDURE Update_bal (

cust_id_IN In NUMBER,

In amount_IN NUMBER DEFAULT 1) AS

BEGIN

Update account_tbl Set amount= amount_IN where cust_id=


cust_id_IN

End

Trigger:

Page 8 of 115

DWH Training -9739096158


CREATE OR REPLACE TRIGGER EMP_AUR

AFTER/BEFORE UPDATE ON EMP

REFERENCING

NEW AS NEW

OLD AS OLD

FOR EACH ROW

DECLARE

BEGIN

IF (:NEW.last_upd_tmst <> :OLD.last_upd_tmst) THEN

-- Insert into Control table record

Insert into table emp_w values('wrk',sysdate)

ELSE

-- Exec procedure

Exec update_sysdate()

END;

ORACLE JOINS:

• Equi join
• Non-equi join
• Self join
• Natural join
• Cross join
Page 9 of 115

DWH Training -9739096158


• Outer join
 Left outer
 Right outer
 Full outer

Equi Join/Inner Join:

SQL> select empno,ename,job,dname,loc from emp e,dept d


where e.deptno=d.deptno;

USING CLAUSE

SQL> select empno,ename,job ,dname,loc from emp e join dept


d using(deptno);

ON CLAUSE

SQL> select empno,ename,job,dname,loc from emp e join dept


d on(e.deptno=d.deptno);

Non-Equi Join

A join which contains an operator other than ‘=’ in the joins


condition.

Ex: SQL> select empno,ename,job,dname,loc from emp


e,dept d where e.deptno > d.deptno;

Self Join

Joining the table itself is called self join.

Ex: SQL> select e1.empno,e2.ename,e1.job,e2.deptno from


emp e1,emp e2 where e1.empno=e2.mgr;

Page 10 of 115

DWH Training -9739096158


Natural Join

Natural join compares all the common columns.

Ex: SQL> select empno,ename,job,dname,loc from emp


natural join dept;

Cross Join

This will gives the cross product.

Ex: SQL> select empno,ename,job,dname,loc from emp cross


join dept;

Outer Join

Outer join gives the non-matching records along with matching


records.

Left Outer Join

This will display the all matching records and the records which
are in left hand side table those that are not in right hand side
table.

Ex: SQL> select empno,ename,job,dname,loc from emp e left


outer join dept d on(e.deptno=d.deptno);

Or

SQL> select empno,ename,job,dname,loc from emp e,dept


d where

e.deptno=d.deptno(+);

Right Outer Join

This will display the all matching records and the records which
Page 11 of 115

DWH Training -9739096158


are in right hand side table those that are not in left hand side
table.

Ex: SQL> select empno,ename,job,dname,loc from emp e


right outer join dept d on(e.deptno=d.deptno);

Or

SQL> select empno,ename,job,dname,loc from emp e,dept


d where e.deptno(+) = d.deptno;

Full Outer Join

This will display the all matching records and the non-matching
records from both tables.

Ex: SQL> select empno,ename,job,dname,loc from emp e full


outer join dept d on(e.deptno=d.deptno);

OR

SQL> select p.part_id, s.supplier_name


2 from part p, supplier s
3 where p.supplier_id = s.supplier_id (+)
4 union
5 select p.part_id, s.supplier_name
6 from part p, supplier s
7 where p.supplier_id (+) = s.supplier_id;

What’s the difference between View and Materialized


View?

View:

Why Use Views?

• To restrict data access


Page 12 of 115

DWH Training -9739096158


• To make complex queries easy

• To provide data independence

A simple view is one that:

– Derives data from only one table

– Contains no functions or groups of data

– Can perform DML operations through the view.

A complex view is one that:

– Derives data from many tables

– Contains functions or groups of data

– Does not always allow DML operations through the view

A view has a logical existence but a materialized view has


a physical existence.Moreover a materialized view can be
Indexed, analysed and so on....that is all the things that
we can do with a table can also be done with a materialized
view.

We can keep aggregated data into materialized view. we can


schedule the MV to refresh but table can’t.MV can be created
based on multiple tables.

Materialized View:

In DWH materialized views are very essential because in


reporting side if we do aggregate calculations as per the
business requirement report performance would be de graded.
So to improve report performance rather than doing report
calculations and joins at reporting side if we put same logic in
the MV then we can directly select the data from MV without
any joins and aggregations. We can also schedule MV
(Materialize View).

Page 13 of 115

DWH Training -9739096158


Inline view:

If we write a select statement in from clause that is nothing but


inline view.

Ex:
Get dept wise max sal along with empname and emp no.

Select a.empname, a.empno, b.sal, b.deptno


From EMP a, (Select max (sal) sal, deptno from EMP group by
deptno) b
Where
a.sal=b.sal and
a.deptno=b.deptno

What is the difference between view and materialized


view?

View Materialized view

A view has a logical A materialized view has a


existence. It does not contain physical existence.
data.

Its not a database object. It is a database object.

We can perform DML We cannot perform DML


operation on view. operation on materialized
view.

When we do select * from When we do select * from


view it will fetch the data materialized view it will fetch
from base table. the data from materialized
view.

In view we cannot schedule to In materialized view we can


refresh. schedule to refresh.

We can keep aggregated data


into materialized view.
Materialized view can be
created based on multiple
Page 14 of 115

DWH Training -9739096158


tables.

What is the Difference between Delete, Truncate and


Drop?

DELETE

The DELETE command is used to remove rows from a table. A


WHERE clause can be used to only remove some rows. If no
WHERE condition is specified, all rows will be removed. After
performing a DELETE operation you need to COMMIT or
ROLLBACK the transaction to make the change permanent or to
undo it.

TRUNCATE

TRUNCATE removes all rows from a table. The operation cannot


be rolled back. As such, TRUCATE is faster and doesn't use as
much undo space as a DELETE.

DROP

The DROP command removes a table from the database. All the
tables' rows, indexes and privileges will also be removed. The
operation cannot be rolled back.

Difference between Rowid and Rownum?

ROWID

A globally unique identifier for a row in a database. It is created


at the time the row is inserted into a table, and destroyed when
it is removed from a table.'BBBBBBBB.RRRR.FFFF' where
BBBBBBBB is the block number, RRRR is the slot(row) number,
and FFFF is a file number.

ROWNUM

For each row returned by a query, the ROWNUM pseudo column


returns a number indicating the order in which Oracle selects
the row from a table or set of joined rows. The first row selected
has a ROWNUM of 1, the second has 2, and so on.

Page 15 of 115

DWH Training -9739096158


You can use ROWNUM to limit the number of rows returned by a
query, as in this example:

SELECT * FROM employees WHERE ROWNUM < 10;

Rowid Row-num

Rowid is an oracle internal id Row-num is a row number


that is allocated every time returned by a select
a new record is inserted in a statement.
table. This ID is unique and
cannot be changed by the
user.

Rowid is permanent. Row-num is temporary.

Rowid is a globally unique The row-num


identifier for a row in a pseudocoloumn returns a
database. It is created at the number indicating the order
time the row is inserted into in which oracle selects the
the table, and destroyed row from a table or set of
when it is removed from a joined rows.
table.

Order of where and having:

SELECT column, group_function

FROM table

[WHERE condition]

[GROUP BY group_by_expression]

[HAVING group_condition]

[ORDER BY column];

The WHERE clause cannot be used to restrict groups. you use


the

Page 16 of 115

DWH Training -9739096158


HAVING clause to restrict groups.

Differences between where clause and having clause

Where clause Having clause

Both where and having clause can be used to filter the data.

Where as in where clause it is But having clause we need to


not mandatory. use it with the group by.

Where clause applies to the Where as having clause is


individual rows. used to test some condition
on the group rather than on
individual rows.

Where clause is used to But having clause is used to


restrict rows. restrict groups.

Restrict normal query by Restrict group by function by


where having

In where clause every record In having clause it is with


is filtered based on where. aggregate records (group by
functions).

MERGE Statement

You can use merge command to perform insert and


update in a single command.

Ex: Merge into student1 s1

Using (select * from student2) s2

On (s1.no=s2.no)

When matched then

Update set marks = s2.marks

Page 17 of 115

DWH Training -9739096158


When not matched then

Insert (s1.no, s1.name, s1.marks) Values (s2.no,


s2.name, s2.marks);

What is the difference between sub-query & co-related


sub query?

A sub query is executed once for the parent statement

whereas the correlated sub query is executed once for each

row of the parent query.

Sub Query:

Example:

Select deptno, ename, sal from emp a where sal in (select sal
from Grade where sal_grade=’A’ or sal_grade=’B’)

Co-Related Sun query:

Example:

Find all employees who earn more than the average salary in
their department.

SELECT last-named, salary, department_id FROM employees A

WHERE salary > (SELECT AVG (salary)

FROM employees B WHERE B.department_id =A.department_id

Group by B.department_id)

EXISTS:

The EXISTS operator tests for existence of rows in

the results set of the subquery.

Select dname from dept where exists


(select 1 from EMP
where dept.deptno= emp.deptno);
Page 18 of 115

DWH Training -9739096158


Sub-query Co-related sub-query

A sub-query is executed Where as co-related sub-


once for the parent Query query is executed once for
each row of the parent
query.

Example: Example:

Select * from emp where Select a.* from emp e where


deptno in (select deptno sal >= (select avg(sal) from
from dept); emp a where
a.deptno=e.deptno group by
a.deptno);

Indexes:

1. Bitmap indexes are most appropriate for columns having


low distinct values—such as GENDER, MARITAL_STATUS,
and RELATION. This assumption is not completely
accurate, however. In reality, a bitmap index is always
advisable for systems in which data is not frequently
updated by many concurrent systems. In fact, as I'll
demonstrate here, a bitmap index on a column with 100-
percent unique values (a column candidate for primary
key) is as efficient as a B-tree index.

2. When to Create an Index

3. You should create an index if:

4. A column contains a wide range of values

5. A column contains a large number of null values

6. One or more columns are frequently used together in a


WHERE clause or a join condition

7. The table is large and most queries are expected to


retrieve less than 2 to 4 percent of the rows

8. By default if u create index that is nothing but b-tree


Page 19 of 115

DWH Training -9739096158


index.

Why hints Require?

It is a perfect valid question to ask why hints should be used.


Oracle comes with an optimizer that promises to optimize a
query's execution plan. When this optimizer is really doing a good
job, no hints should be required at all.

Sometimes, however, the characteristics of the data in the


database are changing rapidly, so that the optimizer (or more
accuratly, its statistics) are out of date. In this case, a hint could
help.

You should first get the explain plan of your SQL and determine
what changes can be done to make the code operate without
using hints if possible. However, hints such as ORDERED,
LEADING, INDEX, FULL, and the various AJ and SJ hints can take
a wild optimizer and give you optimal performance

Tables analyze and update Analyze Statement

The ANALYZE statement can be used to gather statistics for a


specific table, index or cluster. The statistics can be computed
exactly, or estimated based on a specific number of rows, or a
percentage of rows:

ANALYZE TABLE employees COMPUTE STATISTICS;

ANALYZE TABLE employees ESTIMATE STATISTICS SAMPLE 15


PERCENT;

EXEC DBMS_STATS.gather_table_stats('SCOTT', 'EMPLOYEES');

Automatic Optimizer Statistics Collection

By default Oracle 10g automatically gathers optimizer statistics


using a scheduled job called GATHER_STATS_JOB. By default
this job runs within maintenance windows between 10 P.M. to 6
A.M. week nights and all day on weekends. The job calls the
DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC internal
procedure which gathers statistics for tables with either empty
Page 20 of 115

DWH Training -9739096158


or stale statistics, similar to the
DBMS_STATS.GATHER_DATABASE_STATS procedure using the
GATHER AUTO option. The main difference is that the internal
job prioritizes the work such that tables most urgently requiring
statistics updates are processed first.

Hint categories:

Hints can be categorized as follows:

• ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data
warehousing systems.

(/*+ ALL_ROWS */)

• FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.

(/*+ FIRST_ROWS */)

• CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and
FIRST_ROWS, based on statistics gathered.

• Hints for Join Orders,

• Hints for Join Operations,

• Hints for Parallel Execution, (/*+ parallel(a,4) */) specify


degree either 2 or 4 or 16

• Additional Hints

• HASH
Hashes one table (full scan) and creates a hash index for
that table. Then hashes other table and uses hash index to
find corresponding records. Therefore not suitable for < or
> join conditions.

/*+ use_hash */

Use Hint to force using index


Page 21 of 115

DWH Training -9739096158


SELECT /*+INDEX (TABLE_NAME INDEX_NAME) */ COL1,COL2
FROM TABLE_NAME

Select ( /*+ hash */ ) empno from

ORDERED- This hint forces tables to be joined in the order


specified. If you know table X has fewer rows, then ordering it
first may speed execution in a join.

PARALLEL (table, instances)This specifies the operation is to


be done in parallel.

If index is not able to create then will go for /*+ parallel(table,


8)*/-----For select and update example---in where clase like
st,not in ,>,< ,<> then we will use.

Explain Plan:

Explain plan will tell us whether the query properly using


indexes or not.whatis the cost of the table whether it is doing full
table scan or not, based on these statistics we can tune the
query.
The explain plan process stores data in the PLAN_TABLE. This
table can be located in the current schema or a shared schema
and is created using in SQL*Plus as follows:

SQL> CONN sys/password AS SYSDBA


Connected
SQL> @$ORACLE_HOME/rdbms/admin/utlxplan.sql
SQL> GRANT ALL ON sys.plan_table TO public;

SQL> CREATE PUBLIC SYNONYM plan_table FOR sys.plan_table;

What is your tuning approach if SQL query taking long


time? Or how do u tune SQL query?

If query taking long time then First will run the query in Explain
Plan, The explain plan process stores data in the PLAN_TABLE.

it will give us execution plan of the query like whether the


query is using the relevant indexes on the joining columns or
indexes to support the query are missing.

Page 22 of 115

DWH Training -9739096158


If joining columns doesn’t have index then it will do the full
table scan if it is full table scan the cost will be more then will
create the indexes on the joining columns and will run the
query it should give better performance and also needs to
analyze the tables if analyzation happened long back. The
ANALYZE statement can be used to gather statistics for a
specific table, index or cluster using

ANALYZE TABLE employees COMPUTE STATISTICS;

If still have performance issue then will use HINTS, hint is


nothing but a clue. We can use hints like

• ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data
warehousing systems.

(/*+ ALL_ROWS */)

• FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.

(/*+ FIRST_ROWS */)

• CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and
FIRST_ROWS, based on statistics gathered.

• HASH
Hashes one table (full scan) and creates a hash index for
that table. Then hashes other table and uses hash index to
find corresponding records. Therefore not suitable for < or
> join conditions.

/*+ use_hash */

Hints are most useful to optimize the query performance.

Store Procedure:

What are the differences between stored procedures


Page 23 of 115

DWH Training -9739096158


and triggers?

Stored procedure normally used for performing tasks


But the Trigger normally used for tracing and auditing logs.

Stored procedures should be called explicitly by the user in


order to execute
But the Trigger should be called implicitly based on the events
defined in the table.

Stored Procedure can run independently


But the Trigger should be part of any DML events on the table.

Stored procedure can be executed from the Trigger


But the Trigger cannot be executed from the Stored
procedures.

Stored Procedures can have parameters.


But the Trigger cannot have any parameters.

Stored procedures are compiled collection of programs or SQL


statements in the database.

Using stored procedure we can access and modify data


present in many tables.

Also a stored procedure is not associated with any particular


database object.

But triggers are event-driven special procedures which are


attached to a specific database object say a table.

Stored procedures are not automatically run and they have to


be called explicitly by the user. But triggers get executed
when the particular event associated with the event gets fired.

Packages:

Packages provide a method of encapsulating related


procedures, functions, and associated cursors and variables
together as a unit in the database.

Page 24 of 115

DWH Training -9739096158


package that contains several procedures and functions that
process related to same transactions.

A package is a group of related procedures and functions,


together with the cursors and variables they use,

Packages provide a method of encapsulating related


procedures, functions, and associated cursors and variables
together as a unit in the database.

Triggers:

Oracle lets you define procedures called triggers that run


implicitly when an INSERT, UPDATE, or DELETE statement is
issued against the associated table

Triggers are similar to stored procedures. A trigger stored in the


database can include SQL and PL/SQL

Types of Triggers

This section describes the different types of triggers:


• Row Triggers and Statement Triggers

• BEFORE and AFTER Triggers

• INSTEAD OF Triggers

• Triggers on System Events and User Events

Row Triggers

A row trigger is fired each time the table is affected by the


triggering statement. For example, if an UPDATE statement
updates multiple rows of a table, a row trigger is fired once for
each row affected by the UPDATE statement. If a triggering
statement affects no rows, a row trigger is not run.

BEFORE and AFTER Triggers

When defining a trigger, you can specify the trigger timing--


Page 25 of 115

DWH Training -9739096158


whether the trigger action is to be run before or after the
triggering statement. BEFORE and AFTER apply to both
statement and row triggers.

BEFORE and AFTER triggers fired by DML statements can be


defined only on tables, not on views.

Difference between Trigger and Procedure

Triggers Stored Procedures

In trigger no need to execute Where as in procedure we


manually. Triggers will be fired need to execute manually.
automatically.

Triggers that run implicitly


when an INSERT, UPDATE, or
DELETE statement is issued
against the associated table.

Differences between stored procedure and functions

Stored Procedure Functions

Stored procedure may or may Function should return at least


not return values. one output parameter. Can
return more than one
parameter using OUT
argument.

Stored procedure can be used Function can be used to


to solve the business logic. calculations

Stored procedure is a pre- But function is not a pre-


compiled statement. compiled statement.

Stored procedure accepts Whereas function does not


more than one argument. accept arguments.

Stored procedures are mainly Functions are mainly used to


used to process the tasks. compute values

Cannot be invoked from SQL Can be invoked form SQL


statements. E.g. SELECT statements e.g. SELECT

Page 26 of 115

DWH Training -9739096158


Can affect the state of Cannot affect the state of
database using commit. database.

Stored as a pseudo-code in Parsed and compiled at


database i.e. compiled form. runtime.

Data files Overview:

A tablespace in an Oracle database consists of one or more


physical datafiles. A datafile can be associated with only one
tablespace and only one database.

Table Space:

Oracle stores data logically in tablespaces and physically in


datafiles associated with the corresponding tablespace.

A database is divided into one or more logical storage units


called tablespaces. Tablespaces are divided into logical units of
storage called segments.

Control File:

A control file contains information about the associated


database that is required for access by an instance, both at
startup and during normal operation. Control file information can
be modified only by Oracle; no database administrator or user
can edit a control file.

2.2 IMPORTANT QUERIES

1. Get duplicate rows from the table:

Select empno, count (*) from EMP group by empno having


count (*)>1;

2. Remove duplicates in the table:

Page 27 of 115

DWH Training -9739096158


Delete from EMP where rowid not in (select max (rowid) from
EMP group by empno);

3. Below query transpose columns into rows.

Nam No Add1 Add2


e

abc 100 hyd bang

xyz 200 Mysor pune


e

Select name, no, add1 from A

UNION

Select name, no, add2 from A;

4. Below query transpose rows into columns.

select

emp_id,

max(decode(row_id,0,address))as address1,

max(decode(row_id,1,address)) as address2,

max(decode(row_id,2,address)) as address3

from (select emp_id,address,mod(rownum,3) row_id from temp


order by emp_id )

group by emp_id

Other query:

select

emp_id,

max(decode(rank_id,1,address)) as add1,

Page 28 of 115

DWH Training -9739096158


max(decode(rank_id,2,address)) as add2,

max(decode(rank_id,3,address))as add3

from

(select emp_id,address,rank() over (partition by emp_id order


by emp_id,address )rank_id from temp )

group by

emp_id

5. Rank query:

Select empno, ename, sal, r from (select empno, ename, sal,


rank () over (order by sal desc) r from EMP);

6. Dense rank query:

The DENSE_RANK function works acts like the RANK function


except that it assigns consecutive ranks:

Select empno, ename, Sal, from (select empno, ename, sal,


dense_rank () over (order by sal desc) r from emp);

7. Top 5 salaries by using rank:

Select empno, ename, sal,r from (select


empno,ename,sal,dense_rank() over (order by sal desc) r from
emp) where r<=5;

Or

Select * from (select * from EMP order by sal desc)


where rownum<=5;

8. 2 nd highest Sal:

Select empno, ename, sal, r from (select empno, ename, sal,


dense_rank () over (order by sal desc) r from EMP) where r=2;

9. Top sal:

Select * from EMP where sal= (select max (sal) from EMP);

10. How to display alternative rows in a table?


Page 29 of 115

DWH Training -9739096158


SQL> select *from emp where (rowid, 0) in (select
rowid,mod(rownum,2) from emp);

11. Hierarchical queries

Starting at the root, walk from the top down, and eliminate
employee Higgins in the result, but

process the child rows.

SELECT department_id, employee_id, last_name, job_id, salary

FROM employees

WHERE last_name! = ’Higgins’

START WITH manager_id IS NULL

CONNECT BY PRIOR employee_id = menagerie;

3 DWH CONCEPTS

What is BI?
Business Intelligence refers to a set of methods and techniques
that are used by organizations for tactical and strategic decision
making. It leverages methods and technologies that focus on
counts, statistics and business objectives to improve business
performance.

The objective of Business Intelligence is to better understand


customers and improve customer service, make the supply and
distribution chain more efficient, and to identify and address
business problems and opportunities quickly.

Warehouse is used for high level data analysis purpose.It


is used for predictions, timeseries analysis, financial
Analysis, what -if simulations etc. Basically it is used
for better decision making.

What is a Data Warehouse?

Page 30 of 115

DWH Training -9739096158


Data Warehouse is a "Subject-Oriented, Integrated, Time-
Variant Nonvolatile collection of data in support of decision
making".

In terms of design data warehouse and data mart are almost


the same.

In general a Data Warehouse is used on an enterprise level and


a Data Marts is used on a business division/department level.

Subject Oriented:

Data that gives information about a particular subject instead of


about a company's ongoing operations.

Integrated:

Data that is gathered into the data warehouse from a variety of


sources and merged into a coherent whole.

Time-variant:

All data in the data warehouse is identified with a particular


time period.

Non-volatile:

Data is stable in a data warehouse. More data is added but data


is never removed.

What is a DataMart?

Datamart is usually sponsored at the department level and


developed with a specific details or subject in mind, a Data
Mart is a subset of data warehouse with a focused objective.

What is the difference between a data warehouse and a


data mart?

In terms of design data warehouse and data mart are almost


the same.

In general a Data Warehouse is used on an enterprise level and


a Data Marts is used on a business division/department level.

A data mart only contains data specific to a particular subject

Page 31 of 115

DWH Training -9739096158


areas.

Difference between data mart and data warehouse

Data Mart Data Warehouse

Data mart is usually Data warehouse is a “Subject-


sponsored at the department Oriented, Integrated, Time-
level and developed with a Variant, Nonvolatile collection
specific issue or subject in of data in support of decision
mind, a data mart is a data making”.
warehouse with a focused
objective.

A data mart is used on a A data warehouse is used on


business division/ department an enterprise level
level.

A Data Mart is a subset of data A Data Warehouse is simply an


from a Data Warehouse. Data integrated consolidation of
Marts are built for specific user data from a variety of sources
groups. that is specially designed to
support strategic and tactical
decision making.

By providing decision makers The main objective of Data


with only a subset of data Warehouse is to provide an
from the Data Warehouse, integrated environment and
Privacy, Performance and coherent picture of the
Clarity Objectives can be business at a point in time.
attained.

what is fact less fact table?

A fact table that contains only primary keys from the


dimension tables, and that do not contain any measures that
type of fact table is called fact less fact table .

What is a Schema?

Graphical Representation of the datastructure.


Page 32 of 115

DWH Training -9739096158


First Phase in implementation of Universe

What are the most important features of a data


warehouse?

DRILL DOWN, DRILL ACROSS, Graphs, PI charts, dashboards


and TIME HANDLING

To be able to drill down/drill across is the most basic


requirement of an end user in a data warehouse. Drilling down
most directly addresses the natural end-user need to see more
detail in an result. Drill down should be as generic as possible
becuase there is absolutely no good way to predict users drill-
down path.

What does it mean by grain of the star schema?

In Data warehousing grain refers to the level of detail


available in a given fact table as well as to the level of detail
provided by a star schema.

It is usually given as the number of records per key within the


table. In general, the grain of the fact table is the grain of the
star schema.

What is a star schema?

Star schema is a data warehouse schema where there is only


one "fact table" and many denormalized dimension tables.

Fact table contains primary keys from all the dimension tables
and other numeric columns columns of additive, numeric facts.

Page 33 of 115

DWH Training -9739096158


What is a snowflake schema?

Unlike Star-Schema, Snowflake schema contain normalized


dimension tables in a tree like structure with many nesting
levels.

Snowflake schema is easier to maintain but queries require


more joins.

What is the difference between snow flake and star


schema

Star Schema Snow Flake Schema

The star schema is the Snowflake schema is a more


simplest data warehouse complex data warehouse
scheme. model than a star schema.

In star schema each of the In snow flake schema at least


dimensions is represented in one hierarchy should exists
a single table .It should not between dimension tables.
have any hierarchies between
dims.

It contains a fact table It contains a fact table


surrounded by dimension surrounded by dimension

Page 34 of 115

DWH Training -9739096158


tables. If the dimensions are tables. If a dimension is
de-normalized, we say it is a normalized, we say it is a
star schema design. snow flaked design.

In star schema only one join In snow flake schema since


establishes the relationship there is relationship between
between the fact table and the dimensions tables it has
any one of the dimension to do many joins to fetch the
tables. data.

A star schema optimizes the Snowflake schemas normalize


performance by keeping dimensions to eliminated
queries simple and providing redundancy. The result is
fast response time. All the more complex queries and
information about the each reduced query performance.
level is stored in one row.

It is called a star schema It is called a snowflake


because the diagram schema because the diagram
resembles a star. resembles a snowflake.

What is Fact and Dimension?

A "fact" is a numeric value that a business wishes to count or


sum. A "dimension" is essentially an entry point for getting at
the facts. Dimensions are things of interest to the business.

A set of level properties that describe a specific aspect of a


business, used for analyzing the factual measures.

What is Fact Table?

A Fact Table in a dimensional model consists of one or more


numeric facts of importance to a business. Examples of facts
are as follows:

• the number of products sold

• the value of products sold

• the number of products produced

Page 35 of 115

DWH Training -9739096158


• the number of service calls received

What is Factless Fact Table?

Factless fact table captures the many-to-many relationships


between dimensions, but contains no numeric or textual facts.
They are often used to record events or coverage information.

Common examples of factless fact tables include:

• Identifying product promotion events (to determine


promoted products that didn’t sell)

• Tracking student attendance or registration events

• Tracking insurance-related accident events

Types of facts?

There are three types of facts:

• Additive: Additive facts are facts that can be summed up


through all of the dimensions in the fact table.

• Semi-Additive: Semi-additive facts are facts that can be


summed up for some of the dimensions in the fact table,
but not the others.

• Non-Additive: Non-additive facts are facts that cannot be


summed up for any of the dimensions present in the fact
table.

What is Granularity?

Principle: create fact tables with the most granular data


possible to support analysis of the business process.

In Data warehousing grain refers to the level of detail available


in a given fact table as well as to the level of detail provided by
a star schema.

It is usually given as the number of records per key within the

Page 36 of 115

DWH Training -9739096158


table. In general, the grain of the fact table is the grain of the
star schema.

Facts: Facts must be consistent with the grain.all facts are at a


uniform grain.

• Watch for facts of mixed granularity

• Total sales for day & montly total

Dimensions: each dimension associated with fact table must


take on a single value for each fact row.

• Each dimension attribute must take on one value.

• Outriggers are the exception, not the rule.

Dimensional Model

Page 37 of 115

DWH Training -9739096158


What is slowly Changing Dimension?

Slowly changing dimensions refers to the change in


dimensional attributes over time.

An example of slowly changing dimension is a Resource


dimension where attributes of a particular employee change
over time like their designation changes or dept changes etc.

What is Conformed Dimension?

Conformed Dimensions (CD): these dimensions are something


that is built once in your model and can be reused multiple
times with different fact tables. For example, consider a model
containing multiple fact tables, representing different data
marts. Now look for a dimension that is common to these facts
tables. In this example let’s consider that the product
Page 38 of 115

DWH Training -9739096158


dimension is common and hence can be reused by creating
short cuts and joining the different fact tables.Some of the
examples are time dimension, customer dimensions, product
dimension.

What is Junk Dimension?

A "junk" dimension is a collection of random transactional


codes, flags and/or text attributes that are unrelated to any
particular dimension. The junk dimension is simply a structure
that provides a convenient place to store the junk attributes. A
good example would be a trade fact in a company that brokers
equity trades.

When you consolidate lots of small dimensions and instead of


having 100s of small dimensions, that will have few records in
them, cluttering your database with these mini ‘identifier’
tables, all records from all these small dimension tables are
loaded into ONE dimension table and we call this dimension
table Junk dimension table. (Since we are storing all the junk in
this one table) For example: a company might have handful of
manufacture plants, handful of order types, and so on, so forth,
and we can consolidate them in one dimension table called
junked dimension table

It’s a dimension table which is used to keep junk attributes

What is De Generated Dimension?

An item that is in the fact table but is stripped off of its


description, because the description belongs in dimension
table, is referred to as Degenerated Dimension. Since it looks
like dimension, but is really in fact table and has been
degenerated of its description, hence is called degenerated
dimension..

Degenerated Dimension: a dimension which is located in fact


table known as Degenerated dimension

Dimensional Model:

A type of data modeling suited for data warehousing. In a


dimensional model, there are two types of tables:
dimensional tables and fact tables. Dimensional table

Page 39 of 115

DWH Training -9739096158


records information on each dimension, and fact table
records all the "fact", or measures.

Data modeling

There are three levels of data modeling. They are conceptual,


logical, and physical. This section will explain the difference
among the three, the order with which each one is created, and
how to go from one level to the other.

Conceptual Data Model

Features of conceptual data model include:

• Includes the important entities and the relationships among


them.

• No attribute is specified.

• No primary key is specified.

At this level, the data modeler attempts to identify the highest-


level relationships among the different entities.

Logical Data Model

Features of logical data model include:

• Includes all entities and relationships among them.

• All attributes for each entity are specified.

• The primary key for each entity specified.

• Foreign keys (keys identifying the relationship between


different entities) are specified.

• Normalization occurs at this level.

At this level, the data modeler attempts to describe the data in


as much detail as possible, without regard to how they will be
physically implemented in the database.

In data warehousing, it is common for the conceptual data


model and the logical data model to be combined into a single
step (deliverable).

Page 40 of 115

DWH Training -9739096158


The steps for designing the logical data model are as follows:

1. Identify all entities.

2. Specify primary keys for all entities.

3. Find the relationships between different entities.

4. Find all attributes for each entity.

5. Resolve many-to-many relationships.

6. Normalization.

Physical Data Model

Features of physical data model include:

• Specification all tables and columns.

• Foreign keys are used to identify relationships between


tables.

• Demoralization may occur based on user requirements.

• Physical considerations may cause the physical data model


to be quite different from the logical data model.

At this level, the data modeler will specify how the logical data
model will be realized in the database schema.

The steps for physical data model design are as follows:

1. Convert entities into tables.

2. Convert relationships into foreign keys.

3. Convert attributes into columns.

9. http://www.learndatamodeling.com/dm_standard.htm

10. Modeling is an efficient and effective way to represent


the organization’s needs; It provides information in a
graphical way to the members of an organization to
understand and communicate the business rules and
processes. Business Modeling and Data Modeling are the
two important types of modeling.
Page 41 of 115

DWH Training -9739096158


The differences between a logical data model and
physical data model is shown below.

Logical vs Physical Data Modeling

Logical Data Model Physical Data Model

Represents business Represents the physical


information and defines implementation of the model in a
business rules database.

Entity Table

Attribute Column

Primary Key Primary Key Constraint

Unique Constraint or Unique


Alternate Key
Index

Inversion Key Entry Non Unique Index

Rule Check Constraint, Default Value

Relationship Foreign Key

Definition Comment

Page 42 of 115

DWH Training -9739096158


Below is the simple data model

Below is the sq for project dim

Page 43 of 115

DWH Training -9739096158


Page 44 of 115

DWH Training -9739096158


EDIII – Logical Design

Page 45 of 115

DWH Training -9739096158


ACW_ORGANIZATION_D
ACW_DF_FEES_STG ACW_DF_FEES_F Primary Key
Non-Key Attributes Primary Key ORG_KEY [PK1]
SEGMENT1 ACW_DF_FEES_KEY Non-Key Attributes
ORGANIZATION_ID [PK1] ORGANIZATION_CODE
ITEM_TYPE
Non-Key Attributes CREA TED_BY
BUYER_ID CREA TION_DATE
PRODUCT_KEY
COST_REQUIRED
ORG_KEY LAST_UPDATE_DATE
QUARTER_1_COST LAST_UPDATED_BY
DF_MGR_KEY
QUARTER_2_COST
COST_REQUIRED D_CREATED_BY
QUARTER_3_COST D_CREATION_DATE
DF_FEES PID for DF Fees
QUARTER_4_COST
COSTED_BY D_LAST_UPDATE_DATE
COSTED_BY
COSTED_DATE D_LAST_UPDATED_BY
COSTED_DATE
APPROV ING_MGR
APPROV ED_BY
APPROV ED_DATE
APPROV ED_DATE
D_CREATED_BY
D_CREATION_DATE ACW_USERS_D
D_LAST_UPDATE_BY Primary Key
D_LAST_UPDATED_DATE USER_KEY [PK1]
Non-Key Attributes
EDW_TIME_HIERARCHY
PERSON_ID
EMAIL_ADDRESS
ACW_PCBA_A PPROVAL_F LAST_NAME
Primary Key FIRST_NAME
ACW_PCBA_A PPROVAL_STG FULL_NAME
PCBA _APPROVAL_KEY
Non-Key Attributes [PK1] EFFECTIV E_STA RT_DATE
INV ENTORY_ITEM_ID Non-Key Attributes EFFECTIV E_END_DATE
LATEST_REV PART_KEY EMPLOYEE_NUMBER
LOCATION_ID LAST_UPDATED_BY
CISCO_PART_NUMBER
LOCATION_CODE SUPPLY_CHANNEL_KEY LAST_UPDATE_DATE
APPROV AL_FLAG CREA TION_DATE
NPI
ADJUSTMENT APPROV AL_FLAG CREA TED_BY
APPROV AL_DATE D_LAST_UPDATED_BY
ADJUSTMENT
TOTA L_ADJUSTMENT D_LAST_UPDATE_DATE
APPROV AL_DATE
TOTA L_ITEM_COST D_CREATION_DATE
ADJUSTMENT_AMT
DEMAND D_CREATED_BY
SPEND_BY _ASSEMBLY
COMM_MGR COMM_MGR_KEY ACW_PRODUCTS_D
BUYER_ID Primary Key
BUYER_ID
BUYER RFQ_CREATED ACW_PART_TO_PID_D PRODUCT_KEY [PK1]
RFQ_CREATED Users
Primary Key Non-Key Attributes
RFQ_RESPONSE
RFQ_RESPONSE
CSS PART_TO_PID_KEY [PK1] PRODUCT_NA ME
CSS D_CREATED_BY Non-Key Attributes BUSINESS_UNIT_ID
D_CREATED_DATE PART_KEY BUSINESS_UNIT
D_LAST_UPDATED_BY CISCO_PART_NUMBER PRODUCT_FAMILY_ID
ACW_DF_A PPROVAL_STG D_LAST_UPDATE_DATE PRODUCT_KEY PRODUCT_FAMILY
Non-Key Attributes PRODUCT_NA ME ITEM_TYPE
LATEST_REVISION D_CREATED_BY
INV ENTORY_ITEM_ID ACW_DF_A PPROVAL_F
D_CREATED_BY D_CREATION_DATE
CISCO_PART_NUMBER Primary Key
D_CREATION_DATE D_LAST_UPDATE_BY
LATEST_REV
DF_APPROVAL_KEY D_LAST_UPDATED_BY D_LAST_UPDATED_DATE
PCBA _ITEM_FLAG [PK1]
APPROV AL_FLAG D_LAST_UPDATE_DATE
Non-Key Attributes
APPROV AL_DATE
LOCATION_ID PART_KEY
LOCATION_CODE CISCO_PART_NUMBER
BUYER SUPPLY_CHANNEL_KEY
BUYER_ID PCBA _ITEM_FLAG
RFQ_CREATED APPROV ED ACW_SUPPLY_CHA NNEL_D
RFQ_RESPONSE APPROV AL_DATE
Primary Key
CSS BUYER_ID
RFQ_CREATED SUPPLY_CHANNEL_KEY
[PK1]
RFQ_RESPONSE
CSS Non-Key Attributes
D_CREATED_BY SUPPLY_CHANNEL
D_CREATION_DATE DESCRIPTION
D_LAST_UPDATED_BY LAST_UPDATED_BY
D_LAST_UPDATE_DATE LAST_UPDATE_DATE
CREA TED_BY
CREA TION_DATE
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
D_CREATED_BY
D_CREATION_DATE

EDII– Physical Design

Page 46 of 115

DWH Training -9739096158


ACW_DF_FEES_F ACW_ORGANIZAT ION_D
ACW_DF_FEES_ST G Columns Colum ns
Columns ACW_DF_FEES_KEY NUMB ER(10) [P K1] ORG_KEY NUMB ER(10) [P K1]
SEGM ENT 1 VARCHAR2(40) PRODUCT _KEY NUMB ER(10) ORGA NIZAT ION_CODE CHA R(30)
ORGA NIZAT ION_IDNUMB ER(10) ORG_KE Y NUMB ER(10) CREAT ED_BY NUMB ER(10)
IT EM _TYPE CHAR(30) DF_MGR_KEY NUMB ER(10) CREAT ION_DAT E DAT E
BUYER_ID NUMB ER(10) COST_REQUIRED CHAR(1) LAST_UPDATE_DAT E DAT E
COST _REQUIRED CHAR(1) DF_FEES FLOAT (12) LAST_UPDATED_BY NUMB ER
QUART ER_1_COSTFLOAT (12) COSTED_BY NUMB ER(10) D_CREA TED_BY CHA R(10)
QUART ER_2_COSTFLOAT (12) COSTED_DAT E DAT E D_CREA TION_DATE DAT E
QUART ER_3_COSTFLOAT (12) APP ROV ING_MGR NUMB ER(10) D_LAST _UPDAT E_DATEDAT E
QUART ER_4_COSTFLOAT (12) APP ROV ED_DAT E DAT E D_LAST _UPDAT ED_BYCHA R(10)
COST ED_B Y NUMB ER(10) D_CREA T ED_BY CHAR(10)
COST ED_DAT E DAT E D_CREA T ION_DAT E DAT E
PID_for_DF_Fees
APPROVED_BY NUMB ER(10) D_LAST _UPDAT E_BY CHAR(10)
APPROVED_DATE DAT E D_LAST _UPDAT ED_DAT CEHAR(10)

EDW_T IME_HIE RARCHY


ACW_US ERS_D
ACW_PCBA_APPROVAL_F Columns
Columns USER_K EY NUMB ER(10) [P K1]
PCB A_A PPROVAL_KEY CHA R(10) [PK1] PERSON_ID CHA R(10)
ACW_PCBA_APPROVAL_ST G
PART _K EY NUM BER(10) EMAIL_ADDRESS CHA R(10)
Columns
CISCO_PA RT _NUMBE R CHA R(10) LAST_NAM E VARCHAR2(50)
INVENT ORY_IT EM_IDNUMB ER(10) FIRST _NAME VARCHAR2(50)
SUP PLY _CHANNE L_KEYNUM BER(10)
LAT EST _REV CHAR(10) FULL_NAM E CHA R(10)
LOCAT ION_ID NUMB ER(10) NPI CHA R(1)
APP ROV AL_FLAG CHA R(1) EFFECT IVE_ST ART _DATDAT
E E
LOCAT ION_CODE CHAR(10) EFFECT IVE_END_DAT E DAT E
ADJUST MENT CHA R(1)
APPROVAL_FLAG CHAR(1) EMPLOYEE_NUMBER NUMB ER(10)
APP ROV AL_DAT E DAT E
ADJUST ME NT CHAR(1) SEX NUMB ER
APPROVAL_DA T E DAT E ADJUST MENT _AM T FLOAT (12)
SPE ND_BY_ASSE MBLY FLOAT (12) LAST_UPDATE_DAT E DAT E
T OTAL_ADJUST MENT CHAR(10) CREAT ION_DAT E DAT E
COMM_MGR_KEY NUM BER(10)
T OTAL_IT EM _COST FLOAT (10) CREAT ED_BY NUMB ER(10)
BUY ER_ID NUM BER(10)
DEMA ND NUMB ER D_LAST _UPDAT ED_BY CHA R(10)
COMM _MGR CHAR(10) RFQ_CREAT ED CHA R(1)
RFQ_RE SPONSE CHA R(1) D_LAST _UPDAT E_DATE DAT E
BUYER_ID NUMB ER(10) D_CREA TION_DATE DAT E
CSS CHA R(10)
BUYER VARCHAR2(240) D_CREA TED_BY CHA R(10)
D_CREA T ED_BY CHA R(10)
RFQ_CREAT ED CHAR(1)
RFQ_RE SPONSE CHAR(1) D_CREA T ED_DAT E CHA R(10)
D_LAST _UPDATED_BY CHA R(10)
CSS CHAR(10)
D_LAST _UPDATE_DAT EDAT E

ACW_PRODUCT S_D
Columns
ACW_DF_APPROVAL_STG
PRODUCT_KEY NUMB ER(10) [P K1]
Columns
PRODUCT_NAME CHAR(30)
INVENTORY_IT EM_ID NUMB ER(10) BUS INESS_UNIT _ID NUMB ER(10)
CISCO_PA RT _NUM BERCHAR(30) ACW_DF_APPROVA L_F ACW_PART _T O_PID_D
BUS INESS_UNIT VARCHAR2(60)
LATEST _REV CHAR(10) Columns Columns
PRODUCT_FAM ILY_ID NUMB ER(10)
PCBA_IT EM_FLAG CHAR(1) DF_APPROVAL_KEY NUMBER(10) [P K1] PART _T O_PID_KEY NUMB ER(10) [P K1]
PRODUCT_FAM ILY VARCHAR2(180)
APPROVAL_FLAG CHAR(1) PART_K EY NUMBER(10) PART _KEY NUMB ER(10)
IT EM_T YPE CHAR(30)
APPROVAL_DA TE DAT E CISCO_PART _NUMBE R CHA R(30) CISCO_PA RT_NUMBERCHA R(30)
D_CREA TED_BY CHAR(10)
LOCAT ION_ID NUMB ER(10) SUP PLY _CHANNE L_KEYNUMBER(10) PRODUCT _KEY NUMB ER(10) D_CREA TION_DATE DAT E
SUPPLY_CHANNE L CHAR(10) PCB A_IT EM_FLAG CHA R(1) PRODUCT _NAME CHA R(30)
D_LAST _UPDAT E_BY CHAR(10)
BUYER VARCHAR2(240) APP ROV ED CHA R(1) LATEST _REVIS ION CHA R(10)
D_LAST _UPDAT ED_DAT CEHAR(10)
BUYER_ID NUMB ER(10) APP ROV AL_DAT E DAT E D_CREAT ED_BY CHA R(10)
RFQ_CREATED CHAR(1) BUY ER_ID NUMBER(10) D_CREAT ION_DAT E DAT E
RFQ_RESPONSE CHAR(1) RFQ_CREAT ED CHA R(1) D_LAST _UPDAT ED_BYCHA R(10)
CSS CHAR(10) RFQ_RE SPONSE CHA R(1) D_LAST _UPDAT E_DAT E
DAT E
CSS CHA R(10)
D_CREA T ED_BY CHA R(10)
D_CREA T ION_DAT E DAT E
D_LAST _UPDAT ED_BY CHA R(10)
D_LAST _UPDAT E_DAT EDAT E
ACW_SUPPLY_CHANNEL_D
Columns
SUP PLY _CHANNE L_KEYNUMB ER(10) [P K1]
SUP PLY _CHANNE L CHA R(60)
DES CRIPT ION VARCHAR2(240)
LAST _UPDAT ED_BY NUMB ER
LAST _UPDAT E_DATE DAT E
CRE ATED_BY NUMB ER(10)
CRE ATION_DATE DAT E
D_LAST_UPDAT ED_BY CHA R(10)
D_LAST_UPDAT E_DAT EDAT E
D_CREA T ED_BY CHA R(10)
D_CREA T ION_DAT E DAT E

Users

Types of SCD Implementation:

Type 1 Slowly Changing Dimension

In Type 1 Slowly Changing Dimension, the new information


simply overwrites the original information. In other words, no
history is kept.

Page 47 of 115

DWH Training -9739096158


In our example, recall we originally have the following table:

Customer
Name State
Key

1001 Christina Illinois

After Christina moved from Illinois to California, the new


information replaces the new record, and we have the following
table:

Customer
Name State
Key

1001 Christina California

Advantages:

- This is the easiest way to handle the Slowly Changing


Dimension problem, since there is no need to keep track of the
old information.

Disadvantages:

- All history is lost. By applying this methodology, it is


not possible to trace back in history. For example, in
this case, the company would not be able to know that
Christina lived in Illinois before.

- Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not


necessary for the data warehouse to keep track of historical
changes.

Type 2 Slowly Changing Dimension

In Type 2 Slowly Changing Dimension, a new record is added to

Page 48 of 115

DWH Training -9739096158


the table to represent the new information. Therefore, both the
original and the new record will be present. The newe record
gets its own primary key.

In our example, recall we originally have the following table:

Customer
Name State
Key

1001 Christina Illinois

After Christina moved from Illinois to California, we add the new


information as a new row into the table:

Customer
Name State
Key

1001 Christina Illinois

1005 Christina California

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:

- This will cause the size of the table to grow fast. In cases
where the number of rows for the table is very high to start
with, storage and performance can become a concern.

- This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is


necessary for the data warehouse to track historical changes.

Type 3 Slowly Changing Dimension

Page 49 of 115

DWH Training -9739096158


In Type 3 Slowly Changing Dimension, there will be two
columns to indicate the particular attribute of interest, one
indicating the original value, and one indicating the current
value. There will also be a column that indicates when the
current value becomes active.

In our example, recall we originally have the following table:

Customer
Name State
Key

1001 Christina Illinois

To accommodate Type 3 Slowly Changing Dimension, we will


now have the following columns:

• Customer Key

• Name

• Original State

• Current State

• Effective Date

After Christina moved from Illinois to California, the original


information gets updated, and we have the following table
(assuming the effective date of change is January 15, 2003):

Customer Original Current Effective


Name
Key State State Date

1001 Christina Illinois California 15-JAN-2003

Advantages:

- This does not increase the size of the table, since new
information is updated.

- This allows us to keep some part of history.

Disadvantages:

- Type 3 will not be able to keep all history where an attribute is


Page 50 of 115

DWH Training -9739096158


changed more than once. For example, if Christina later moves
to Texas on December 15, 2003, the California information will
be lost.

Usage:

Type 3 is rarely used in actual practice.

When to use Type 3:

Type III slowly changing dimension should only be used when it


is necessary for the data warehouse to track historical changes,
and when such changes will only occur for a finite number of
time.

What is Staging area why we need it in DWH?

If target and source databases are different and target table


volume is high it contains some millions of records in this
scenario without staging table we need to design your
informatica using look up to find out whether the record exists or
not in the target table since target has huge volumes so its
costly to create cache it will hit the performance.

If we create staging tables in the target database we can


simply do outer join in the source qualifier to determine
insert/update this approach will give you good performance.

It will avoid full table scan to determine insert/updates on target.


And also we can create index on staging tables since these
tables were designed for specific application it will not impact to
any other schemas/users.

While processing flat files to data warehousing we can perform


cleansing.
Data cleansing, also known as data scrubbing, is the process of
ensuring that a set of data is correct and accurate. During data
cleansing, records are checked for accuracy and consistency.

• Since it is one-to-one mapping from ODS to staging we


do truncate and reload.

• We can create indexes in the staging state, to perform


our source qualifier best.
Page 51 of 115

DWH Training -9739096158


• If we have the staging area no need to relay on the
informatics transformation to known whether the
record exists or not.

Data cleansing

Weeding out unnecessary or unwanted things (characters


and spaces etc) from incoming data to make it more
meaningful and informative

Data merging

Data can be gathered from heterogeneous systems and put


together

Data scrubbing

Data scrubbing is the process of fixing or eliminating


individual pieces of data that are incorrect, incomplete or
duplicated before the data is passed to end user.

Data scrubbing is aimed at more than eliminating errors


and redundancy. The goal is also to bring consistency to
various data sets that may have been created with
different, incompatible business rules.

ODS (Operational Data Sources):

My understanding of ODS is, its a replica of OLTP system and so


the need of this, is to reduce the burden on production system
(OLTP) while fetching data for loading targets. Hence its a
mandate Requirement for every Warehouse.

So every day do we transfer data to ODS from OLTP to keep it


up to date?

OLTP is a sensitive database they should not allow multiple


select statements it may impact the performance as well as if
something goes wrong while fetching data from OLTP to data
warehouse it will directly impact the business.

ODS is the replication of OLTP.

Page 52 of 115

DWH Training -9739096158


ODS is usually getting refreshed through some oracle jobs.

enables management to gain a consistent picture of the


business.

What is a surrogate key?

A surrogate key is a substitution for the natural primary key.


It is a unique identifier or number ( normally created by a
database sequence generator ) for each record of a dimension
table that can be used for the primary key to the table.

A surrogate key is useful because natural keys may change.

What is the difference between a primary key and a


surrogate key?

A primary key is a special constraint on a column or set of


columns. A primary key constraint ensures that the column(s)
so designated have no NULL values, and that every value is
unique. Physically, a primary key is implemented by the
database system using a unique index, and all the columns in
the primary key must have been declared NOT NULL. A table
may have only one primary key, but it may be composite
(consist of more than one column).

A surrogate key is any column or set of columns that can be


declared as the primary key instead of a "real" or natural key.
Sometimes there can be several natural keys that could be
declared as the primary key, and these are all called candidate
keys. So a surrogate is a candidate key. A table could actually
have more than one surrogate key, although this would be
unusual. The most common type of surrogate key is an
incrementing integer, such as an auto increment column in
MySQL, or a sequence in Oracle, or an identity column in SQL
Server.

Page 53 of 115

DWH Training -9739096158


4 ETL-INFORMATICA

4.1 Informatica Overview

Informatica is a powerful Extraction, Transformation, and Loading tool and


is been deployed at GE Medical Systems for data warehouse development
in the Business Intelligence Team. Informatica comes with the following
clients to perform various tasks.

• Designer – used to develop transformations/mappings


• Workflow Manager / Workflow Monitor replace the
Server Manager - used to create sessions / workflows/
worklets to run, schedule, and monitor mappings for data
movement
• Repository Manager – used to maintain folders, users,
permissions, locks, and repositories.
• Integration Services – the “workhorse” of the domain.
Informatica Server is the component responsible for the
actual work of moving data according to the mappings
developed and placed into operation. It contains several
distinct parts such as the Load Manager, Data
Transformation Manager, Reader and Writer.
• Repository Services- Informatica client tools and
Informatica Server connect to the repository database over
the network through the Repository Server.

Informatica Transformations:

Mapping: Mapping is the Informatica Object which contains set


of transformations including source and target. Its look like
pipeline.
Page 54 of 115

DWH Training -9739096158


Mapplet:

Mapplet is a set of reusable transformations. We can use this


mapplet in any mapping within the Folder.

A mapplet can be active or passive depending on the


transformations in the mapplet. Active mapplets contain one or
more active transformations. Passive mapplets contain only
passive transformations.

When you add transformations to a mapplet, keep the following


restrictions in mind:

• If you use a Sequence Generator transformation, you must


use a reusable Sequence Generator transformation.

• If you use a Stored Procedure transformation, you must


configure the Stored Procedure Type to be Normal.

• You cannot include the following objects in a mapplet:

o Normalizer transformations

o COBOL sources

o XML Source Qualifier transformations

o XML sources

o Target definitions

o Other mapplets

• The mapplet contains Input transformations and/or source


definitions with at least one port connected to a
transformation in the mapplet.

• The mapplet contains at least one Output transformation


with at least one port connected to a transformation in the
mapplet.

Input Transformation: Input transformations are used to


create a logical interface to a mapplet in order to allow data to
pass into the mapplet.

Output Transformation: Output transformations are used to

Page 55 of 115

DWH Training -9739096158


create a logical interface from a mapplet in order to allow data
to pass out of a mapplet.

System Variables

$$$SessStartTime returns the initial system date value on the


machine hosting the Integration Service when the server
initializes a session. $$$SessStartTime returns the session start
time as a string value. The format of the string depends on the
database you are using.

Session: A session is a set of instructions that tells informatica


Server how to move data from sources to targets.

WorkFlow: A workflow is a set of instructions that tells


Informatica Server how to execute tasks such as sessions, email
notifications and commands. In a workflow multiple sessions
can be included to run in parallel or sequential manner.

Source Definition: The Source Definition is used to logically


represent database table or Flat files.

Target Definition: The Target Definition is used to logically


represent a database table or file in the Data Warehouse / Data
Mart.

Aggregator: The Aggregator transformation is used to perform


Aggregate calculations on group basis.

Expression: The Expression transformation is used to perform


the arithmetic calculation on row by row basis and also used to
convert string to integer vis and concatenate two columns.

Filter: The Filter transformation is used to filter the data based


on single condition and pass through next transformation.

Router: The router transformation is used to route the data


based on multiple conditions and pass through next
transformations.

It has three groups

1) Input group

2) User defined group

Page 56 of 115

DWH Training -9739096158


3) Default group

Joiner: The Joiner transformation is used to join two sources


residing in different databases or different locations like flat file
and oracle sources or two relational tables existing in different
databases.

Source Qualifier: The Source Qualifier transformation is used


to describe in SQL the method by which data is to be retrieved
from a source application system and also

used to join two relational sources residing in same databases.

What is Incremental Aggregation?

A. Whenever a session is created for a mapping Aggregate


Transformation, the session option for Incremental Aggregation
can be enabled. When PowerCenter performs incremental
aggregation, it passes new source data through the mapping
and uses historical cache data to perform new aggregation
calculations incrementally.

Lookup: Lookup transformation is used in a mapping to look up


data in a flat file or a relational table, view, or synonym.

Two types of lookups:

1) Connected

2) Unconnected

Differences between connected lookup and unconnected


lookup

Connected Lookup Unconnected Lookup

This is connected to Which is not connected to


pipleline and receives the pipeline and receives input
input values from pipleline. values from the result of a:
LKP expression in another
transformation via
arguments.

Page 57 of 115

DWH Training -9739096158


We cannot use this lookup We can use this
more than once in a transformation more than
mapping. once within the mapping

We can return multiple Designate one return port


columns from the same row. (R), returns one column from
each row.

We can configure to use We cannot configure to use


dynamic cache. dynamic cache.

Pass multiple output values Pass one output value to


to another transformation. another transformation. The
Link lookup/output ports to lookup/output/return port
another transformation. passes the value to the
transformation calling: LKP
expression.

Use a dynamic or static Use a static cache


cache

Supports user defined Does not support user


default values. defined default values.

Cache includes the lookup Cache includes all


source column in the lookup lookup/output ports in the
condition and the lookup lookup condition and the
source columns that are lookup/return port.
output ports.

Lookup Caches:

When configuring a lookup cache, you can specify any of the


following options:

• Persistent cache

• Recache from lookup source

• Static cache

• Dynamic cache

Page 58 of 115

DWH Training -9739096158


• Shared cache

Dynamic cache: When you use a dynamic cache, the


PowerCenter Server updates the lookup cache as it passes rows
to the target.

If you configure a Lookup transformation to use a dynamic


cache, you can only use the equality operator (=) in the lookup
condition.

NewLookupRow Port will enable automatically.

NewLookupRo
Description
w Value

The PowerCenter Server does not update or


0
insert the row in the cache.

The PowerCenter Server inserts the row into


1
the cache.

The PowerCenter Server updates the row in


2
the cache.

Static cache: It is a default cache; the PowerCenter Server


doesn’t update the lookup cache as it passes rows to the target.

Persistent cache: If the lookup table does not change


between sessions, configure the Lookup transformation to use a
persistent lookup cache. The PowerCenter Server then saves
and reuses cache files from session to session, eliminating the
time required to read the lookup table.

Differences between dynamic lookup and static lookup

Dynamic Lookup Cache Static Lookup Cache

In dynamic lookup the cache In static lookup the cache


memory will get refreshed as memory will not get
Page 59 of 115

DWH Training -9739096158


soon as the record get refreshed even though
inserted or updated/deleted record inserted or updated
in the lookup table. in the lookup table it will
refresh only in the next
session run.

When we configure a lookup It is a default cache.


transformation to use a
dynamic lookup cache, you
can only use the equality
operator in the lookup
condition.

NewLookupRow port will


enable automatically.

Best example where we If we use static lookup first


need to use dynamic cache record it will go to lookup
is if suppose first record and and check in the lookup
last record both are same cache based on the
but there is a change in the condition it will not find the
address. What informatica match so it will return null
mapping has to do here is value then in the router it
first record needs to get will send that record to
insert and last record should insert flow.
get update in the target
table. But still this record dose not
available in the cache
memory so when the last
record comes to lookup it
will check in the cache it will
not find the match so it
returns null value again it
will go to insert flow through
router but it is suppose to
go to update flow because
cache didn’t get refreshed
when the first record get
inserted into target table.

Normalizer: The Normalizer transformation is used to generate

Page 60 of 115

DWH Training -9739096158


multiple records from a single record based on columns
(transpose the column data into rows)

We can use normalize transformation to process cobol sources


instead of source qualifier.

Rank: The Rank transformation allows you to select only the


top or bottom rank of data. You can use a Rank transformation
to return the largest or smallest numeric value in a port or
group.

The Designer automatically creates a RANKINDEX port for each


Rank transformation.

Sequence Generator: The Sequence Generator


transformation is used to generate numeric key values in
sequential order.

Stored Procedure: The Stored Procedure transformation is


used to execute externally stored database procedures and
functions. It is used to perform the database level operations.

Sorter: The Sorter transformation is used to sort data in


ascending or descending order according to a specified sort
key. You can also configure the Sorter transformation for case-
sensitive sorting, and specify whether the output rows should
be distinct. The Sorter transformation is an active
transformation. It must be connected to the data flow.

Union Transformation:

The Union transformation is a multiple input group


transformation that you can use to merge data from multiple
pipelines or pipeline branches into one pipeline branch. It
merges data from multiple sources similar to the UNION ALL
SQL statement to combine the results from two or more SQL
statements. Similar to the UNION ALL statement, the Union
transformation does not remove duplicate rows.Input groups
should have similar structure.

Update Strategy: The Update Strategy transformation is used


Page 61 of 115

DWH Training -9739096158


to indicate the DML statement.

We can implement update strategy in two levels:

1) Mapping level

2) Session level.

Session level properties will override the mapping level


properties.

Aggregator Transformation:

Transformation type:

Active

Connected

The Aggregator transformation performs aggregate


calculations, such as averages and sums. The Aggregator
transformation is unlike the Expression transformation, in that
you use the Aggregator transformation to perform calculations
on groups. The Expression transformation permits you to
perform calculations on a row-by-row basis only.

Components of the Aggregator Transformation:

The Aggregator is an active transformation, changing the


number of rows in the pipeline. The Aggregator transformation
has the following components and options

Aggregate cache: The Integration Service stores data in the


aggregate cache until it completes aggregate calculations. It
stores group values in an index cache and row data in the data
cache.

Group by port: Indicate how to create groups. The port can be


any input, input/output, output, or variable port. When grouping
data, the Aggregator transformation outputs the last row of
each group unless otherwise specified.

Sorted input: Select this option to improve session


performance. To use sorted input, you must pass data to the
Page 62 of 115

DWH Training -9739096158


Aggregator transformation sorted by group by port, in
ascending or descending order.

Aggregate Expressions:

The Designer allows aggregate expressions only in the


Aggregator transformation. An aggregate expression can
include conditional clauses and non-aggregate functions. It can
also include one aggregate function nested within another
aggregate function, such as:

MAX (COUNT (ITEM))

The result of an aggregate expression varies depending on the


group by ports used in the transformation

Aggregate Functions

Use the following aggregate functions within an Aggregator


transformation. You can nest one aggregate function within
another aggregate function.

The transformation language includes the following aggregate


functions:

(AVG,COUNT,FIRST,LAST,MAX,MEDIAN,MIN,PERCENTAGE,SUM,V
ARIANCE and STDDEV)

When you use any of these functions, you must use them in an
expression within an Aggregator transformation.

Perfomance Tips in Aggregator

Use sorted input to increase the mapping performance but we


need to sort the data before sending to aggregator
transformation.

Filter the data before aggregating it.

If you use a Filter transformation in the mapping, place the


transformation before the Aggregator transformation to reduce
unnecessary aggregation.

SQL Transformation

Transformation type:
Page 63 of 115

DWH Training -9739096158


Active/Passive

Connected

The SQL transformation processes SQL queries midstream in a


pipeline. You can insert, delete, update, and retrieve rows from
a database. You can pass the database connection information
to the SQL transformation as input data at run time. The
transformation processes external SQL scripts or SQL queries
that you create in an SQL editor. The SQL transformation
processes the query and returns rows and database errors.

For example, you might need to create database tables before


adding new transactions. You can create an SQL transformation
to create the tables in a workflow. The SQL transformation
returns database errors in an output port. You can configure
another workflow to run if the SQL transformation returns no
errors.

When you create an SQL transformation, you configure the


following options:

Mode. The SQL transformation runs in one of the following modes:

Script mode. The SQL transformation runs ANSI SQL scripts


that are externally located. You pass a script name to the
transformation with each input row. The SQL transformation
outputs one row for each input row.

Query mode. The SQL transformation executes a query that


you define in a query editor. You can pass strings or parameters
to the query to define dynamic queries or change the selection
parameters. You can output multiple rows when the query has a
SELECT statement.

Page 64 of 115

DWH Training -9739096158


Database type. The type of database the SQL transformation
connects to.

Connection type. Pass database connection information to


the SQL transformation or use a connection object.

Script Mode

An SQL transformation configured for script mode has the


following default ports:

Port Type Description

ScriptNa Input Receives the name of the script to execute for


me the current row.

ScriptRes Outp Returns PASSED if the script execution


ult ut succeeds for the row. Otherwise contains
FAILED.

ScriptErr Outp Returns errors that occur when a script fails for
or ut a row.

Java Transformation Overview

Transformation type:

Active/Passive

Connected

The Java transformation provides a simple native programming


interface to define transformation functionality with the Java
programming language. You can use the Java transformation to
quickly define simple or moderately complex transformation
functionality without advanced knowledge of the Java
programming language or an external Java development
environment.

For example, you can define transformation logic to loop


through input rows and generate multiple output rows based on
a specific condition. You can also use expressions, user-defined
Page 65 of 115

DWH Training -9739096158


functions, unconnected transformations, and mapping variables
in the Java code.

Transaction Control Transformation

Transformation type:

Active

Connected

PowerCenter lets you control commit and roll back transactions


based on a set of rows that pass through a Transaction Control
transformation. A transaction is the set of rows bound by
commit or roll back rows. You can define a transaction based on
a varying number of input rows. You might want to define
transactions based on a group of rows ordered on a common
key, such as employee ID or order entry date.

In PowerCenter, you define transaction control at the following


levels:

Within a mapping. Within a mapping, you use the Transaction


Control transformation to define a transaction. You define
transactions using an expression in a Transaction Control
transformation. Based on the return value of the expression,
you can choose to commit, roll back, or continue without any
transaction changes.

Within a session. When you configure a session, you


configure it for user-defined commit. You can choose to commit
or roll back a transaction if the Integration Service fails to
transform or write any row to the target.

When you run the session, the Integration Service evaluates the
expression for each row that enters the transformation. When it
evaluates a commit row, it commits all rows in the transaction
to the target or targets. When the Integration Service evaluates
a roll back row, it rolls back all rows in the transaction from the
target or targets.

If the mapping has a flat file target you can generate an output
file each time the Integration Service starts a new transaction.
You can dynamically name each target flat file.

Page 66 of 115

DWH Training -9739096158


What is the difference between joiner and lookup

Joiner Lookup

In joiner on multiple matches In lookup it will return either


it will return all matching first record or last record or
records. any value or error value.

In joiner we cannot configure Where as in lookup we can


to use persistence cache, configure to use persistence
shared cache, uncached and cache, shared cache,
dynamic cache uncached and dynamic cache.

We cannot override the query We can override the query in


in joiner lookup to fetch the data from
multiple tables.

We can perform outer join in We cannot perform outer join


joiner transformation. in lookup transformation.

We cannot use relational Where as in lookup we can use


operators in joiner the relation operators. (i.e.
transformation.(i.e. <,>,<= <,>,<= and so on)
and so on)

What is the difference between source qualifier and


lookup

Source Qualifier Lookup

In source qualifier it will push Where as in lookup we can


all the matching records. restrict whether to display
first value, last value or any
value

In source qualifier there is no Where as in lookup we


concept of cache. concentrate on cache
concept.

When both source and lookup When the source and lookup
are in same database we can table exists in different
use source qualifier. database then we need to use
Page 67 of 115

DWH Training -9739096158


lookup.

Have you done any Performance tuning in informatica?

1) Yes, One of my mapping was taking 3-4 hours to process


40 millions rows into staging table we don’t have any
transformation inside the mapping its 1 to 1 mapping .Here
nothing is there to optimize the mapping so I created
session partitions using key range on effective date
column. It improved performance lot, rather than 4 hours it
was running in 30 minutes for entire 40millions.Using
partitions DTM will creates multiple reader and writer
threads.

2) There was one more scenario where I got very good


performance in the mapping level .Rather than using
lookup transformation if we can able to do outer join in the
source qualifier query override this will give you good
performance if both lookup table and source were in the
same database. If lookup tables is huge volumes then
creating cache is costly.

3) And also if we can able to optimize mapping using less no


of transformations always gives you good performance.

4) If any mapping taking long time to execute then first we


need to look in to source and target statistics in the
monitor for the throughput and also find out where exactly
the bottle neck by looking busy percentage in the session
log will come to know which transformation taking more
time ,if your source query is the bottle neck then it will
show in the end of the session log as “query issued to
database “that means there is a performance issue in the
source query.we need to tune the query using .

Informatica Session Log shows busy percentage

If we look into session logs it shows busy percentage based on


that we need to find out where is bottle neck.

***** RUN INFO FOR TGT LOAD ORDER GROUP [1],


CONCURRENT SET [1] ****

Page 68 of 115

DWH Training -9739096158


Thread [READER_1_1_1] created for [the read stage] of partition
point [SQ_ACW_PCBA_APPROVAL_STG] has completed: Total
Run Time = [7.193083] secs, Total Idle Time = [0.000000] secs,
Busy Percentage = [100.000000]

Thread [TRANSF_1_1_1] created for [the transformation stage]


of partition point [SQ_ACW_PCBA_APPROVAL_STG] has
completed. The total run time was insufficient for any
meaningful statistics.

Thread [WRITER_1_*_1] created for [the write stage] of partition


point [ACW_PCBA_APPROVAL_F1, ACW_PCBA_APPROVAL_F] has
completed: Total Run Time = [0.806521] secs, Total Idle Time =
[0.000000] secs, Busy Percentage = [100.000000]

If suppose I've to load 40 lacs records in the target table and


the workflow
is taking about 10 - 11 hours to finish. I've already increased
the cache size to 128MB.
There are no joiner, just lookups
and expression transformations

Ans:

(1) If the lookups have many records, try creating indexes


on the columns used in the lkp condition. And try
increasing the lookup cache.If this doesnt increase
the performance. If the target has any indexes disable
them in the target pre load and enable them in the
target post load.

(2) Three things you can do w.r.t it.

1. Increase the Commit intervals ( by default its 10000)


2. Use bulk mode instead of normal mode incase ur target
doesn't have
primary keys or use pre and post session SQL to
implement the same (depending on the business req.)
3. Uses Key partitionning to load the data faster.

(3)If your target consists key constraints and indexes u slow

Page 69 of 115

DWH Training -9739096158


the loading of data. To improve the session performance in

this case drop constraints and indexes before you run the

session and rebuild them after completion of session.

What is Constraint based loading in informatica?

By setting Constraint Based Loading property at session level


in Configaration tab we can load the data into parent and child
relational tables (primary foreign key).

Genarally What it do is it will load the data first in parent table


then it will load it in to child table.

What is use of Shortcuts in informatica?

If we copy source definaltions or target definations or mapplets


from Shared folder to any other folders that will become a
shortcut.

Let’s assume we have imported some source and target


definitions in a shared folder after that we are using those
sources and target definitions in another folders as a shortcut in
some mappings.

If any modifications occur in the backend (Database) structure


like adding new columns or drop existing columns either in
source or target I f we reimport into shared folder those new
changes automatically it would reflect in all folder/mappings
wherever we used those sources or target definitions.

Target Update Override

If we don’t have primary key on target table using Target


Update Override option we can perform updates.By default, the
Integration Service updates target tables based on key values.
However, you can override the default UPDATE statement for
each target in a mapping. You might want to update the target
based on non-key columns.

Overriding the WHERE Clause

You can override the WHERE clause to include non-key


Page 70 of 115

DWH Training -9739096158


columns. For example, you might want to update records for
employees named Mike Smith only. To do this, you edit the
WHERE clause as follows:

UPDATE T_SALES SET DATE_SHIPPED =:TU.DATE_SHIPPED,


TOTAL_SALES = :TU.TOTAL_SALES WHERE EMP_NAME
= :TU.EMP_NAME and
EMP_NAME = 'MIKE SMITH'

If you modify the UPDATE portion of the statement, be sure to


use :TU to specify ports.

How do you perform incremental logic or Delta or CDC?

Incremental means suppose today we processed 100 records


,for tomorrow run u need to extract whatever the records
inserted newly and updated after previous run based on last
updated timestamp (Yesterday run) this process called as
incremental or delta.

Approach_1: Using set max var ()

1) First need to create mapping var ($$Pre_sess_max_upd)and


assign initial value as old date (01/01/1940).

2) Then override source qualifier query to fetch only


LAT_UPD_DATE >=$$Pre_sess_max_upd (Mapping var)

3) In the expression assign max last_upd_date value to $


$Pre_sess_max_upd(mapping var) using set max var

4) Because its var so it stores the max last upd_date value in


the repository, in the next run our source qualifier query
will fetch only the records updated or inseted after
previous run.

Approach_2: Using parameter file

1 First need to create mapping parameter ($


$Pre_sess_start_tmst )and assign initial value as old
date (01/01/1940) in the parameterfile.

2 Then override source qualifier query to fetch only


LAT_UPD_DATE >=$$Pre_sess_start_tmst (Mapping var)

3 Update mapping parameter($$Pre_sess_start_tmst)


Page 71 of 115

DWH Training -9739096158


values in the parameter file using shell script or another
mapping after first session get completed successfully

4 Because its mapping parameter so every time we


need to update the value in the parameter file after
comptetion of main session.

Approach_3: Using oracle Control tables

1 First we need to create two control tables cont_tbl_1


and cont_tbl_1 with structure of
session_st_time,wf_name

2 Then insert one record in each table with


session_st_time=1/1/1940 and workflow_name

3 create two store procedures one for update cont_tbl_1


with session st_time, set property of store procedure
type as Source_pre_load .

4 In 2nd store procedure set property of store procedure


type as Target _Post_load.this proc will update the
session _st_time in Cont_tbl_2 from cnt_tbl_1.

5 Then override source qualifier query to fetch only


LAT_UPD_DATE >=(Select session_st_time from
cont_tbl_2 where workflow name=’Actual work flow
name’.

SCD Type-II Effective-Date Approach

• We have one of the dimension in current project called


resource dimension. Here we are maintaining the history to
keep track of SCD changes.

• To maintain the history in slowly changing dimension or


resource dimension. We followed SCD Type-II Effective-
Date approach.

• My resource dimension structure would be eff-start-date,


eff-end-date, s.k and source columns.

• Whenever I do a insert into dimension I would populate eff-


Page 72 of 115

DWH Training -9739096158


start-date with sysdate, eff-end-date with future date and
s.k as a sequence number.

• If the record already present in my dimension but there is


change in the source data. In that case what I need to do is

• Update the previous record eff-end-date with sysdate and


insert as a new record with source data.

Informatica design to implement SDC Type-II effective-


date approach

• Once you fetch the record from source qualifier. We will


send it to lookup to find out whether the record is present
in the target or not based on source primary key column.

• Once we find the match in the lookup we are taking SCD


column from lookup and source columns from SQ to
expression transformation.

• In lookup transformation we need to override the lookup


override query to fetch Active records from the dimension
while building the cache.

• In expression transformation I can compare source with


lookup return data.

• If the source and target data is same then I can make a flag
as ‘S’.

• If the source and target data is different then I can make a


flag as ‘U’.

• If source data does not exists in the target that means


lookup returns null value. I can flag it as ‘I’.

• Based on the flag values in router I can route the data into
insert and update flow.

• If flag=’I’ or ‘U’ I will pass it to insert flow.

• If flag=’U’ I will pass this record to eff-date update flow

• When we do insert we are passing the sequence value to


s.k.
Page 73 of 115

DWH Training -9739096158


• Whenever we do update we are updating the eff-end-date
column based on lookup return s.k value.

Complex Mapping

• We have one of the order file requirement. Requirement is


every day in source system they will place filename with
timestamp in informatica server.

• We have to process the same date file through informatica.

• Source file directory contain older than 30 days files with


timestamps.

• For this requirement if I hardcode the timestamp for source


file name it will process the same file every day.

• So what I did here is I created $InputFilename for source


file name.

• Then I am going to use the parameter file to supply the


values to session variables ($InputFilename).

• To update this parameter file I have created one more


mapping.

• This mapping will update the parameter file with appended


timestamp to file name.

• I make sure to run this parameter file update mapping


before my actual mapping.

How to handle errors in informatica?

• We have one of the source with numerator and


denominator values we need to calculate num/deno

• When populating to target.

• If deno=0 I should not load this record into target table.

• We need to send those records to flat file after completion


of 1st session run. Shell script will check the file size.

• If the file size is greater than zero then it will send email
Page 74 of 115

DWH Training -9739096158


notification to source system POC (point of contact) along
with deno zero record file and appropriate email subject
and body.

• If file size<=0 that means there is no records in flat file. In


this case shell script will not send any email notification.

• Or

• We are expecting a not null value for one of the source


column.

• If it is null that means it is a error record.

• We can use the above approach for error handling.

Why we need source qualifier?

Simply it performs select statement.

Select statement fetches the data in the form of row.

Source qualifier will select the data from the source table.

It identifies the record from the source.

Parameter file it will supply the values to session level


variables and mapping level variables.

Variables are of two types:

• Session level variables

• Mapping level variables

Session level variables are of four types:

• $DBConnection_Source

• $DBConnection_Target

• $InputFile

• $OutputFile

Mapping level variables are of two types:

Page 75 of 115

DWH Training -9739096158


• Variable

• Parameter

What is the difference between mapping level and


session level variables?

Mapping level variables always starts with $$.

A session level variable always starts with $.

Flat File

Flat file is a collection of data in a file in the specific format.

Informatica can support two types of files

• Delimiter

• Fixed Width

In delimiter we need to specify the separator.

In fixed width we need to known about the format first. Means


how many character to read for particular column.

In delimiter also it is necessary to know about the structure of


the delimiter. Because to know about the headers.

If the file contains the header then in definition we need to skip


the first row.

List file:

If you want to process multiple files with same structure. We


don’t need multiple mapping and multiple sessions.

We can use one mapping one session using list file option.

First we need to create the list file for all the files. Then we can
use this file in the main mapping.

Parameter file Format:

It is a text file below is the format for parameter file. We use to


Page 76 of 115

DWH Training -9739096158


place this file in the unix box where we have installed our
informatic server.

[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_
GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALE
S_HIST_AUSTRI]

$InputFileName_BAAN_SALE_HIST=/interface/dev/etl/apo/srcfile
s/HS_025_20070921

$DBConnection_Target=DMD2_GEMS_ETL

$$CountryCode=AT

$$CustomerNumber=120165

[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_
GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALE
S_HIST_BELUM]

$DBConnection_Sourcet=DEVL1C1_GEMS_ETL

$OutputFileName_BAAN_SALES=/interface/dev/etl/apo/trgfiles/
HS_002_20070921

$$CountryCode=BE

$$CustomerNumber=101495

Page 77 of 115

DWH Training -9739096158


Difference between 7.x and 8.x

Power Center 7.X Architecture.

Page 78 of 115

DWH Training -9739096158


Power Center 8.X Architecture.

Page 79 of 115

DWH Training -9739096158


Developer Changes:
Page 80 of 115

DWH Training -9739096158


For example, in PowerCenter:

• PowerCenter Server has become a service, the


Integration Service

• No more Repository Server, but PowerCenter includes


a Repository Service

• Client applications are the same, but work on top of the


new services framework

Below are the difference between 7.1 and 8.1 of infa..

1) powercenter connect for sap netweaver bw option

2) sql transformation is added

3) service oriented architecture

4) grid concept is additional feature

5) random file name can genaratation in target

6) command line programms: infacmd and infasetup new


commands were added.

7) java transformation is added feature

8) concurrent cache creation and faster index building are


additional feature in lookup transformation

9) caches or automatic u dont need to allocate at


transformation level

10) push down optimization techniques,some

11) we can append data into the flat file target.

12)Dynamic file names we can generate in informatica 8

13)flat file names we can populate to target while processing


through list file .

14)For Falt files header and footer we can populate using


advanced options in 8 at session level.

15) GRID option at session level


Page 81 of 115

DWH Training -9739096158


Effective in version 8.0, you create and configure a grid in the
Administration Console. You configure a grid to run on multiple
nodes, and you configure one Integration Service to run on the
grid. The Integration Service runs processes on the nodes in the
grid to distribute workflows and sessions. In addition to running
a workflow on a grid, you can now run a session on a grid.
When you run a session or workflow on a grid, one service
process runs on each available node in the grid.

Pictorial Representation of Workflow execution:

1. A PowerCenter Client request IS to start workflow

2. IS starts ISP

3. ISP consults LB to select node

4. ISP starts DTM in node selected by LB

Integration Service (IS)

The key functions of IS are

 Interpretation of the workflow and mapping metadata from


the repository.

 Execution of the instructions in the metadata

 Manages the data from source system to target system


within the memory and disk

Page 82 of 115

DWH Training -9739096158


The main three components of Integration Service which enable
data movement are,

 Integration Service Process

 Load Balancer

 Data Transformation Manager

Integration Service Process (ISP)

The Integration Service starts one or more Integration Service


processes to run and monitor workflows. When we run a
workflow, the ISP starts and locks the workflow, runs the
workflow tasks, and starts the process to run sessions. The
functions of the Integration Service Process are,

 Locks and reads the workflow

 Manages workflow scheduling, ie, maintains session


dependency

 Reads the workflow parameter file

 Creates the workflow log

 Runs workflow tasks and evaluates the conditional links

 Starts the DTM process to run the session

 Writes historical run information to the repository

 Sends post-session emails

Load Balancer

The Load Balancer dispatches tasks to achieve optimal


performance. It dispatches tasks to a single node or across the
nodes in a grid after performing a sequence of steps. Before
understanding these steps we have to know about Resources,
Resource Provision Thresholds, Dispatch mode and Service
levels

 Resources – we can configure the Integration Service to


check the resources available on each node and match
them with the resources required to run the task. For

Page 83 of 115

DWH Training -9739096158


example, if a session uses an SAP source, the Load
Balancer dispatches the session only to nodes where the
SAP client is installed

 Three Resource Provision Thresholds, The maximum


number of runnable threads waiting for CPU resources on
the node called Maximum CPU Run Queue Length. The
maximum percentage of virtual memory allocated on the
node relative to the total physical memory size called
Maximum Memory %. The maximum number of running
Session and Command tasks allowed for each Integration
Service process running on the node called Maximum
Processes

 Three Dispatch mode’s – Round-Robin: The Load Balancer


dispatches tasks to available nodes in a round-robin fashion
after checking the “Maximum Process” threshold. Metric-
based: Checks all the three resource provision thresholds
and dispatches tasks in round robin fashion. Adaptive:
Checks all the three resource provision thresholds and also
ranks nodes according to current CPU availability

 Service Levels establishes priority among tasks that are


waiting to be dispatched, the three components of service
levels are Name, Dispatch Priority and Maximum dispatch
wait time. “Maximum dispatch wait time” is the amount of
time a task can wait in queue and this ensures no task
waits forever

A .Dispatching Tasks on a node

1. The Load Balancer checks different resource provision


thresholds on the node depending on the Dispatch mode
set. If dispatching the task causes any threshold to be
exceeded, the Load Balancer places the task in the
dispatch queue, and it dispatches the task later

2. The Load Balancer dispatches all tasks to the node that


runs the master Integration Service process

B. Dispatching Tasks on a grid,

1. The Load Balancer verifies which nodes are currently


running and enabled
Page 84 of 115

DWH Training -9739096158


2. The Load Balancer identifies nodes that have the
PowerCenter resources required by the tasks in the
workflow

3. The Load Balancer verifies that the resource provision


thresholds on each candidate node are not exceeded. If
dispatching the task causes a threshold to be exceeded,
the Load Balancer places the task in the dispatch queue,
and it dispatches the task later

4. The Load Balancer selects a node based on the dispatch


mode

Data Transformation Manager (DTM) Process

When the workflow reaches a session, the Integration Service


Process starts the DTM process. The DTM is the process
associated with the session task. The DTM process performs the
following tasks:

 Retrieves and validates session information from the


repository.

 Validates source and target code pages.

 Verifies connection object permissions.

 Performs pushdown optimization when the session is


configured for pushdown optimization.

 Adds partitions to the session when the session is


configured for dynamic partitioning.

 Expands the service process variables, session parameters,


and mapping variables and parameters.

 Creates the session log.

 Runs pre-session shell commands, stored procedures, and


SQL.

 Sends a request to start worker DTM processes on other


nodes when the session is configured to run on a grid.

 Creates and runs mapping, reader, writer, and


transformation threads to extract, transform, and load data
Page 85 of 115

DWH Training -9739096158


 Runs post-session stored procedures, SQL, and shell
commands and sends post-session email

After the session is complete, reports execution result to ISP

Approach_1: Using set max var ()

1) First need to create mapping var ($$INCREMENT_TS)and


assign initial value as old date (01/01/1940).
2) Then override source qualifier query to fetch only
LAT_UPD_DATE >=($$INCREMENT_TS (Mapping var)
3) In the expression assign max last_upd_date value to ($
$INCREMENT_TS (mapping var) using set max var
4) Because its var so it stores the max last upd_date value in
the repository, in the next run our source qualifier query will
fetch only the records updated or inseted after previous run.

Page 86 of 115

DWH Training -9739096158


Logic in the mapping variable is

Page 87 of 115

DWH Training -9739096158


Logic in the SQ is

In expression assign max last update date value to the variable


using function set max variable.

Page 88 of 115

DWH Training -9739096158


Page 89 of 115

DWH Training -9739096158


Logic in the update strategy is below

Page 90 of 115

DWH Training -9739096158


Approach_2: Using parameter file

First need to create mapping parameter ($$LastUpdateDate


Time )and assign initial value as old date (01/01/1940) in the
parameterfile.

Then override source qualifier query to fetch only


LAT_UPD_DATE >=($$LastUpdateDate Time (Mapping var)

Update mapping parameter($$LastUpdateDate Time) values in


the parameter file using shell script or another mapping after
first session get completed successfully

Because its mapping parameter so every time we need to


update the value in the parameter file after comptetion of
main session.

Parameterfile:

[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_
GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALE
S_HIST_AUSTRI]

$DBConnection_Source=DMD2_GEMS_ETL

$DBConnection_Target=DMD2_GEMS_ETL

$$LastUpdateDate Time =01/01/1940

Page 91 of 115

DWH Training -9739096158


Updating parameter File

Logic in the expression

Main mapping
Page 92 of 115

DWH Training -9739096158


Sql override in SQ Transformation

Workflod Design

Page 93 of 115

DWH Training -9739096158


4.2 Informatica Scenarios:

1) How to populate 1st record to 1st target ,2nd


record to 2nd target ,3rd record to 3rd target
and 4th record to 1st target through
informatica?

We can do using sequence generator by setting end value=3


and enable cycle option.then in the router take 3 goups

In 1st group specify condition as seq next value=1 pass those


records to 1st target simillarly

In 2nd group specify condition as seq next value=2 pass those


records to 2nd target

In 3rd group specify condition as seq next value=3 pass those


records to 3rd target.

Since we have enabled cycle option after reaching end value


sequence generator will start from 1,for the 4th record seq.next
value is 1 so it will go to 1st target.
Page 94 of 115

DWH Training -9739096158


2) How to do Dymanic File generation in
Informatica?

I want to generate the separate file for every State (as per
state, it should generate file).It has to generate 2 flat files and
name of the flat file is corresponding state name that is the
requirement.

Below is my mapping.

Source (Table) -> SQ -> Target (FF)

Source:

Stat Transacti City


e on

AP 2 HYD

AP 1 TPT

KA 5 BANG

KA 7 MYSOR
E

KA 3 HUBLI

This functionality was added in informatica 8.5 onwards earlier


versions it was not there.

We can achieve it with use of transaction control and special


"FileName" port in the target file .

In order to generate the target file names from the mapping,


we should make use of the special "FileName" port in the target
file. You can't create this special port from the usual New port
button. There is a special button with label "F" on it to the right
most corner of the target flat file when viewed in "Target
Designer".

When you have different sets of input data with different target
files created, use the same instance, but with a Transaction

Page 95 of 115

DWH Training -9739096158


Control transformation which defines the boundary for the
source sets.

in target flat file there is option in column tab i.e filename as


column.
when you click that one non editable column gets created in
metadata of target.

in transaction control give condition as iif(not


isnull(emp_no),tc_commit_before,continue) else
tc_commit_before

map the emp_no column to target's filename column

ur mapping will be like this

source -> squlf-> transaction control-> target

run it ,separate files will be created by name of Ename

3) How to concatenate row data through


informatica?

Source:

Enam EmpNo
e

stev 100

methe 100
w

john 101

tom 101

Target:

Ename EmpNo

Stev 100
methew

Page 96 of 115

DWH Training -9739096158


John tom 101

Approach1: Using Dynamic Lookup on Target table:

If record doen’t exit do insert in target .If it is already exist then


get corresponding Ename vale from lookup and concat in
expression with current Ename value then update the target
Ename column using update strategy.

Approch2: Using Var port :

Sort the data in sq based on EmpNo column then Use


expression to store previous record information using Var port
after that use router to insert a record if it is first time if it is
already inserted then update Ename with concat value of prev
name and current name value then update in target.

4) How to send Unique (Distinct) records into One


target and duplicates into another tatget?

Source:

Enam EmpNo
e

stev 100

Stev 100

john 101

Mathe 102
w

Output:

Target_1:

Ename EmpNo

Stev 100

Page 97 of 115

DWH Training -9739096158


John 101

Mathew 102

Target_2:

Ename EmpNo

Stev 100

Approch 1: Using Dynamic Lookup on Target table:

If record doen’t exit do insert in target_1 .If it is already exist


then send it to Target_2 using Router.

Approch2: Using Var port :

Sort the data in sq based on EmpNo column then Use


expression to store previous record information using Var ports
after that use router to route the data into targets if it is first
time then sent it to first target if it is already inserted then
send it to Tartget_2.

5) How to Process multiple flat files to single


target table through informatica if all files are
same structure?

We can process all flat files through one mapping and one
session using list file.

First we need to create list file using unix script for all flat file
the extension of the list file is .LST.

This list file it will have only flat file names.

At session level we need to set

source file directory as list file path

And source file name as list file name

And file type as indirect.

Page 98 of 115

DWH Training -9739096158


6) How to populate file name to target while
loading multiple files using list file concept.

In informatica 8.6 by selecting Add currently processed


flatfile name option in the properties tab of source definition
after import source file defination in source analyzer.It will add
new column as currently processed file name.we can map
this column to target to populate filename.

7) If we want to run 2 workflow one after


another(how to set the dependence between
wf’s)

• If both workflow exists in same folder we can create 2


worklet rather than creating 2 workfolws.

• Finally we can call these 2 worklets in one workflow.

• There we can set the dependency.

• If both workflows exists in different folders or repository


then we cannot create worklet.

• We can set the dependency between these two workflow


using shell script is one approach.

• The other approach is event wait and event rise.

If both workflow exists in different folrder or different rep then


we can use below approaches.

1) Using shell script

• As soon as first workflow get completes we are creating


zero byte file (indicator file).

• If indicator file is available in particular location. We will run


second workflow.

• If indicator file is not available we will wait for 5 minutes


and again we will check for the indicator. Like this we will
continue the loop for 5 times i.e 30 minutes.

Page 99 of 115

DWH Training -9739096158


• After 30 minutes if the file does not exists we will send out
email notification.

2) Event wait and Event rise approach

We can put event wait before actual session run in the workflow
to wait a indicator file if file available then it will run the session
other event wait it will wait for infinite time till the indicator file
is available.

8) How to load cumulative salary in to target ?

Solution:

Using var ports in expression we can load cumulative salary


into target.

Page 100 of 115

DWH Training -9739096158


4.3 Development Guidelines
General Development Guidelines

The starting point of the development is the logical model


created by the Data Architect. This logical model forms the
foundation for metadata, which will be continuously be
maintained throughout the Data Warehouse Development Life
Cycle (DWDLC). The logical model is formed from the
requirements of the project. At the completion of the logical
model technical documentation defining the sources, targets,
requisite business rule transformations, mappings and filters.
This documentation serves as the basis for the creation of the
Extraction, Transformation and Loading tools to actually
manipulate the data from the applications sources into the Data
Warehouse/Data Mart.

To start development on any data mart you should have the


following things set up by the Informatica Load Administrator

 Informatica Folder. The development team in


consultation with the BI Support Group can decide a three-letter

Page 101 of 115

DWH Training -9739096158


code for the project, which would be used to create the
informatica folder as well as Unix directory structure.
 Informatica Userids for the developers
 Unix directory structure for the data mart.
 A schema XXXLOAD on DWDEV database.

Transformation Specifications

Before developing the mappings you need to prepare the


specifications document for the mappings you need to develop.
A good template is placed in the templates folder You can use
your own template as long as it has as much detail or more
than that which is in this template.

While estimating the time required to develop mappings the


thumb rule is as follows.

 Simple Mapping – 1 Person Day


 Medium Complexity Mapping – 3 Person Days
 Complex Mapping – 5 Person Days.
Usually the mapping for the fact table is most complex and
should be allotted as much time for development as possible.

Data Loading from Flat Files

It’s an accepted best practice to always load a flat file into a


staging table before any transformations are done on the data
in the flat file.

Always use LTRIM, RTRIM functions on string columns before


loading data into a stage table.

You can also use UPPER function on string columns but before
using it you need to ensure that the data is not case sensitive
(e.g. ABC is different from Abc)

If you are loading data from a delimited file then make sure the
delimiter is not a character which could appear in the data
itself. Avoid using comma-separated files. Tilde (~) is a good
delimiter to use.

Failure Notification

Page 102 of 115

DWH Training -9739096158


Once in production your sessions and batches need to send out
notification when then fail to the Support team. You can do this
by configuring email task in the session level.

Naming Conventions and usage of Transformations

Port Standards:

Input Ports – It will be necessary to change the name of input


ports for lookups, expression and filters where ports might have
the same name. If ports do have the same name then will be
defaulted to having a number after the name. Change this
default to a prefix of “in_”. This will allow you to keep track of
input ports through out your mappings.
Prefixed with: IN_

Variable Ports – Variable ports that are created within an


expression

Transformation should be prefixed with a “v_”. This will allow


the developer to distinguish between input/output and variable
ports. For more explanation of Variable Ports see the section
“VARIABLES”.
Prefixed with: V_

Output Ports – If organic data is created with a transformation


that will be mapped to the target, make sure that it has the
same name as the target port that it will be mapped to.

Prefixed with: O_

Quick Reference

Object Type Syntax

Folder XXX_<Data Mart Name>

Mapping m_fXY_ZZZ_<Target Table


Page 103 of 115

DWH Training -9739096158


Name>_x.x

Session s_fXY_ZZZ_<Target Table


Name>_x.x

Batch b_<Meaningful name representing


the sessions inside>

Source Definition <Source Table Name>

Target Definition <Target Table Name>

Aggregator AGG_<Purpose>

Expression EXP_<Purpose>

Filter FLT_<Purpose>

Joiner JNR_<Names of Joined Tables>

Lookup LKP_<Lookup Table Name>

Normalizer Norm_<Source Name>

Rank RNK_<Purpose>

Router RTR_<Purpose>

Sequence Generator SEQ_<Target Column Name>

Source Qualifier SQ_<Source Table Name>

Stored Procedure STP_<Database Name>_<Procedure


Name>

Update Strategy UPD_<Target Table Name>_xxx

Mapplet MPP_<Purpose>

Input Transformation INP_<Description of Data being


funneled in>

Output Tranformation OUT_<Description of Data being


funneled out>

Page 104 of 115

DWH Training -9739096158


Database Connections XXX_<Database Name>_<Schema
Name>

4.4 Performance Tips


What is Performance tuning in Informatica

The aim of performance tuning is optimize session

performance so sessions run during the available load window

for the Informatica Server.

Increase the session performance by following.

The performance of the Informatica Server is related to

network connections. Data generally moves across a network

at less than 1 MB per second, whereas a local disk moves

data five to twenty times faster. Thus network connections

ofteny affect on session performance. So avoid work

connections.

1. Cache lookups if source table is under 500,000 rows and


DON’T cache for tables over 500,000 rows.

2. Reduce the number of transformations. Don’t use an


Expression Transformation to collect fields. Don’t use an
Update Transformation if only inserting. Insert mode is the
default.

3. If a value is used in multiple ports, calculate the value once


(in a variable) and reuse the result instead of recalculating
it for multiple ports.

4. Reuse objects where possible.

5. Delete unused ports particularly in the Source Qualifier and


Lookups.

6. Use Operators in expressions over the use of functions.

Page 105 of 115

DWH Training -9739096158


7. Avoid using Stored Procedures, and call them only once
during the mapping if possible.

8. Remember to turn off Verbose logging after you have


finished debugging.

9. Use default values where possible instead of using IIF


(ISNULL(X),,) in Expression port.

10. When overriding the Lookup SQL, always ensure to put


a valid Order By statement in the SQL. This will cause the
database to perform the order rather than Informatica
Server while building the Cache.

11. Improve session performance by using sorted data


with the Joiner transformation. When the Joiner
transformation is configured to use sorted data, the
Informatica Server improves performance by minimizing
disk input and output.

12. Improve session performance by using sorted input


with the Aggregator Transformation since it reduces the
amount of data cached during the session.

13. Improve session performance by using limited number


of connected input/output or output ports to reduce the
amount of data the Aggregator transformation stores in the
data cache.

14. Use a Filter transformation prior to Aggregator


transformation to reduce unnecessary aggregation.

15. Performing a join in a database is faster than


performing join in the session. Also use the Source
Qualifier to perform the join.

16. Define the source with less number of rows and


master source in Joiner Transformations, since this reduces
the search time and also the cache.

17. When using multiple conditions in a lookup conditions,


specify the conditions with the equality operator first.

18. Improve session performance by caching small lookup


tables.
Page 106 of 115

DWH Training -9739096158


19. If the lookup table is on the same database as the
source table, instead of using a Lookup transformation, join
the tables in the Source Qualifier Transformation itself if
possible.

20. If the lookup table does not change between sessions,


configure the Lookup transformation to use a persistent
lookup cache. The Informatica Server saves and reuses
cache files from session to session, eliminating the time
required to read the lookup table.

21. Use :LKP reference qualifier in expressions only when


calling unconnected Lookup Transformations.

22. Informatica Server generates an ORDER BY statement


for a cached lookup that contains all lookup ports. By
providing an override ORDER BY clause with fewer
columns, session performance can be improved.

23. Eliminate unnecessary data type conversions from


mappings.

24. Reduce the number of rows being cached by using the


Lookup SQL Override option to add a WHERE clause to the
default SQL statement.

4.5 Unit Test Cases (UTP):


QA Life Cycle consists of 5 types of

Testing regimens:

1. Unit Testing

2. Functional Testing

3. System Integration Testing

4. User Acceptance Testing

Unit testing: The testing, by development, of the application


modules to verify each unit (module) itself meets the accepted
Page 107 of 115

DWH Training -9739096158


user requirements and design and development standards

Functional Testing: The testing of all the application’s


modules individually to ensure the modules, as released from
development to QA, work together as designed and meet the
accepted user requirements and system standards

System Integration Testing: Testing of all of the application


modules in the same environment, database instance, network
and inter-related applications, as it would function in
production. This includes security, volume and stress testing.

User Acceptance Testing(UAT): The testing of the entire


application by the end-users ensuring the application functions
as set forth in the system requirements documents and that the
system meets the business needs.

UTP Template:
Actual Pass Test
Results, or ed
Fail By

Step Descripti Test Conditions Expected Results (P or


on F)
#

SAP-
CMS
Inter
face
s

Page 108 of 115

DWH Training -9739096158


Actual Pass Test
Results, or ed
Fail By

Step Descripti Test Conditions Expected Results (P or


on F)
#

1 Check for SOURCE: Both the source and Should Pass Stev
the total target table load be same
count of SELECT count(*) record count should as the
records in FROM match. expected
source XST_PRCHG_STG
tables that
is fetched
and the TARGET:
total
records in Select count(*) from
the _PRCHG
PRCHG
table for a
perticular
session
timestamp

2 Check for select PRCHG_ID, Both the source and Should Pass Stev
all the target table record be same
target PRCHG_DESC, values should return as the
columns zero records expected
whether DEPT_NBR,
they are
getting EVNT_CTG_CDE,
populated
correctly PRCHG_TYP_CDE,
with
source PRCHG_ST_CDE,
data.
from T_PRCHG

MINUS

select PRCHG_ID,

PRCHG_DESC,

DEPT_NBR,

EVNT_CTG_CDE,

PRCHG_TYP_CDE,

PRCHG_ST_CDE,

from PRCHG

Page 109 of 115

DWH Training -9739096158


Actual Pass Test
Results, or ed
Fail By

Step Descripti Test Conditions Expected Results (P or


on F)
#

3 Check for Identify a one record It should insert a Should Pass Stev
Insert from the source record into target table be same
strategy which is not in target with source data as the
to load table. Then run the expected
records session
into target
table.

4 Check for Identify a one Record It should update record Should Pass Stev
Update from the source into target table with be same
strategy which is already source data for that as the
to load present in the target existing record expected
records table with different
into target PRCHG_ST_CDE or
table. PRCHG_TYP_CDE
values Then run the
session

5 UNIX

How strong you are in UNIX?

1) I have Unix shell scripting knowledge whatever informatica


required like

If we want to run workflows in Unix using PMCMD.

Below is the script to run workflow using Unix.

cd /pmar/informatica/pc/pmserver/

/pmar/informatica/pc/pmserver/pmcmd startworkflow -u
$INFA_USER -p $INFA_PASSWD -s $INFA_SERVER:
$INFA_PORT -f $INFA_FOLDER -wait $1 >> $LOG_PATH/
$LOG_FILE
Page 110 of 115

DWH Training -9739096158


2) And if we suppose to process flat files using informatica but
those files were exists in remote server then we have to write
script to get ftp into informatica server before start process
those files.

3) And also file watch mean that if indicator file available in the
specified location then we need to start our informatica jobs
otherwise will send email notification using

Mail X command saying that previous jobs didn’t completed


successfully something like that.

4) Using shell script update parameter file with session start


time and end time.

This kind of scripting knowledge I do have. If any new UNIX


requirement comes then I can Google and get the solution
implement the same.

Basic Commands:

Cat file1 (cat is the command to create none zero byte file)
cat file1 file2 > all -----it will combined (it will create file if it
doesn’t exit)
cat file1 >> file2---it will append to file 2

o > will redirect output from standard out (screen) to


file or printer or whatever you like.

o >> Filename will append at the end of a file called


filename.

o < will redirect input to a process or command.

How to create zero byte file?

Touch filename (touch is the command to create zero byte


file)

how to find all processes that are running

ps -A

Crontab command.
Page 111 of 115

DWH Training -9739096158


Crontab command is used to schedule jobs. You must have
permission to run this command by Unix Administrator. Jobs are
scheduled in five numbers, as follows.

Minutes (0-59) Hour (0-23) Day of month (1-31) month (1-12)


Day of week (0-6) (0 is Sunday)

so for example you want to schedule a job which runs from


script named backup jobs in /usr/local/bin directory on sunday
(day 0) at 11.25 (22:25) on 15th of month. The entry in crontab
file will be. * represents all values.

25 22 15 * 0 /usr/local/bin/backup_jobs

The * here tells system to run this each month.


Syntax is
crontab file So a create a file with the scheduled jobs as above
and then type
crontab filename .This will scheduled the jobs.

Below cmd gives total no of users logged in at this time.

who | wc -l

echo "are total number of people logged in at this time."

Below cmd will display only directories

$ ls -l | grep '^d'

Pipes:

The pipe symbol "|" is used to direct the output of one


command to the input

of another.

Moving, renaming, and copying files:

Page 112 of 115

DWH Training -9739096158


Cp file1 file2 copy a file

mv file1 newname move or rename a file

mv file1 ~/AAA/ move file1 into sub-directory AAA in


your home directory.

rm file1 [file2 ...] remove or delete a file

To display hidden files

ls –a

Viewing and editing files:

cat filename Dump a file to the screen in ascii.

More file name to view the file content

head filename Show the first few lines of a file.

head -5 filename Show the first 5 lines of a file.

tail filename Show the last few lines of a file.

Tail -7 filename Show the last 7 lines of a file.

Searching for files :

find command

find -name aaa.txt Finds all the files named aaa.txt in the
current directory or

any subdirectory tree.

find / -name vimrc Find all the files named 'vimrc' anywhere
on the system.

find /usr/local/games -name "*xpilot*"

Find all files whose names contain the string 'xpilot' which

exist within the '/usr/local/games' directory tree.


Page 113 of 115

DWH Training -9739096158


You can find out what shell you are using by the
command:

echo $SHELL

If file exists then send email with attachment.

if [[ -f $your_file ]]; then


uuencode $your_file $your_file|mailx -s "$your_file exists..."
your_email_address
fi

Below line is the first line of the script

#!/usr/bin/sh

Or

#!/bin/ksh

What does #! /bin/sh mean in a shell script?

It actually tells the script to which interpreter to refer. As you


know, bash shell has some specific functions that other shell
does not have and vice-versa. Same way is for perl, python and
other languages.

It's to tell your shell what shell to you in executing the following
statements in your shell script.

Interactive History

A feature of bash and tcsh (and sometimes others) you can use

the up-arrow keys to access your previous commands, edit

them, and re-execute them.

Basics of the vi editor

Opening a file
Page 114 of 115

DWH Training -9739096158


Vi filename

Creating text

Edit modes: These keys enter editing modes and type in the
text

of your document.

i Insert before current cursor position

I Insert at beginning of current line

a Insert (append) after current cursor position

A Append to end of line

r Replace 1 character

R Replace mode

<ESC> Terminate insertion or overwrite mode

Deletion of text

x Delete single character

dd Delete current line and put in buffer

:w Write the current file.

:w new.file Write the file to the name 'new.file'.

:w! existing.file Overwrite an existing file with the file currently


being edited.

:wq Write the file and quit.

:q Quit.

:q! Quit with no changes.

Page 115 of 115

DWH Training -9739096158

Anda mungkin juga menyukai