A DWH is a data base which is specifically designed for analyzing the business but not
for business transactional processing.
A ware house is designed to support decision making process.
Time Variant: A dwh is a time variant data base, which support the users in analyzing
the business and comparing the business with different time periods. User can identify the
trends in the business.
30
25
} Variance
20
15
10
0
20 08 Q 1 2 0 09 Q 1
Non Volatile: A dwh is a non volatile db. Once the data entered into the dwh, it does not
reflect to the changes which take place at operational db.
Integrated: A DWH is an integrated db which collects the data from multiple operational
sources & integrates the data in a homogeneous format.
Subject Oriented: A DWH is a subject oriented db, which supports the business needs
of individual departments in the enterprise or middle management users.
E.g. Sales, HR, Accounts, Loans
Oltp DB DWH
Designed to support operational Designed to support decision making
monetering process
Data is volatile Data is non volatile
Current Data Historical Data
Detailed data Summary data
Normalized De Normalized
Designed for running business Designed for analyzing business
Designed for Clercical access Designed for Managerial access
More Joins Few Joins
Few Indexes More Indexes
1. Data Extraction: It’s a process of reading the data from various types of source
systems like db, sap, PeopleSoft, JD Edwards, Flat File, XML files, Cobol Files
etc.
2. Data Transformation: It’s a process of converting the data and cleansing the
data into a required business format. The following are the different types of
Transformation activities takes place in Staging area.
a. Data Merging:
b. Data Cleansing
c. Data Scrubbing
d. Data Aggregation.
3. Data Loading: It’s a process of inserting the data into the target system. There
are 2 types of data loads
a. Initial Load / Full Load: It’s a process of loading all the request data at first
load.
b. Incremental Load / Delta Load: It’s a process of loading only new records
after initial load
Data Transformation:
Data Merging: It’s a process of integrating the data from multiple sources into a single
output pipeline.
There are 2 types of data merging activities: Join, Union.
Oralce [OLTP]
Pdt
Emp E
Category
E T
Oralce [OLTP]
Join
Emp E
Join on 2 diff types of data sources or
having 2 different types of data
SQL Server Union
Data Cleaning: It’s a process of removing unwanted data. It’s a process of changing
inconsistencies and inaccuracies.
Country Sales
City
india $7.6
Hyd
austria $4.780
Initcap(country) Hbad
Italy $3.21
Decode(…). hyderabad
Australia $3.0
Round()
Data Scrubbing: it’s a process of deriving new data definitions using source data
definition.
Cust_FirstName
Cust_Last Name
In OLTP
Sale Sale
sid Revenue = sid
customer qty*price customer
pdt pdt Cust_name in DW
qty qty
price revenue
Meta Data: It is defined as data about data. A Metadata describes field name, data type,
precision, scale & keys.
To design a plan for data acquisition, the following inputs need to be required.
1. Source Definition
2. Target definition
3. Transformation rule (Business logic) (Optional).
Repository: A repository is a central metadata storage location, which contains all the
required metadata to build extraction, transformation & loading to target system.
Customer T_Customer
- cno - cno
- cfname - cname
Customer - clname - gender
- gender T_Customer
- cno number(5) pk
- cid number(5) pk
- cfname varchar2(5)
- cname varchar2(12)
- clanme varchar2(6)
T/R rule - gender varchar2(1)
- gender number(1)
Connect()
Decode()
OLTP DB [ Source]
DW [Target]
ETL Plan
Client: A Client is a GUI component where we can draw the plan of ETL Process.
ETL Plan
Informatica Mapping
Data Stage Job
Ab Intitio Graph
Data Integration Batch Job
Server: Server is the main component which executes the ETL Plan to move the data
from source to target.