Anda di halaman 1dari 49

Introduction to Data Warehousing

Pragim Technologies

Roadmap

Introduction to Data Warehousing


OLTP Vs OLAP
Data Warehouse Cycle
Data Warehouse Architecture
Dimensional Modeling and Design
Types of Dimensions & Facts
Slowly Changing Dimensions
Data warehouse design
Indexing & Partitioning

Pragim Technologies

What is Data Warehouse?


A data warehouse is a relational database which is specifically
designed for analyzing the business and making decisions effectively
in time but not for business transactional processing. Ralf Kimball
A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of managements decisionmaking process.W. H. Inmon

Pragim Technologies

What is Data Warehouse? - Cont

Pragim Technologies

OLTP Vs OLAP

Pragim Technologies

OLTP Vs OLAP - Cont

Pragim Technologies

OLTP Vs OLAP - Cont

Transaction Systems
Support Business
transactional process
Designed to run the
Business
Detailed data
No redundancy
(Normalized)
Data is normally updated
Current data
Few Indexes
Supports E-R Model

Pragim Technologies

Data warehouse
Support Decision making
process
Designed to analyze the
business
Summarized data
Allows redundancy
(Denormalized)
Data is normally loaded
Historical data
More Indexes
Supports Dimensional Model
7

Data Warehousing Cycle


Day-to-day
operations

Transaction Based Process

On-line, real time


Insert/ update/ delete.

Detailed Information to
operational systems.

Batch Load

Warehouse Based Process

Decision support for


management use.

Pragim Technologies

Load &
Summarise

Extract &
Transform

Data Warehousing Architecture -Simple


Report

Operational
Data
Query

ETL

Warehouse

Information
Delivery

External
Data

Pragim Technologies

Analyze

Data Warehousing Architecture -Typical


Integration/
Cleansing/
Intermediate
Calculation

Data Sources

Extract
Extract
Flat File
Flat
FlatFile
File

Extraction
Transformation
Loading
Scheduling
Refreshing

Summarized
De-normalized
Historical
Nonvolatile
Subject Oriented

Data
Mart

Extract

RDBMS
Data

Extract

Staging Area / ODS

Data

RDBMS
Data

User friendly report


creation &
Web Publishing

ETL

Data
Warehouse

Managers
Power Users

Report

Query

Data
Mart

Analyze

Data
Mart

Extract

Pragim Technologies

10

Why a Data Warehouse?


Why are companies
building DWs?

Pragim Technologies

11

Why a Data Warehouse? - Cont


1. Reporting Pressures :
Relieve reporting pressure on transactional databases.
2. Restructure Data :
Restructure data to speed up data analysis and reporting capabilities.
3. Reduce Complexity :
Reduce the user complexity associated with generating new reports
4. Clean Data:
Create a repository of clean data that does not require wholesale changes to
the transactional systems or business processes.
5. Multiple Source Analysis:
Allow easier reporting across multiple transactional systems and external data
sources.
6. Historic Analysis:
To provide a data source supporting a longer span of time than can be
reasonable supported on the transactional systems.

Pragim Technologies

12

What is Fact & Dimension ?


Fact :

Contains Keys to dimensions, and measures


Measures are typically described as the performance measures of
the business
Usually numerical, counts, currency amounts, percentages or ratios
Contains measures or facts at the lowest level granularity.

Dimension :

Contains descriptive information of business


Dimensions contains current and historical information.
Dimensions may be hierarchical in nature, eg.Time Dimension
Year Quarters months weeks day.

Pragim Technologies

14

Relationship - Fact & Dimensions

Pragim Technologies

15

Star Schema
-

A database design that stores a central fact table


surrounded by multiple denormalized dimension tables

Star schema uses all denormalized dimensions.

All the Dimensions will be directly related to the Fact table

Pragim Technologies

16

Star Schema - Cont


Region_Dimension_Table

Dimension Tables

region _key region _doc


10
11
12
13

Product_Dimension_Table
prod_grp_key prod_key prod_grp_desc prod_desc
10
20
30

100
140
220

Fewer devices
Circuit boards
Components

Northeast
Northwest
Southeast
Southwest

account _key account _doc


100000
110000
120000
130000
140000

Power supply
Motherboard
Co-processor

ABC Electronics
Midway Electric
Victor Components
Washburn, Inc.
Zerox

Account_Dimension_Table
Time_key prod_key region_key Account_key vend_key net-sales gross_sales

1
2
3

100
140
220

10
11
12

100000
110000
100000

100
200
300

Sales Fact Table


Time_key

month

month_name

30,000
23,000
32,000

50,000
42,000
49,000

Vendor_Dimension_Table
vend_key

1
2
3

01-1996
02-1996
03-1996

January
February
March

Fact Table

100
200
300

vendor_desc
PowerAge, Inc.
Advanced Micro Devices
Farad Incorporated

Time_Dimension_Table

Pragim Technologies

17

Star Schema - Cont


TIME

PRODUCT

time_key
day
day_of_the_week
month
quarter
year

SALES
time_key
product_key
location_key

measures

Pragim Technologies

units_sold
amount

product_key
product_name
category
brand
color
supplier_name

LOCATION
location_key
store
street_address
city
state
country
18
region

Advantages of Star Schema


dimension tables are relatively static, data is loaded into fact table(s)
- easy to write queries
- Simplify joins
- It contains few tables
-

Find total sales per product-category in our stores in Europe


SELECT PRODUCT.category, SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND
SALES.location_key = LOCATION.location_key
AND
LOCATION.region=Europe
GROUP BY PRODUCT.category

Pragim Technologies

19

Star Schema Query Processing


TIME

PRODUCT

time_key
day
day_of_the_week
month
quarter
year

SALES

Pcategory

time_key
product_key
location_key

measures

Pragim Technologies

product_key
product_name
category
brand
color
supplier_name

LOCATION

units_sold
amount

Sregion=Europe

location_key
store
street_address
city
state
country
20
region

Snow Flake Schema


A database design that stores a central fact table surrounded by
multiple normalized dimension tables

Advantages :
Space Can be minimized

Disadvantages :
Can hamper the query performance due to more number of
joins

Pragim Technologies

21

Galaxy Schema

It is a combination of Star Schema and Snowflake schema.


Also Know as Integrated schema or Constellation schema.

- Fact Constellation
Process of joining two Fact tables

Pragim Technologies

22

Types of Dimensions
CONFIRMED DIMENSIONS
It can be shared by multiple fact tables ( e.g. customer dimension)

DEGENERATED DIMENSIONS
Not connected to any dimensions (e.g. Transaction Id)

JUNK DIMENSIONS
A dimension with text description, flag, Boolean, which are not used in
describing the key performance indicators ( e.g. gender description,
product description)

Pragim Technologies

23

Types of Facts

ADDITIVE FACTS
facts that can be summed up through all of its dimensions in the fact table
(e.g. dollars sold)
SEMIADDITIVE FACTS
facts that can be summed up for some of its dimensions in the fact table
(e.g. inventory levels can not be added across time)
NONADDITIVE FACTS
facts that cannot be summed up for any of its dimensions present in the
fact table (e.g. true textual fact; which probably should not be in the
data warehouse to begin with)

Pragim Technologies

24

Dimensional Modeling
A dimensional modeling consists of following phases to build the DW.

i. Conceptual Modeling
- understand the business requirements
- Identify the entities ( tables)
- Identify the attributes(Columns) for each entity.
- Identify the relationship between the entities (Pk Fk)

ii. Logical Modeling


- Design the dimension tables
- Design the fact table
- Create relationship between dimensions and fact tables.

iii. Physical Modeling


- Create the tables physically in the Data Base to load the data.
Note: First load the data into all the dimension tables and load into fact table later.

Pragim Technologies

25

Slowly Changing Dimensions


- Types of Dimensions:
i) Slowly Changing Dimensions (SCD)
Eg: Employee Address
ii) Continuously Changing Dimensions (CCD)
Eg: Policy holders age
iii) Rapidly Changing Dimensions (RCD)
Eg: product version

Pragim Technologies

26

Slowly Changing Dimensions - Cont


contains attributes whose values occasionally change over time (e.g.
employees address)
three options for handling slowly changing dimensions:
Type1: Overwrite the dimension record with the new values (thereby losing
history)
used in cases when history does not matter (e.g. errors)
Type2: Create a new additional dimension record using a new value for the
surrogate key
used in cases when history is important
Type3: Create an old field in the dimension record to store the immediate
previous attribute value
used in case when limited history is important

Pragim Technologies

27

SURROGATE KEYS
System Generated Key used to uniquely identify the record in Fact or
Dimension table.
This key is generated by a sequence generator and will not be derived
from OLTP system.
A 4 byte integer is a good choice to improve the Join performance.

Pragim Technologies

28

INDEXING
Index is a pointer locates the physical address of data
Indexing can be used to increase the performance and scalability of the data
warehouse solution.
Using Indexes, replaces the full-table scan, followed by a read of only those
disk blocks that contain the rows needed
It will improve the performance while retrieving or manipulating data using
the indexed column in where clause.
Types of Indexes:
i) B-Tree Index
ii) Bitmap Index

Pragim Technologies

29

What is Cardinality ?

Cardinality is defined as the number of distinct values expressed as a percentage of


the number of rows in a table. For example, a million-row index with five distinct
values has a low cardinality, while a 100-row table with 80 distinct values has a high
cardinality.

Pragim Technologies

30

B-Tree Index
The most common type of indexing is the B-tree index. This type of
indexing is often used for high-cardinality columns such as product key or
customer key. B-tree indexes are designed to return few rows.

Pragim Technologies

31

Bitmap Index
This Index is used for low cardinality columns. When a bitmap index is
created on a column, a bit stream is created for each distinct value in the
indexed column. A bit stream is composed on ones and zeros.

Pragim Technologies

32

B-Tree Vs Bitmap

Pragim Technologies

33

Partitioning
Partitioning enables you to divide tables into smaller units that are more
manageable.
This feature addresses the problem of supporting large tables and indexes
that are inherent to data warehouses

Pragim Technologies

34

Partition Types
Data partitioning can be divided into two broad categories: Horizontal
Partitioning and Vertical Partitioning

Pragim Technologies

35

Horizontal Partitioning
Horizontal partitioning is commonly used in data warehouse environments
because it enables data in a very large to be stored in smaller tables. It gives
the DBA control over the rows that go into each table.

Pragim Technologies

36

Vertical Partitioning
Vertical Partitioning divides tables on a column-by-column basis

Pragim Technologies

37

Range Partitioning
Range Partitioning allows the users to specify the ranges for each of the
partition. Here each of the partition may not be evenly distributed

Pragim Technologies

38

Hash Partitioning
This partition will give the
control to the system to evenly
maintain each of the partition

Pragim Technologies

39

Composite Partitioning
This Partition is the combination of
both Range and Has Partition

Pragim Technologies

40

Data Acquisition
It is a process of extracting the relevant business information,
transforming the data into required business format and loading
into the data ware house
i) Data Extraction
ii) Data Transformation
iii) Data Loading

Pragim Technologies

41

Data Acquisition - Cont


Data Extract : It is a process of extracting data from various
types of source systems.
i) Relational oracle, sql server, teradata, DB2
ii) ERP Source (Enterprise resource planning) SAP R/3,
People Soft and Siebel.
iii) Legacy Source Mainframes.
iv)File source -flat files, XML files, COBOL files and XL files
v) Other source - MS access, web log files

Pragim Technologies

42

Data Acquisition - Cont


Data Transformation: It is a process of transforming data from
one format to the required format.
i) Data Cleansing : it is a process of converting inconsistencies
into consistence and removing unwanted data.
ii) Data scrubbing: It is a process of deriving the new
information or definitions from existing information.
iii) Data merging: It is a process of combining multiple input
flows to single output flow.
iv) Data aggregation: It is a process of converting detailed
information into summarized information by grouping them.

Pragim Technologies

43

Data Acquisition - Cont


Types of ETL :
i) Code based ETL: Design ETL applications using some
programming languages.
eg: SQL/PL SQL, Tera Data Utilities
ii) GUI based ETL: Design ETL applications using simple
graphical user interfacing.
eg: Informatica, Data Stage, Abinitio, Oracle warehouse builder

Pragim Technologies

44

ETL Tools
Informatica
Data Integrator
Ardent Data Stage

MS SQL-Server DTS
Ab Initio
Data Junction

Pragim Technologies

45

Reporting Tools
Business Objects
Cognos
Brio

Hyperion
Seagate
Eureka Strategy

Micro Strategy

Pragim Technologies

46

ETL Steps
Extract
Extract source system data to populate interim stage

Transform
Apply the business logic
Process to find New Dimension records
Add Changed/New Dimension record to staging area with new key

Load
Load directly from staging area into DW

Pragim Technologies

47

DW Extract Transform Load (ETL)


ERP/CRM/eCRM

ETL

Data

QR&A

Extract

Transform

Extract

Load

Extract

Pragim Technologies

48

Incremental Vs Full Load


Full load Load all records every refresh cycle
Incremental load Only load new & updated records since last
refresh
Incremental load requires date/time stamp in source systems

Pragim Technologies

49

The End

Thank you!

Pragim Technologies

50

Anda mungkin juga menyukai