Introduction To Data Warehousing: Pragim Technologies

Introduction to Data Warehousing
Pragim Technologies
Roadmap
Introduction to Data Warehousing

OLTP Vs OLAP
Data Warehouse Cycle
Data Warehouse Architecture
Dimensional Modeling and Design
Types of Dimensions & Facts
Slowly Changing Dimensions
Data warehouse design
Indexing & Partitioning
Pragim Technologies
What is Data Warehouse?

A data warehouse is a relational database which is specifically
designed for analyzing the business and making decisions effectively
in time but not for business transactional processing. Ralf Kimball
A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of managements decisionmaking process.W. H. Inmon
Pragim Technologies
What is Data Warehouse? - Cont
Pragim Technologies
OLTP Vs OLAP
Pragim Technologies
OLTP Vs OLAP - Cont
Pragim Technologies
OLTP Vs OLAP - Cont
Transaction Systems
Support Business
transactional process
Designed to run the
Business
Detailed data
No redundancy
(Normalized)
Data is normally updated
Current data
Few Indexes
Supports E-R Model
Pragim Technologies
Data warehouse
Support Decision making
process
Designed to analyze the
business
Summarized data
Allows redundancy
(Denormalized)
Data is normally loaded
Historical data
More Indexes
Supports Dimensional Model
7
Data Warehousing Cycle

Day-to-day
operations
Transaction Based Process
On-line, real time

Insert/ update/ delete.
Detailed Information to
operational systems.
Batch Load
Warehouse Based Process
Decision support for

management use.
Pragim Technologies
Load &
Summarise
Extract &
Transform
Data Warehousing Architecture -Simple

Report
Operational
Data
Query
ETL
Warehouse
Information
Delivery
External
Data
Pragim Technologies
Analyze
Data Warehousing Architecture -Typical

Integration/
Cleansing/
Intermediate
Calculation
Data Sources
Extract
Extract
Flat File
Flat
FlatFile
File
Extraction
Transformation
Loading
Scheduling
Refreshing
Summarized
De-normalized
Historical
Nonvolatile
Subject Oriented
Data
Mart
Extract
RDBMS
Data
Extract
Staging Area / ODS
Data
RDBMS
Data
User friendly report

creation &
Web Publishing
ETL
Data
Warehouse
Managers
Power Users
Report
Query
Data
Mart
Analyze
Data
Mart
Extract
Pragim Technologies
10
Why a Data Warehouse?

Why are companies
building DWs?
Pragim Technologies
11
Why a Data Warehouse? - Cont

1. Reporting Pressures :
Relieve reporting pressure on transactional databases.
2. Restructure Data :
Restructure data to speed up data analysis and reporting capabilities.
3. Reduce Complexity :
Reduce the user complexity associated with generating new reports
4. Clean Data:
Create a repository of clean data that does not require wholesale changes to
the transactional systems or business processes.
5. Multiple Source Analysis:
Allow easier reporting across multiple transactional systems and external data
sources.
6. Historic Analysis:
To provide a data source supporting a longer span of time than can be
reasonable supported on the transactional systems.
Pragim Technologies
12
What is Fact & Dimension ?

Fact :
Contains Keys to dimensions, and measures

Measures are typically described as the performance measures of
the business
Usually numerical, counts, currency amounts, percentages or ratios
Contains measures or facts at the lowest level granularity.
Dimension :
Contains descriptive information of business

Dimensions contains current and historical information.
Dimensions may be hierarchical in nature, eg.Time Dimension
Year Quarters months weeks day.
Pragim Technologies
14
Relationship - Fact & Dimensions
Pragim Technologies
15
Star Schema
-
A database design that stores a central fact table

surrounded by multiple denormalized dimension tables
Star schema uses all denormalized dimensions.
All the Dimensions will be directly related to the Fact table
Pragim Technologies
16
Star Schema - Cont

Region_Dimension_Table
Dimension Tables
region _key region _doc

10
11
12
13
Product_Dimension_Table
prod_grp_key prod_key prod_grp_desc prod_desc
10
20
30
100
140
220
Fewer devices
Circuit boards
Components
Northeast
Northwest
Southeast
Southwest
account _key account _doc

100000
110000
120000
130000
140000
Power supply
Motherboard
Co-processor
ABC Electronics
Midway Electric
Victor Components
Washburn, Inc.
Zerox
Account_Dimension_Table
Time_key prod_key region_key Account_key vend_key net-sales gross_sales
1
2
3
100
140
220
10
11
12
100000
110000
100000
100
200
300
Sales Fact Table

Time_key
month
month_name
30,000
23,000
32,000
50,000
42,000
49,000
Vendor_Dimension_Table
vend_key
1
2
3
01-1996
02-1996
03-1996
January
February
March
Fact Table
100
200
300
vendor_desc
PowerAge, Inc.
Advanced Micro Devices
Farad Incorporated
Time_Dimension_Table
Pragim Technologies
17
Star Schema - Cont

TIME
PRODUCT
time_key
day
day_of_the_week
month
quarter
year
SALES
time_key
product_key
location_key
measures
Pragim Technologies
units_sold
amount
product_key
product_name
category
brand
color
supplier_name
LOCATION
location_key
store
street_address
city
state
country
18
region
Advantages of Star Schema

dimension tables are relatively static, data is loaded into fact table(s)
- easy to write queries
- Simplify joins
- It contains few tables
-
Find total sales per product-category in our stores in Europe

SELECT PRODUCT.category, SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND
SALES.location_key = LOCATION.location_key
AND
LOCATION.region=Europe
GROUP BY PRODUCT.category
Pragim Technologies
19
Star Schema Query Processing

TIME
PRODUCT
time_key
day
day_of_the_week
month
quarter
year
SALES
Pcategory
time_key
product_key
location_key
measures
Pragim Technologies
product_key
product_name
category
brand
color
supplier_name
LOCATION
units_sold
amount
Sregion=Europe
location_key
store
street_address
city
state
country
20
region
Snow Flake Schema

A database design that stores a central fact table surrounded by
multiple normalized dimension tables
Advantages :
Space Can be minimized
Disadvantages :
Can hamper the query performance due to more number of
joins
Pragim Technologies
21
Galaxy Schema
It is a combination of Star Schema and Snowflake schema.

Also Know as Integrated schema or Constellation schema.
- Fact Constellation
Process of joining two Fact tables
Pragim Technologies
22
Types of Dimensions
CONFIRMED DIMENSIONS
It can be shared by multiple fact tables ( e.g. customer dimension)
DEGENERATED DIMENSIONS
Not connected to any dimensions (e.g. Transaction Id)
JUNK DIMENSIONS
A dimension with text description, flag, Boolean, which are not used in
describing the key performance indicators ( e.g. gender description,
product description)
Pragim Technologies
23
Types of Facts
ADDITIVE FACTS
facts that can be summed up through all of its dimensions in the fact table
(e.g. dollars sold)
SEMIADDITIVE FACTS
facts that can be summed up for some of its dimensions in the fact table
(e.g. inventory levels can not be added across time)
NONADDITIVE FACTS
facts that cannot be summed up for any of its dimensions present in the
fact table (e.g. true textual fact; which probably should not be in the
data warehouse to begin with)
Pragim Technologies
24
Dimensional Modeling
A dimensional modeling consists of following phases to build the DW.
i. Conceptual Modeling
- understand the business requirements
- Identify the entities ( tables)
- Identify the attributes(Columns) for each entity.
- Identify the relationship between the entities (Pk Fk)
ii. Logical Modeling

- Design the dimension tables
- Design the fact table
- Create relationship between dimensions and fact tables.
iii. Physical Modeling

- Create the tables physically in the Data Base to load the data.
Note: First load the data into all the dimension tables and load into fact table later.
Pragim Technologies
25
Slowly Changing Dimensions

- Types of Dimensions:
i) Slowly Changing Dimensions (SCD)
Eg: Employee Address
ii) Continuously Changing Dimensions (CCD)
Eg: Policy holders age
iii) Rapidly Changing Dimensions (RCD)
Eg: product version
Pragim Technologies
26
Slowly Changing Dimensions - Cont

contains attributes whose values occasionally change over time (e.g.
employees address)
three options for handling slowly changing dimensions:
Type1: Overwrite the dimension record with the new values (thereby losing
history)
used in cases when history does not matter (e.g. errors)
Type2: Create a new additional dimension record using a new value for the
surrogate key
used in cases when history is important
Type3: Create an old field in the dimension record to store the immediate
previous attribute value
used in case when limited history is important
Pragim Technologies
27
SURROGATE KEYS
System Generated Key used to uniquely identify the record in Fact or
Dimension table.
This key is generated by a sequence generator and will not be derived
from OLTP system.
A 4 byte integer is a good choice to improve the Join performance.
Pragim Technologies
28
INDEXING
Index is a pointer locates the physical address of data
Indexing can be used to increase the performance and scalability of the data
warehouse solution.
Using Indexes, replaces the full-table scan, followed by a read of only those
disk blocks that contain the rows needed
It will improve the performance while retrieving or manipulating data using
the indexed column in where clause.
Types of Indexes:
i) B-Tree Index
ii) Bitmap Index
Pragim Technologies
29
What is Cardinality ?
Cardinality is defined as the number of distinct values expressed as a percentage of

the number of rows in a table. For example, a million-row index with five distinct
values has a low cardinality, while a 100-row table with 80 distinct values has a high
cardinality.
Pragim Technologies
30
B-Tree Index
The most common type of indexing is the B-tree index. This type of
indexing is often used for high-cardinality columns such as product key or
customer key. B-tree indexes are designed to return few rows.
Pragim Technologies
31
Bitmap Index
This Index is used for low cardinality columns. When a bitmap index is
created on a column, a bit stream is created for each distinct value in the
indexed column. A bit stream is composed on ones and zeros.
Pragim Technologies
32
B-Tree Vs Bitmap
Pragim Technologies
33
Partitioning
Partitioning enables you to divide tables into smaller units that are more
manageable.
This feature addresses the problem of supporting large tables and indexes
that are inherent to data warehouses
Pragim Technologies
34
Partition Types
Data partitioning can be divided into two broad categories: Horizontal
Partitioning and Vertical Partitioning
Pragim Technologies
35
Horizontal Partitioning
Horizontal partitioning is commonly used in data warehouse environments
because it enables data in a very large to be stored in smaller tables. It gives
the DBA control over the rows that go into each table.
Pragim Technologies
36
Vertical Partitioning
Vertical Partitioning divides tables on a column-by-column basis
Pragim Technologies
37
Range Partitioning
Range Partitioning allows the users to specify the ranges for each of the
partition. Here each of the partition may not be evenly distributed
Pragim Technologies
38
Hash Partitioning
This partition will give the
control to the system to evenly
maintain each of the partition
Pragim Technologies
39
Composite Partitioning
This Partition is the combination of
both Range and Has Partition
Pragim Technologies
40
Data Acquisition
It is a process of extracting the relevant business information,
transforming the data into required business format and loading
into the data ware house
i) Data Extraction
ii) Data Transformation
iii) Data Loading
Pragim Technologies
41
Data Acquisition - Cont

Data Extract : It is a process of extracting data from various
types of source systems.
i) Relational oracle, sql server, teradata, DB2
ii) ERP Source (Enterprise resource planning) SAP R/3,
People Soft and Siebel.
iii) Legacy Source Mainframes.
iv)File source -flat files, XML files, COBOL files and XL files
v) Other source - MS access, web log files
Pragim Technologies
42

Data Transformation: It is a process of transforming data from
one format to the required format.
i) Data Cleansing : it is a process of converting inconsistencies
into consistence and removing unwanted data.
ii) Data scrubbing: It is a process of deriving the new
information or definitions from existing information.
iii) Data merging: It is a process of combining multiple input
flows to single output flow.
iv) Data aggregation: It is a process of converting detailed
information into summarized information by grouping them.
Pragim Technologies
43

Types of ETL :
i) Code based ETL: Design ETL applications using some
programming languages.
eg: SQL/PL SQL, Tera Data Utilities
ii) GUI based ETL: Design ETL applications using simple
graphical user interfacing.
eg: Informatica, Data Stage, Abinitio, Oracle warehouse builder
Pragim Technologies
44
ETL Tools
Informatica
Data Integrator
Ardent Data Stage
MS SQL-Server DTS
Ab Initio
Data Junction
Pragim Technologies
45
Reporting Tools
Business Objects
Cognos
Brio
Hyperion
Seagate
Eureka Strategy
Micro Strategy
Pragim Technologies
46
ETL Steps
Extract
Extract source system data to populate interim stage
Transform
Apply the business logic
Process to find New Dimension records
Add Changed/New Dimension record to staging area with new key
Load
Load directly from staging area into DW
Pragim Technologies
47
DW Extract Transform Load (ETL)

ERP/CRM/eCRM
ETL
Data
QR&A
Extract
Transform
Extract
Load
Extract
Pragim Technologies
48
Incremental Vs Full Load

Full load Load all records every refresh cycle
Incremental load Only load new & updated records since last
refresh
Incremental load requires date/time stamp in source systems
Pragim Technologies
49
The End
Thank you!
Pragim Technologies
50

Introduction To Data Warehousing: Pragim Technologies

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Introduction To Data Warehousing: Pragim Technologies

Diunggah oleh

Hak Cipta:

Format Tersedia

Introduction to Data Warehousing

Introduction to Data Warehousing

What is Data Warehouse?

What is Data Warehouse? - Cont

OLTP Vs OLAP - Cont

OLTP Vs OLAP - Cont

Data Warehousing Cycle

Transaction Based Process

On-line, real time

Warehouse Based Process

Decision support for

Data Warehousing Architecture -Simple

Data Warehousing Architecture -Typical

Staging Area / ODS

User friendly report

Why a Data Warehouse?

Why a Data Warehouse? - Cont

What is Fact & Dimension ?

Contains Keys to dimensions, and measures

Contains descriptive information of business

Relationship - Fact & Dimensions

A database design that stores a central fact table

Star schema uses all denormalized dimensions.

All the Dimensions will be directly related to the Fact table

Star Schema - Cont

region _key region _doc

account _key account _doc

Sales Fact Table

Star Schema - Cont

Advantages of Star Schema

Find total sales per product-category in our stores in Europe

Star Schema Query Processing

Snow Flake Schema

It is a combination of Star Schema and Snowflake schema.

ii. Logical Modeling

iii. Physical Modeling

Slowly Changing Dimensions

Slowly Changing Dimensions - Cont

Cardinality is defined as the number of distinct values expressed as a percentage of

Data Acquisition - Cont

Data Acquisition - Cont

Data Acquisition - Cont

DW Extract Transform Load (ETL)

Incremental Vs Full Load

Anda mungkin juga menyukai