Anda di halaman 1dari 43

INTRODUCTION TO

DATA
WAREHOUSING
BY
QUONTRA SOLUTIONS

PHONE :(404)-900-9988
EMAIL

: INFO@QUONTRASOLUTIONS.COM

WEBSITE : WWW.QUONTRASOLUTIONS.COM

DATA WAREHOUSE

Maintain historic data


Analysis to get better understanding of business
Better Decision making
Definition: A data warehouse is a

subject-oriented

integrated

time-varying

non-volatile
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data Warehouse
1996

SUBJECT ORIENTED

Data warehouse is organized around subjects such as


sales, product, customer.
It focuses on modeling and analysis of data for decision
makers.
Excludes data not useful in decision support process.

INTEGRATED

Data Warehouse is constructed by integrating multiple


heterogeneous sources.
Data Preprocessing are applied to ensure consistency.

RDBMS
Data Processing
Data Transformation
Legacy
System

Flat File

Data
Warehouse
Data Processing
Data Transformation

NON-VOLATILE

Mostly, data once recorded will not be updated.


Data warehouse requires two operations in data accessing
- Incremental loading of data
- Access of data

load

access

TIME VARIANT

Provides information from historical perspective e.g. past


5-10 years
Every key structure contains either implicitly or explicitly
an element of time

WHY DATA WAREHOUSE?

Problem Statement:
ABC Pvt Ltd is a company with branches at USA,
UK,CANADA,INDIA
The Sales Manager wants quarterly sales report across
the branches.
Each branch has a separate operational system where
sales transactions are recorded.

WHY DATA WAREHOUSE?


USA

UK

Get quarterly sales figure


for each branch
and manually calculate
sales figure across branches.

Sales
Manager

CANADA

INDIA

What if he need daily sales report across the branches?

WHY DATA WAREHOUSE?

Solution:
Extract sales information from each database.
Store the information in a common repository at a single
site.

WHY DATA WAREHOUSE?


USA

UK
Data
Warehouse
CANADA

INDIA

Query &
Analysis tools

Sales
Manager

CHARACTERISTICS OF DATA
WAREHOUSE
Relational / Multidimensional database

Query and Analysis rather than transaction


Historical data from transactions
Consolidates Multiple data sources
Separates query load from transactions
Mostly non volatile
Large amount of data in order of TBs

WHEN WE SAY LARGE - WE


MEAN
IT!
Terabytes
-- 10^12 bytes: Yahoo! 300 Terabytes and
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:

growing

Geographic Information
Systems

Zettabytes -- 10^21 bytes: National Medical Records


Zottabytes -- 10^24 bytes: Weather images
Intelligence Agency Videos

OLTP VS DATA WAREHOUSE (OLAP)


OLTP

Data Warehouse (OLAP)

Indexes

Few

Many

Data

Normalized

Generally De-normalized

Joins

Many

Some

Derived data and aggregates

Rare

Common

DATA WAREHOUSE ARCHITECTURE


Operational
System

Sales
Data Mart
Analysis

Operational
System
ETL
(Extract
Transform
and Load)

Data
Warehouse

Generic
Data Mart

Flat
Files

Flat
Files

Data Mining

Inventory
Data Mart

Reporting

ETL

ETL stands for Extract, Transform and Load

Data is distributed across different sources

Flat files, Streaming Data, DB Systems, XML,


JSON
Data can be in different format

CSV, Key Value Pairs


Different units and representation

Country: IN or India

Date: 20 Nov 2010 or 20101020

ETL FUNCTIONS

Extract

Collect data from different sources

Parse data

Remove unwanted data


Transform

Project

Generate Surrogate keys

Encode data

Join data from different sources

Aggregate
Load

ETL STEPS

The first step in ETL process is mapping the data between


source systems and target database.
The second step is cleansing of source data in staging area.
The third step is transforming cleansed source data.
Fourth step is loading into the target system.

Data before ETL Processing:

Data after ETL Processing:

ETL GLOSSARY

Mapping:
Defining relationship between source and target objects.
Cleansing:
The process of resolving inconsistencies in source data.
Transformation:
The process of manipulating data. Any manipulation
beyond copying is a transformation. Examples include
aggregating, and integrating data from multiple sources.
Staging Area:
A place where data is processed before entering the
warehouse.

DIMENSION

Categorizes the data. For example - time, location, etc.


A dimension can have one or more attributes. For
example - day, week and month are attributes of time
dimension.
Role of dimensions in data warehousing.
- Slice and dice
- Filter by dimensions

TYPES OF DIMENSIONS

Conformed Dimension - A dimension that is shared across fact


tables.
Junk Dimension - A junk dimension is a convenient grouping of
flags and indicators. For example, payment method, shipping
method.
De-generated Dimension - A dimension key, that has no attributes
and hence does not have its own dimension table. For example,
transaction number, invoice number. Value of these dimension is
mostly unique within a fact table.
Role Playing Dimensions - Role Playing dimension refers to a
dimension that play different roles in fact tables depending on the
context. For example, the Date dimension can be used for the
ordered date, shipment date, and invoice date.
Slowly Changing Dimensions - Dimensions that have data that
changes slowly, rather than changing on a time-based, regular
schedule.

TYPES OF SLOWLY CHANGING


DIMENSION

Type1 - The Type 1 methodology overwrites old data with new data, and
therefore does not track historical data at all.
Type 2 - The Type 2 method tracks historical data by creating multiple
records for a given value in dimension table with separate surrogate keys.
Type 3 - The Type 3 method tracks changes using separate columns.
Whereas Type 2 had unlimited history preservation, Type 3 has limited
history preservation, as it's limited to the number of columns we designate
for storing historical data.
Type 4 - The Type 4 method is usually referred to as using "history tables",
where one table keeps the current data, and an additional table is used to
keep a record of all changes.

Type 1, 2 and 3 are commonly used.


Some books talks about Type 0 and 6 also.
http://en.wikipedia.org/wiki/Slowly_changing_dimension

FACTS

Facts are values that can be examined and analyzed.


For Example - Page Views, Unique Users, Pieces Sold,
Profit.
Fact and measure are synonymous.
Types of facts:

Additive - Measures that can be added across all


dimensions.

Non Additive - Measures that cannot be added


across all dimensions.

Semi Additive - Measures that can be added across


few dimensions and not with others.

HOW TO STORE DATA?

Facts and Dimensions:


1.
Select the business process to model
2.
Declare the grain of the business process
3.
Choose the dimensions that apply to each fact table row
4.
Identify the numeric facts that will populate each fact
table row

DIMENSION TABLE

Contains attributes of dimensions e.g. month is an


attribute of Time dimension.
Can also have foreign keys to another dimension table
Usually identified by a unique integer primary key called
surrogate key

FACT TABLE

Contains Facts
Foreign keys to dimension tables
Primary Key: usually composite key of all FKs

TYPES OF SCHEMA USED IN DATA WAREHOUSE

Star Schema
Snowflake Schema
Fact Constellation Schema

STAR SCHEMA

Multi-dimensional Data
Dimension and Fact Tables
A fact table with pointers to Dimension tables

STAR SCHEMA

SNOWFLAKE SCHEMA

An extension of star schema in which the dimension


tables are partly or fully normalized.
Dimension table hierarchies broken down into simpler
tables.

SNOWFLAKE SCHEMA

FACT CONSTELLATION SCHEMA


A fact constellation schema allows dimension tables to be
shared between fact tables.
This Schema is used mainly for the aggregate fact tables,
OR where we want to split a fact table for better
comprehension.

For example, a separate fact table for daily, weekly and


monthly reporting requirement.

FACT CONSTELLATION SCHEMA

In this example, the dimensions tables for time, item, and location are
shared between both the sales and shipping fact tables.

OPERATIONS ON DATA
WAREHOUSE
Drill Down

Roll up
Slice & Dice
Pivoting

DRILL DOWN
Product
Category e.g Home Appliances
Sub Category e.g Kitchen Appliances

Region

Product e.g Toaster

Time

ROLL UP
Year

Fiscal Year

Quarter

Fiscal Quarter

Month

Fiscal Month

Fiscal Week

Day

SLICE & DICE


Product = Toaster

Region

Region

Product

Time
Time

PIVOTING
Product

Time

Region

Product

Time

Also called rotation


Rotate on an axis
Interchange Rows and Columns

Region

ADVANTAGES OF DATA
One consistent data store for reporting, forecasting, and
WAREHOUSE

analysis
Easier and timely access to data
Scalability
Trend analysis and detection
Drill down analysis

DISADVANTAGES OF DATA
WAREHOUSE
Preparation may be time consuming.

High associated cost

CASE STUDY: WHY DATA


WAREHOUSE
G2G Courier Pvt. Ltd. is an established brand in courier

industry which has its own network in main cities and


also have sub contracted in rural areas across the country
to various partners.
The President of the company wants to look deep into the
financial health of the company and different performance
aspects.

CHALLENGES

Apart from G2Gs own transaction system, each partner


has their own system which make the data very
heterogeneous.
Granularity of data in various systems is also different.
For eg: minute accuracy and day accuracy.
To do analysis on metrics like Revenue and Timely
delivery across various geographical locations and
partner, we need to have a unified system.

DATA WAREHOUSE MODEL


Time

Product

Sales Fact

Region

Product
Category

THANK YOU

Anda mungkin juga menyukai