Anda di halaman 1dari 36

ETL Testing Training – Day 2 Agenda

 Data warehousing – overview


 Data warehouse Vs OLTP
 Data warehouse Vs Data Mart
 Over view of OLAP
OLTP vs Data Warehouse
 OLTP  Warehouse (DSS)
 Application Oriented  Subject Oriented
 Used to run business  Used to analyze business
 Detailed data  Summarized and refined
 Current up to date  Snapshot data
 Isolated Data  Integrated Data
 Repetitive access  Ad-hoc access
 Clerical User  Knowledge User (Manager)
OLTP vs Data Warehouse

OLTP Data Warehouse


 Performance Sensitive  Performance relaxed
 Few Records accessed at a time  Large volumes accessed at a
(tens) time(millions)
 Mostly Read (Batch Update)
 Read/Update Access  Redundancy present
 Database Size 100 GB - few
 No data redundancy terabytes
 Database Size 100MB -100 GB  Hundreds of users
 Thousands of users
To summarize ...

 OLTP Systems
are
used to “run” a
business
 The Data Warehouse
helps to “optimize” the
business
Generic two-level architecture

L
One,
company-
T wide
warehouse

Periodic extraction  data is not completely current in warehouse


Independent Data MartData marts:
Mini-warehouses, limited in scope

Separate ETL for each Data access complexity


independent data mart due to multiple data marts
Dependent data mart with operational data store ODS provides option for
obtaining current data

T
E
Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
Data Warehouse Architecture - 3
The ETL Process

 Capture

 Scrub or data cleansing

 Transform

 Load

ETL = Extract, transform, and load


Steps in data reconciliation

Capture = extract…obtaining a snapshot


of a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract = capturing


snapshot of the source data at a changes that have occurred since
point in time the last static extract
Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern


recognition and AI techniques to
upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time


erroneous dates, incorrect field stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating missing
inconsistencies data
Steps in data reconciliation (continued)

Transform = convert data from format


of operational system to format of data
warehouse
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one
Joining – data combining field
Aggregation – data summarization multi-field – from many fields to one,
or one field to many
Steps in data reconciliation (continued)

Load/Index= place transformed data


into the warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in


target data at periodic intervals source data are written to data
warehouse
Data Mart

Data mart is:


A functional segment of an enterprise restricted for
purposes of security, locality, performance, or
business necessity using modeling and information
delivery techniques identical to data warehousing.
Data Mart

Why build a data mart?


Allows an organization to visualize the large but focus on the small
and attainable.

Provides a platform for rapid delivery of an operational system.

Minimizes risk.

A corporate warehouse can be constructed from the union of the


enterprise data marts.
Top-down

External
Data

SOURCE DATA

Operational Data

Data Warehouse Data Marts

Staging Area

Physical Data Warehouse:


Data Warehouse --> Data Marts
Bottom-up approach

External
Data

SOURCE DATA

Operational Data

Data Warehouse
Data Marts

Staging Area

Physical Data Warehouse:


Data Marts --> Data Warehouse
Hybrid

Data Warehouse

External
Data

SOURCE DATA

Staging Area
Operational Data

Data Marts

Physical Data Warehouse:


Parallel Data Warehouse & Data Marts
Conceptual Modeling of Data
Warehouses

 Modeling data warehouses: dimensions & measures


 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Example of Star Schema

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Example of Snowflake Schema

time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
What Is a Slowly Changing Dimension?

 A slowly changing dimension (SCD) is a dimension that


stores and manages both current and historical data over
time in a data warehouse.

 When the historical attribute values are retained if the


attributes are updated

 Used when the organization does not want to lose track


of what actually happened

 Example: customer moves from Connecticut to Seattle


What Is a Slowly Changing
Dimension?(Cont.)

There are three types of slowly changing dimensions.

Type 1 overwrites old values.

Type 2 creates another dimension record.

Type 3 creates a current value field.


Type 1 SCD: Does Not Store History

Type 1 overwrites old values.

Old record

ID Customer ID Customer Name Marital Status


3 1125 Steve Single

New record

ID Customer ID Customer Name Marital Status


3 1125 Steve Married
Overwriting a Record

 As taught earlier, this is referred to as a type


1 slowly changing dimension.
 Implementation is easy.
 History is lost.
 This technique is not recommended.

Customer ID John Doe Married


......................................................................
......................................................................
Type 2 SCD: Preserves Complete History

Type 2 stores complete change history in a new record.


Before
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 NULL

Open/current record
After
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 01-13-2001
8 1125 Steve Married 01-14-2001 NULL

Open/current record Closed record


Adding a New Record

 This is an example of a type 2 slowly


changing dimension.
 History is preserved; dimensions grow.
 Time constraints are required.
 A generalized key is created.
 Metadata tracks the use of keys.
1 Customer ID John Doe Single 1-Feb-41 31-Dec-95
42 Customer ID John Doe Married 1-Jan-96
Type 3 SCD: Stores Only
the Previous Value

Type 3 stores current and previous version of a selected attribute.

ID Customer Customer Marital Previous Effective


Marital
ID Name Status Date
Status
3 1125 Steve Married Single 01-14-2001

3 1125 Steve Widower Married 10-30-2004


Adding a Current Field

 This is an example of a type 3 slowly


changing dimension.
 Some history is maintained.
 Intermediate values are lost.
 This method is enhanced by adding an
Effective Date field.
Dimensions

 Dimensions determine the contextual background for the


facts.
 A dimension is a collection of members or units of the
same type of views.
 Dimensions describe who, what, when, where and why
for the facts.
 Dimensions should consist of the following data types
1. Surrogate key.
2. Primary key of the loaded source(s)
3. Any additional attributes (columns) that
Facts

 A fact is a collection of related data items, consisting of


measures and context data.

 Each fact typically represents a business item, a


business transaction, or an event that can be used in
analyzing the business or business process.

 Facts are measured, “continuously valued”, rapidly


changing information. Can be calculated and/or derived.
Facts(Cont.)

Facts are the key metrics used to measure business results:


Sales
Production
Inventory
Can be additive, semi-additive, or non-additive
act table consists of at least two types of data: keys and
measures.
s are usually surrogate keys that link to the dimension tables.
Fact Table

 A table that is used to store business information


(measures) that can be used in mathematical
equations.
 Quantities
 Percentages
 Prices
Types of Facts

Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account balance, Inventory
balance
Added and divided by number of time period to get a time-average
Types of Facts(In Continuation)

 Non Additive

 Numeric measures that cannot be added across any


dimensions
 Intensity measure averaged across all dimensions eg.
Room temperature
 Textual facts - AVOID THEM

Anda mungkin juga menyukai