ETL Training - Day 1

ETL Testing Training – Day 2 Agenda
 Data warehousing – overview

 Data warehouse Vs OLTP
 Data warehouse Vs Data Mart
 Over view of OLAP
OLTP vs Data Warehouse
 OLTP  Warehouse (DSS)
 Application Oriented  Subject Oriented
 Used to run business  Used to analyze business
 Detailed data  Summarized and refined
 Current up to date  Snapshot data
 Isolated Data  Integrated Data
 Repetitive access  Ad-hoc access
 Clerical User  Knowledge User (Manager)
OLTP vs Data Warehouse
OLTP Data Warehouse

 Performance Sensitive  Performance relaxed
 Few Records accessed at a time  Large volumes accessed at a
(tens) time(millions)
 Mostly Read (Batch Update)
 Read/Update Access  Redundancy present
 Database Size 100 GB - few
 No data redundancy terabytes
 Database Size 100MB -100 GB  Hundreds of users
 Thousands of users
To summarize ...
 OLTP Systems
are
used to “run” a
business
 The Data Warehouse
helps to “optimize” the
business
Generic two-level architecture
L
One,
company-
T wide
warehouse
Periodic extraction  data is not completely current in warehouse

Independent Data MartData marts:
Mini-warehouses, limited in scope
Separate ETL for each Data access complexity

independent data mart due to multiple data marts
Dependent data mart with operational data store ODS provides option for
obtaining current data
T
E
Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
Data Warehouse Architecture - 3
The ETL Process
 Capture
 Scrub or data cleansing
 Transform
 Load
ETL = Extract, transform, and load

Steps in data reconciliation
Capture = extract…obtaining a snapshot

of a chosen subset of the source data for
loading into the data warehouse
Static extract = capturing a Incremental extract = capturing

snapshot of the source data at a changes that have occurred since
point in time the last static extract
Steps in data reconciliation (continued)
Scrub = cleanse…uses pattern

recognition and AI techniques to
upgrade data quality
Fixing errors: misspellings, Also: decoding, reformatting, time

erroneous dates, incorrect field stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating missing
inconsistencies data
Transform = convert data from format

of operational system to format of data
warehouse
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one
Joining – data combining field
Aggregation – data summarization multi-field – from many fields to one,
or one field to many
Load/Index= place transformed data

into the warehouse and create indexes
Refresh mode: bulk rewriting of Update mode: only changes in

target data at periodic intervals source data are written to data
warehouse
Data Mart
Data mart is:

A functional segment of an enterprise restricted for
purposes of security, locality, performance, or
business necessity using modeling and information
delivery techniques identical to data warehousing.
Data Mart
Why build a data mart?

Allows an organization to visualize the large but focus on the small
and attainable.
Provides a platform for rapid delivery of an operational system.
Minimizes risk.
A corporate warehouse can be constructed from the union of the

enterprise data marts.
Top-down
External
Data
SOURCE DATA
Operational Data
Data Warehouse Data Marts
Staging Area
Physical Data Warehouse:

Data Warehouse --> Data Marts
Bottom-up approach
External
Data
SOURCE DATA
Operational Data
Data Warehouse
Data Marts
Staging Area

Data Marts --> Data Warehouse
Hybrid
Data Warehouse
External
Data
SOURCE DATA
Staging Area
Operational Data
Data Marts

Parallel Data Warehouse & Data Marts
Conceptual Modeling of Data
Warehouses
 Modeling data warehouses: dimensions & measures

 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Example of Snowflake Schema
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
What Is a Slowly Changing Dimension?
 A slowly changing dimension (SCD) is a dimension that

stores and manages both current and historical data over
time in a data warehouse.
 When the historical attribute values are retained if the

attributes are updated
 Used when the organization does not want to lose track

of what actually happened
 Example: customer moves from Connecticut to Seattle

What Is a Slowly Changing
Dimension?(Cont.)
There are three types of slowly changing dimensions.
Type 1 overwrites old values.
Type 2 creates another dimension record.
Type 3 creates a current value field.

Type 1 SCD: Does Not Store History
Type 1 overwrites old values.
Old record
ID Customer ID Customer Name Marital Status

3 1125 Steve Single
New record
ID Customer ID Customer Name Marital Status

3 1125 Steve Married
Overwriting a Record
 As taught earlier, this is referred to as a type

1 slowly changing dimension.
 Implementation is easy.
 History is lost.
 This technique is not recommended.
Customer ID John Doe Married

......................................................................
......................................................................
Type 2 SCD: Preserves Complete History
Type 2 stores complete change history in a new record.

Before
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 NULL
Open/current record
After
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 01-13-2001
8 1125 Steve Married 01-14-2001 NULL
Open/current record Closed record

Adding a New Record
 This is an example of a type 2 slowly

changing dimension.
 History is preserved; dimensions grow.
 Time constraints are required.
 A generalized key is created.
 Metadata tracks the use of keys.
1 Customer ID John Doe Single 1-Feb-41 31-Dec-95
42 Customer ID John Doe Married 1-Jan-96
Type 3 SCD: Stores Only
the Previous Value
Type 3 stores current and previous version of a selected attribute.
ID Customer Customer Marital Previous Effective

Marital
ID Name Status Date
Status
3 1125 Steve Married Single 01-14-2001
3 1125 Steve Widower Married 10-30-2004

Adding a Current Field
 This is an example of a type 3 slowly

changing dimension.
 Some history is maintained.
 Intermediate values are lost.
 This method is enhanced by adding an
Effective Date field.
Dimensions
 Dimensions determine the contextual background for the

facts.
 A dimension is a collection of members or units of the
same type of views.
 Dimensions describe who, what, when, where and why
for the facts.
 Dimensions should consist of the following data types
1. Surrogate key.
2. Primary key of the loaded source(s)
3. Any additional attributes (columns) that
Facts
 A fact is a collection of related data items, consisting of

measures and context data.
 Each fact typically represents a business item, a

business transaction, or an event that can be used in
analyzing the business or business process.
 Facts are measured, “continuously valued”, rapidly

changing information. Can be calculated and/or derived.
Facts(Cont.)
Facts are the key metrics used to measure business results:

Sales
Production
Inventory
Can be additive, semi-additive, or non-additive
act table consists of at least two types of data: keys and
measures.
s are usually surrogate keys that link to the dimension tables.
Fact Table
 A table that is used to store business information

(measures) that can be used in mathematical
equations.
 Quantities
 Percentages
 Prices
Types of Facts
Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account balance, Inventory
balance
Added and divided by number of time period to get a time-average
Types of Facts(In Continuation)
 Non Additive
 Numeric measures that cannot be added across any

dimensions
 Intensity measure averaged across all dimensions eg.
Room temperature
 Textual facts - AVOID THEM

ETL Training - Day 1

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ETL Training - Day 1

Diunggah oleh

Hak Cipta:

Format Tersedia

ETL Testing Training – Day 2 Agenda

 Data warehousing – overview

OLTP Data Warehouse

Periodic extraction  data is not completely current in warehouse

Separate ETL for each Data access complexity

 Scrub or data cleansing

ETL = Extract, transform, and load

Capture = extract…obtaining a snapshot

Static extract = capturing a Incremental extract = capturing

Scrub = cleanse…uses pattern

Fixing errors: misspellings, Also: decoding, reformatting, time

Transform = convert data from format

Load/Index= place transformed data

Refresh mode: bulk rewriting of Update mode: only changes in

Data mart is:

Why build a data mart?

Provides a platform for rapid delivery of an operational system.

A corporate warehouse can be constructed from the union of the

Data Warehouse Data Marts

Physical Data Warehouse:

Physical Data Warehouse:

Physical Data Warehouse:

 Modeling data warehouses: dimensions & measures

branch location_key location to_location

 A slowly changing dimension (SCD) is a dimension that

 When the historical attribute values are retained if the

 Used when the organization does not want to lose track

 Example: customer moves from Connecticut to Seattle

There are three types of slowly changing dimensions.

Type 1 overwrites old values.

Type 2 creates another dimension record.

Type 3 creates a current value field.

Type 1 overwrites old values.

ID Customer ID Customer Name Marital Status

ID Customer ID Customer Name Marital Status

 As taught earlier, this is referred to as a type

Customer ID John Doe Married

Type 2 stores complete change history in a new record.

Open/current record Closed record

 This is an example of a type 2 slowly

Type 3 stores current and previous version of a selected attribute.

ID Customer Customer Marital Previous Effective

3 1125 Steve Widower Married 10-30-2004

 This is an example of a type 3 slowly

 Dimensions determine the contextual background for the

 A fact is a collection of related data items, consisting of

 Each fact typically represents a business item, a

 Facts are measured, “continuously valued”, rapidly

Facts are the key metrics used to measure business results:

 A table that is used to store business information

 Numeric measures that cannot be added across any

Anda mungkin juga menyukai