The Data Warehouse ETL Toolkit - Chapter 06

The Data WareHouse ETL
Toolkit
VSV Training
Chapter 6: Delivering Fact Tables
Prepared by: Tho HOANG Hien
NGUYEN
Date: 09/02/2008
6. Delivering Fact Tables

1. The Basic Structure of a Fact Table
2. Guaranteeing Referential Integrity
3. Surrogate Key Pipeline
1. Using the Dimension Instead of a Lookup Table
4. Fundamental Grains
1. Transaction Grain Fact Tables
2. Periodic Snapshot Fact Tables
3. Accumulating Snapshot Fact Tables
5. Preparing for Loading Fact Tables

1.
2.
3.
4.
Managing Indexes
Managing Partitions
Outwitting the Rollback Log
Loading the Data

(cont.)
5. Preparing for Loading Fact Tables (cont.)
5. Incremental Loading
6. Inserting Facts
7. Updating and Correcting Facts
8. Negating Facts
9. Updating Facts
10. Deleting Facts
11. Physically Deleting Facts
12. Logically Deleting Facts

(cont.)
6. Factless Fact Tables
7. Augmenting a Type 1 Fact Table with Type 2 History
8. Graceful Modifications
9. Multiple Units of Measure in a Fact Table
10. Collecting Revenue in Multiple Currencies
11. Late Arriving Facts
12. Aggregations
1.
2.
3.
4.
5.
Design Requirement # 1
Design Requirement # 2
Design Requirement #3
Design Requirement #4
Administering Aggregations, Including Materialized
Views

(cont.)
13.Delivering Dimensional Data to
OLAP Cubes
Cube Data Sources
Processing Dimensions
Changes in Dimension Data
Processing Facts
Integrating OLAP Processing into the ETL
System
6. OLAPWrap-up
1.
2.
3.
4.
5.
6.0 Introduction
Fact tables hold the measurements
of an enterprise.
Fact tables contain measurements,
and dimension tables contain the
context surrounding measurements.
6.1. The Basic Structure of a

Fact Table
Every fact table is defined by the grain
of the table (The grain of the fact table
is the definition of the measurement
event)
All fact tables possess a set of foreign
keys connected to the dimensions that
provide the context of the fact table
measurements
6.1 (cont.)
6.2. Guaranteeing Referential

Integrity
There are only two ways to violate
referential integrity in a dimensional
schema:
1. Load a fact record with one or more
bad foreign keys.
2. Delete a dimension record whose
primary key is being used in the fact
table.
6.2 (cont.)
Three main places in the ETL pipeline where
referential integrity can be enforced. They are:
1.Careful bookkeeping and data preparation just
before loading the fact table records into the final
tables, coupled with careful bookkeeping before
deleting any dimension records
2. Enforcement of referential integrity in the
database itself at the moment of every fact table
insertion and every dimension table deletion
3. Discovery and correction of referential integrity
violations after loading has occurred by regularly
scanning the fact table, looking for bad foreign
keys
6.2 (cont.)
6.2 (cont.)
The queries checking referential
integrity must be of the form:
select f.product_key
from fact_table f
where f.product_key not in (select
p.product_key from
product_dimension p)
6.3. Surrogate Key Pipeline

All records to be loaded into the fact
table are current
When building a fact table, the final
ETL step is converting the natural
keys in the new input records into
the correct, contemporary surrogate
keys.
6.3 (cont.)
6.3.1. Using the Dimension Instead of

a Lookup Table
The lookup table approach works best
when the overwhelming fraction of fact
records processed each day are
contemporary
A significant number of fact records are

late arriving what happened?
The dimension must be the source for the
correct surrogate key
6.4. Fundamental Grains

Every fact table should have one,
and only one, grain.
The three kinds of fact tables are: the
transaction grain, the periodic
snapshot, and the accumulating
snapshot.
6.4.1. Transaction Grain Fact

Tables
The transaction grain represents an
instantaneous measurement at a
specific point in space and time
The tiny atomic measurements

typical of transaction grain fact tables
have a large number of dimensions
In environments like a retail store,
there may be only one transaction
type (the retail sale) being measured.
6.4.1. (cont.)
6.4.1. (cont.)
6.4.2. Periodic Snapshot Fact

Tables
The periodic snapshot represents a
span of time, regularly repeated.
The date dimension in the periodic
snapshot fact table refers to the
period.
Periodic snapshot fact tables have
completely predictable sparseness
6.4.2. Periodic Snapshot Fact

Tables (cont.)
Periodic snapshot fact tables have
similar loading characteristics to
those of the transaction grain tables
However, there are two somewhat
different strategies for maintaining
periodic snapshot fact tables.
6.4.3. Accumulating Snapshot Fact

Tables
The accumulating snapshot fact table is

used to describe processes that have a
definite beginning and end, such as order
fulfillment, claims processing, and most
workflows
The grain of an accumulating snapshot fact

table is the complete history of an entity
from its creation to the present moment
Accumulating snapshot fact tables have
several unusual characteristics
6.4.3. (cont.)
6.5. Preparing for Loading Fact

Tables
How to build efficient load processes
and overcome common obstacles
6.5.1. Managing Indexes

Indexes are performance enhancers
at query time, but they are
performance killers at load time.
6.5.1. Managing Indexes

(cont.)
In a nutshell, perform the steps that follow
to prevent table indexes from causing a
bottleneck in your ETL process:
1. Segregate updates from inserts.
2. Drop any indexes not required to support
updates.
3. Load updates.
4. Drop all remaining indexes.
5. Load inserts (through bulk loader).
6. Rebuild the indexes.
6.5.2. Managing Partitions

Partitions allow a table (and its
indexes) to be physically divided into
minitables for administrative purposes
and to improve query performance.
The partitions of a table are under the
covers, hidden from the users.
Only the DBA and ETL team should be
aware of partitions.

(cont.)
The most common partitioning strategy on
fact tables is to partition the table by the
date key.
Tables that are partitioned by a date
interval are usually partitioned by year,
quarter, or month.
Unless your DBA team takes a proactive
role in administering your partitions, the
ETL process must manage them.

(cont.)
Suppose your fact table is partitioned by year and
the first three years are created by the DBA
team. When you attempt to load any data after
December, 31, 2004, in Oracle you receive the
following error:
ORA-14400: inserted partition key is beyond highest
legal partition key
At this point, the ETL process has a choice:

Notify the DBA team, wait for them to manually
create the next partition, and resume loading.
Dynamically add the next partition required to
complete loading.

(cont.)
select max(date_key) from
'STAGE_FACT_TABLE
compared with
select high_value from all_tab_partitions

where table_name = 'FACT_TABLE'
and partition_position = (select
max(partition_position)
from all_tab_partitions where
table_name = 'FACT_TABLE')
6.5.3. Outwitting the Rollback

Log
Reasons why the data warehouse does
not need rollback logging include:
All data is entered by a managed
processthe ETL system.
Data is loaded in bulk.
Data can easily be reloaded if a load
process fails.
6.5.4. Loading the Data

The initial load of a new table has a
unique set of challenges. The primary
challenge is handling the one-time
immense volume of data.
Separate inserts from updates
Utilize a bulk-load utility
Load in parallel
Minimize physical updates
Build aggregates outside of the database
6.5.5. Incremental Loading

The incremental load is the process
that occurs periodically to keep the
data warehouse synchronized with its
respective source systems.
ETL routines that load data
incrementally are usually a result of
the process that initially loaded the
historic data into the data warehouse.
6.5.6. Inserting Facts

Fact tables are too immense to
process via SQL INSERT statements.
6.5.7. Updating and Correcting

Facts
Handle data corrections in the data
warehouse in three essential ways:
Negate the fact.
Update the fact.
Delete and reload the fact.
6.5.8. Negating Facts

The negative measures in the reversing
fact table record cancel out the original
record.
Many reasons exist for negating an
error rather than taking other
approaches to correcting fact data.
Other reasons for negating facts,
instead of updating or deleting, involve
data volume and ETL performance.
6.5.9. Updating Facts

Updating data in fact tables can be a
process-intensive endeavor.
The best approach to updating fact
data is to REFRESH the table via the
bulk-load utility.
6.5.10. Deleting Facts

Deleting facts from a fact table is
forbidden in data warehousing.
If your business requires deletions,
two ways to handle them exist:
Physical deletes
Logical deletes
6.5.11. Physically Deleting

Facts
Physically deleting facts means data is
permanently removed from the data
warehouse.
Never: users to think in terms of
todays requirements because it is
based on their current way of
thinking about the data they use
See: users have no idea what exists
in raw data
6.5.11. Physically Deleting

Facts
(cont.)
Insert into deleted_rows nologging
select * from prior_extract
MINUS
select * from current_extract
6.5.12. Logically Deleting

Facts
A logical delete entails the utilization
of an additional column named
deleted.
6.6. Factless Fact Tables

The grain of every fact table is a
measurement event.
6.6. Factless Fact Tables (cont.)

Another common type of factless fact
table represents a coverage
6.7. Augmenting a Type 1 Fact

Table with Type 2 History
In a Type 1 dimension, the natural
key and the primary key of the
dimension have a 1-to-1 relationship.
In many of these Type 1
environments, there is a desire to
access the customer history for
specialized analysis.
6.7. Augmenting a Type 1 Fact

Table with Type 2 History
(cont.)
Three approaches can be used in this case:
1. Maintain a full Type 2 dimension off to the
side
2. Build the primary dimension as a full Type
2 dimension.
3. Build the primary dimension as a full Type
2 dimension and simultaneously embed
the natural key of the dimension in the
fact table alongside the surrogate key
6.8. Graceful Modifications

There are four types of graceful
modifications to dimensional schemas:
1. Adding a fact to an existing fact table at the
same grain
2. Adding a dimension to an existing fact table
at the same grain
3. Adding an attribute to an existing dimension
4. Increasing the granularity of existing fact
and dimension tables.

(cont.)

(cont.)
The first three types raise the issue

ofhowto populate the old history of
the tables prior to the addition of the
fact, dimension, or attribute.

(cont.)
When the change is valid only from
today forward, we handle the first
three modifications as follows:
1. Adding a Fact
2. Adding a Dimension
3. Adding a Dimension Attribute
6.9. Multiple Units of Measure

in a Fact Table
6.10. Collecting Revenue in

Multiple Currencies
Multinational businesses often book
transactions, collect revenues, and pay
expenses in many different currencies
6.11. Late Arriving Facts

If we have been time-stamping the dimension records in
our Type 2 SCDs, our processing involves the following
steps:
1. For each dimension, find the corresponding dimension
record in effect at the time of the purchase.
2. Using the surrogate keys found in the each of the
dimension records from Step 1; replace the natural
keys of the late-arriving fact record with the surrogate
keys.
3. Insert the late-arriving fact record into the correct
physical partition of the database containing the other
fact records from the time of the late-arriving
purchase.
6.12. Aggregations
The single most dramatic way to
affect performance in a large data
warehouse is to provide a proper set
of aggregate (summary) records that
coexist with the primary base records.
Aggregate navigation is a standard
data warehouse topic that has been
discussed extensively in literature
6.12. Aggregations (cont.)

In a properly designed data warehouse
environment, multiple sets of aggregates are
built, representing common grouping levels
within the key dimensions of the data warehouse.
Aggregate navigation has been defined and
supported only for dimensional data warehouses.
There is no coherent approach for aggregate
navigation in a normalized environment.
An aggregate navigator is a piece of middleware
that sits between the requesting client and the
DBMS.

An aggregate navigator intercepts the
clients SQL and, wherever possible,
transforms base-level SQL into aggregate
aware SQL.
The aggregate navigator understands how
to transform base-level SQL into aggregateaware SQL because the navigator uses
special metadata that describes the data
warehouse aggregate portfolio.

A good aggregate program for a large
data warehouse should:
1. Provide dramatic performance gains for as
many categories of user queries as possible
2. Add only a reasonable amount of extra
data storage to the warehouse.
3. Be completely transparent to end users
and to application designers, except for the
obvious performance benefits.

A good aggregate program for a large
data warehouse should:
4. Affect the cost of the data-extract
system as little as possible.
5. Affect the DBAs administrative
responsibilities as little as possible.
6.12.1. Design Requirement #1

Aggregates must be stored in their
own fact tables, separate from baselevel data. Each distinct aggregation
level must occupy its own unique fact
table.
The separation of aggregates into their

own fact tables is very important and
has a whole series of beneficial side
effects.

The dimension tables attached to the
aggregate fact tables must, wherever
possible, be shrunken versions of the
dimension tables associated with the
base fact table.

(cont.)

The base fact table and all its related
aggregate fact tables can be associated
together as a family of schemas so that
the aggregate navigator knows which
tables are related to each other.
The registration of this family of fact
tables, together with the associated
full-size and shrunken dimension
tables, is the sole metadata needed in
this design.

Force all SQL created by any end user
or application to refer exclusively to
the base fact table and its associated
full-size dimension tables.
This design requirement pervades all
user interfaces and all end user
applications.
6.12.5.
Administering Aggregations,
Including
Materialized Views
There are a number of different

physical variations of aggregations,
depending on the DBMS and the
front-end tools.
There are two fundamental
approaches to aggregating
navigation at the time of this writing.
6.12.5.
Administering Aggregations,
Including
Materialized Views (cont.)
Generally, a given aggregate fact table

should be at least ten-times smaller than
the base fact table in order to make the
tradeoff between administrative overhead
and performance gain worthwhile
Large hardware-based, parallel-processing
architectures gain exactly the same
performance advantages from aggregates
as uniprocessor systems with conventional
disk storage, since the gain comes simply
from reducing total I/O.
6.13. Delivering Dimensional

Data to OLAP Cubes
OLAP servers deliver two primary
functions
Query performance
Analytic richness
The best source for an OLAP cube is

a dimensional data warehouse stored
in an RDBMS.
6.13.1. Cube Data Sources

The various server-based OLAP products
and versions have different features.
If you are sourcing your cube from flat
files, one of the last steps in your ETL
process is, obviously, to write out the
appropriate datasets.
Analyze any queries that your cube
processing issues against the relational
data warehouse.
6.13.2. Processing Dimensions

Just as relational dimensions are
processed before the fact tables that
use them, so must OLAP dimensions
be processed before facts
Your ETL system design needs to be
aware of a few characteristics of
OLAP dimensions.
6.13.3. Changes in Dimension

Data
OLAPservers handle different kinds of
dimension-data changes completely differently
Depending on the costs and system usage, you

could decide to design your system to:
Let the OLAP and relational data warehouse
databases diverge
Keep the OLAP cubes in synch with the relational
data warehouse by halting relational processing as
well
Keep the OLAP cubes in synch with the relational
data warehouse by accepting the expensive
reprocessing operation during a nightly load
6.13.4. Processing Facts

Server-based OLAP products are capable
of managing very large volumes of data
and are increasingly used to hold data at
the same grain as the relational data
warehouse.
Most server-based OLAP products support
some form of incremental processing;
others support only full processing.
Loading into a partition is an appealing
way to load a subset of the cubes data
6.13.4. Processing Facts (cont.)

Partitioning the OLAP cube can
provide significant benefits for both
query and processing performance.
Some OLAP servers also support true
incremental fact processing.
6.13.5. Integrating OLAP Processing

into the ETL System
Technologies that include ETL and OLAP
offerings from a single vendor provide
more elegant integration.
Many systems can fully process the entire
OLAP database on a regular schedule,
usually weekly or monthly.
For larger systems, a common integration
structure includes adding the OLAP
dimension processing as the final step in
each dimension table processing branch,
module, or package in your ETL system.
6.13.6. OLAP Wrap-up

If an OLAP database is part of your
data warehouse system, it should be
managed rigorously
Summary
We defined the fact table as the
vessel that holds all numeric
measurements of the enterprise.
We saw that referential integrity is
hugely important to the proper
functioning of a dimensional schema,
and we proposed three places where
referential integrity can be enforced.
Summary (cont.)
We showed how to build a surrogate
key pipeline for data warehouses that
accurately track the historical
changes in their imensional entities
We described the structure of the
three kinds of fact tables: transaction
grain, periodic snapshot grain, and
accumulating snapshot grain
Summary (cont.)
We then proposed a number of specific
techniques for handling graceful
modifications to fact and dimension
tables, multiple units of measurement,
late-arriving fact data, and building
aggregations.
A specialized section on loading OLAP
cubes.

The Data Warehouse ETL Toolkit - Chapter 06

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

The Data Warehouse ETL Toolkit - Chapter 06

Diunggah oleh

Hak Cipta:

Format Tersedia

The Data WareHouse ETL

6. Delivering Fact Tables

5. Preparing for Loading Fact Tables

6. Delivering Fact Tables

6. Delivering Fact Tables

6. Delivering Fact Tables

6.1. The Basic Structure of a

6.2. Guaranteeing Referential

6.3. Surrogate Key Pipeline

6.3.1. Using the Dimension Instead of

A significant number of fact records are

6.4. Fundamental Grains

6.4.1. Transaction Grain Fact

The tiny atomic measurements

6.4.2. Periodic Snapshot Fact

6.4.2. Periodic Snapshot Fact

6.4.3. Accumulating Snapshot Fact

The accumulating snapshot fact table is

The grain of an accumulating snapshot fact

6.5. Preparing for Loading Fact

6.5.1. Managing Indexes

6.5.1. Managing Indexes

6.5.2. Managing Partitions

6.5.2. Managing Partitions

6.5.2. Managing Partitions

At this point, the ETL process has a choice:

6.5.2. Managing Partitions

select high_value from all_tab_partitions

6.5.3. Outwitting the Rollback

6.5.4. Loading the Data

6.5.5. Incremental Loading

6.5.6. Inserting Facts

6.5.7. Updating and Correcting

6.5.8. Negating Facts

6.5.9. Updating Facts

6.5.10. Deleting Facts

6.5.11. Physically Deleting

6.5.11. Physically Deleting

6.5.12. Logically Deleting

6.6. Factless Fact Tables

6.6. Factless Fact Tables (cont.)

6.7. Augmenting a Type 1 Fact

6.7. Augmenting a Type 1 Fact

6.8. Graceful Modifications

6.8. Graceful Modifications

6.8. Graceful Modifications

The first three types raise the issue

6.8. Graceful Modifications

6.9. Multiple Units of Measure

6.10. Collecting Revenue in

6.11. Late Arriving Facts

6.12. Aggregations (cont.)

6.12. Aggregations (cont.)

6.12. Aggregations (cont.)

6.12. Aggregations (cont.)

6.12. Aggregations (cont.)

6.12. Aggregations (cont.)

6.12.1. Design Requirement #1

The separation of aggregates into their

6.12.2. Design Requirement #2

6.12.2. Design Requirement #2

6.12.3. Design Requirement #3

6.12.4. Design Requirement #4

There are a number of different