October 2006
Note:
This document is for informational purposes. It is not a commitment to deliver
any material, code, or functionality, and should not be relied upon in making
purchasing decisions. The development, release, and timing of any features or
functionality described in this document remains at the sole discretion of Oracle.
This document in any form, software or printed matter, contains proprietary
information that is the exclusive property of Oracle. This document and
information contained herein may not be disclosed, copied, reproduced, or
distributed to anyone outside Oracle without prior written consent of Oracle. This
document is not part of your license agreement nor can it be incorporated into
any contractual agreement with Oracle or its subsidiaries or affiliates.
INTRODUCTION
In many organizations information requirements become more advanced and
require an adaptive architecture to deliver on these advanced requirements in a
timely manner. This latter requirement forces organizations to put more emphasis
on planning ahead and implementing more advanced concepts in production
environments.
The specific solutions we are discussing in this paper are technically elegant and
effective, but need to achieve real business goals to warrant their development
time. For Elizabeth Arden Inc. the goals were concrete and tangible:
· Ensure business users own good as well as bad data, allowing for better
overall information quality and clear information ownership in the
business.
· Empower business users to create their own complex reporting solutions
reducing the involvement of IT developers and reducing the cost to the
business.
· Allow for quick and easy conversion of outside or external data reducing
the time and effort required to integrate these new data sets.
Ultimately, if all of these goals are achieved, Elizabeth Arden achieves better
information quality, faster turn around cycles to add or move new data and gets
better decisions directly from users. The company also offloads some reporting
related efforts from the IT staff, removing bottle-necks from the reporting
process for end-users.
In this paper we discuss these advanced concepts that are useful in any data
warehouse situation:
· In place rejection and realignment of data elements to allow for business
ownership of all data.
Concepts
In a dimensional model, the fact table is linked to the dimensions based on some
key value. For each fact record, to load it into the fact table, the dimension keys
must be a match, otherwise the constraints are enforced and the record is not
loaded. To load facts, therefore the dimensions must hold the exact data elements
that the new fact records expect.
If a fact record contains bad data (for at least one of the dimensions it
references), then the normal process in a relational database is to reject the
record. Rejecting the record, however does not expose the records to the business
user. In some cases it might end up in a pile of incorrect data that someone
should go through at some point in time. Experience learns that that task is hardly
ever completed in a timely manner.
Hence the concept of in place rejection is introduced. By rejecting the records in
place within the fact table the business users see the misaligned data at each
report, and are consciously aware the bad data is there and how much of it is lying
around. They also see the business value (in dollars, units or whatever the
measurement type is) of misaligned data.
In place rejection
To achieve the goal of placing the bad data in the actual reporting systems, in
place rejection uses generated dimension members to capture the bad data. For
each dimension member at the linked level (typically this is the lowest level of the
dimension) a value is introduced that is then used as a garbage collector.
All data that does not match any of the members is assigned to this garbage
collector value. The garbage collector value can either be a static value like
rejected, or a dynamic value like for example REJ_originalvalue. Alternatively you
can log the entire input row and the new key value in a table for later processing
of reloads.
In all cases the lowest level rolls up to a distinct level in the top of the hierarchy
and the total gets the in place rejected records so the total of a star always rolls up
correctly.
In some cases in place rejection is caused not by bad data, but by bad timing. In
that case the realignment happens without any interference directly upon the next
load of the facts and dimensions. When bad data is truly the cause of
misalignment, realignment can only happen if the data is fixed and then reloaded.
Regardless of the fix, the system must be built in such a way that the realignment
is detected and the data that was labeled as incorrect is realigned in the database
to be correct. It is crucial that the system is not double counting due to a previous
error.
If there is no recoding (e.g. the record is still not found in the dimension) the
default key is loaded into the recoding table again.
Because the recoding table has both the identifying information for the entire fact
record, all of its logical key references are stored as well as the surrogate key for
the recoded dimension, facts now correctly roll up to their new dimension link.
Benefits
By applying this strategy Elizabeth Arden achieves some of the following benefits
in either the infrastructure provided or in the business environment:
· No special structure needed to store error data simplifies the data model.
· All data is visible in a single environment rather than in a special area for
erroneous data.
· Balancing data to the source is more efficient.
· Simplicity in dependencies between ETL jobs makes the loading process
simpler and less prone to delays.
Concepts
The main important concept is pre-calculation, where a reporting entity is pre-built
for the users. A data warehouse in itself is a good example of implementing this
concept. Instead of heaving each user join the tables to get the sales for all
customers over the last month out of the ERP system, it is pre-built.
Embedded special time goes one step further. It creates more reporting constructs
up-front giving easy access to hard to get data.
To pre-calculate many options can be looked at. On-Line Analytical Processing
(OLAP) is one storage technique that heavily uses pre-built information. In this
section however we cover a strictly relational implementation, using two diverse
techniques to pre-build the special time constructs:
· Analytic SQL
· A bitmap table linking fact to dimensions
As with OLAP data, if you start to create facts that only exist in certain cells for
the dimensions, sparsity is the result. The data is sparse in that for certain
dimension combinations no fact is stored. The problem with sparse data is that it
increases storage and typically can decrease performance.
To avoid this sparsity in the query objects, an extra join is used to condense the
data set, and then analytic SQL functions enable the data to be shown to users as
embedded special time constructs.
A bitmap table is an intermediate table that links the fact table to the time
dimension allowing the facts to be looked at for different time constructs by pre-
joining the data set.
A bitmap table would look something like this (note this is not a complete table
but should convey the idea):
from_period to_period CP LY_CP LU_R2 LY_R3 QTD_CP LY_QTD_CP
200701 200701 1 0 0 0 1 0
200601 200701 0 1 1 1 0 0
200512 200701 0 0 1 1 0 0
200511 200701 0 0 0 1 0 0
The columns in the bitmap table represent the special time constructs we want to
work with. The SQL will join on the rows to determine when there are values to
be shown leveraging the bitmap in the table.
As we said, there are 2 important concepts here. The first is to deal with the
analytic SQL itself. The below fragment shows a particular snippet from the select
clause of the statement, note the analytic SQL constructs.
Select
<snip>
The second part is the actual join used to densify the data. It literally squeezes the
air of the result set and gives us better performance in retrieving the results. That
join is shown in the snippet below.
Select
<snip>
FROM Hybrid_Eaden.c_inventory_rpt fact
PARTITION BY (fact.company_id,fact.item_sku_id)
RIGHT OUTER JOIN Hybrid_Eaden.d_ea_month v_time
ON ( fact.ea_month_id = v_time.ea_month_id )
The bitmap table is linked from the fact table to the dimension. The bitmap
(either a 1 or a 0) is then added into the select clause of the query. If we look at
our bitmap table again, and then at the usage in the select clause we can
understand where data is shown (there is a 1) or not shown in the results (there is
a 0) allowing us to create a nice reporting solution.
, (fact.sales_units*time.ly_qtd_cp)
, (fact.sales_units*time.ly_tot_qtr_cp)
, (fact.sales_units*time.ly_tot_hyr_cp)
, (fact.sales_units*time.ly_tot_yr_cp)
, (fact.sales_units*time.ly_r3)
, (fact.sales_units*time.ly_r6)
, (fact.sales_units*time.ly_r12)
FROM c_global_sales_affiliate fact
, d_ea_month_matrix time
WHERE fact.ea_month_id = time.from_ea_month_id
The bitmap is simply joined to the fact table but the measures in the fact are
multiplied with the value in the bitmap.
This allows each measure to either hold data, or to hold a zero value. In this case
zero means that this cell is not applicable for the measure. Rolling the totals up
will then just add a zero value and not influence any totals.
In the case where the bitmap has a one, the value is shown in the measure and is
counted in rollups and in reporting. Users do not have to figure this out, they
simply see either a value or a zero value.
Concepts
The dimension-based conversion is based on a simple but elegant concept. Rather
than changing data structures or original data loaded into a dimension to convert
data, the rollup in a hierarchy manages the changes.
This concept has a number of very compelling benefits. One is that a change is
not permanent or destructive to the original data. The original record and linkage
is still in place, it just rolls up differently. The second benefit is that these typically
can be done quickly with much less impact on the system than a normal update.
The two benefits we mentioned earlier are achieved by adding a level to each
dimension (or better to a hierarchy) that acts as the lowest level. This lowest level
is called a load level. In other words data for the fact is always loaded at this level
in the hierarchy. The load level however is never exposed to users of the
reporting system. These users only see the so-called reporting levels.
In many cases conversion methods are destructive. For example fact data might
be stored at a higher grain, and the new data needs to be aggregated. Doing this in
a normal load will convert the data once. If that turned out to be not completely
accurate, then there is no easy fix. If the source is lost by that time, there is no fix
at all. Realigning simply keeps the new records as is in the grain they are in
originally. Overwriting is not what we want to do to our data.
An example
If we look at a table structure and data, this concept will become a lot clearer and
the benefits become visible instantly.
Let’s say we have loaded three products into the dimension table (in this example
we only show a few dimension columns). The result with the booked (which is
what the load level is called in this example) and reporting levels will look like this:
D D STANDARD
Now lets say we need to merge from a new vendor into this dimension, and we
do have the following mapping table that shows how a product from the new
vendor can be recoded to the Elizabeth Arden product codes (and therefore
hierarchies):
Vendor XYZ Product Elizabeth Arden Product
123 A
654 A
234 D
A C
With the levels as shown in Table 3 we can now quickly load the dimension and
see the following as a result after loading and recoding:
booked_products reported_products source_key
A A STANDARD
C C STANDARD
D D STANDARD
123 A VENDORXYZ
654 A VENDORXYZ
234 D VENDORXYZ
A C VENDORXYZ
We can now see where the big benefits come in. The table in Table 5 shows both
the original data and it shows the Elizabeth Arden product to the user. The
revenue for this dimension is now directly available in the relevant business units.
The other advantage is when a recoding is required, where for example product
234 should be linked to Elizabeth Arden product A, the dimension can be
updated instead of all related facts.
Oracle Corporation
World Headquarters
500 Oracle Parkway
Redwood Shores, CA 94065
U.S.A.
Worldwide Inquiries:
Phone: +1.650.506.7000
Fax: +1.650.506.7200
www.oracle.com