Anda di halaman 1dari 55

FUNDAMENTALS OF DATA WAREHOUSING

Unit 3 - Dimensional Modeling


Fundamentals of Data Warehousing

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Unit Outline
Unit 3
Overview of Modeling In The Warehouse Environment Introduction to Data Mart / Star Schema Modeling Extending the Dimensional Model Demo: Dimensional Tools
Modeling Tools Front End Tools & The Star-Schema

2005 Evolve computer Solutions

FUNDAMENTALS OF DATA WAREHOUSING

Overview of Modeling in the Warehouse Environment

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Unit Learning Objectives


Review the basic architectural layers from a data modelers perspective. Highlight how modeling at each layer differs.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Industry & Academic Terms


Modeling for The Data Mart has several industry terms associated with it:
OLAP Online Analytical Processing is an acronym that was coined to distinguish decision support query/analysis from regular Online Transaction Processing (OLTP). The notion is that OLAP should strive to deliver analytical transactions with comparable response time we are used to experiencing with transaction processing.
A variety of acronyms spawn from OLAP, each a different implementation of the same premise
ROLAP : Relational OLAP MOLAP : Multidimensional OLAP HOLAP : Hybrid OLAP (Multi-Dimensional / Relational)

2005 Evolve computer Solutions

Dimensional Model A relational database design composed of dimensions and facts (aka Star Schema.). A dimensional model is the foundation of data mart design and is typically the starting point of any OLAP solution. Dimensional Bus Architecture A data warehouse architecture pattern which uses re-usable business dimensions that plug into different analytical subject areas (facts).
The term bus comes from the analogy to a hardware bus within a PC, into which any number of different cards can plug in and function.

Fundamentals of Data Warehousing

Overview: Data Warehouse Architecture


The Data Warehouse Environment
Source Source Systems Systems Data Sources

Data cleansing, matching, and consolidation.

Contains: Time-capable data models (temporal) A perfect historical record. A unified view/definition of dimensional information.
Data Mart # n

Data Warehouse

- or ETL

ETL
Data Staging Area

Warehouse Meta-Data

Data Mart # 1

Data Mart # 2

2005 Evolve computer Solutions

Data Access

Collection and maintenance of documentation to help the builders and users of the data warehouse.

content development
Multi-Dim Analysis

Data Mart
Scorecards Web/Portal

reporting

Dimensional Data Models

content delivery

email

wireless

Fundamentals of Data Warehousing

Understanding the Role of Each Layer: An Example


Consider a warehouse environment dealing with sales to customers we examine how each layer handles the customers address information:

Handling of Customer Address History


Source Source Systems Systems

System allows part or all of a customers address to be updated. Keeps track of the current address of the customer.

Data Warehouse

- or -

Historical versions of the address are maintained over time. The ETL process used to load the DW or DSA is designed to differentiate a meaningless address correction from a meaningful relocation. Depending upon the analysis required, the user is able to perform sales analysis using the customers current address or their address at different points in time. Flexibility is given to the user. Using the correct time frame of a customer address is made simple.

Data Staging Area


2005 Evolve computer Solutions

Data Mart # 1 Data Mart # 2 Data Mart # n

Fundamentals of Data Warehousing

Modeling: Traditional Data Sources


Data Sources

The Data Source Model


Is designed for minimal redundancy (3rd normal form or greater). Is designed to enforce very specific rules. Will only store the history that is relevant to the operational system. Is usually not designed with analytical/historical needs in mind. Will change when rules surrounding the business process are changed.

Example: Our system must track the customers current legal address.
Supports the requirement of the op. system: I need to track the customers current address. As the address changes, previous addresses are lost.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Modeling & The Warehouse Layers

Q:
A slightly flattened version of the original data source with temporal extensions. Dimensions and Facts Capable of storing history.
2005 Evolve computer Solutions

How will we model each layer?


Source Source Systems Systems

Standard 3rd or 4th normal form data model. Minimal redundancy.

Model remains unchanged.

Model to
Data Warehouse

- or -

Data Staging Area

capture the history of anything that has direct/indirect analytical significance to our users.

Model to
Data Mart # 1 Data Mart # 2 Data Mart # n

Star Schema: A series of dimensional tables tied to a fact table using the required perspectives on the data.

Provide an optimal, easy to use data structure that interprets data from the required perspective(s).

Fundamentals of Data Warehousing

Modeling & The Warehouse Layers

Source Source Systems Systems

Source model remains unchanged

Data Warehouse

Normalization is collapsed. Temporal extensions added. Natural Keys replaced with surrogate keys. How to Model the Data Staging Area? There are different options/opinions on this. Generally the Data Staging area is modeled dimensionally with temporal capabilities. Temporal Elements Surrogate Key

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Modeling & The Warehouse Layers


A data mart can interpret time as needed. Temporal information from the warehouse or staging is used to determine how to join a dimension to a fact.

ETL uses time/date comparisons to load link the fact and dimension as required. Using this approach a data mart can offer a variety of time/history perspectives.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Modeling & The Warehouse Layers: Summary


The Sources
Have limits/problems:
Not good for analytical question/answer queries. Doesnt store history particularly well (or at all). Inconsistent representation of the same information (e.g., two sources of redundant customer information).

The Warehouse / Data Staging Area


Is designed to do three things for us
Overcome the problems of historical data we see in the sources. Consolidate and conform the same information even if they come from different sources. Provide a clean set of centralized data that makes it easy to build data marts.

2005 Evolve computer Solutions

The Mart
Is designed to
Provide clarity and ease of analysis. Reduce the potential for user error when performing analysis. Provide query performance and accuracy. Offer a unique/customized view of the data. Offer the required time perspectives between dimensions and measures.

FUNDAMENTALS OF DATA WAREHOUSING

Dimensional Modeling
Basic Terms & Concepts

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Analysis: Input to Modeling The Data Mart


Establishing A Subject Area


Prior to focusing on a single data mart, the practitioner has already worked with the organization to establish an initial subject area upon which to focus and build.

Gathering and Documenting Requirements


Getting a list of reports is a good starting point for understanding what the requirements are. DO NOT RELY SOLELY ON THIS!! Establishing an understanding of the decision makers role in the organization and different ways that data can be analyzed to uncover opportunity or gain productivity is essential.

Documenting Requirements
A list of sample reports? (good but can often be misleading or short-sighted) Use Cases? (format doesnt necessarily lend itself well to documenting flexible analytics) Business Questions / Sample Analysis? (often too high level to exhibit meaning) SUBJECT AREA PROFILE - Conceptually contains all of the above..
Describes strategic business objective, use, and context. Contains sample reports or analysis. References sample business questions / analysis that must be accommodates. Contains a list of dimensions and measures required for analysis to be performed. May contain known constraints or risks associated with different elements of data.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Inputs To Establishing the Dimensional Model


What do you need to establish a logical dimensional model?


A good understanding of the decision makers perspective
Why are they looking at the data? What types of business processes will the analysis support?

A profile of the data mart requirements


Sample Questions / Use Cases Identified & well defined dimensions & measures.

A good understanding of the source data.


2005 Evolve computer Solutions

Where is the grain of the fact in the source? How is dimensional information maintained? How/why does data change? What limitations need to be considered due to source data quality?

Fundamentals of Data Warehousing

Overview of Key Terms and Concepts


Dimensions
The descriptive information used in OLAP. Customers, Products, Time..etc.
Associated Terms:
Conformed Dimension: A dimension that is established as an enterprise definition

Facts (Fact Tables)


The transactional information that forms the basis for analysis.
Associated Terms:
Measures Key Performance Indicators

2005 Evolve computer Solutions

Star Schema
The star pattern which emerges when dimensions are connected to a central fact table.
Associated Terms:
Dimensional Model ROLAP

Fundamentals of Data Warehousing

Basic Terms & Concepts: Dimensions


A Dimension is
The what, when, where, who, and how information that is related to a measurable event (i.e., a fact) An analytical perspective used by a decision maker to assess a pattern, exception, or trend. A viewpoint taken when analyzing key performance indicators.

A Dimension Table
A table containing attributes (typically descriptive text elements) that collectively describe a single what, when, how, where, or who.

WHO? Customer Dimension


2005 Evolve computer Solutions

Sales Transaction Fact

WHAT? Product Dimension

WHEN? Sale Date Dimension

WHERE? HOW? Geography Sales Channel Dimension Dimension

Fundamentals of Data Warehousing

Basic Terms & Concepts: Facts & Measures


A Fact is
A single event that occurs at the intersection of one or more dimensions that contains measurable attributes.
E.g., a sales fact occurs at a point in time and makes reference to customer, product, geography, and sales channel.

A fact contains one or more measurable attributes or measures that when summarized, provide a basis for decision support.

A Fact Table
A table containing multiple fact records which has a one-to-one relationship (from the perspective of the fact record) to a row in each dimension table.

2005 Evolve computer Solutions

WHO? Customer Dimension WHAT? Product Dimension WHEN? Sale Date Dimension

Sales Tx Fact

WHERE? Geography HOW? Dimension Sales Channel Dimension

Quantity Revenue Cost Profit %Margin # Sales Transactions (Count)

Fundamentals of Data Warehousing

Basic Terms & Concepts: Grain / Granularity


The term Grain in dimensional modeling refers to a description of how detailed or granular the fact table is. This is somewhat analogous to defining the uniqueness of a row in a fact table but relates more to how atomic the data is. A statement of the fact tables grain is a vital aspect of developing a dimensional model. Establishing the grain will help identify problems or limitations early in the modeling process.
Note: This term can also be used to describe dimensions, however, it is more commonly associated with defining the fact table.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Grain & Aggregates


Examples of Grain Statements:


A Fact Table where a row represents the most detailed sales information available.
GRAIN Statement: The grain of the fact table is equivalent to a detailed line item of an invoice. (one row in the fact is equivalent in detail to an invoice line item)

A Fact Table where a row represents a month end balance of a chequing/savings account.
GRAIN Statement: The grain of the fact table is equivalent to one row, per month, per account

Aggregation
When the grain of a fact table is intentionally set at a summary level, we can say that the data has been aggregated or that the fact tables grain is an aggregate of more detailed data. A common pattern in dimensional modeling is to have a detailed, fine grain fact table with any number of aggregated/summarized version of that same fact table to improve performance for certain types of audiences/analysis. When an aggregated version of the fact table is used, one or more of the dimensions will either drop off or shift grain as well.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Grain & Aggregates


Sample Fact Table & Corresponding Aggregate: Shifting Grain for Performance

Detailed

Important! Do not build a summary-only data mart as this limits flexibility severely. Summary

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Guidelines: Dimensional Modeling


No outer joins! Every fact record has a relationship to every dimension. If a fact legitimately can exist without a piece of dimensional information, we attach that fact to a Not Applicable row in the dimension. Large Fact, Small Dimensions! Typically (there are exceptions) the fact will contain the greatest number of rows in the schema. One to Many Relationships ONLY! A dimension may only have a one-tomany relationship to the fact. Many-tomany relationships between a dimension and fact are not allowed.

Customer Dimension Geography Dimension


2005 Evolve computer Solutions

Product Dimension
Sales Transaction Fact

Sale Date Dimension

The Dimension is typically a hierarchy! The perfect dimension has multiple levels of hierarchy ideal for drilling down.

Only FKs and Measures! A perfect fact table has ONLY foreign keys and measures in it. (again, there are exceptions)

Fundamentals of Data Warehousing

Guidelines: Dimensional Modeling


Inventory: What should be in/out of the model?


Only the elements that are required by the subject area. We do not include data elements that have no meaning to the user (with the exception of necessary FK/PK columns). This includes:
(1) Primary Keys or IDs which have no meaning to the user. (2) Information that we need solely to populate the star schema.

*Example:
2005 Evolve computer Solutions

If we needed account number in order to tie a customer to a fact but the user clearly does not use account number in their work, we could safely exclude it from the model if we were concerned about table size or confusing the user.

*Tip: When in doubt include the attribute. You can always suppress it from view when presenting the data.

Fundamentals of Data Warehousing

Guidelines: Dimensional Modeling


No Outer Joins / Optional Relationships


Each dimension table will have a zero-one-to-many relationship to the fact table. No null values are allowed in foreign key columns on the fact. If a Fact record can legitimately exist without a Dimensional record, then a Not Applicable record which explains why the dimension is missing is added to the Dimension.

Keep Fact Tables Pure


Fact tables should only contain foreign key columns and measures (some exceptions apply).

Always Use Surrogate Keys


Use Surrogate/Integer Keys to Create Joins between facts and dimensions
Surrogate keys using integers should be used for performance reasons. Surrogate keys allow us to avoid compound keys which are poor performers. As we will see in a latter unit, surrogate keys are fundamental to providing accurate historical views of data.

2005 Evolve computer Solutions

When You Can Store it, Dont Calculate It


Model items that the user frequently calculates or descriptions that are always shown as a single field (e.g., first/last name). This avoids additional processing at query time, producing better performance.

Fundamentals of Data Warehousing

Guidelines: Dimensional Modeling


Bullet-Proof Analysis
The star schema design should never require that a user remember to filter something or manipulate their analysis in a certain way to avoid a mistake. Avoid creating a situation in the model where a user may accidentally to double count during analysis. The design (and loading) of the model should highlight invalid or suspect data by attaching appropriate labels to it.
E.g., your design should allow for a Product Not Found to be assigned to a fact record that arrives with a missing/bad product code rather than discarding the fact.

Avoid Tunnel Vision


Design as much future-proofing into the model as you project timeline will allow. Use your understanding of the business strategic direction to predict future requirements. Always budget time to prepare a set of sample analytical reports/queries at each iteration of the data mart models design. Help your users work with the data for real. This will help validate your design and will demonstrate to stakeholders that you are on the right track. A solid approach to iterative construction allows the user to reliably use the model while you work on the next iteration.


2005 Evolve computer Solutions

Use it While You Build It


Fundamentals of Data Warehousing

Process / Steps
Step 1: Select the Subject Area Step 2: Declare the Grain Step 3: Choose the Dimensions Step 4: Identify the Facts

2005 Evolve computer Solutions

FUNDAMENTALS OF DATA WAREHOUSING

Exercise 1
Creating A Simple Star-Schema Model

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Exercise Overview
Objectives
Get a feel for data mart modeling. Develop a preliminary star schema designs from simple requirements.

Recognize
These are samples, we will not be able to truly complete the models, but we should be able to produce a first good cut. We dont have true business contextwe may need to fill in the blanks in the absence of a real user. Much of the context will come from the data model of the source system. In reality, you would establish the requirements primarily from your business audience.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Exercise Material: High Level Requirements


Subject Area 1 : Order Processing Volume Analysis Target Audience: Store Operations Management & Staffing
Overview Certa has a number of floating employees that can work in different stores. Generally, the floaters are used to bolster staffing levels when stores have spikes in the number of orders they take. Management would like to be able to develop a weekly clerk requirement forecast based this week in history for each store location so that they assign the right number of floaters to each location. Each store has a base staff level (full time, regular employees) and a maximum number of staff it can handle. This of course is important information to consider. Experience of management suggests that on average there should be one staff person for every 150 orders processed in a day. Currently, floater staff assignment is done manually. Management has a series of spreadsheet reports with a number of calculations and manipulations.


2005 Evolve computer Solutions

Sample Analytical Business Questions


For a given store, what was their historical order volume for a given week. Which stores look like they may have a shortage of staff for next week. Which appear to have the most critical/significant shortage forecast? Which stores appear to be maxed out based on their forecasted transaction volume and maximum staffing levels?

Fundamentals of Data Warehousing

Exercise Material: High Level Requirements


Subject

Area 2: Invoiced Sales Analysis

Target

Audience: Sales & Marketing

Overview

The sales management team at head office needs the ability to analyze trends and patterns in sales across the country. This information will be used to measure and monitor the performance of managers, clerks, stores, regions and identify products that are most profitable. The management team will use their analysis to set sales targets and identify possible incentives for certain products.
Sample

Analytical Business Questions

2005 Evolve computer Solutions

What products show consistently high margins in each store, region. Are there preferences for certain colors in certain regions of the country? Who are our top ten customers in each province by quarter? Which geographic regions are showing a steady increase/decrease for our new Outdoor Camping product division? What types of trends exist between seasons and different product divisions? Which managers have shown a consistent improvement in sales over the past 6 months for certain product groups?

Fundamentals of Data Warehousing

Exercise 1: Questions / Tasks


1.

Develop a Star Schema Model (on paper or using a diagramming tool) which contains all of the required dimensions and a single fact table with the appropriate measures.
To simplify our initial modeling exercise make the following exclusions.
Exclude the customer category from your initial model. Exclude tax-related items from the initial model. Exclude the sales clerk from consideration.

2.

Present an overview of your model, highlighting any trouble spots or challenges you detected. Answer the following questions:
What is the grain of the fact table in your model? Are there any business questions that your model would struggle or fail to answer correctly?

2005 Evolve computer Solutions

3.

Fundamentals of Data Warehousing

Exercise 1: Questions/Tasks
You may write your answers to the exercise questions below: 2. Highlighting any trouble spots or challenges you detected.

2005 Evolve computer Solutions

3. Answer the following questions:


What is the grain of the fact table in your model?

FUNDAMENTALS OF DATA WAREHOUSING

Unit III: Dimensional Modeling


Intermediate Topics

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Unit Learning Objectives


Explore and Understand the following Topics:
Date Dimensions Dimensional Role Playing Slowly Changing Dimensions Exceptions to the Rules: Examine common patterns / situations where it will be necessary to bend the rules of dimensional modeling.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Date Dimension

We often think of a calendar date as one item, in reality a single date can represent up to five or more distinct elements (month, day, year, quarter, season). Common dimension used to capture information about the date/time that a fact took place Contains a series of attributes designed to eliminate performance robbing date parsing/manipulation at query time.

Typically uses the integer value of the date as the primary key (e.g. 20040604 is the key for June 4, 2004)
2005 Evolve computer Solutions

Date Dimension PK Date JULIAN_DT YEAR_NUM MONTH_NUM MONTH_NM DAY_OF_MNTH DAY_OF_WK DAY_TEXT DAY_OF_YR DAYS_LEFT QUARTER WEEKDAY_FLAG HOLIDAY_CAN

Contains the actual Date (formatted as a date)

Can be easily expanded to accommodate any future date needs of the business

Contains a series of attributes that describe different aspects about the date

Fundamentals of Data Warehousing

Dimensional Role Playing


In most business models, the same dimension plays multiple roles in relation to a fact table.
Examples:
A customer both pays for and receives a shipment of goods. An account manager can be the approver and the signing officer on a loan.

The most common, and most simple example of role playing occurs with dates:

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Slowly Changing Dimensions: What are they?


Slowly changing dimensions enable a star schema model to correctly and accurately represent changes to who/what/where and, if necessary associate fact records with the correct version of dimensional record. The term slowly changing comes from the notion that it is relatively rare for dimensional rows to change, but when changes occur they maybe significant to the business analysis at hand. Examples of changes which would normally be handled by slowly changing dimensions:
Change in marital status for a customer. Change of address / relocation of a large customer. Credit rating of a customer. Suggested retail price of a product/service.

2005 Evolve computer Solutions

Sample Analytical Questions that rely on Slowly Changing Dimensions


What were our sales like when the price of our sleeping bags have been set below $35.00? What is the 5 year trend of sales for products to married customers?
NOTE: in both cases, the answers to these questions would be incorrect if only current information was used to drive the reporting of the measures.

Fundamentals of Data Warehousing

Types of Dimensions

Type 1 (No History)


Used in situations where an attribute that has been changed/corrected, is overwritten with the new information. Overwrites any historical information that was contained in this field.

Type 2 (Different Versions of Rows Are Stored Temporal Rows)


A new record is inserted with the changed attribute(s) while the old record is maintained for historical purposes. Since this is a situation where we require that both records are present in the database, a different generated key is necessary for each record, even though the primary key from the source system has not changed.

Type 3 (Columns are added to track previous n values of a specific column)


Type 3 SCD uses an extra attribute(s) in the dimensional record to store the old value(s) of the changed attribute. This allows the system to track both the current value for the field and the n previous value(s) that the attribute held.

2005 Evolve computer Solutions

NOTE: Upon asking Ralph Kimball why the types were simply enumerated, he replied that we simply assigned numbers to them and never got around to giving them better taxonomy.

Fundamentals of Data Warehousing

Type 1 Dimension No History


Usually used for:


Information that was entered incorrectly originally Data that is of limited analytical value.
Generated Primary Key does not change. The same record is maintained for the same customer. The surrogate key exists for either (a) query efficiency or (b) futureproofing the data mart.

C u s t o m e r_ d m P K _C us tom er C u s t o m e r_ Id C u s t o m e r_ F irs t _ N a m e C u s t o m e r_ L a s t _ N a m e C u s t o m e r_ F u ll_ N a m e C u s t o m e r_ A d d re s s _ 1 C u s t o m e r_ A d d re s s _ 2 C u s t o m e r_ A d d re s s _ 3 C u s t o m e r_ A d d re s s _ 4 C u s t o m e r_ C it y C u s t o m e r_ P ro vin c e C u s t o m e r_ R e g io n C u s t o m e r_ C o u n t ry C u s t o m e r_ P o s t a l_ C o d e C u s t o m e r_ F S A C u s t o m e r_ P h o n e C u s t o m e r_ F a x C u s t o m e r_ E m a il C u s t o m e r_ Ty p e C u s t o m e r_ B irt h _ D a t e C u s t o m e r_ L a n g u a g e C u s t o m e r_ S t a t u s C u s t o m e r_ O c c u p a t io n

2005 Evolve computer Solutions

Data elements are overwritten with new values

Fundamentals of Data Warehousing

Type 2 Slowly Changing Dimension


Usually used for:


Situations where important analytical information has changed Situations where history needs to be maintained A new customer record with a new key is created each time any relevant information changes.

C u s t o m e r_ d m P K _C us tom er C u s t o m e r_ Id C u s t o m e r_ F irs t _ N a m e C u s t o m e r_ L a s t _ N a m e C u s t o m e r_ F u ll_ N a m e C u s t o m e r_ A d d re s s _ 1 C u s t o m e r_ A d d re s s _ 2 C u s t o m e r_ A d d re s s _ 3 C u s t o m e r_ A d d re s s _ 4 C u s t o m e r_ C it y C u s t o m e r_ P ro vin c e C u s t o m e r_ R e g io n C u s t o m e r_ C o u n t ry C u s t o m e r_ P o s t a l_ C o d e C u s t o m e r_ F S A C u s t o m e r_ P h o n e C u s t o m e r_ F a x C u s t o m e r_ E m a il C u s t o m e r_ Ty p e C u s t o m e r_ B irt h _ D a t e C u s t o m e r_ L a n g u a g e C u s t o m e r_ S t a t u s C u s t o m e r_ O c c u p a t io n E ffe c ti v e _ D a te E n d _ D a te

2005 Evolve computer Solutions

Addition of date fields allow for multiple customer records to exist for different time periods. It is also common to add an attribute which identifies which record is the current or latest version.

Fundamentals of Data Warehousing

Type 3 Slowly Changing Dimension


Usually used for:


Situations where both the current and previous value need to be examined on the same record (from a single point in time). The amount of historical context required is known to be fixed and predictable. Simplifying the data mart design (tactical approach).

Customer_dm PK_Customer Customer_Id Customer_First_Name Customer_Last_Name Customer_Full_Name Customer_Address_1 Customer_Address_2 Customer_Address_3 Customer_Address_4 Customer_City Customer_Province Custom e r_Curre nt_Re gion Custom e r_Pa st_Re gion Customer_Country Customer_Postal_Code Customer_FSA Customer_Phone Customer_Fax Customer_Email Customer_Type Customer_Birth_Date Customer_Language Customer_Status Customer_Occupation

Addition of a second region field allows for analysis of both the current and past regions while only evaluating one record. The amount of history is limited by the number of columns added by the designer.

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Degenerate Dimensions

A degenerate dimension has the following characteristics:


A who/what/when/where/how item which has only one attribute/column and has no other descriptive information or association to other dimensions. A descriptive element which occurs at or very near the same grain of the fact table. Is rarely (if ever) used to aggregate/group facts. Typically these are unique identifiers relating to a transaction such as receipt #, invoice #..etc.

2005 Evolve computer Solutions

Degenerate Dimensions, Once Identified, can be placed within the fact table as shown.

FUNDAMENTALS OF DATA WAREHOUSING

Snowflake & The Star Schema


Breaking the Rules

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

What is a Snow-Flake?

What is a snowflake?
A snow flake is a table connected to a dimensional arm of a star schema. The relationship between the dimension and the snowflake can be one-to-one, one-to-zero, or one-to-many.

They should be used when they accomplish the following goals:


Considerable space savings and performance improvement are achieved. Simplification of the logical model is required.**

2005 Evolve computer Solutions

** Some authors prescribe that use of snowflake designs provide logical simplicity to a model. I believe that their use should be exclusively reserved for making physical optimizations to a star schema model, or implementing a design when there are no other alternatives. - C. Sousa

Fundamentals of Data Warehousing

Snowflakes: When are they dangerous?


A snowflake with a one-to-many relation to a dimension is technically an invalid design. In some situations, there may be no other approach to the design. Double Counting by accident can occur unless the analyst understand the implications of using a snow-flaked dimensional element.

When the snowflake is not used, the correct total is produced.

2005 Evolve computer Solutions

Because Reynolds belongs to both groups, their sales are double counted in the total. Strictly speaking there is nothing wrong with this report, the danger occurs when a total is applied and an incorrect interpretation is made.

Fundamentals of Data Warehousing

Snowflakes: When are they ok?


A snowflake design with a one-to-zero/one relation is permitted as a design alternative to make the model more efficient. This design is optimal when dealing with large dimensions and sparse information (i.e., few records have data) in a number attributes. The normalization of sparse attributes implies that they are not used frequently, and when they are used, analysis is limited to those records that possess this information. NOTE: This snowflake would perform poorly if an outer join was made between the snow-flaked table and the customer dimension (i.e., sales analysis of customers with demographics was needed along with customers that do not have demographics).

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Snow-Flakes: When should they be used?


They should be used when they accomplish the following goals:
Considerable space savings and performance improvement are achieved. Simplification of the logical model is required.**

2005 Evolve computer Solutions

** Some authors prescribe that use of snow-flake designs provide logical simplicity to a model. Front end tools can achieve simplicity without any need to normalize the model. I believe that their use should be exclusively reserved for making physical optimizations to a star schema model, or implementing a design when there are no other alternatives.

Fundamentals of Data Warehousing

Snow-Flakes: Common Mistakes


In the example shown below, the model becomes more cumbersome and no space savings are achieved. The design pattern shown below is a common mistake made by designers new to dimensional modeling where de-normalization feels unnatural.
Fact_Table Package_Type PK_Product_Type Product_Type_Description Product_Dimension PK_Product Product_Number Product_Name Product_Description Product_SKU Product_Size FK_Product_Type FK_Product_Category FK_Product (FK) FK_Customer FK_Location FK_Order_Date FK_Invoice_Date Product_Number Invoice_Number Order_Number Invoice_Quantity Invoice_Product_Cost

2005 Evolve computer Solutions

Package_Category PK_Product_Category Product_Category_Description

FUNDAMENTALS OF DATA WAREHOUSING

Helper Tables
Handling Many-To-Many Relations

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

The Problem with Many-to-Many Relationships


Many to many relationships between the fact and a dimension occur when an unlimited number of rows in the dimension can be said to own/share the same fact.
Examples:
Banking: Many customers jointly hold one or more accounts. No distinction is made between account holders. Insurance: Many individuals contribute to a jointly held policy/fund. No distinction is made between policy holders. Many to Many (M:N) relationship is not permitted in dimensional modeling. A HELPER TABLE design is used to overcome this design challenge.

Customer_dm PK_Customer Customer_Id Customer_Name More Customer Information

Policy_Fact_Table FK_Policy_Owner FK_Date FK_Location Some_Measures

2005 Evolve computer Solutions

John Smith Jane Smith John Smith Jr. etc.

Co-own

Policy 234-89 Value: $250K

Fundamentals of Data Warehousing

Helper Tables & Many-to-Many Relationships


Helper Tables are used to handle many to many relationships between the fact and a dimension when the number of values for the dimension are said to be infinite Requires the use of a Weighting Factor to ensure that the measures in the fact are not double counted. At query time (or with a view), the weighting factor is used to multiply the measure(s) in the fact table

2005 Evolve computer Solutions

John Smith owns 33.33% of Jane Smith owns 33.33% of John Smith Jr. owns 33.33% of etc.
The weighting factor effectively splits ownership over the same fact to eliminate double counts.

Policy 234-89 Value: $250K

FUNDAMENTALS OF DATA WAREHOUSING

Exercise 2: Intermediate Topics


Extending the Certa Sales Analysis Data Mart

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Exercise 2: Questions

Question 1: Include customer category and clerk in the model.


Make any necessary revisions to the model.

Question 2: The CFO, upon witnessing new reports produced from Marketing & Sales wishes to also use the data mart to analyze and report on sales tax collection. They need to analyze sales tax by product division and store geography, however, they do not need to analyze customers. Finance only requires monthly reports on taxation.
If possible, change the grain of the fact table to suit Finances requirements. Add the necessary tax measures to the star schema model to accommodate these needs.
What is the problem with the tax measures? How can this problem be solved?

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Exercise 2: Questions

Question 3: We have just discovered that some stores are re-cycling (re-using) manager and sales clerk IDs.
Alter the star schema as required to accommodate this new quirk coming from the source system.

Question 4: The Sales and Marketing team would like to accurately profile market share in different geographic areas.
Data which shows total industry sales by year (includes all competitors) by census boundaries was purchased. Marketing would like to express Certas sales as a percentage of these sales figures.
Describe the best approach to accommodating these requirements given that we have already designed a basic dimensional model for sales analysis?

2005 Evolve computer Solutions

Fundamentals of Data Warehousing

Exercise 2: Questions
Question 5: A decision needs to be made on what line of product should receive an injection of R&D money. Part of the basis for the decision will be based on a five year trend based on customer geography, of sales of different product lines.
What dimensions would need to store history correctly in order to answer this question? Illustrate the changes/additions to the model that would be required.

2005 Evolve computer Solutions

Anda mungkin juga menyukai