Table of Contents
1 DIMENSION MODELING ...........................................................................................................................1 2 DIMENSIONAL MODELING SCHEMAS...................................................................................................2 2.1 STAR SCHEMA ............................................................................................................................................2 2.2 SNOWFLAKE SCHEMA...................................................................................................................................3 3 DIMENSIONAL MODELING CONSTRUCTS ...........................................................................................4 3.1 DIMENSIONS...............................................................................................................................................5 3.1.1 Dimension Keys................................................................................................................................5 3.1.2 Slowly Changing Dimensions (SCDs)..............................................................................................6 3.1.3 Rapidly Changing Dimensions (RCDs)............................................................................................7 3.1.4 Degenerate Dimensions...................................................................................................................8 3.1.5 Demographic Mini Dimension (or Sub Dimensions)........................................................................9 3.1.6 Junk/Dirty Dimensions...................................................................................................................11 3.1.7 Conformed Dimensions..................................................................................................................11 3.2 FACTS.....................................................................................................................................................13 3.2.1 Granularity....................................................................................................................................13 3.2.2 Type of Fact table grains................................................................................................................13 3.2.3 Type of Measures ..........................................................................................................................13 3.2.4 Factless Fact Table........................................................................................................................14 3.2.5 Aggregation....................................................................................................................................16 3.3 DIMENSIONAL MODELING TERMINOLOGY.....................................................................................................16 3.3.1 Hierarchies....................................................................................................................................16 3.3.2 Browsing........................................................................................................................................17 3.3.3 Drilling...........................................................................................................................................17 4 THE DIMENSIONAL MODELING DESIGN PROCESS........................................................................19 4.1 FOUR PHASE OF KIMBALLS APPROACH FOR DESIGNING DIMENSIONAL DATABASE...............................................19 4.2 THE DATA WAREHOUSE BUS ARCHITECTURE................................................................................................20 5 ADVANCED DESIGN....................................................................................................................................22 5.1 INDEXING.................................................................................................................................................22
1 Dimension Modeling
Dimension Modeling is a database modeling technique often used in the data warehousing/OLAP area and is different from entity Relationship technique used for normal transactional (OLTP) applications. Primarily used for designing data marts and data warehouses, Dimensional Modeling seeks to present the data in a standard, intuitive framework that allows for high-performance access.
Page 1 of 22
BI Dimensional Modeling
It is inherently dimensional and adheres to a discipline that uses the relational model with some important restrictions. Simply put, Its a way to store data in a multidimensional form in a two dimensional Relational Database Management System. When the database can be visualized as a cube of three (or more) dimensions, people can imagine slicing and dicing that cube along each of its axis or dimensions to look at the cells or measures inside the cube.
Figure 1. Multi Dimensional Model Every dimensional model is composed of one dominant/central table with a multi part key, called the fact table, and a set of smaller tables called the dimension tables. Each dimension table has a single part primary key that corresponds exactly to one of the components of the multipart key in the fact table. This characteristic star-like structure is often called a star join schema.
Page 2 of 22
BI Dimensional Modeling Fact Tables: Fact tables contain the detail information you want to look at, such as numbers of online items. Access control to sensitive information is maintained in fact tables. These rows can be very large as much as several million rows of data. Dimension Tables: dimension tables are designed especially for selection and grouping. There is no access control on these tables and all users can view this information. These tables are much smaller than the fact tables and may contain 10,000 rows of data.
Physically, the database consists of a single fact table and a single table for each dimension. Each tuple in the fact table consists of pointers (foreign keys) to each of the dimension table. Each dimension table consists of columns that correspond to attributes of the dimension. Salient features of the star schema - Easy to understand - Better Performance minimizes the number of joins - Supports multi-dimensional analysis - Allows relative easy maintenance - Recommended for most Decision Support System user tools - Extensible design supports changing business requirements (debatable)
BI Dimensional Modeling
Snow flaking allows for easy update and load of data as redundancy of data is avoided to some extent, but browsing capabilities are greatly compromised. Snow flaking often becomes necessary when you need data for which there is a one-to-many relationship with a dimension table. To try to consolidate this data into the dimension table would necessarily lead to redundancy (this is a violation of second normal form, which will produce a Cartesian product). This sort of redundancy can cause misleading results in queries, since the count of rows is artificially large (due to the Cartesian product). A simple example of such a situation might be a "customer" dimension for which there is a need to store multiple contacts. If the contact information is brought in to the customer table, there would be one row for each contact (i.e., one for each customer/contact combination). In this situation, it is better just to create a "contact" snowflake table with a FK to the customer. In general, it is better to avoid snow flaking if possible, but sometimes the consequences of avoiding it are much worse.
Figure 3. Snow Flake Schema Note: Dimensions that are snow flaked are also called outrigger tables. In the above figure we have City Outrigger, State Outrigger and Region Outrigger tables.
Page 4 of 22
BI Dimensional Modeling
3.1 Dimensions
The dimension tables are where the attributes of the dimensions of the business are stored. The best attributes are textual and discrete and used to constraint the fact table. Each of these textual descriptions helps us to describe the member of the respective dimension. - They are the entry points into the fact tables. It determines the grain of the fact table and vice versa. - A single Primary key identifies each Dimension record. - It serves as a primary source of query constraints grouping and report labels/row headers. - They are relatively shallow in terms of rows but are wide with many large columns. - Not usually time dependent - Hierarchical relationships. - Robust dimension attributes delivers analytic slicing and dicing capabilities. - Dimension tables are de-normalized. Examples of Dimensions: Employee, Time Product Customer etc.
Avoid smart keys Keys where you can tell something about the record just by looking at the key are called smart keys. Avoid production keys Production keys should be avoided because
Page 5 of 22
BI Dimensional Modeling
Production may reuse keys that it has purged but that you are still maintaining, as I described. Production may make a mistake and reuse a key even when it isnt supposed to. This happens frequently in the world of UPCs in the retail world, despite everyones best intentions. Production may recompact its key space because it has a need to garbage-collect the production system. One of my customers was recently handed a data warehouse load tape with all the production customer keys reassigned! Production may legitimately overwrite some part of a product description or a customer description with new values but not change the product key or the customer key to a new value. You are left holding the bag and wondering what to do about the revised attribute values. This is the Slowly Changing Dimension crisis, which I will explain in a moment. Production may generalize its key format to handle some new situation in the transaction system. Now the production keys that used to be integers become alphanumeric. Or perhaps the 12-byte keys you are used to have become 20-byte keys. Your company has just made an acquisition, and you need to merge more than a million new customers into the master customer list. You will now need to extract from two production systems, but the newly acquired production system has nasty customer keys that dont look remotely like the others.
Type One (Overwriting History): A Type 1 change overwrites an existing dimensional attribute with new information. In the customer name-change example, the new name overwrites the old name, and the value for the old version is lost. A Type One change updates only the attribute, doesn't insert new records, and affects no keys. Type Two (Preserving history) Creating an additional dimension record at the time of the change with the new attribute values and thereby segmenting history very accurately between the old description and the new description. Implementing Type Two changes within a data warehouse might require significant analysis and development. Type Two changes accurately partition history across time more effectively than other types. However, because Type Two changes add records, they can significantly increase the database's size.
Page 6 of 22
BI Dimensional Modeling
Type Three (Preserving a version of history) Creating new current fields within the original dimension record to records the new attribute values, while keeping the original attribute values as well, thereby being able to describe history both forward and backward from the change either in terms of the original attribute values or in terms of the current attribute values. You usually implement Type Three changes only if you have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name. Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record. You can create multiple fields to hold distinct values for separate points in time. In the case of a name change, you could create an OLD_NAME and NEW_NAME field and a NAME_CHANGE_EFF_DATE field to record when the change occurs. This method preserves the change. But how would you handle a second name change, or a third, and so on? The side effects of this method are increased table size and, more important, increased complexity of the queries that analyze historical values from these old fields. After more than a couple of iterations, queries become impossibly complex, and ultimately you're constrained by the maximum number of attributes allowed on a table.
Because most business requirements include tracking changes over time, data warehouse architects commonly implement Type Two changes. A data warehouse might use Type Two changes for all attributes in all tables. As an alternative, you can implement a mix of Type One and Type Two changes at an attribute level by implementing Type 2 changes for only attributes whose historical values are important when you're slicing and dicing. For example, users might not need to know an individual's previous name if a name change occurs, so a Type One change would suffice. Users might want the system to show only the person's current name. However, if the company reassigns sales territories, users might need to track who sold what, at what time, and in what territory, necessitating a Type Two change. Although most data warehouses include Type Two changes, you need to seriously examine the business need to record historical data. Implementing Type Two changes might be necessary, but those changes will increase the database size, degrade performance, and lengthen the development time. You need to carefully evaluate using a Type Two implementation, a Type One implementation, or a hybrid implementation.
Page 7 of 22
BI Dimensional Modeling
Force the attributes selected to the demographic dimension to have relatively small number of discrete values Build up the demographic dimension with all possible discrete attributes combinations Construct a surrogate demographic key for this dimension
Note: The demographic attributes are the one of the heavily used attributes. Their values are often compared in order to identify interesting subsets.
Page 8 of 22
BI Dimensional Modeling
Degenerate dimensions usually occur in line item-oriented fact table designs. A degenerate dimension is represented by a dimension key attribute with no corresponding dimension table. Many of the dimensional designs revolve around some kind of control document like an order, an invoice, a bill of lading, or a ticket. Usually these control documents are a kind of container with one or more line items inside. A very natural grain for a fact table in these cases is the individual line item, In other words, a fact table record is a line item. Given this perspective, we can quickly visualize the necessary dimensions for describing each line item e.g. Product, Customer, Time etc. Generally, the attributes on the order number automatically go over to these chosen dimensions. But what do we do with the order number itself? At the end of the design, the order number is sitting by itself, without any attributes. We call this a degenerate dimension. The degenerate dimension key should be the actual production order number and should sit in the fact table without a join to anything. There is no point of making a dimension table because the dimension table would not contain anything. Note: If one or more attributes are legitimately left over after all the other dimension have been created, and they seem to belong to this control document entity we should simply create a normal dimension record with a normal join. You dont have a degenerate dimension any more.
Page 9 of 22
BI Dimensional Modeling
Page 10 of 22
BI Dimensional Modeling
A better alternative is to create a junk dimension. A junk dimension is a convenient grouping of flags and attributes to get them out of a fact table into a useful dimensional framework
BI Dimensional Modeling
Suppose now that you add a marketing data mart to help you analyze product promotions. Again, with conformed customer and time dimensions, youre able to analyze the effects of a particular product promotion on sales. (Analyzing facts from more than one fact table in this way is termed drilling across. My previous article, Thinking dimensionally aids business intelligence design and use, explains the function of facts and dimensions.) The same conformed dimensionsin this case, time and customer dimensionshave meaning in the context of three independently developed data marts. These dimensions become enterprise property and can be used later in other marts as you evolve the enterprise data warehouse. In order to use multiple data sources together, the data warehouse team has no choice but to conform the dimensions. The reason for this is the data arrived from multiple sources may have incompatible granularity. Conforming the dimensions means forcing the two data sources to share identical dimensions. Conformed dimensions have consistent definitions regardless of where they are used. This allows a single query to be run across multiple tables, Data Marts and Data Warehouses.
Page 12 of 22
BI Dimensional Modeling
3.2 Facts
As described earlier, the fact table is the table that is at the center of a star schema and holds the primary data. They contain the actual numerical measurements that the business is interested in. A fact table typically has two types of columns: those that contain measures and those that are foreign keys to dimension tables. Some key features of a fact table are - Multi part Key. I.e. a composite key with one foreign key for each dimension. - Time is a always a part of the key - Usually numeric. Keys are surrogate integers and the measures are numeric. - Typically additive.
3.2.1 Granularity
By granularity we mean the level of data in the fact table. The lowest granularity is referred as atomic data. The granularity is determined by the grain. The meaning of a single record in a fact table is grain. The granularity or frequency of the data is usually determined by the time dimension. For example, you may want to only store weekly or monthly totals. The lower the granularity, the more records you will have in the fact table. The granularity also determines how far you can drill down without returning to the base, transaction-level data. Many OLAP systems have a daily grain to them. The lower the grain, the more records that we have in the fact table. However, we must also make sure that the grain is low enough to support our decision support needs. One of the major benefits of the star schema is that the low-level transactions are summarized to the fact table grain. This greatly speeds the queries we perform as part of our decision support.
Page 13 of 22
BI Dimensional Modeling
A second kind of factless fact table is called a coverage table. Coverage tables are frequently needed when a primary fact table in a dimensional data warehouse is sparse. The figure below shows a simple sales fact table that records the sales of products in stores on particular days under each promotion condition. The sales fact table does answer many interesting questions but cannot answer questions about things that didn't happen. For instance, it cannot answer the question,
Page 14 of 22
BI Dimensional Modeling
"Which products were on promotion that didn't sell?" because it contains only the records of products that did sell. In this case the coverage table is used. A record is placed in the coverage table for each product in each store that is on promotion in each time period. You need the full generality of a fact table to record which products are on promotion. In general, which products are on promotion varies by all of the dimensions of product, store, promotion, and time. This complex many-to-many relationship must be expressed as a fact table. An option is to just filling out the original fact table with records representing zero sales for all possible products. This is logically valid, but it would expand the fact table enormously and the coverage factless fact table can be made much smaller. The coverage table must only contain the items on promotion; the items not on promotion that also did not sell can be left out. Also the frequency of population of the coverage table will be less frequent than the fact table. Answering the question, "Which products were on promotion that did not sell?" requires a twostep application. First, consult the coverage table for the list of products on promotion on that day in that store. Second, consult the sales table for the list of products that did sell. The desired answer is the set difference between these two lists of products. Coverage tables are also useful for recording the assignment of sales teams to customers in businesses in which the sales teams make occasional very large sales. In such a business, the sales fact table is too sparse to provide a good place to record which sales teams were associated with which customers. The sales team coverage table provides a complete map of the assignment of sales teams to customers, even if some of the combinations never result in a sale
Page 15 of 22
BI Dimensional Modeling
3.2.5 Aggregation
Aggregates are statistical summaries of a fact table. You can have multiple levels of aggregates based on the same fact table. The summarization is done over levels of different dimensions e.g. Monthly sales is a summarization at the month level of the time dimension. This is typically done to improve performance. Fact tables that are aggregated are also called summary tables.
Alternate hierarchy Alternate hierarchies are very powerful. The preferred method for handling alternate hierarchies is to build the alternate hierarchy structure in columns to the right of the primary hierarchy. Not all leaf-level members, or primary keys of the dimension table, are members in alternate hierarchies. For those members, the alternate hierarchy columns should be left NULL. A sample of a dimension with a primary hierarchy and an alternate hierarchy follows:
Page 16 of 22
BI Dimensional Modeling
3.3.2 Browsing
Browsing is the act of navigating around a single dimension, either to gain an intuitive understanding of how the various attributes correlate with each other or to build a constraint on the dimension as a hole. Browsing often involves constraining one or more dimensional attributes and looking at the distinct values of another attribute in the presence of these constraints. E.g. we want to see which of the stores in Kent County, Ohio has the new upgraded plan, we need to go through the store name, country, state and floor plan fields in the Store dimension. Browsing may or may not be across defined hierarchies. To support browsing, the dimensional tables should remain as fact tables and not be normalized. Normalized dimension tables destroy the ability to browse.
3.3.3 Drilling
Drill down and drill up are common terms used while describing OLAP applications. When the user navigates from a summary level to a detail level data it is called drill down and the vice versa is called drill up (or roll up). This is typically done across a hierarchy. In dimensional modeling terms this means adding or subtracting grouping columns from a query. Imagine creating a grouping column in a report by opportunistically dragging a dimension attribute from any of the dimension tables down into the report, thereby making it a grouping column (see figure below). All dimension attributes can become grouping columns (though they are typically part of the same hierarchy).
Page 17 of 22
BI Dimensional Modeling
Drilling Across Drilling across is the process of linking two or more fact tables at the same granularity, or, in other words, tables with the same set of grouping columns and dimensional constraints. Drilling across is a valuable technique whenever a business has several fundamental business processes that can be arranged in a value chain. Each business process gets its own separate fact table. For e.g., almost all manufacturers have an obvious value chain representing the demand side of their businesses consisting of finished goods inventory, orders, shipments, customer inventory, and customer sales. The figure shows how these fact tables are arranged in a sequence. The product and time dimensions thread through all of these fact tables. Some dimensions, such as customer ship to, thread through some, but not all of the fact tables. For instance, customer ship to does not apply to finished goods inventory. A drill across report can be created by using grouping columns that apply to all the fact tables used in the report. Thus in the example, attributes may be freely chosen from the product and time dimension tables because they make sense for every fact table. Attributes from customer ship to can only be used as grouping columns if we avoid touching the finished goods inventory fact table. When multiple fact tables are tied to a dimension table, the fact tables should all link to that dimension table. When we use precisely the same dimension table with each of the fact tables, we say that the dimension is "conformed" to each fact table. Dimensions that are not conformed (such as those that differ in grain or detail) across fact tables will defeat the drill across application.
Page 18 of 22
BI Dimensional Modeling
Drilling Around The final variant of drilling is drilling around a value circle. This is similar to the linear value chain in the previous example, but occurs in a data warehouse where the related fact tables that share common dimensions are not arranged in a linear order. The best example is from health care, where as many as 10 separate entities are processing patient encounters, and are sharing this information with one another. The Figure shows a typical health care value circle with 10 separate entities surrounding the patient. Although this is not a value chain like manufacturing, the data warehouse issues of combining facts from separate fact tables across a single line of a report are very much the same as the previous discussion. When the common dimensions are conformed and the requested grouping columns are drawn from dimensions that tie to all the fact tables in a given report, you can generate really powerful drill around reports by performing separate queries on each fact table and outer joining the answer sets in the client tool. Once you have set up multiple fact tables for either drilling across or drilling around, you can certainly drill up and down at the same time. In this case, you take the whole value chain, or value circle, and simultaneously ask all the fact tables for more granular data (drill down) or less
Four steps involved in Kimballs approach are: 1. Choose a Business Process to Model:
A business process is major operational process in your organization that is supported by some kind of legacy system (or system) from which data can be collected for the purpose of the data warehouse. Examples of business processes are orders, invoices, shipments, inventory, account administration, sales and the general ledger.
Page 19 of 22
BI Dimensional Modeling
3. Choose the dimensions that will apply to each fact table record.
Typical dimensions are Time, Product, Customer, Promotion, Warehouse, Transaction type and status. With the choice of each dimension, describe all discrete text like dimension attributes (fields) that fill out each dimension table.
4. Choose the measured fact that will populate each fact table record.
Typical measured facts are numeric additive quantities like quantity sold and dollars sold.
communication (with the users or business owners). This matrix approach has been exceptionally effective for distributed data warehouses without a center. The matrix is simply a vertical list of data marts and a horizontal list of dimensions. The Figure below is an example matrix for the enterprise data warehouse of a large telecommunications company.
Page 20 of 22
BI Dimensional Modeling
The Matrix Plan for the enterprise data warehouse of a large telecommunications company. First-level data marts are directly derived from production applications. Second-level data marts are developed later and represent combinations of first-level data marts. You start the matrix by listing all the first-level data marts. . A first-level data mart is a collection of related fact tables and dimension tables that is typically:
Derived from a single data source Supported and implemented by a single department Based on the most atomic data possible to collect from the source Conformed to the data warehouse bus.
A second-level data mart is a combination of two or more first-level marts. In most cases, a Second-level mart is more than a simple union of data sets from the first-level marts. For example, a second-level profitability mart may result from a complex allocation process that associates costs from several first-level cost-oriented data marts onto products and customers contained in a first-level revenue mart. The matrix planning technique helps you build an enterprise data warehouse, especially when the warehouse is a distributed combination of far-flung data marts. The matrix becomes a resource that is part technical tool, part project management tool, and part communication vehicle to senior management.
Page 21 of 22
BI Dimensional Modeling
5 Advanced Design
5.1 Indexing
a. Dimension table indexing Dimensions should be heavily indexed The attributes should generally have B-tree indices In case the cardinality is low use Bitmap indexes b. Fact table indexing Fact tables should be indexed carefully Build a single clustered index consisting of the dimension table foreign keys B-tree index may be needed if clustered dimension keys dont ensure uniqueness Use caution when building more fact indexes Determine primary key sort order Leading column should be time dimension key Consider other leading index terms to best clump data in blocks and provide fastest subsetting
Page 22 of 22